What email address or phone number would you like to use to sign in to Docs.com?
If you already have an account that you use with Office or other Microsoft services, enter it here.
Or sign in with:
Signing in allows you to download and like content, and it provides the authors analytical data about your interactions with their content.
Embed code for: On the Distribution of Software Faults
Select a size
Short Papers___________________________________________________________________________________________________ On the Distribution of Software Faults Hongyu Zhang, Member, IEEE Abstract—The Pareto principle is often used to describe how faults in large software systems are distributed over modules. A recent paper by Andersson and Runeson again confirmed the Pareto principle of fault distribution. In this paper, we show that the distribution of software faults can be more precisely described as the Weibull distribution. Index Terms—Software fault distribution, empirical research, replication. Ç 1INTRODUCTION IN the May 2007 issue of this journal, a paper entitled “A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems”  described a replication of Fenton and Ohlsson’s quantitative study  of faults in a complex software system. One of the hypotheses it confirmed was the widespread belief that the number of faults in large software systems follows the Pareto principle . We have replicated the studies in  and  and discovered that the distribution of faults over modules can be better modeled using the Weibull probability distribution function. The Pareto principle is named after the economist Vilfredo Pareto, who proposed a model to describe the distribution of wealth among individuals. The idea is sometimes expressed more simply as the Pareto principle or the “20-80 rule,” which says that 20 percent of the population owns 80 percent of the wealth. Formally, the cumulative distribution function (CDF) of the Pareto distribution  can be defined as PðxÞ¼1 x ð>0;>0Þ: The Weibull distribution, developed by the physicist Waloddi Weibull, is one of the most widely used probability distributions in the reliability engineering discipline . The CDF of the Weibull distribution can be formally defined as PðxÞ¼1 exp x ! ð>0;>0Þ: In this short paper, we show that the Weibull distribution fits the actual data better and, therefore, it is more appropriate to describe the distribution of the software faults (including both prerelease and postrelease faults) as the Weibull distribution. 2T HE WEIBULL DISTRIBUTION OF SOFTWARE FAULTS OVER MODULES In this research, we replicate the studies described in  and  using the public Eclipse data collected by the University of Saarland . Eclipse is a widely used integrated development platform for creating Java, C++, and Web applications. The public Eclipse data sets contain measurement and fault data for Eclipse Versions 2.0, 2.1, and 3.0, which are collected from Eclipse’s bug databases and version archives. The original data sets contain data at file and package levels. In this correspondence, we use the package-level data (Table 1) to present our findings as the granularity level of “package” is more similar to the level of “modules” used in  (that is, a collection of files). However, we should note that our findings apply to the file-level data as well. For each Eclipse project, we analyze the distribution of its faults across the modules (in this example, the packages). As in the cases in the original studies , , we also find that the distribution is highly skewed—that a small number of modules accounts for most of the faults. Further analysis shows that the faults follow the Weibull distribution instead of the Pareto distribution. Using statistical packages such as the SPSS, we are able to perform a nonlinear regression analysis and derive the parameters for each distribu- tion. Fig. 1 shows the fitted curves of Weibull and Pareto distributions for the prerelease faults in Eclipse 2.1. Clearly, the Weibull distribution fits the actual data better. To statistically compare the goodness of fit of these two distributions, we compute the coefficient of determination ðR2Þ and the Standard Error of Estimate ðSeÞ . The R2 statistic measures the percentage of variations that can be explained by the model. Its value is between 0 and 1, with a higher value indicating a better fit. Therefore, R2 can be seen as an index of the relative goodness of fit of a sample regression curve. In Fig. 1, the Weibull distribution has the R2 value 0.998, meaning that the model accounts for 99.8 percent of the variations, whereas the R2 value for the Pareto distribution is IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 34, NO. 2, MARCH/APRIL 2008 301 . The author is with the School of Software, Tsinghua University, Beijing 100084, China. E-mail: firstname.lastname@example.org. Manuscript received 9 May 2007; revised 6 Nov. 2007; accepted 8 Nov. 2007; published online 19 Nov. 2007. Recommended for acceptance by B. Littlewood. For information on obtaining reprints of this article, please send e-mail to: email@example.com, and reference IEEECS Log Number TSE-2007-05-0157. Digital Object Identifier no. 10.1109/TSE.2007.70771. TABLE 1 The Eclipse Data Sets (Package Level) Fig. 1. The distribution of prerelease faults in Eclipse 2.1. It shows the percentage of the accumulated number of faults when the modules are ordered by decreasing number of faults. 0098-5589/08/$25.00 2008 IEEE Published by the IEEE Computer Society only 0.684. Se is a measure of the absolute prediction error and is computed as Se ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pðy y0Þ2 n 2 s ; where y and y0 are the actual and predicted values, respectively. The larger Se indicates the larger prediction error. In Fig. 1, the Weibull distribution has the Se value 0.01, whereas the Pareto distribution has the Se value 0.11. Therefore, we conclude that the Weibull distribution is a better fitting distribution. In the same way, we analyze all projects in the Eclipse data sets and for both prerelease and postrelease faults. The results show that the Weibull distribution can better describe the distribution of faults (Table 2). 3CONCLUSION We perform a replicated study of  and  and find that the Weibull distribution describes the actual fault data well. We suggest using Weibull distribution to precisely model the fault distribution over modules instead of the commonly used term “Pareto principle.” ACKNOWLEDGMENTS The author thanks Carina Andersson, the author of , who confirmed his findings using the data sets described in . The author gratefully acknowledges her support for this work. He also thanks the reviewers for their valuable comments. This work is partially sponsored by Chinese NSF grants 90718022 and 60703060. REFERENCES  C. Andersson and P. Runeson, “A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems,” IEEE Trans. Software Eng., vol. 33, no. 5, pp. 273-286, May 2007.  R. Cooper and A. Weekes, Data, Models, and Statistical Analysis. Philip Allan Publishing, 1983.  N. Fenton and N. Ohlsson, “Quantitative Analysis of Faults and Failures in a Complex Software System,” IEEE Trans. Software Eng., vol. 26, no. 8, pp. 797-814, Aug. 2000.  J.M. Juran and F.M. Gryna Jr., Quality Control Handbook, fourth ed. McGraw-Hill, 1988.  G. Keller and B. Warrack, Statistics for Management and Economics. Duxbury, 1999.  R. Ramakumar, Engineering Reliability: Fundamentals and Applications. Prentice Hall, 1993.  T. Zimmermann, R. Premraj, and A. Zeller, “Predicting Defects for Eclipse,” Proc. Third Int’l Workshop Predictor Models in Software Eng., http:// www.st.cs.uni-sb.de/softevo/, May 2007. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. 302 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 34, NO. 2, MARCH/APRIL 2008 TABLE 2 The Weibull Distribution of Eclipse Faults curves of Weibull and Pareto distributions for the prerelease faults in Eclipse 2.1. Clearly, the Weibull distribution fits the actual data better. To statistically compare the goodness of fit of these two distributions, we compute the coefficient of determination ðR2Þ and the Standard Error of Estimate ðSeÞ . The R2 statistic measures the percentage of variations that can be explained by the model. Its value is between 0 and 1, with a higher value indicating a better fit. Therefore, R2 can be seen as an