What email address or phone number would you like to use to sign in to Docs.com?
If you already have an account that you use with Office or other Microsoft services, enter it here.
Or sign in with:
Signing in allows you to download and like content, and it provides the authors analytical data about your interactions with their content.
Embed code for: Estimating LOC for Information Systems
Select a size
Estimating LOC for Information Systems from their Conceptual Data Models Hee Beng Kuan Tan, Yuan Zhao School of Electrical and Electronic Engineering Block S2 Nanyang Technological University Nanyang Avenue Singapore 639798 email@example.com Hongyu Zhang School of Computer Science and Information Technology RMIT University Melbourne 3001, Victoria Australia Hongyu.firstname.lastname@example.org ABSTRACT Effort and cost estimation is crucial in software management. Estimation of software size plays a key role in the estimation. Line of Code (LOC) is still a commonly used software size measure. Despite the fact that software sizing is well recognized as an important problem for more than two decades, there is still much problem in existing methods. Conceptual data model is widely used in the requirements analysis for information systems. It is also not difficult to construct conceptual data models in the early stage of developing information systems. Much characteristic of an information system is actually reflected from its conceptual data model. We explore into the use of conceptual data model for estimating LOC. This paper proposes a novel method for estimating LOC for an information system from its conceptual data model through the use of multiple linear regression model. We have validated the method through collecting samples from both the industry and open-source systems. Categories and Subject Descriptors D.2.8 Metrics – Product metrics. General Terms Measurement. Keywords Software sizing, line of code (LOC), conceptual data model, multiple linear regression model. 1. INTRODUCTION Estimating the required effort and cost for a software project is crucial [6, 8, 16, 23]. Overestimation may lead to the abortion of essential projects or loss of projects to competitors. Underestimation may result in huge financial losses. It is also likely to affect the success and quality of projects adversely . The estimation of software size plays a key role in the project effort and cost estimation. Line of Code (LOC) and Function Points (FP) are still the most commonly used size measures adopted by existing software cost estimation models. Despite the existence of well known software sizing methods such as Function Point method [1, 10, 13, 15] and its variants tailored for Object- Oriented software [11, 19], many practitioners and project managers continue to produce estimates based on ad-hoc or so called “expert” approaches [2, 21]. The main causes cited are the lack of required information in the early stage of a project, the need for domain specific method and the effort required . However, the accuracy of ad-hoc and expert approaches also has much problem that often results in problems on project budgets and schedules . It also affects the success of many projects. The entity-relationship (ER) model originally proposed by Chen  is generally regarded as the most widely used tool for the conceptual data modeling of information systems [14, 29]. An ER model is constructed to depict the ideal organization of data, independent of the physical organization of the data and where and how data are used. An ER model is specified in a diagram called an ER diagram. Class diagram is an evolution of ER diagram. ER diagram is equivalent to simplified class diagram that excludes operations. In this paper, for the purpose of following the latest notations, we shall use the simplified class diagram that excludes operations instead of ER diagram for conceptual data modeling. The term conceptual data model shall refer to such class diagram that models the entities and concepts in an information system and the relationships between them. Information systems constitute one of the largest software domains. This paper proposes a novel method to estimate the LOC for an information system from its conceptual data model. The method is an enhancement of the method proposed in . Based on samples collected from both the industry and open- source systems, we have validated the proposed method for systems that are developed using several programming languages. We have also empirically compared the LOC estimation from the proposed method with the well-known Function Point method. The paper is organized as follows. Section 2 gives the background information. Section 3 presents the proposed method for LOC estimation. Section 4 reports our validation of the proposed method. Section 5 empirically compares LOC estimation between the proposed method and the well-known Function Point method. Section 6 compares the proposed method with related methods to conclude the paper. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE'06, May 20–28, 2006, Shanghai, China. Copyright 2006 ACM 1-59593-085-X/06/0005...$5.00. 2. BACKGROUND Regression analysis is a classical statistical technique for building estimation models. It is one of the most commonly used methods in econometric work. It is concerned with describing and evaluating the relationship between a dependent variable and one or more independent variables. The relationship is described as a model for estimating the dependent variable from independent variables. The model is built and evaluated through collecting sample data for these variables. The following multiple linear regression model that expresses the estimated value of a dependent variable y as a functions of k independent variables, x1, x2, ….. , xk , is a commonly used method for regression analysis: kk22110 xˆ.....xˆxˆˆy ˆ ββ ββ ++++= where k ,...,210 ˆˆ,ˆ,ˆ ββ ββ are the coefficients to be estimated from a random sample. In building a multiple linear regression model, as a rule of thumb, the size of the sample should not be less than five times of the total number of independent variables. The following tests that can be computed from most statistical packages are usually carried out in building a model [17, 20, 22, 24]: 1) Significant Test: This is usually based on 5% level of significant. An F-test should be done for the overall model. If the value of (Pr > F) is less than 0.05, then it indicates that the overall model is useful. That is, there is sufficient evidence that at least one of the coefficients is non-zero at 5% level of significant. Furthermore, a t-test is conducted on each j ˆ β (0 ≤ j ≤ k). If all the values of (Pr > |t|) are less than 0.05, than there is sufficient evidence of a linear relationship between y and each xj (1 ≤ j ≤ k) at 5% level of significant. 2) Fitness Test: If the adjusted multiple coefficient of determination 2 aR for the model is higher than 0.75, then it implies that the least square model has explained more than 75% of the total sample variation of y values, after adjusting for sample size and number of independent variables. 3) Collinearity Test: The independent variables should not be highly correlated. If the variance inflation factors (VIFs) and the condition indexes of all the independent variables are less than 10 (the threshold value), then there is no significant evident of the existence of collinearity between the independent variables. 4) Extreme Case Test (Outlier): There should be no extreme cases (i.e., outliers). Extreme cases can distort coefficient estimation. For each datapoint, a residual analysis is conducted. If the absolute value of its studentized residual (RStudent) is below 2 (a usual threshold for identifying large influence of observation on the parameter estimates), then there is no significant evident of large influence to the parameter estimation caused by the datapoint. That is, there is no significant evident that the datapoint is an extreme case. Otherwise, the datapoint should be investigated. If the datapoint itself is wrong, it should be recollected if possible, otherwise, appropriate action should be taken . Any model built from a random sample should be validated against another random sample. As a rule of thumb, the size of the latter sample should not be less than half the size of the former sample. MMRE and PRED(0.25) are the most commonly used assessments for validation. Lowest MMRE is preferred. PRED(0.25) is the ratio of number of cases in which the estimates are within the 25% of the actual values divided by the total number of cases. The main criteria for validation lies in obtaining acceptable values of MMRE and PRED(0.25). Their acceptable values are not more than 0.25 and not less than 75% respectively. Our observation reveals that the size of an information system is well characterized by its conceptual model. In fact, the derivation of program structure from the structure of the data that the program processes in Jackson structured program design method (JSP) has already implied this characteristic . Based on the observation, we have explored into the use of multiple linear regression model to estimate of LOC for information systems from their conceptual data model for sometime. A preliminary method was proposed in . This method is based on entity-relationship (ER) model using multiple linear regression models. It uses the total number of entity types (E), total number of relationships (R) and total number of attributes (A) to characterize an ER model. In this method, the LOC of an information system is estimated from the three variables, E, R and A. We used the method to build and validate LOC estimation models for two programming languages: Visual Basic with SQL; Java with JSP, HTML and SQL. The models were built and validated from limited datasets collected from the industry. The model that we built for Visual Basic based systems is as follows: ARESize 007.0169.2062.0788.6 −+−= The model that we built for Java based systems is as follows: ARESize 023.0028.0218.1678.4 +++= In both models, size is in thousand lines of code (KLOC). 3. THE PROPOSED ESTIMATION OF LOC To continue the work on estimating of LOC for information systems from conceptual data models, further examinations have been carried out to the two models that we built for Visual Basic and Java based systems in . We observed that the coefficients for E and A in the model for Visual Basic based systems were negative. Furthermore, from the coefficients in the two LOC estimation models, one may expect to see a consistent relative order of influence by E, R and A on LOC. However, there is no such consistent pattern reflected. In the model for Visual Basic based systems, the coefficient for R is the highest, but the coefficients for both A and E are negative. However, in the model for Java based systems, the coefficient for E is the highest, R is the second higher and A is the lowest. To address these problems, we have performed more statistical analysis of the datasets and have also investigated the accuracy of the datasets. It is well known that collinearity [3, 17] can have harmful effects on multiple regression models, both in the interpretation of the results and in how they are obtained. To investigate the above mentioned observations, we run a collinearity diagnostics for the two models. From the result computed by SAS for collinearity diagnostics, we noticed that for both models, some of the condition indexes exceed the threshold value (10). For those variables with condition indexes exceed the threshold value, we further examine its proportion of variation. And, we further found that for both models, there are some variables with proportion of variation exceeding the threshold value (0.90). Thus, from statistical analysis of both models, we conclude that collinearity (including multicollinearity) may exist among E, R and A in both models. As most organizations in the industry have the view that their systems are confidential and therefore not accessible to external parties. Most of the data in the datasets presented in  are supplied by users directly and there is no means for verification. After further investigation, we found some errors in these data. Based on the finding, we have decided to recollect datasets for rebuilding and revalidating models. From the previous experience, for the industry datasets, we decided to collect data only from those systems that we are allowed to verify their documentation and systems. Due to this reason, it is extremely difficult to collect larger datasets from the industry. To address this problem, in addition to the industry datasets, we have also collected data from open-source systems through reengineering. Through much experimentation, for a programming language or environment, for information systems with same major characteristics, we propose the following multiple linear regression model to estimate the LOC for such an information system developed using the programming language or environment respectively from its conceptual data model: AˆRˆCˆˆKLOC ARC0 ββ ββ +++= The multiple linear regression model has the following three independent variables to characterize the conceptual data model: • C: the total number of classes in the conceptual data model. • R: the total number of relationships in the conceptual data model. • A : the average number of attributes per class, that is, CAA /= , where A is the total number of attributes in the conceptual data model. 0 ˆ β , C ˆ β , R ˆ β and A ˆ β are the coefficients in the multiple regression model to be estimated from samples. Note that a separate model should be built for each programming language or development environment used. This is due to the fact that the size of source code also depends on programming language or development environment used. 4. VALIDATION OF THE PROPOSED LOC ESTIMATION We have validated the proposed method through the following five pairs of datasets – each pair has one dataset for model building and another dataset for model validation -- collected from the industry and open-source data-intensive information systems: 1) Industry VB-based System: These are collected from systems in the industry that were developed using Visual Basic with SQL. Both datasets have 16 systems. 2) Open-source PHP-based System: These are collected from open-source systems that were developed using PHP with HTML and SQL. The dataset for model building has 32 systems. The dataset for validation has 31 systems. 3) Java-based System: We have the following three pairs of datasets collected from systems that were developed using Java with JSP, HTML and SQL: i) Industry Java-based System: These are collected from systems that were developed in the industry. Both datasets have 16 systems. ii) Open-source Java-based System: These are collected from open-source systems. The dataset for model building has 30 systems. The dataset for validation has 24 systems. iii) The combined Java-based System: These are formed by combing the corresponding datasets from the last two pairs of datasets. Therefore, the dataset for model building has 46 systems. The dataset for validation has 40 systems As we have three independent variables, according to the rule of thumb mentioned earlier, each dataset for model building must have at least 15 systems (5 times of the number of independent variables). The main objective for collecting data from open- source systems is to have larger datasets. Many organizations in the industry did not construct proper conceptual data models in the early stage of software development. Furthermore, due to the view of confidentiality, many industry organizations did not allow us to access their system documentations to verify the accuracy of the data supplied by them. To ensure the accuracy, we do not use such data. Moreover, data collected from open-source also have the advantage of allowing others to verify. In the datasets collected from the industry, all the conceptual data models used were extracted from requirement specifications (to reflect the early stage of software development). All data supplied by users are verified by us to ensure its correctness. Any questionable data that cannot be further verified are excluded. In the datasets collected from open-source systems, all the sample systems were randomly collected from SourceForge and Freshmeat [12, 26] -- the world’s two largest open-source software development websites. All the data are collected by our final year students in their final year projects. For each sampled open-source system, the following steps were carried: 1) Reverse engineer its conceptual data model from the system: Students study the database schema and documentations for the system. They also test the system. From the study and testing, they identify tables in the schema that represent domain entities and concepts. They also identify the access paths between the selected tables. Then, based on the transformation of class diagrams into relational database design discussed in Chapter 13 of , they work backward to identify all the classes and relationships between them. 2) Count LOC (line-of-code) of the system automatically by software tools: Comment, blank lines, and text file are excluded in the counting. For systems that cater for multi-languages, the above tasks are performed only for English language. That is, any source code that is not for English language is excluded. All the required statistics for our model building and validation are computed by the statistical package SAS. 4.1 Model Building and Validation 4.1.1 Industry VB-Based System This pair of datasets is collected from a number of different organizations in the industry. It is collected from systems in a variety of domains including shipment management, auction management, finance, administration, logistic management, business management, medical system and donation management. Both datasets in this pair have similar composition. The dataset for model building is shown in Table 1. From this dataset, we built the following model: A*311.0R*126.1C*374.1546.11KLOC +++−= The model was tested as discussed in Section 2. In significance test, (Pr > F) < 0.0001 and (Pr > |t|) for intercept, C, R and A are 0.0032, <0.0001, <0.0001 and 0.0491, respectively. In fitness test, 2 aR = 0.9737. In collinearity test, the variance inflation factors (VIFs) and the condition indexes of all the independent variables are less than 10 (the threshold). In extreme case test, the absolute values of studentized residuals (RStudents) for all the systems are below 2 (the threshold). Therefore, all the test results are affirmative. Table 1. The dataset for building VB-based LOC estimation model System No Actual Size (KLOC) C R A 1 37.54 27 8 8.000 2 14.723 8 6 25.375 3 24.667 12 10 16.917 4 42.1 19 25 16.526 5 87.23 35 38 8.343 6 31.445 14 21 6.214 7 67.04 35 27 16.829 8 30.79 17 20 6.176 9 22.402 13 14 5.769 10 69.713 28 31 9.571 11 16.17 6 9 27.333 12 90.854 37 39 21.270 13 64.35 27 33 5.481 14 27.076 13 16 8.000 15 20.933 10 10 6.500 16 40.341 22 20 5.818 The model was validated using the dataset shown in Table 2. MMRE and Pred(0.25) computed are 0.18 and 0.81 respectively. As discussed in Section 2, these values fall within the acceptable level (not more than 0.25 and not less than 75% respectively). Therefore, the evaluation result supports the validity of the LOC estimation model built. 4.1.2 Open-Source PHP-Based System This pair of datasets is collected from two open-source websites: SourceForge and Freshmeat [12, 26]. It is collected from systems in a variety of domains including content management, resource scheduling, inventory management, viewing and maintaining pictures and entertainment. Both datasets in this pair have similar composition. The dataset for model building is shown in Table 3. From this dataset, we built the following model: A*652.0R*672.1C*241.1223.13KLOC +++−= The model was tested as discussed in Section 2. In significance test, (Pr > F) < 0.0001 and (Pr > |t|) for intercept, C, R and A are <0.0001, <0.0001, <0.0001 and 0.0467 respectively. In fitness test, 2 aR = 0.9438. In collinearity test, the variance inflation factors (VIFs) and the condition indexes of all the independent variables are less than 10 (the threshold). In extreme case test, the absolute values of studentized residuals (RStudents) for all the systems are below 2 (the threshold). Therefore, all the test results are affirmative. Table 2. The validation dataset for VB-based LOC estimation model System No Actual Size (KLOC) C R A Estimated Size (KLOC) MRE 1 27.217 16 8 6.188 21.370 0.215 2 14.53 7 6 16.286 9.893 0.319 3 65.872 28 24 13.179 58.049 0.119 4 41.435 24 21 14.417 49.560 0.196 5 72 36 37 15.000 84.245 0.170 6 29.52 16 16 27.563 37.026 0.254 7 52.76 31 24 10.645 61.383 0.163 8 46.92 29 24 6.931 57.480 0.225 9 97.88 42 44 33.524 106.132 0.084 10 38.764 18 12 11.000 30.119 0.223 11 31.665 12 14 8.000 23.194 0.268 12 77.52 30 32 7.533 68.049 0.122 13 16.81 10 7 9.500 13.031 0.225 14 59.332 32 26 14.781 66.295 0.117 15 107.8 41 45 16.561 100.608 0.067 16 53.66 24 18 9.125 44.536 0.170 MMRE = 0.18, Pred(0.25) = 0.81 The model was validated using the dataset shown in Table 4. MMRE and Pred(0.25) computed are 0.15 and 0.81 respectively. These values fall within the acceptable level. Therefore, the evaluation result supports the validity of the LOC estimation model built. 4.1.3 Java-Based System There are three pairs of datasets collected for Java-Based system: Industry Java-Based system, Open-source Java-Based system and Combined industry and open-source Java-Based system. 184.108.40.206 Industry Java-Based System This pair of datasets is collected from a number of different organizations in the industry. It is collected from systems in a variety of domains including food-supply management, business management, project management, schedule management, demand chain management, freight management and booking management. Both datasets in this pair have similar composition. The dataset for model building is shown in Table 5. From this dataset, we built the following model: A*889.0R*254.1C*324.1729.10KLOC +++−= The model was tested as discussed in Section 2. In significance test, (Pr > F) < 0.0001 and (Pr > |t|) for intercept, C, R and A are 0.0008, <0.0001, <0.0001 and 0.0178 respectively. In fitness test, 2 aR = 0.9910. In collinearity test, the variance inflation factors (VIFs) and the condition indexes of all the independent variables are less than 10 (the threshold). In extreme case test, the absolute values of studentized residuals (RStudents) for all the systems are below 2 (the threshold). Therefore, all the test results are affirmative. Table 3. The dataset for building open-source PHP-based LOC estimation model System Actual Size (KLOC) C R A Bannerex3a 3.038 5 2 10.600 Castor 22.599 17 7 7.000 Cmsmadesimple 32.243 21 13 4.524 Comendar 16.164 13 11 7.077 Commerce 83.862 35 24 6.571 Coppermine 24.22 13 9 8.077 EclipseBB 63.929 35 19 8.029 Jdcms 2.543 5 3 9.400 Linkbase 6.697 5 5 7.000 Linpha 55.537 25 14 8.640 Lotgd 55.752 39 10 9.077 Mailwatch 62.602 30 17 7.000 Mantis 67.111 23 22 14.957 Mundimail 2.552 3 1 8.333 opendocman 12.17 10 5 3.700 Openrating 12.757 13 9 5.000 php4Flicks 5.695 7 3 8.429 Phpalumni 7.744 9 6 9.222 Phpcollegeex 7.514 4 1 8.000 Phpman 11.054 9 9 3.667 phpmylibrary 29.77 17 15 3.412 phpstudentcenter 11.653 9 8 8.778 phptimesheet 6.847 5 4 3.600 Poppawid 13.389 7 5 11.714 Refbase 14.45 12 6 16.583 Replex 4.414 6 3 3.667 Timeclock 2.102 3 1 3.333 Uccass 42.819 20 18 3.500 Uma 4.077 4 2 9.000 Vallheru 57.408 33 14 9.242 WebFileSystem 7.428 7 3 7.000 Winventory 8.947 15 5 4.000 The model was validated using the dataset shown in Table 6. MMRE and Pred(0.25) computed are 0.15 and 0.81 respectively. These values fall within the acceptable level. Therefore, the evaluation result supports the validity of the LOC estimation model built. 220.127.116.11 Open-source Java-based System This pair of datasets is collected from two open-source websites: SourceForge and Freshmeat [12, 26]. It is collected from systems in a variety of domains including content management, project organization, member management, job scheduler and entertainment. Both datasets in this pair has similar composition. The dataset for model building is shown in Table 7. From this dataset, we built the following model: A*726.0R*439.1C*201.1121.10KLOC +++−= The model was tested as discussed in Section 2. In significance test, (Pr > F) < 0.0001 and (Pr > |t|) for intercept, C, R and A are 0.0017, <0.0001, <0.0001 and 0.0499 respectively. In fitness test, 2 aR = 0.9587. In collinearity test, the variance inflation factors (VIFs) and the condition indexes of all the independent variables are less than 10 (the threshold). In extreme case test, except for jwp and xc-act systems, the absolute values of studentized residuals (RStudents) for all the systems are below 2 (the threshold). The studentized residuals for these two systems are 2.2146 and 2.3948 respectively. We checked the data for these two systems and found them correct. Therefore, no adjustment top the dataset was made. Therefore, except these two systems (out of 30 systems) show some mild extreme case behavior, all the test results are affirmative. The model was validated using the dataset shown in Table 8. MMRE and Pred(0.25) computed are 0.20 and 0.79 respectively. These values fall within the acceptable level. Therefore, the evaluation result supports the validity of the LOC estimation model built. 18.104.22.168 Combined Industry and Open-Source Java- based System This pair of datasets is the combination of java-based system collection from the industry and open-source. From the dataset that is formed by the combination of the two datasets collected from the industry and open-source Java-based systems for model building (Table 5 and 7), we built the following model: A*754.0R*392.1C*258.1576.10KLOC +++−= The model was tested as discussed in Section 2. In significance test, (Pr > F) < 0.0001 and (Pr > |t|) for intercept, C, R and A are <0.0001, <0.0001, <0.0001 and 0.0051 respectively. In fitness test, 2 aR = 0.9737. In collinearity test, the variance inflation factors (VIFs) and the condition indexes of all the independent variables are less than 10 (the threshold). In extreme case test, in the 46 systems, except for jcv, jwp and xc-ast systems, the absolute values of studentized residuals (RStudents) for all the systems are below 2 (the threshold). The studentized residuals for these three systems are -2.0230, 2.7722 and 2.2337 respectively. As we confirmed the correctness of the data for these three systems, no adjustment to the dataset was made. Therefore, all the test results are affirmative except that these three systems (out of 43 systems) show some extreme case behavior. However, except the behavior of the second system is more moderate, the other two systems are mild. We believe that this could be due to the combination of the two datasets with different accuracy in their conceptual data models -- the conceptual data models of the open- source datasets are clearly more accurate than the conceptual data models from the industry dataset. Table 4. The validation dataset for open-source PHP-based LOC estimation model System Actual Size (KLOC) C R A Estimated Size (KLOC) MRE ackerTodo 9.47 8 5 5.000 8.325 0.121 Bookmark4u 19.381 10 10 5.800 19.689 0.016 ByteMonsoon 4.219 6 2 8.000 2.783 0.340 Core 17.065 8 5 8.125 10.363 0.355 e107 65.649 34 12 7.912 67.215 0.051 eFiction 12.443 9 6 6.222 12.035 0.033 fdcl 3.856 5 2 11.600 3.889 0.009 galant 10.225 6 4 11.333 8.300 0.188 Infocentral 47.734 19 13 8.053 40.136 0.159 jamdb 4.554 4 5 8.500 5.643 0.239 openrealty 16.846 9 8 6.333 15.451 0.083 phpESP 33.645 16 15 5.438 35.258 0.048 Phpnews 6.532 6 5 6.167 6.604 0.011 phpollster 1.818 4 3 7.000 1.321 0.273 PhpScheduleIt 16.006 6 9 8.667 14.922 0.068 Phpsera 15.846 6 6 7.000 8.819 0.443 Phpwims 10.507 6 4 9.333 6.996 0.334 Phpwscookbook 7.607 5 5 10.200 7.992 0.051 plume 15.875 11 7 6.455 16.340 0.029 Quantum_Star_SE 57.643 30 23 9.533 68.679 0.191 Rasmp 9.452 10 4 6.600 10.178 0.077 Rimps 10.81 10 6 6.500 13.457 0.245 so-net 38.781 20 14 10.100 41.590 0.072 Supersurf 1.489 4 3 7.000 1.321 0.113 usebb 6.118 6 3 13.333 7.932 0.297 videodb 24.227 14 10 4.643 23.898 0.014 Webaddressbook 8.363 3 2 20.000 6.884 0.177 wikiwig 23.457 15 9 5.800 24.222 0.033 yabbse 22.475 11 10 12.818 25.505 0.135 zebraz 8.59 6 5 7.833 7.690 0.105 ztml 11.724 10 3 9.600 10.462 0.108 MMRE = 0.15, Pred(0.25) = 0.81 The model was validated using the dataset that is formed by the combination of the two datasets collected from the industry and open-source Java-based systems for model validation (Table 6 and 8). MMRE and Pred(0.25) computed are 0.18 and 0.78 respectively. These values fall within the acceptable level. Therefore, the evaluation result supports the validity of the LOC estimation model built. 4.2 Summary of the Results The results of model building and validation from all the five datasets are encouraging. Such results are reckoned as good by other researchers (e.g., ). We also observe that in all the models built the magnitude of the coefficient A is the lowest among the three independent variables. This is consistent with the intuitive understanding. Much complexity in information systems is caused by the maintaining of class instances and navigation between these instances. The numbers of attributes in its classes do not usually affect the complexity too much. Therefore, C and R have much greater influence than A on LOC. 4.3 Threats to Validity Many organizations in the industry did not construct conceptual data models for information systems developed in the early stage of software development. Furthermore, to ensure that the data collected from a system is correct, we have to verify the data with the system documentations. Furthermore, due to the view of confidentiality, many industry organizations did not allow us to access their system documentations to verify the accuracy of the data supplied by them. This makes the data collection from industry extremely difficult. As a result, the sizes of our industry datasets just managed to meet the minimum criteria (15 datapoints). This may affect the accuracy of our validation. To address this issue, in addition to the industry datasets, we have also collected some datasets from open-source systems to build and validate models. The sizes of samples collected from opens-source systems are much better. Furthermore, the datasets collected from open-source also have the benefit of allowing other people to verify their accuracy. However, the conceptual data models for these systems have to be constructed from reengineering their source codes and database schemas. Though steps have been taken to ensure that only conceptual entities but not implementation data are included in the models, they are usually more accurate than those models constructed in the early stage of software development. Therefore, models built from open-source samples also face some threat to validity. Fortunately, models built from open-source and industry samples that were developed based on the same programming language are quite consistent to a certain extent. Furthermore, surely, it is not possible for our samples to cover the full range of possible LOC values well. So, the models that we have built are not appropriate for estimating systems with LOC values that are significantly larger or smaller than the LOC values in the respective samples. The current evaluation criteria of using MMRE and PRED(0.25) for building models is also not perfect . Table 5. The dataset for building Industry Java-based LOC estimation model System No Actual Size (KLOC) C R A 1 30.02 22 6 8.182 2 37.288 16 18 6.188 3 60.102 23 26 9.261 4 46.27 24 18 6.583 5 85.009 42 26 5.738 6 12.62 6 5 6.167 7 80.014 40 25 6.675 8 47.658 20 20 5.200 9 20.53 10 11 7.500 10 56.48 24 25 7.792 11 33.602 14 15 7.286 12 63.1 20 26 14.2 13 92.841 45 26 11.622 14 22.08 10 13 6.400 15 14.89 7 6 5.714 16 100.213 52 27 9.481 Table 6. The validation dataset for Industry Java-based LOC estimation model System No Actual Size (KLOC) C R A Estimated Size (KLOC) MRE 1 84.89 49 22 5.898 86.978 0.025 2 20.446 12 7 4.250 17.715 0.134 3 17.29 8 6 7.750 14.277 0.174 4 34.075 16 12 5.125 30.059 0.118 5 40.113 18 14 6.389 36.339 0.094 6 28.722 17 14 7.412 35.924 0.251 7 71.37 31 27 6.871 70.281 0.015 8 60.34 30 29 10.300 74.514 0.235 9 20.45 11 11 5.091 22.155 0.083 10 93.27 35 36 8.486 88.299 0.053 11 31.69 19 20 4.789 43.765 0.381 12 77.52 30 33 6.267 75.944 0.020 13 45.384 21 24 7.714 54.029 0.190 14 52.1 24 28 6.000 61.493 0.180 15 19.36 7 9 4.857 14.143 0.269 16 56.744 18 26 5.944 50.992 0.101 MMRE = 0.15, Pred(0.25) = 0.81 In summary, we do not claim that we have built models that are readily for use. However, in the worst case, at least, our work has shown that it is useful and promising to experiment the proposed method on much larger scale to address the critical software sizing problem in the industry. Clearly, such experiment requires organized industry and research community cooperation on a long term basis. 5. COMPARISON OF THE PROPOSED METHOD WITH THE FP METHOD The Function Point (FP) method is a well-known method for software sizing including the estimation of LOC. Indeed, since the proposed method is based on the conceptual data model, it has a clear advantage over the FP method on having the required information more readily available in the early stage of software development. However, for additional information, we compared the LOC estimation from the proposed method with the FP method based on open-source datasets. Such comparison is not possible for industry datasets due to the unavailability of their FPs. In the comparison, for the FP method, we used adjusted Function Point (FP): FP = VAF x UFP. A value adjustment factor (VAF) is calculated as follows: VAF = (TDI x 0.01) + 0.65. The TDI is calculated as the total of all the degrees of influence (each of them is ranked from 0 to 5) by the 14 general system characteristics . We counted all UFPs and VAFs through reading the system documents and conducting testing. The counting was carried out by a MSc student in his dissertation project. Table 7. The dataset for building open-source Java-based LOC estimation model System Actual Size (KLOC) C R A Chatterbox 11.717 8 6 4.250 churchinfo 47.52 23 19 9.565 cream 84.01 26 40 11.462 dlog4j 26.999 15 14 8.933 dynasite 41.72 20 15 5.900 e-library 13.015 5 6 12.400 elips 30.402 18 7 6.611 Forumnuke 29.159 23 10 6.957 GeneaPro 53.443 28 25 4.179 Ibatis 18.694 13 9 6.615 i-tor 26.384 16 6 5.125 Itracker 38.721 19 16 6.579 jcv 75.643 26 30 6.154 jwordnet 46.72 21 24 6.048 jwp 6.413 7 5 4.143 kaon 79.534 20 37 4.850 Kbvt 36.343 18 17 5.333 malbum 59.684 22 31 6.182 northstarbbs 50.454 15 20 11.600 Personalblog 3.055 4 1 7.000 planeta 63.257 34 17 3.971 racetrack 91.28 35 28 13.571 roller 32.707 11 17 7.545 s2j-0.94 11 5 5 3.600 Sacash 5.543 6 4 3.833 Sixqos 22.686 12 11 6.667 Storyserver 3.911 3 2 6.667 tinapos 20.841 14 7 3.000 Webcockpit 9.269 6 5 3.500 xc-ast 7.732 7 2 11.143 5.1 PHP Based Open-Source System For PHP based systems, since there is no existing formula available for estimating LOC directly from Function Point, we built multiple linear regression models in the same way as we did for the proposed method (simple linear regression model in this case as there is only one independent variable, FP). Therefore, UFPs and VAFs were counted for both datasets in the pair of datasets for PHP based open-source system. The FPs and related data for the two datasets are shown in Table 9 and 10. In the same manner, we built a linear regression model from the dataset that built the proposed model successfully for estimating LOC from FP. We used the resulting model to estimate LOCs for the systems in the validation dataset. The FP model built is as follows: FPSize 055.0485.2 +−= With the use of the FP model built to estimate LOC of systems in the validation dataset, MMRE and Pred(0.25) computed are 0.14 and 0.77 respectively. Note that for use of the LOC estimation model built from the proposed method, MMRE and Pred(0.25) computed for the same dataset are 0.15 and 0.81 respectively. There is no clear difference between the results from the two methods. 5.2 Java Based Open-Source System For Java based language, the formula for converting FP to LOC estimation is available . The estimated LOC is Function Point (FP) multiplied by 53. Therefore, UFPs and VAFs were counted for only the validation dataset and are shown in Table 11. Table 8. The validation dataset for open-source Java-based LOC estimation model System Actual Size (KLOC) C R A Estimated Size (KLOC) MRE abaguibuilder 83.176 29 23 7.345 63.137 0.241 Art 17.67 14 7 6.929 21.796 0.234 Bofhms 10.507 12 4 4.333 13.193 0.256 Contineo 21.067 11 11 7.364 24.265 0.152 Dspace 47.57 32 20 4.250 60.177 0.265 Ejen 4.437 5 3 4.600 3.541 0.202 Imcms 91.424 59 30 4.712 107.329 0.174 Itracker 38.721 19 14 6.579 37.620 0.028 jdbforms 50.72 19 15 6.789 39.212 0.227 jmbase 19.723 8 8 5.250 14.811 0.249 jwma 63.228 22 19 8.045 49.483 0.217 Jbooks 12.817 11 7 3.727 15.869 0.238 Jfolder 29.288 10 8 12.800 22.694 0.225 Jgossip 28.047 14 9 5.714 23.793 0.152 Jpo 8.54 5 5 5.800 7.290 0.146 Jwnl 6.913 7 5 3.571 8.074 0.168 mp3cattle 14.598 9 10 3.444 17.579 0.204 openjms 59.72 17 16 6.824 38.274 0.359 Openhre 9.932 9 4 3.444 8.945 0.099 Quartz 22.954 11 11 4.364 22.087 0.038 sqlunit 74.089 26 20 7.808 55.553 0.250 tau_lastest 48.039 14 16 8.000 35.525 0.260 testsuite 30.076 15 9 5.067 24.523 0.185 Tesuji 8.01 2 1 16.000 5.336 0.334 MMRE = 0.20, Pred(0.25) = 0.79 With the use of the above formula to estimate LOC from FP, MMRE and Pred(0.25) computed are 0.17 and 0.75 respectively. Note that for use of the LOC estimation model built from the proposed method, MMRE and Pred(0.25) computed for the same dataset are 0.20 and 0.79 respectively. There is also no clear difference between the results from the two methods. 6. COMPARATIVE DISCUSSION We have proposed and validated a novel method for estimating LOCs for information systems from their conceptual data models. As the proposed method is based on conceptual data model, clearly, in the early stage of software development, the information required by the proposed method is more readily available than the information required by the current well-known FP method. Moreover, our comparison on the accuracy also shows that the proposed method is comparable with the FP method. Table 9. The FP data for the model building dataset for PHP based open-source systems System UFP VAF FP Bannerex3a 126 1.01 127 Castor 476 1.01 481 Cmsmadesimple 634 1.02 647 Comendar 337 1.02 344 Commerce 1517 1.03 1563 Coppermine 447 1.02 456 EclipseBB 1190 1.02 1214 Jdcms 99 1.01 100 Linkbase 172 1.03 177 Linpha 1003 1.02 1023 Lotgd 1045 1.01 1055 Mailwatch 1183 1.01 1195 Mantis 1240 1.03 1277 Mundimail 90 1.01 91 opendocman 246 1.01 248 Openrating 265 1.02 270 php4Flicks 119 1.01 120 Phpalumni 201 1.02 205 Phpcollegeex 183 1.01 185 Phpman 249 1.03 256 phpmylibrary 549 1.03 565 phpstudentcenter 244 1.03 251 phptimesheet 140 1.02 143 Poppawid 304 1.02 310 Refbase 283 1.01 286 Replex 103 1.01 104 Timeclock 81 1.01 82 Uccass 782 1.03 805 Uma 120 1.01 121 Vallheru 1065 1.01 1076 WebFileSystem 210 1.01 212 Winventory 212 1.01 214 Since the proposed method estimates LOC from conceptual data model, all the data required for the method is fully available at the end of requirements analysis. Furthermore, the derivation of the required data from conceptual model for the proposed method is simple and does not subject to any further judgment, decision and interpretation. Estimation of software size is a crucial activity in software management. The most commonly used size measures are Line of Code (LOC) and Function Point (FP). Due to the lack of information required by existing sizing methods, many practitioners and project managers continue to produce estimates based on ad-hoc or so called “expert” approaches [2, 21]. Despite of its problems on the difficulty to derive the required information in the early stage of software development, Function Point has achieved a worldwide acceptance to estimate LOC of business systems and also to directly predict the effort and cost of software projects . Some variants to the FP method have also been proposed. In addition to the above-mentioned major problem, although originally conceived to be independent of methodology used to develop the system under measurement, the application of the FP method turns out to be rather unnatural when applied to object-oriented (OO) systems . As a consequence of using use-cases to model software requirements in OO approach, Use Case Point has been proposed as a method to estimate software effort . However, the information such as complexities of use- cases and details of scenarios required by this method are difficult to predict in the early stage of software development. Recently, the Class Point (CP) method that generalizes the FP method for OO systems has been proposed . The CP method is based on information from design documentation . Clearly, much of the information required by the CP method is also available only when the design of the system completes. Table 10. The FP data for the validation dataset for PHP based open-source systems System UFP VAF FP ackerTodo 205 1.02 209 Bookmark4u 421 1.04 438 ByteMonsoon 129 1.01 130 Core 256 1.02 261 E107 1023 1.01 1033 eFiction 245 1.02 250 fdcl 121 1.01 122 galant 182 1.01 184 Infocentral 684 1.01 691 jamdb 132 1.03 136 openrealty 312 1.03 321 phpESP 659 1.03 679 Phpnews 193 1.03 199 phpollster 87 1.02 89 PhpScheduleIt 274 1.04 285 Phpsera 257 1.01 260 Phpwims 218 1.02 222 Phpwscookbook 169 1.02 172 plume 337 1.02 344 Quantum_Star_SE 898 1.02 916 Rasmp 224 1.01 226 Rimps 262 1.01 265 so-net 747 1.02 762 Supersurf 75 1.03 77 usebb 163 1.01 165 videodb 446 1.02 455 Webaddressbook 177 1.02 181 wikiwig 438 1.02 447 yabbse 405 1.03 417 zebraz 156 1.02 159 ztml 211 1.01 213 As the proposed method is based on conceptual data model, in opposing to the above-mentioned related methods, the information required by the proposed method is more readily available in the early stage of software development. In the worst case, all the information required can also be fully available when the requirements analysis completes (that is, before the design begins). The proposed estimation method shares the use of class diagram with the CP method. However, the CP method requires much detailed design information of classes. The proposed method does not require such information. In terms domain of applications, the proposed method is a domain specific method. It is for information systems. Domain specific method has been identified as a key to improve software size estimation. The related methods are for general use. Information systems constitute a large software domain in the industry. There is still much problem in using existing methods for estimating the sizes of these systems in the industry . The proposed method estimates LOC from information that is more readily available than existing methods in the earlier stage of software development. We believe the proposed method is promising in providing a key to address this crucial problem in the software industry. Based on the experience from the FP method, it is no doubt that much work is still required to fine tune the proposed method in order to put it into practical use. It is also highly probable to estimate project effort and cost directly from the three independent variables used in the proposed method (C, R, and A ) in a similar way as the FP method. This is an important further research area. Table 11. The FP data for the validation dataset for Java based open-source systems System UFP VAF FP abaguibuilder 1602 1.02 1634 Art 426 1.02 435 Bofhms 223 1.02 227 Contineo 337 1.05 354 Dspace 923 1.03 951 Ejen 57 1.06 60 Imcms 1767 1.04 1838 Itracker 554 1.06 587 jdbforms 794 1.02 810 jmbase 298 1.04 310 jwma 1249 1.02 1274 Jbooks 174 1.02 177 Jfolder 388 1.04 404 Jgossip 367 1.03 378 Jpo 131 1.06 139 Jwnl 142 1.03 146 mp3cattle 277 1.05 291 openjms 854 1.05 897 Openhre 201 1.04 209 Quartz 291 1.04 303 sqlunit 1583 1.03 1630 tau_lastest 682 1.05 716 testsuite 482 1.03 496 Tesuji 122 1.02 124 7. ACKNOWLEDGEMENT We would like to thank Singapore Computer Systems Pte Ltd, IPACS E-Solution (S) Pte Ltd, NatSteel Ltd, Great Eastern Life Assurance Co. Limited, JTC Corporation, Adroit Pte Ltd, Institute of Technical Education of Singapore, National Institute of Education and National Computer Systems Pte Ltd for providing their project data. Without their support, this work would not be possible. We would also like to thank our MSc student, Di Wang, and our final year BEng students, Han Chong Tan, Pua Chyuan Koh, Teck Leong Ng, Chye Yang Tan, Chiang Fong Lee and Yeow Meng Ng for collecting the data from the open-source systems. 8. REFERENCES  Albrecht, A. J., and Gaffney, J. E. Jr. Software function, source lines of code, and development effort prediction: a software science validation. IEEE Trans. Software Eng., vol. SE-9, no. 6, Nov. 1983, 639-648.  Armour, P. Ten unmyths of project estimation: reconsidering some commonly accepted project management practices. Comm. ACM 45,11( Nov. 2002), 15-18.  Belsley, D. A., Kuh, E., and Welsch, R. E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley, New York, 2004.  Blaha, M., and Premerlani, W. Object-Oriented Modeling and Design for Database Applications. Prentice Hall, 1998.  Boehm, B. W., and Fairley, R. E. Software estimation perspectives. IEEE Software, Nov./Dec. 2000, 22-26.  Boehm, B. W. et al. Software Cost Estimation with COCOMO II..Prentice Hall, 2000.  Burgess, R. S. Structured Program Design Using JSP. ELBS, 1988.  Canfora, G., Cerulo, L., and Troiano, L. An experience of fuzzy linear regression applied to effort estimation. In Proc. 16th International Conference on Software Engineering & Knowledge Engineering, 2004, 57-61.  Chen, P. P. The entity-relationship model - towards a unified view of data. ACM Trans. Database Syst. 1,1 ( Mar. 1976), 9-36.  COSMIC-Full Functions – Release 2.0. September 1999.  Costagliola, G., Ferrucci, F., Tortora, G. and Vitiello, G. Class point: an approach for the size estimation of object- oriented systems. IEEE Trans. Software Eng., 31, 1(Jan, 2005), 52-74.  Freshmeat. http://freshmeat.net.  Garmus, D., and Herron, D. Function Point Analysis: Measurement Practices for Successful Software Projects. Addison Wesley, 2000.  Ghezzi, C., Jazayeri, M. and Mandrioli, D. Fundamentals of Software Engineering. 2nd Edition, Prentice, 2003.  Jeffery, D. R., Low, G. C., and Barnes, M. A comparison of function point counting techniques. IEEE Trans. Software Eng., May, 1993, 529-532.  Jeffery, D. R., and Walkerden, F. An empirical study of analogy-based software effort estimation. Empirical Software Engineering, Kluwer Academic Publishers, 4, 2 (June 1999), 135-158.  Kennedy, P. A Guide to Econometrics. Blackwell Publishing, 5th Edition, 2003.  Lai, R., and Huang, S. J. A model for estimating the size of a formal communication protocol application and its implementation. IEEE Trans. Software Eng., Jan, 2003, 46- 62.  Laranjeira, L. A. Software size estimation of object-oriented systems. IEEE Trans. Software Eng., May, 1990, 510-522.  McClave, J. T., and Sincich, T. Statistics. 9th Ed, Prentice Hall, 2003.  Miranda, E. An evaluation of the paired comparisons method for software sizing. In Proc. Int. Conf. On Software Eng., 2000, 597-604.  Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. Applied Linear Regression Models, IRWIN, 1996.  Ruthe, M., Jeffery, R., and Wieczorek, I. Cost estimation for web applications. In Proc. Int. Conf. On Software Eng., 2003, 285-294.  SAS/STAT User’s Guide. http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/sta t/.  Smith, J. The estimation of effort based on use cases, Rational Software White Paper.1999.  SourceForge.net. http://sourceforge.net/.  Stensrud, E., Foss, T., Kitchenham, B., Myrtveit, I. An empirical validation of the relationship between the magnitude of relative error and project size. In Proc. IEEE Symp. Software Metrics, 2002, 3-12.  Tan, H. B. K., and Zhao, Y. ER-based software sizing for data-intensive systems. In Proc. Int. Conf. on Conceptual Modeling, 2004, 180-190.  Teorey, T. J., Yang, D., and Fry, J. P. A logical design methodology for relational databases using the extended entity-relationship model. ACM Computing Surveys, June, 1986, 197-222. g their project data. Without