During the software lifecycle, the maintenance of the software is usually the biggest overall expense, totaling even up to 90 percent of all life cycle costs  . Knowing this, it is rather surprising, that the software development processes do not focus more on the maintenance phase. Instead development processes focus to enhance and offer product quality and quality-in-use improvements within the development and quality assurance steps. For example, the Scrum software process model which is favored in many SME organizations  , does not take into account any activities which happen before or after the active sprints, even though majority of the software related costs are not realized within this period. This issue is glaring for example in the game software development, where the current business models such as live-ops or any other free-2-play model  , mean that basically all profits are generated during the maintenance period, not at the commitment to develop software or after delivery.
Some activity models, such as continuous delivery (CD)  or DevOps  promote more thorough integration of maintenance activities into the development activities, but the runtime monitoring and control of the quality characteristics supporting maintenance are not included. Actions such as the delivery of hotfixes, patches or customer-tailored features are part of the continuous release cycle, where development and maintenance are concurrent activities, with the development phase being one iteration ahead of the maintenance phase. However, the general infrastructure in this area is not very systematically studied. In more abstract terms, technical evaluation of the software quality is not very straightforward, since the maintenance issues and quality assurance needs are usually related to the preferred quality: Usually quality assurance during maintenance assesses, if the software system delivers the expected features or services, and achieves the necessary quality requirements. However, there are several different types of quality involved  , and if we consider quality models such as defined by the ISO/IEC 25000-family  , there are tens of different measurements and methods to assess the quality and quality-in-use from different perspectives.
Many existing software measurement frameworks are influenced by the ISO/IEC quality models. For example, the software maintainability measurements developed by Motogna et al.  , the performance measurement framework for cloud computing by Bautista et al.  , or the framework for evaluating the effect of coding practices to software maintainability by Hegedus  . However, previous research has been limited to cover only parts of quality models, concentrating around specific quality characteristics such as maintainability or performance efficiency. There is a need for further work with a general measurement framework, which aims to incorporate the characteristics of a software quality model to a software system during run time.
The aim of this research is to study the different methods of reducing the costs of the maintenance by directly lowering the amount of work required for the maintenance by predicting and identifying the changes in the quality characteristics. Changes in the quality characteristics serve as an early warning system of the problematic components and software failures. More specifically, we concentrate on developing a library of software measurement probes using the ISO/IEC 25000 standard series as a starting point. From our prior study  , applying the ISO/IEC 25000 standard, we understood that the quality model is understandable enough to warrant application in the industry. The actual research questions are: “What kind of technical infrastructure would enable identification of on-line quality characteristics and thereby maintenance issues?” and “How to incorporate a software quality model into a library of run-time metrics?”
To answer these research questions, our approach was to define a framework and implement the framework in a system to collect and monitor run-time data from an open-source application. In addition, the collected data is visualized with a separate analysis tool to monitor trends and changes between the different versions of the system and to assess, for example, resource usage for the customer environments.
In summary, this paper presents a framework for run-time software measurement. The framework consists of two different types of metrics: direct metrics, which can be recorded from a system at run time by incorporating measurement probes into the software during development; and indirect metrics, which need to be derived from the direct metrics and the knowledge of the software engineer or software specification. The framework aims to be general to warrant use in different applications but at the same time loose enough to allow developers to derive application-specific measurement.
Rest of the paper is structured as follows: In Section 2, related work is introduced; Section 3 presents the research methodology; The main contribution of this paper, the measurement framework and our proof-of-concept project are introduced in Section 4; Discussion and conclusions are given in Sections 5 and 6 respectively.
2. Related Research
Measuring a software system with different kinds of tools and metrics is not a novel idea. There are several different kinds of definitions for the use of metrics (for example Honglei et al.  ) and different quality models for selecting what to test (for example Herzig et al.  ), or how to select test cases (For example Fontana & Zanoni  , Schrettner et al.  , Kasurinen et al.  ). These all are serious concerns, since the test cases and in general ensuring the system testability can cause one third or even half of the workload in software development life-cycle. This is mostly due to the need to capture not only the normal usage but also extraordinary uses of the system  .
The very definition of what product quality or quality-in-use actually means is also a concern. There are several definitions such as value-based or manufacturing-based quality  , depending on the viewpoint or the relevant stakeholders. The users have very different views on what is software or product quality, when compared to some other quality definition, like the production quality. The customers may not care at all on how low percentage of the products are faulty or how high-quality the building components are, if the product is badly designed, overpriced against its competing products or simply feels cheap or low-grade product.
As stated, the assessment of quality relies on several measurable metrics and models, which capture the different definitions of quality. Especially for software products and their quality, there are some different models such as CISQ  or ISO/IEC 9126  , which share a number of common features. For example, in both these models the problematic aspect of defining what product quality actually is, is solved by defining a number of characteristics such as reliability or security. These characteristics in combination assess if not all, then at least most of the different aspects of software quality, and they can be selected on case-by-case basis depending on what aspects are relevant.
The ISO/IEC 9126 model is probably the most applied standardized model but it is not the most current or extensive standard in existence. The ISO/IEC 25000 Software Quality Requirements and Evaluation (SQUARE) model  in its core is the upgraded version of the ISO/IEC 9126 model, with the added definitions for quality in software product, and a separate model for software quality-in-use. Overall, the objective of the ISO/IEC 25000 standard family is to clarify the requirements, which should be identified to assess the software quality and ensure the success on the evaluation and reaching of the set quality objectives. Overall, the models cover 5 characteristics with 11 sub-characteristics for the quality-in-use, and 8 characteristics with 31 sub-characteristics for software product quality. These models and their characteristics are summarized in Figure 1 and Figure 2.
Figure 1. The ISO/IEC 25010 software product quality model  , characteristics on left, and the subcharacteristics on the right.
Figure 2. ISO/IEC 25010 software quality-in-use model  .
The characteristics of the product quality model focus on the technical aspects of the software, although there are also defined sub-characteristics for more human-centric aspects such as learnability. For all of the defined sub-characteristics, the ISO/IEC 25000 standard series also defines a number of measurement techniques and metrics, which can be used to assess the quality of that particular characteristic. For example, with the testability sub-characteristics there are defined measurements for test function completeness, autonomous testability and test restartability. Similarly, confidentiality is measured with access controllability, data encryption correctness and with the strength of the cryptographic algorithms. Additionally, all of the measurements are formatted to a model, in which the result provides a clear indicator, usually percentage, of positive outcomes versus negative outcomes. Technically, this could enable the software systems to be comparable against each other, and more importantly, allow formal measurement of every different characteristic and their sub-characteristic. Similar approach is also applied in the “Quality in use”-model, which focuses on the clients and customers.
The quality-in-use-model focuses on the client side usage and on the delivered user experience. The model follows the same principle as the product quality model, dividing the model into a set of characteristics and their sub-characteristics. Unlike the product quality model, several sub-characteristics such as trust or pleasure use measurements based on the psychometric scales which are defined by a questionnaire. However, each main characteristic have at least one aspect, which can be measured through the use of software.
These models have been studied also in the other research works. For example Motogna et al.  have been studying the maintainability-characteristic of the ISO/IEC 25010 model, since in the software life cycle model maintenance has significant effect on the software costs. Their study investigates the maintenance sub-characteristics in detail, and proposes a set of metrics, which could be applied in the assessment of the software maintainability, and provide evidence that the model is a feasible starting point for a quality assessment system. In more general studies, for example Goues and Weimer  have observed that the amount of needed test cases in the maintenance can be reduced almost by 55 percent, if the system is designed to include formalized method of collecting quality assurance-related metrics. A similar approach was used in a research project documented by Black  , where a set of explicit data sources was designed to ensure that the quality assurance criterions were met in each incremental development cycle, since there were no realistic resources to do complete regression test cycle with each test case of the software during each software development cycle.
In more practical terms, Lincke et al.  have studied the different quality models and their applicability in real-life software development projects. Their study suggests, that while the models are able to implicate the quality of the software measured to some degree, the different models provide different results and the models in general are not comparable nor compatible. The same project could yield completely opposite results between two different quality models, if the selected models and applied metrics are not carefully designed and meaningful. Similar observations have been made also by Darcy and Kemerer  , who discuss the generally applicable measurements and notify that there are only handful of universal metrics. Their studies indicate that for example in the object-oriented programming languages, concepts such amount of cohesion and coupling between the objects are the most useful metrics to assess the product quality and maintenance.
Rompaey et al.  also state that one aspect of quality, code quality, especially the concept of code smell could be transferred to the quality assurance of unit testing. Their definition of the SSVT-test cycle (set-up, stimulate, verify, tear down) could be useful in the assessment of system maintainability, test automation coverage and additional aspects such as explicitness of the system and traceability of the encountered malfunctions.
3. Research Process
During the study we constructed a framework for quality measurement and monitoring. The measurement and monitoring system aimed specifically for software maintenance using a multi-discipline approach. First, we conducted a literature review that covered, for example, software maintenance, quality assurance and software measurement methods. The review was used to identify existing solutions and proposed methods to tackle the issues raised by our research questions. In short, software quality related to the maintainability of a system is often evaluated by analyzing code quality or complexity and run-time approaches are used less often.
In addition to the literature review, we conducted a survey on the applied testing and quality assurance practices in the industry. One of the key observations was that the use of standards and formal process models seems to have declined during the last eight years across different domains in our sample of software organizations  . This observation affected our approach towards an on-line measurement framework applying an international standard.
In order to realize the framework we followed the process described in the ISO 15939 (Systems and software engineering―Measurement process)  . Our framework covers the first two activities of the ISO 15939 model: establishing measurement commitment and planning the measurement process. The other activities recommended by the ISO 15939 model, performing measurements and evaluating the measurements, are realized as a small-scale proof-of-concept system. In the proof-of-concept, we implemented the metrics as a measurement library in an open-source application. Figure 3 depicts the measurement framework and proof-of-concept implementations.
For each of the different sub-characteristics the ISO/IEC 25000 standard defines a set metrics or measurements, which can be applied in the assessment. For example, in the performance efficiency characteristic the sub-characteristic time-behavior is defined as follows: “The degree to which the response and processing times and throughput rates of a product or system, when performing its functions, meet requirements”. To assess quality of this characteristic, the system has to be able to measure and record the response and processing times. Another example could be the compatibility-interoperability characteristic, which is defined as “degree to which two or more systems, products or components can exchange information and use the information that has been exchanged”. This characteristic demands a measurement or metric to assess object interface similarities, usage of data storages and the amount of errors caused by the faulty simultaneous operations. This approach was used to establish measurements for sub-characteristics of the ISO/IEC 25000 model. Measurements were either direct measurements such as with the performance efficiency, or indicative measurements, which were used to collect information related to the characteristic.
4. Framework for Collecting and Monitoring Quality Characteristics
The concept system and the proposed testing and maintenance framework is
Figure 3. The measurement framework and proof-of-concept system (Modified from ISO/IEC 15939 Systems and software engineering―Measurement process model  ).
based on two separate components, which complement each other: the metrics library, developed as a proof-of-concept IDE plugin for NetBeans  , and the analyzer front-end, which visualizes the collected metrics.
The IDE plugin includes a shorthand and the code generators for the different types of measurement functions included in the library. The measurement functions collect data into a log file with session-relevant information, which enables the analysis tool to calculate the results and maintenance information. The objective of the IDE plugin is to offer practical tools for testers and software developers to measure and collect relevant data to satisfy their needs to verify or to validate their product, or to assess the feasibility and stability of their latest release.
The metrics library consists of different testing methods. These methods are collected from previous experiences and research work with the software industry, from different models, for example, Swebok  , Test Process Improvement (TPI) model  and Test Maturity Model integration (TMMi) model  . The objective of the library is to offer a wide list of different testing techniques and tools, and recommend at least one feasible approach for evaluating any ISO/IEC 25000 family model characteristic.
The target IDE and the used programming language Java were both selected to represent a well-known, platform-independent development environment. The developed measurements were then incorporated to a test project, which in our case was an experimental version of the Violet UML editor  . The experimental version had all of the measurements implemented, so that the system would provide real session data for the analysis tool to calculate.
Table 1 presents the measurements using the quality characteristics collection and monitoring framework. The measurements are categorized as either direct or indirect: Direct measures consist of runtime events which are used to calculate descriptive statistics; Indirect measures are derived from the direct measures, and their implementation requires additional expert information from the developer or the designer. For example, Maintainability is an indirect measure based on both runtime and static analysis, whereas Mean time between failures is a direct measurement.
The analysis tool gives longitudinal observations for the product maintenance and the reveals production issues. The tool is used to analyze the existing log-files, assess quality characteristics and provide a visualization snapshot of the current state of the system along with a view into the changes of key values between the software versions. The objective of this quality characteristics collection and monitoring framework system is to provide robust and transferable metrics, which can be used to assess the “wellbeing” of the system, and provide systematic and relevant information from the state of the environment or success rate of the system revisions against the set targets. Especially for the maintenance, one long-term objective would be the observation of system performance or feature utilization.
Table 1. Ways to measure the different quality characteristics in the proof-of-concept environment.
To evaluate the utility of the proposed framework, we developed use case scenarios to test the proof-of-concept system where the metrics library based on the framework was integrated to the Violet UML editor. In the scenarios we wanted to present simple maintenance metrics collected over time which would be beneficial for a developer monitoring a software system in use.
In the first scenario the proof-of-concept system is being used by multiple clients, with varying hardware and possibly different operating systems. The performance metric we decided to visualize was mean system startup time for each client. Figure 4 presents the data from our scenario with six different clients. In this example, the developer would be able to see if a patch or update causes system startup times to rise for all clients, and have an early warning for when to adjust loaded resources at startup. Alternatively, if a client files a bug report about slow system performance, the developer will be able to categorize if the problem appears locally for a single client only.
In another scenario, the metric we implemented was the usage of a new feature in the program. When software is in the maintenance phase old functionality seldom changes, but new features may be added. In our scenario, a new feature has been added and deployed. In the scenario there is only one client but the software could just as well be deployed as a public web service or the metric could be the sum of all clients. The software developer wants to monitor how much the new feature is being utilized since it has been launched into production. In this example, the two features being compared allow the end user to access the same functionality and have the same outcome, but through a different path of navigation in the user interface. Figure 5 illustrates the comparison between the usages of the selected features, where feature utilization is plotted by day. As observable from the graph, users in this scenario have started to favor the newly deployed feature over the old one to accomplish their task.
5. Discussion and Conclusion
The objective was to integrate a software quality measurement framework into source code as a library of measurement tools. To bridge the gap between
Figure 4. Example, a time-performance metric collected from six different clients in a test scenario.
Figure 5. A feature utilization metric collected from clients in a test scenario.
established quality evaluation models and their use in practice, we used the ISO 25,000 software quality and quality in-use models as the starting point. In order to realize our goals we implemented a framework for software measurement and a proof-of-concept prototype using an open-source project, and evaluated the work using descriptive scenarios for software in the maintenance phase of its life cycle. The framework offers a step towards integrating software development and run-time quality evaluation.
According to the quality characteristics collection and monitoring framework, the ISO/IEC 25,000 quality characteristics which can be represented or measured from technical aspects of the system can be covered by the framework. The framework offers the following novel benefits:
・ Development of a systematic interface for the measurement components.
・ Framework that is systematic and intuitive enough to warrant ease of use without extensive training.
・ Analysis tools.
Software quality related to the maintainability of a system, is often evaluated by analyzing the quality or complexity of the source code. Cyclomatic complexity  , Halstead complexity measures  , and C&K metrics  are established ways to measure code complexity. The complexity metrics are calculated directly from source code, and analysis tools often employ them. For example, Microsoft’s Visual Studio includes a maintainability index indicator, which is based on both Halstead metrics and cyclomatic complexity  . In the academic work, RTtool is a software suite used by researchers to analyze the relative thresholds for the metrics of code quality in a software project  . Unfortunately, at the moment existing code complexity metrics are poorly used in the industry  .
Model based approaches or machine learning have been identified as solutions of evaluating software and predicting defects  . Runtime metrics have been proposed as one method of quality evaluation  , and they have been applied by, for example Hegedus, whose model used run-time measures together with static measures to measure testability and analyzability by using fault proneness metrics  . However, in general run-time metrics are rarely used in software quality and maintenance evaluation.
The limitations and validity of the presented framework warrant some discussion. First, we must acknowledge that the analysis of metrics depends on the software they are used with. Not all quality characteristics are interesting in all software applications. The analysis is affected by the application context, and therefore the normalization of metrics varies case by case.
This work begun by using the ISO/IEC 25,000 software quality and software quality-in-use models as the starting point. In the presented framework, we have covered examples of quality characteristics of the models. The limitations are related to quality-in-use characteristics that have an inherent subjective nature. For example, it is difficult to quantify user trust, pleasure or comfort through source code, but indirect run-time measurements may give useful information. Quality characteristics like freedom from risk or security can only partly be covered.
Additionally, the utilization of the framework requires effort from the developer. Probes must be fitted directly to source code, as the framework is intended to be used considering the domain knowledge. In our proof-of-concept library we have tried to minimize the required manual programming work required by exposing ready-to-use API’s to the developer.
The objective of this paper was to study how the amount of maintenance effort, and thereby, cost could be reduced using a quality characteristics collection and monitoring framework. The paper presents the implementation of a framework for software measures and a proof-of-concept prototype using an open-source project. The framework provides a systematic interface, which can be used to collect runtime metrics and measure software quality-in-use.
The measurement framework and proof-of-concept project were evaluated by using descriptive scenarios for software in the maintenance phase of its life cycle. The measurement framework was implemented as a metrics library, and measurements were linked into the software as probes during development. This work maps the run-time software metrics to quality characteristics.
In future work, we are going to investigate approaches to source code modelling and defect prediction methods to automate the measurement process. In addition, the methods presented to assess the quality of the system during maintenance could also be thematically expanded to cover the software lifecycle phases of design and implementation.
This work was funded by the Technology Development center of Finland (TEKES), as part of the. Maintain project (project number 1204/31/2016).
 Kasurinen, J., Maglyas, A. and Smolander, K. (2014) Is Requirements Engineering Useless in Game Development? In: Salinesi, C. and van de Weerd, I., Eds., Requirements Engineering: Foundation for Software Quality, REFSQ 2014, Lecture Notes in Computer Science, Vol. 8396, Springer, Cham, 1-16.
 Motogna, S., Vescan, A., Serban, C. and Tirban, P. (2016) An Approach to Assess Maintainability Change. 2016 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR), Cluj-Napoca, 19-21 May 2016, 1-6.
 Bautista, L., Abran, A. and April, A. (2012) Design of a Performance Measurement Framework for Cloud Computing. Journal of Software Engineering and Applications, 5, 69-75.
 Hegedus, P. (2013) Revealing the Effect of Coding Practices on Software Maintainability. 2013 29th IEEE International Conference on Software Maintenance (ICSM), Eindhoven, 22-28 September 2013, 578-581.
 Kasurinen, J., Taipale, O., Vanhanen, J. and Smolander, K. (2012) Exploring the Perceived End-Product Quality in Software-Developing Organizations. International Journal of Information System Modeling and Design, 3, 1-32.
 Honglei, T., Wei, S. and Yanan, Z. (2009) The Research on Software Metrics and Software Complexity Metrics. International Forum on Computer Science-Technology and Applications, Chongqing, 25-27 December 2009, Vol. 1, 131-136. https://doi.org/10.1109/IFCSTA.2009.39
 Herzig, K., Greiler, M., Czerwonka, J. and Murphy, B. (2015) The Art of Testing Less without Sacrificing Quality. Proceedings of the 37th International Conference on Software Engineering, Vol. 1, Florence, 16-24 May 2015, 483-493.
 Fontana, F.A. and Zanoni, M. (2011) On Investigating Code Smells Correlations. IEEE 4th International Conference on Software Testing, Verification and Validation Workshops, Berlin, 21-25 March 2011, 474-475.
 Schrettner, L., Fülöp, L.J., Beszédes, á., Kiss, á. and Gyimóthy, T. (2012) Software Quality Model and Framework with Applications in Industrial Context. 16th European Conference on Software Maintenance and Reengineering, Szeged, 27-30 March 2012, 453-456.
 Kasurinen, J., Taipale, O. and Smolander, K. (2010) Test Case Selection and Prioritization: Risk-Based or Design-Based? Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, Bolzano-Bozen, 16-17 September 2010, Article No. 10.
 Rompaey, B.V., Bois, B.D., Demeyer, S. and Rieger, M. (2007) On The Detection of Test Smells: A Metrics-Based Approach for General Fixture and Eager Test. IEEE Transactions on Software Engineering, 33, 800-817.
 Black, P.E. (2006) Software Assurance during Maintenance. 22nd IEEE International Conference on Software Maintenance, Philadelphia, 24-27 September 2006, 70-72.
 Hynninen, T., Kasurinen, J., Knutas, A. and Taipale, O. (2017) Testing Practices in the Finnish Software Industry. IEEE Conference on Software Engineering Education and Training, Savannah, 7-9 November 2017.
 Oliveira, P., Lima, F.P., Valente, M.T. and Serebrenik, A. (2014) RTTool: A Tool for Extracting Relative Thresholds for Source Code Metrics. IEEE International Conference on Software Maintenance and Evolution, Victoria, 29 September-3 October 2014, 629-632.
 Antinyan, V., Staron, M. and Sandberg, A. (2017) Evaluating Code Complexity Triggers, Use of Complexity Measures and the Influence of Code Complexity on Maintenance Time. Empirical Software Engineering, 22, 3057-3087.