【译】Powerful Metrics: Software Cost of Quality + Defect Containment

29 4月 2019
Comments

"Everything should be as simple as it is , but not simpler."
事情应该力求简单，不过不能过于简单。
-- Albert Einstein

INTRODUCTION 简介

This article explores the application of two metrics frameworks-software cost of quality (CoQ) and software detect containment - to model and manage the cost and delivered quality consequences of alternative strategies for software defect detection and correction during the software development lifecycle. Both leading and lagging indicators are considered.

本文探讨了两个度量框架的应用 - 软件质量成本（CoQ）和软件检测遏制 - 来模拟和管理软件开发生命周期中软件缺陷检测和纠正的替代策略的成本和交付质量后果。领先和滞后指标都将纳入考虑。

ORIGINS AND EVOLUTION OF COST OF QUALlTY IDEAS 质量理念成本的起源和演变

Cost of (Poor) Quality Frameworks 质量（不佳的）框架带来的代价

First described in a 1956 Harvard Business Review article titled "Total Quality Control," Armand Feigenbaum introduced an approach to categorizing quality-related costs that transcended traditional cost accounting practices. This framework (as it relates to software) is summarized in Figure I.

James Harrington's 1987 book "Poor Quality Costs" introduced a rennement of Feigenbaum's approach intended to emphasize that investment in detection and prevention leads to lower total cost.

Armand Feigenbaum 1956 年在《哈佛商业评论》的文章《全面质量控制》中首次描述并介绍了一种分类质量相关成本的方法，该方法超越了传统的成本会计实践。该框架（与软件相关）总结在图 I 中。

James Harrington在1987年出版的“质量成本不佳”一书中介绍了Feigenbaum的方法，强调了：对检测和预防进行投资可以降低总成本。

Lean and Agile Methods 精益和敏捷方法

Lean methods and tools focus on identifying and reducing non-value-added elements of CoQ. Lean ideas originated in the Toyota production System in a manufacturing context and have since been adapted to many other areas of business activity, including services, information technology (IT), and software development. As applied to software development, key practices of lean may be mapped to software development, as shown in Figure 2.

精益方法和工具侧重于识别和减少 CoQ 的非增值元素。 “精益”这一概念起源于丰田生产系统的制造环境，它已经适应了许多其他业务活动领域，如服务、信息技术（IT）和软件开发等。“精益方法”的关键实践可应用并映射到软件开发的许多环节，如图 2 所示。

FIGURE 2

10 basic practices of lean, with software "translations" (Linked to agile ideas)

Eliminate waste: For example, diagrams and models that do not add value to the final deliverable ("tacit knowledge")
Minimize Inventory: For example, intermediate artifacts such as requirements and design documents ("iteration back log")
Minimize Flow: For example, use iterative development to reduce overall cycle time ("short iterations")
Pull from demand: For example, accommodate changing or emerging requirements ("iteration backlog")
Empower workers: for example, allow process to "emerge" (self-organizing teams, trust)
Meet customer requirements: for example, close collaboration; flexible response to change (the "product owner," short iterations)
Do it right the first time: for example, test early and "refactor" (redesign) when necessary ("continuous attention to technical excellence")
Abolish local optimization: for example, flexible scope (short iterations)
Partner with suppliers: for example, avoid adversarial relationships (the " product owner")
Create a continuous improvement culture: build in time to reflect and improve ("regular reflection")

翻译：图2

“精益方法”的十条基础实践，以及对应敏捷方法中的“术语译文”。

消除浪费：例如，图表和模型不会增加最终交付物的价值（“隐性知识”）
最小化库存：例如，需求和设计文档等中间工件（“迭代后退日志”）
最小化流程：例如，使用迭代开发来减少总循环时间（“短迭代”）
从需求中拉出：例如，适应变化或新兴需求（“迭代积压”）
赋予员工权力：例如，允许流程“自行涌现”（自组织团队，信任）
满足客户要求：例如，密切合作; 灵活应对变化（“产品所有者”，短期迭代）
第一时间做到正确：例如，早期测试和必要时“重构”（重新设计）（“持续关注技术卓越”）
取消局部优化：例如，灵活范围（短迭代）
与供应商合作：例如，避免对抗关系（“产品所有者”）
创建持续改进的目标：及时构建以反映和改进（“定期反思”）

Value Stream Mapping and Elimination of Waste 价值流图和消除浪费

In his work with the Toyota Production System, Shingo (1981) identified seven wastes commonly found in manufacturing processes. These wastes have parallels in software development, as described in Figure 3 (based loosely on Poppendieck (2007)).
These wastes may be discovered in any process by applying one of the most popular lean tools-value stream mapping (VSM). As illustrated by this highly simplified example, the VSM method (of which there are many variants) annotates various facts and categorizes the steps in a process map to differentiate those that are "value-added" (VA) from those that are "non-value-added" (NVA), that is, steps that are among the seven wastes (see Figure 4).

These terms are used from a customer's perspective, that is, VA activities are those that a customer would choose to pay for, given a choice. Certain activities, such as testing, are certainly necessary but are not value added from a customer's perspective. Given a choice, the customer will always prefer companies produce a perfect product the first time. Certain other activities (For example, regulatory compliance), while not truly value added, may nonetheless be required. As it applies to software development, all appraisal (defect-finding) and rework (defect-fixing) activities are, by definition, NVA, which is not to say they are unnecessary, merely that they should be minimized.

在他与丰田生产系统的合作中，Shingo（1981）发现了制造过程中常见的七种“浪费”。这些浪费在软件开发中具有相似之处，如图3所示（基于Poppendieck（2007））。通过使用最流行的精益工具 - 价值流映射（VSM），可以在任何过程中发现这些浪费。如此高度简化的示例所示，VSM方法（其中有许多变体）注释了各种事实，并对流程图中的步骤进行了分类，以区分那些“增值”（VA）和“非增值”的流程（NVA），即 7 种浪费中的步骤（见图4）。

这些术语是从客户的角度使用的，也就是说，VA活动是客户在选择时选择支付的活动。某些活动（如测试）当然是必要的，但从客户的角度来看并不是增值。如果有选择，客户将始终希望公司在第一时间生产出完美的产品。尽管如此，某些其他活动（例如，监管合规性）虽然不是真正的增值，但仍然可以被要求。因为它适用于软件开发，所有评估（缺陷发现）和返工（缺陷修复）活动，根据定义，NVA，并不是说它们是不必要的，只是它们应该被最小化。

Software Cost of Quality Benchmarks 质量基准测试的软件代价

Harry and Schroder (2000) estimate that CoQ for an average company is approximately 15 to 25 percent of sales. Based on available industry benchmarks, the author believes average NVA costs for software organizations are in the area of 60 percent of total effort - while not a direct comparison, it is clear CoQ in software is higher than nearly any other area of business activity.

In software, NVA has three principal components. As illustrated in Figure 5, canceled projects in the MIS domain account for around 20 percent of total software effort.

Harry和Schroder（2000）估计，普通公司的CoQ约占销售额的15％到25％。根据可用的行业基准，作者认为软件组织的平均NVA成本占总工作量的60％ - 虽然不是直接比较，但显然软件中的CoQ高于几乎任何其他业务领域。

在软件方面，NVA有三个主要组成部分。如图5所示，MIS域中取消的项目占软件总工作量的20％左右。

As illustrated in Figure 6, the second component includes appraisal and rework activities (shaded items labeled "B") and also includes effort devoted to "descoped" features ("C") that were partially completed and then dropped prior to delivery, typically due to schedule or cost overruns. The author is unable to find solid benchmark data for this element, but from experience something around 10 percent seems reasonable. As indicated, A+B+C equals approximately 60 percent of total effort. (Note: This includes pre-release rework only - when post-release rework is added, the picture is even worse.) Certainly some organizations do better than these averages suggest, but a significant number are even worse. As illustrated next, best-in-class groups using a sound combination of best practices will reduce NVA by two-tbirds or more.

Many groups are skeptical that NVA is so high, but when actually measured, it is invariably much higher than believed beforehand.

如图6所示，第二部分包含评估和返工活动（标有“B”的阴影项目），以及专门用于“描述”功能（“C”）的工作，这些功能在交付之前部分完成然后丢弃，原因通常是超期或超支。作者无法找到这个元素的可靠基准数据，但从经验来看，大约10％的数据似乎是合理的。如上所述，A + B + C约占总工作量的60％。（注意：这仅包括发布前的返工 - 当添加了发布后返工的数据后，情况将会更糟。）当然，有些组织的表现要好于这些平均值，但是相当多的组织只会在这方面更糟。如下图所示，使用最佳实践的完美组合的同类最佳团队将使NVA减少两个或更多。

许多团体对NVA如此之高持怀疑态度，但在实际测量时，它总是比事先认为的要高得多。

Effort Accounting in a Software Cost of Quality Framework 质量框架软件成本中的工作量核算

Adapting these ideas to software development provides a robust approach to objectively answering several age-old questions retrospectively: "How are we doing?" "Are we getting better?" Clearly things are getting better if the VA percent of total effort is improving relative to a baseline. A software CoQ framework provides a comparatively simple approach to effort accounting that avoids many of the traps and complilations that often defeat efforts to measure productivity. To apply this framework, one requires less effort accounting detail than is typically collected, yet one gains more understanding and insight from what is collected. Using the CoQ framework moves one to the most useful level of abstraction - to see the forest, not just the trees.

将这些想法应用于软件开发提供了一种强有力的方法。这些方法可以回顾性地客观地回答几个古老的问题：“我们做的怎么样？”，“我们会变得更好吗？”
显然，当总体效能的 VA 百分比相对于基准线有所改善时，情况肯定会好转。CoQ 框架为软件工作量核算提供了一种相对简单的方法，避免了许多经常使测量工作量失败的陷阱和顺序。为了应用这个框架，需要比通常收集的更少的会计细节，而人们获得的是更多的理解，以及洞察到收集了什么信息。所以，CoQ 框架为我们提供了最有用的抽象级别，进而避免了“只见树木，不见森林”。

Applying this framework does not require time accounting to the task level; rather, one must collect data in a simplified set of categories for each project as follows:

应用此框架不需要对任务级别进行时间计算; 相反，必须为每个项目收集一组简化类别的数据，如下所示：

Project ID
Phase or iteration (note these include "post-release")
VA effort within each phase or iteration, that is, all effort devoted to the creation of information and/or artifacts deemed necessary and appropriate (for example, requirements elicitation, architecture definition, solution design, construction)
NVA effort within each phase or iteration:
- All appraisal effort (anything done to find defects) per appraisal method (in many instances only one form of appraisal, such as inspections or testing, wiII be used in a single phase - if more than one method is used during the same phase, the effort associated with each should be separately recorded).
- All rework effort (anything done to fix defects).
- All effort devoted to effort previously categorized as VA, that is, related to features or functions, promised but not delivered (that is, "descoped" features) in the current project or iteration.
Prevention effort, that is, effort devoted to activities primarily intended to prevent defects and waste in any form. Many of these activities will be projects in their own right, separate from development projects per se, for example, training, SEPG activities, and process improvement initiatives generally. Some prevention activities, such as project-specific training, may occure within a development project. Certain activities, including formal inspections such as IEEE Standard 1028-2008, may serve a dual purpose of both appraisal and prevention. When in doubt a conservative approach suggests categorizing these as appraisal.

翻译：

项目ID
阶段或迭代（注意这些包括“发布后”）
VA在每个阶段或迭代中的努力，即所有致力于创建被认为必要和适当的信息和/或工件的工作量（例如，需求获取，架构定义，解决方案设计，构造）
每个阶段或迭代中的NVA工作：
- 每种评估方法的所有评估工作（发现缺陷的任何工作）（在许多情况下，只有一种形式的评估，例如检查或测试，如果在一个阶段中使用 - 如果在同一阶段使用多种方法，与每个相关的工作量应该单独记录）。
- 所有返工的工作量（为修复缺陷所做的一切工作）。
- 所有致力于先前归类为VA的工作量，即：在当前项目或迭代中，已承诺但未交付的特征（即，“描述”特征）或功能相关的工作。
预防工作，即专门用于主要旨在防止任何形式的缺陷和浪费的活动的努力。其中许多活动本身(per se)就是项目，与开发项目本身分开，例如，培训，SEPG 活动和流程改进计划。某些预防活动，如项目特定培训，可能会在开发项目中发生。某些活动，包括 IEEE 标准 1028-2008 等正式审查活动，可能具有评估和预防的双重目的。如果有疑问，保守的方法是建议将这些分类为评估。

Many groups are accustomed to collecting time accounting at the task level. Many believe these data will prove useful for future estimates. In practice, however, few actually use the data collected. Jones (2008) reports time is sometimes under-reported by 25 percent of total effort. A few organizations accutally use task-level data for earned value analysis, and many simply collect data at that level of detail because they always did it that way and/or because time reporting is tightly integrated with project status tracking and no lower-overhead option is available. In instances where task-level reporting cannot be discountinued in favor of a simpler CoQ structure, it is nearly always possible to map tasks to CoQ categories.

许多团体习惯于在任务级别的时间核计，他们认为这些数据对未来的估算有用。然而，在实践中，我们很少使用收集的数据。Jones（2008）报告说，有时候“时间”这一概念占总工作量的不足 25％。一些组织可以使用任务级数据进行挣值分析，许多组织只是按照详细程度收集数据，因为他们总是这样做和/或因为时间报告与项目状态跟踪是紧密集成的，因此没有低开销选项可用。在任务级报告不连续因而无法支持更简单的 CoQ 结构的情况下，我们几乎总是可以将任务映射到 CoQ 类别。

The Cost of Quality Message

As illustrated by Figure 7, judiciously chosen increases in prevention and appraisal (明智地选择预防和评估工作) will reduce rework and lead to an overall increase in the portion of total effort that is value added. Judiciously chosen means the application of best practices at the right time in the development life cycle.

Defect containment（牵制，抑制）frameworks

A number of authors (for example, Jones 2008; 2010; Kan 2003; Radice 2002) have described measures of defect containment using a veriety of terms, including defect removal efficiency, defect removal effectiveness (DRE), total containment efficiency/effectiveness (TCE), and phase containment efficiency (PCE). Regardless of the terms used, these metrics are always defined by a simple formula, applied to final project results and/or to specific phases or iterations.

Effectiveness

Unfortunately, the terms "efficiency" and "effectiveness" seem to be used somewhat interchangeably (可交换地) although they are actually different indicators, In the context of defect containment, one might consider any approach that removes a high percentage of defects present (for example, 99 percent) to be an effective process, even if it is very costly. Hence, defect containment metrics are a valid indication of effectiveness, but do not measure efficiency.

Containment metrics can be applied retrospectively (回顾地，有追溯效力地) (lagging indicators) or, if one forecasts the number of defects likely to be present, can also be applied prospectively as leading indicators. Applied to a complete project prospectively, a TCE (Total Containment Efficiency/Effectiveness) metric would be computed as follows:

The after-release period might be 3, 6, or 12 months-typically chosen to correspond to the usual upgrade release cycle. In some instances more than one period is used - a check of containment might be done after 3 months and again after 6 months. Large software systems (such as SAP) may merit(值得) an even longer after period if installation and deployment extends over several years.

Similarly, this metric can be applied retrospectively to a phase or iteration. To calculate PCE, the numerator is the number of defects found in a particular phase (for example, requirements) divided by that number plus the number of defects found later that originated in that phase. This necessarily implies that an effort is made to review defects found later and attribute them to a phase (or iteration) of origin - a process that can be both time consuming and potentially contentious. To reduce the effort, a sampling approach might be used.

To apply containment metrics retrospectively (as leading indicators), one requires a forecast of the number of defects likely to have been "inserted" in a particular phase or iteration of a development process.

Few organizations have these data, but they can be derived from an analysis of defect tracking records where those records are reasonably complete and accurate. Alternately, many organizations can start with available industry benchmarks such as those found in Jones (2008), illustrated in Figure 8. The values in parentheses in the figure reflect allocation of documentation defects proportionately to phases. Later in this article the author will introduce a model that will use these data to forecast efficiency and effectiveness consequences of alternative appraisal and prevention strategies.

Given a forecast of the number of defects likely to be present, one can compare those estimates to a count of defects actually found at a point in time and compute a containment rate. This rate is a leading indicator; if, for example, one has consumed his or her effort and duration budget for a given phase but has found only 50 percent of the defects estimated to be present, that is an unmistakable "red flag." One will find those defects later, but at a greater cost in time and effort than likely has been budgeted. A containment rate determined in this manner provides a "quality adjusted" understanding of project status.

Efficiency

Efficiency takes cost into consideration.VA percentage is a valid measure of the aggregate efficiency of any process measured using the CoQ framework. VA percentage can be determined for a single project, or one can evaluate it by domain or in total across groups of projects; it can also be evaluated at the phase level.

In addition, by combining defect data and CoQ data one can evaluate the differential efficiency of alternative appraisal methods. If using the CoQ effort accounting approach, one will know the total effort devoted to each appraisal method in each phase or iteration. If using defect containment metrics, one will know how many defects have been found by each method in each phase or iteration. Given these two sets of data, one can calculate comparative cost per defect to determine relative efficiency per method/phase. These values provide guidance to the most efficient methods and to apply optimal amounts of each in each phase or iteration.

Note "cost per defect" is not a valid metric to compare projects, as it in effect penalizes projects that have high incoming quality - poor quality invariably leads to a lower cost per defect despite higher total cost. Nonetheless, comparative (比较的；相当的) cost per defect is a valid way to judge which methods are most cost effective in each phase. Relative cost per defect across alternative appraisal methods is unlikely to vary significantly across projects. Many different studies, for example, have clearly shown formal inspections ala IEEE Std 1028-2008 to be dramatically more effective than any form of testing. Absolute values vary significantly, but relative values consistently lead to the same conclusions.

Effectiveness of Alternative Appraisal Methods

Figure 9 indicates comparative containment effectiveness of several different methods of appraisal. It is evident from this table and from many other experience reports that formal inspections are significantly more effective than any type of testing. In addition, inspections are applicable to early phases of development (requirements, architecture, and design), while testing can obviously not be employed until at least sorne construction has been completed.

An analysis the author made of 1500 formal inspections completed in accordance with IEEE Std 1028-2008 indicated inspections of text materials (requirements, architecture, and design) found approximately 6 times more defects per inspection than did inspections of code. This data set included approximately 600 text inspections and 900 code inspections. There were no statistically significant differences among the several work product types included in the text inspections category. Approximately 75 percent of these inspection records were collected during training events. On average age, each inspection required approximately 18 person hours of effort and engaged 4 inspectors in most cases.

Clearly inspections are effecive, but are they efficient as well? How can one combine these two metrics frameworks to optimize the combined efficiency and effectiveness of the development processes?

Modeling and Managing Both Efficiency and Effectiveness

In this section the author introduces models intended to help software organizations determine an optimal combination of defect containment methods and resource allocation. These models provide a mechanism to forecast the consequences of alternative assumptions and strategies in terms of both efficiency (measured by NVA vs. VA percent of total effort) and delivered quality (measured by TCE, that is the estimated percentage of defects removed before software release to customers).

As George Box said, "All models are wrong—some are useful" (Box 1979). This model uses parameters taken from a variety of public sources, but makes no claim that these parameter values are valid or correct in any particular situation. It is hoped that the reader will take away a thought process and perhaps make use of this or similar models using parameter values appropriate and realistic in the intended context. It is widely recognized that all benchmark values are subject to wide (and typically unstated) variation. Many parameter values will change significantly as a function of project size application domain, and other factors. The objectives of these models include the following:

Predict: 1) delivered quality (effectiveness); and 2) total NVA effort (efficiency) consequences of alternative appraisal strategies
Predict defect insertion
- Focus attention on defects, which account for the largest share of total development cost.
- Enable early monitoring of the relationship between defects likely to be present and those actually found; provide early awareness using leading indicators.
Estimate effort needed to execute the volume of appraisal necessary to find the number of defects one forecasts to remove.
- A "sanity check" on the planned level of appraisal effort, that is, is it actually plausible to remove an acceptable volume of defects with the level of appraisal effort planned?
Forecast both pre-release (before delivery) and post-release (after delivery) NVA effort.
- When delivered quality is poor (TCE approximately 85 percent), the example discussed next indicates post-release defect repair costs can be 50 percent of the original project budget. Lower quality will lead to even higher post-release costs.

The complete model includes five tables; the first four include user-supplied parameters and calculate certain fields based on those parameters. The fifth table summarizes the results of the other four This article will look at the summary first, and then will provide an overview of the details upon which the summary is based.

The author has defined six scenarios, all based on an assumed size of 1000 function points, to illustrate use of the model. Many other scenarios might be constructed. The first four scenarios assume defects are "inserted" at U.S. average rates (Jones 2008, 69) of a total of 5.0 defects per function point, including bad fixes and documentation errors. Scenarios 4 and 5, respectively, reflect results reported by higher maturity groups in which defects inserted are reduced to 3.6 per function point in scenario 4 (a 20 percent reduction due to learning) and 2.7 per function point in scenario 5 (an additional 20 percent reduction.) These reductions in scenarios 4 and 5 are generally consistent with best-in-class results reported in Jones (2008).

Scenario 1 represents a test-only scenario in which pre-test appraisals, such as formal inspections, are not used. Scenario la assumes test eases are designed using a combinatorial tool based on design of experiments, and that gains realized thereby are consistent with results reported by Kuhn et al. (2009). Scenarios 2 and 3 introduce certain pre-test appraisals, including inspections and static analysis, and in scenarios 3 through 5 discontinue some test activities. Inspection effectiveness (percent of defects found) is assumed to be 60 percent in scenarios 2 and 3, and increases to 70 percent in scenario 4, and 80 percent in scenario 5. Other model parameters, discussed later, remain constant across all scenarios. Inspection percentages indicate the portion of the work product inspected (see Figure 10). Note that static analysis can only be used for certain procedural languages such as C and Java.

Two summaries are developed; Figure 11 shows the impact of alternative appraisal strategies on delivered quality as measured by TCE, that is, the percentage of defects removed prior to delivery of the software. In this illustration, a "best" mix of appraisal activities (in scenario 3 4.4 percent of defects are "delivered") reduces delivered defects by about 75 percent compared to a test-only approach typically used by average groups (in scenario 116.6 percent of defects are "delivered"). Comparison of levels 1 and 3 isolates the impact of an improved mix of containment methods independent of reductions in defect potentials indicated in scenarios 4 and 5.

The second summary (see Figure 12) shows the impact of alternative appraisal strategies on NVA effort as defined in the CoQ framework, that is, all appraisal and rework effort is by definition NVA. Although certainly a necessary evil, the goal will always be to minimize these costs. In the interest of simplicity, the author assumes ancillary costs such as defect tracking, change management, and other defect-related activities are included in either appraisal or rework, as they are in any case a small fraction of total costs.

In this illustration a best (scenario 3) mix of appraisal activities reduces total NVA effort (including both pre- and post-release effort) by more than 40 percent compared to the scenario 1 test-only approach typically used by average groups (67.3 person months in scenario 3 vs. 113.9 in scenario 1). More mature organizations, as a result of lower defect insertion and improved inspection effectiveness, can reduce NVA by an additional 38 percent (to 41.5 in scenario 5 vs. 67.3 in scenario 3). The largest gains in efficiency are realized when formal inspections are used, and those gains are far larger post release than they are pre-release. Perhaps this dynamic explains why so many organizations try inspections but do not sustain them — the largest savings are not evident at the time of release, and may in any case be in someone else's budget.

Model Parameters

Four sets of parameter values, each in a separate Excel table, are required to generate the summary conclusions described previously. One set of tables is used for each scenario and is contained in a single Excel sheet (tab) dedicated to each scenario. These tables and the parameters in each are as follows:

(1) Defect insertion and removal forecast. This table contains a row for each distinct appraisal step, for example, requirements inspection, design inspection, code inspection, static analysis, and unit-function-integration-system-acceptance tests. Any desired set may be identified. Defects inserted are forecast on a "per size" basis, for example, using Jones benchmark value per function point, Percent of defects are also forecast to be removed by each appraisal step, again using Jones benchmark values or other values locally determined. Given these user-supplied parameters, the model calculates defects found and remaining at each appraisal step and the final TCE percent.

(2) Inspection effort forecast. This table ...

(3) Static analysis effort forecast.

(4) Test effort forecast.

Sanity Check

Sanity checking any model is always a good idea.

In general, the sanity check seems consistent with published results (For example, Jones 2009; 2010; Humphrey 2008) if one assumes the scenarios considered are roughly representative of CMMI levels 1 to 5, respectively.

Junpeng Ouyang

Menu

Subscribe