量化绩效变化随效尺寸置信区间

论文标题

量化绩效变化随效尺寸置信区间

Quantifying Performance Changes with Effect Size Confidence Intervals

论文作者

Kalibera, Tomas, Jones, Richard

论文摘要

衡量性能并量化绩效变化是编程语言和系统研究中的核心评估技术。在最近的122篇科学论文中，有多达65篇包括实验评估，这些评估使用执行时间比例量化了性能变化。这些论文中很少有在其他实验科学中可以预期的严格水平来评估其结果。测量结果的不确定性在很大程度上被忽略了。几乎没有任何论文提到平均执行时间比率的不确定性，而且大多数人甚至没有提及这两个手段的不确定性。大多数论文未能解决计算机程序的非确定性执行（例如，由内存放置等因素引起的），也没有解决非确定性汇编。事实证明，实验的设计和摘要中计算机系统性能评估文献中介绍的统计方法也不容易允许。这对定量结果的可重复性，可重复性甚至有效性构成了危害。受到其他科学领域中使用的统计方法的启发，并基于没有使其入门教科书的统计结果的基础，我们提出了一个统计模型，使我们既可以量化（执行时间）均值的不确定性，并通过对这些多个来源的不确定性处理可能会影响测量的性能进行的多个来源处理实验。更好的是，在我们的框架摘要下，“系统A的速度比系统B快5.5％$ \ pm $ 2.5％，具有95％的信心”，这比典型的当前惯例所获得的声明更为自然，而这些陈述通常会被误解。 2013年11月

Measuring performance & quantifying a performance change are core evaluation techniques in programming language and systems research. Of 122 recent scientific papers, as many as 65 included experimental evaluation that quantified a performance change using a ratio of execution times. Few of these papers evaluated their results with the level of rigour that has come to be expected in other experimental sciences. The uncertainty of measured results was largely ignored. Scarcely any of the papers mentioned uncertainty in the ratio of the mean execution times, and most did not even mention uncertainty in the two means themselves. Most of the papers failed to address the non-deterministic execution of computer programs (caused by factors such as memory placement, for example), and none addressed non-deterministic compilation. It turns out that the statistical methods presented in the computer systems performance evaluation literature for the design and summary of experiments do not readily allow this either. This poses a hazard to the repeatability, reproducibility and even validity of quantitative results. Inspired by statistical methods used in other fields of science, and building on results in statistics that did not make it to introductory textbooks, we present a statistical model that allows us both to quantify uncertainty in the ratio of (execution time) means and to design experiments with a rigorous treatment of those multiple sources of non-determinism that might impact measured performance. Better still, under our framework summaries can be as simple as "system A is faster than system B by 5.5% $\pm$ 2.5%, with 95% confidence", a more natural statement than those derived from typical current practice, which are often misinterpreted. November 2013

下载PDF全文

下载文献需遵守相关版权规定

论文标题