Abstract

Benchmarking is a core element in the toolbox of most systems researchers and is used for analyzing, comparing, and validating complex systems. In the quest for reliable benchmark results, a consensus has formed that a significant experiment must be based on multiple runs. To interpret these runs, mean and standard deviation are often used. In case of experiments where each run produces a time series, applying and comparing the mean is not easily applicable and not necessarily statistically sound. Such an approach ignores the possibility of significant differences between runs with a similar average. In order to verify this hypothesis, we conducted a survey of 1,112 publications of selected performance engineering and systems conferences canvassing open data sets from performance experiments. The identified 3 data sets purely rely on average and standard deviation. Therefore, we propose a novel analysis approach based on similarity analysis to enhance the reliability of performance evaluations. Our approach evaluates 12 (dis-)similarity measures with respect to their applicability in analysing performance measurements and identifies four suitable similarity measures. We validate our approach by demonstrating the increase in reliability for the data sets found in the survey.

Video Presentation