Benchmarks

The Benchmarking Suite comes with a set of third-party benchmarking tools, each of them with a set of different test configurations ready to be executed. The tools are:

  • CFD: a tool realized in the CloudPerfect EU project [1] that uses OpenFOAM to run a waterbox simulation. Can be configured with different solvers, number of iterations and write to disk strategies. It is primarily a CPU intensive benchmark;
  • DaCapo: a tool for Java benchmarking simulating real world applications with non-trivial memory loads. It is mainly a CPU and memory intensive benchmark;
  • Filebench: a powerful and flexible tool able to generate and execute a variety of filesystem workloads to simulate applications like Web servers, File servers, Video services. It is mainly a Disk intensive benchmark;
  • Iperf: is a tool for active measurements of the maximum achievable bandwidth on IP networks;
  • Sysbench: a tool to test CPU, memory, file I/O, mutex performance and MySQL on Linux systems;
  • YCSB: a tool for database benchmarking that supports several database technologies. In the Benchmarking Suite, tests for Mysql and MongoDB are provided. It is primarily a Disk intensive benchmark;
  • WebFrameworks: tests common web frameworks workloads like fetching and inserting data in a database or create/parse json objects. It is mainly a Memory and Network intensive benchmark;

The following table summarizes the tools available and their compatibility with different operating system.

Test-OS compatibility matrix
Tool Version CentOS Ubuntu 14 Ubuntu 16 Ubuntu 18 Ubuntu 20 Debian
CFD 1.0      
DaCapo 9.12      
Filebench 1.4.9.1      
Iperf 2.0.5      
Sysbench 2.1.0      
YCSB-MySQL 0.12.0      
YCSB-MongoDB 0.11.0      
WebFrameworks master      

CFD

The CFD benchmarking tool has been realized in the context of the CloudPerfect EU project [1] and released open source on GitHub [2]. The tool executes a CFD simulation on a waterbox geometry allowing to customize several parameters in order to simulate different simulations.

The following combination of parameters is used in the Benchmarking Suite tests:

100iterGAMG 100 iterations using the GAMG solver
100iterWriteAtLast 100 iterations using the GAMG solver and not writing intermediate results on the disk
500iterGAMG 500 iterations using the GAMG solver
500iterGAMGWriteAtLast 500 iterations using the GAMG solver and not writing intermediate results on the disk
500iterICCG 500 iterations using the ICCG solver
500iterPCG 500 iterations using the PCG solver

All the tests uses all the CPUs available in the machine.

Metrics

Metric Unit Description
duration s The overall duration of the simulation

DaCapo

DaCapo [3] as a tool for Java benchmarking by the programming language, memory management and computer architecture communities. It consists of a set of open source, real world applications with non-trivial memory loads. Tests implemented by the tool are:

DaCapo tests (source: http://www.dacapobench.org/)
avrora simulates a number of programs run on a grid of AVR microcontrollers
batik produces a number of Scalable Vector Graphics (SVG) images based on the unit tests in Apache Batik
eclipse executes some of the (non-gui) jdt performance tests for the Eclipse IDE
fop takes an XSL-FO file, parses it and formats it, generating a PDF file.
h2 executes a JDBCbench-like in-memory benchmark, executing a number of transactions against a model of a banking application, replacing the hsqldb benchmark
jython inteprets a the pybench Python benchmark
luindex Uses lucene to indexes a set of documents; the works of Shakespeare and the King James Bible
lusearch Uses lucene to do a text search of keywords over a corpus of data comprising the works of Shakespeare and the King James Bible
pmd analyzes a set of Java classes for a range of source code problems
sunflow renders a set of images using ray tracing
tomcat runs a set of queries against a Tomcat server retrieving and verifying the resulting webpages
tradebeans runs the daytrader benchmark via a Jave Beans to a GERONIMO backend with an in memory h2 as the underlying database
tradesoap runs the daytrader benchmark via a SOAP to a GERONIMO backend with in memory h2 as the underlying database
xalan transforms XML documents into HTML

Each test is executed multiple times, until the exectuions duration converge (variance is <= 3.0 in the latest 3 executions).

Metrics

Metric Unit Description
timed_duration ms the duration of the latest execution
warmup_iters num the number of executions that were necessary to converge

Filebench

Filebench [4] is a very powerful tool able to generate a variety of filesystem- and storage-based workloads. It implements a set of basic primitives like createfile, readfile, mkdir, fsync, … and provide a language (the Workload Model Language - WML) to combine these primitives in complex workloads.

In the Benchmarking Suite, a set of pre-defined workloads have been used to simulate different services:

Filebench workloads (source: https://github.com/filebench/filebench/wiki/Predefined-personalities)
fileserver Emulates simple file-server I/O activity. This workload performs a sequence of creates, deletes, appends, reads, writes and attribute operations on a directory tree. 50 threads are used by default. The workload generated is somewhat similar to SPECsfs.
webproxy Emulates I/O activity of a simple web proxy server. A mix of create-write-close, open-read-close, and delete operations of multiple files in a directory tree and a file append to simulate proxy log. 100 threads are used by default.
webserver Emulates simple web-server I/O activity. Produces a sequence of open-read-close on multiple files in a directory tree plus a log file append. 100 threads are used by default.
videoserver This workloads emulates a video server. It has two filesets: one contains videos that are actively served, and the second one has videos that are available but currently inactive. One thread is writing new videos to replace no longer viewed videos in the passive set. Meanwhile $nthreads threads are serving up videos from the active video fileset.
varmail Emulates I/O activity of a simple mail server that stores each e-mail in a separate file (/var/mail/ server). The workload consists of a multi-threaded set of create-append-sync, read-append-sync, read and delete operations in a single directory. 16 threads are used by default. The workload generated is somewhat similar to Postmark but multi-threaded.

Metrics

Metric Unit Description
duration s The overall duration of the test
ops num The sum of all operations (of any type) executed
ops_throughput ops/s The average number of operations executed per second
throughput MB/s The average number of MBs written/read during the test
cputime µs The average cpu time taken by each operation
latency_avg µs The average duration of each operation

Iperf

IPerf [5] is a benchmarking tool to measure the maximum achievable bandwidth on IP networks. It provides statistics both for TCP and UDP protocols.

In the Benchmarking Suite, the following pre-defined workloads have been created:

tcp_10_1 transfer data over a single TCP connections for 10 seconds
tcp_10_10 transfer data over 10 parallel TCP connections for 10 seconds
udp_10_1_1 transfer UDP packets over a single connection with a maximum bandwidth limited at 1MBit/s
udp_10_1_10 transfer UDP packets over a single connection with a maximum bandwidth limited at 10MBit/s
udp_10_10_10 transfer UDP packets over 10 parallel connections with a maximum bandwidth limited at 1MBit/s

Metrics

For the TCP workloads:

Metric Unit Description
duration s The overall duration of the test
transferred_x bytes data transferred for the connection x
bandwidth_x bit/s bandwidth fo the connection x
transferred_sum bytes sum of data transferred in all connections
bandwidth_sum bit/s sum of bandwidth of all connections

For the UDP workloads:

Metric Unit Description
duration s The overall duration of the test
transferred_x bytes data transferred over connection x
bandwidth_x bit/s bandwidth of connection x
total_datagrams_x num number of UDP packets sent over connection x
lost_datagrams_x num number of lost UDP packets over connection x
jitter_x ms latency of connection x
outoforder_x num number of packets received by the server in the wrong order
transferred_avg bytes average data transferred by each connection
bandwidth bit/s average bandwidth of each connection
total_datagrams_avg num average number of packets sent over each connection
lost_datagrams_avg num average number of packets lost for each connection
jitter_avg ms average latency
outoforder_avg num average number of packets received in the wrong order

Sysbench

SysBench [6] is a modular, cross-platform and multi-threaded benchmark tool for evaluating CPU, memory, file I/O, mutex performance, and even MySQL benchmarking. At the moment, in the Benchmarking Suite only the CPU benchmarking capabilities are integrated.

cpu_10000 Verifies prime numbers between 0 and 20000 by doing standard division of the number by all numbers between 2 and the square root of the number. This is repeated 1000 times and using 1, 2, 4, 8, 16 and 32 threads

Metrics

Metric Unit Description
events_rate_X num/s the number of times prime numbers between 0 and 20000 are verified each second with X threads
total_time_X s total number of seconds it took to execute the 1000 cycles with X threads
latency_min_X ms minimum time it took for a cycle
latency_max_X ms maximum time it took for a cycle
latency_avg_X ms average time the 1000 cycles took. It gives a good measure of the cpu speed
latency_95_X ms 95th percentile of the latency times.

YCSB

YCSB [7] is a database benchmarking tool. It has the support for several database technologies and provides a configuration mechanism to simulate different usages.

In the Benchmarking Suite, YCSB is used to benchmark two of the most popular database servers: MySQL and MongoDB.

For each database, the following workloads are executed:

workloada Simulates an application that performs read and update operations with a ratio of 50/50 (e.g. recent actions recording)
workloadb Simulates an application that performs read and update operations with a ratio of 95/5 (e.g. photo tagging)
workloadc Simulates a read-only databases (100% read operations)
workloadd Simulates an application that performs read and insert operations with a ratio of 95/5 (e.g. user status update)
workloade Simulates an application that performs scan and insert operations with a ratio of 95/5 (e.g. threaded conversations)
workloadf Simulates an application that performs read and read-modify-write operations with a ratio of 50/50 (e.g. user database)

Metrics

Metric Unit Description
duration s The overall duration of the test
read_ops num THe number of read operations executed
read_latency_avg µs The average latency of the read operations
read_latency_min µs The minimum latency of the read operations
read_latency_max µs The maximum latency of the read operations
read_latency_95 µs The maximum latency for the 95% of the read operations
read_latency_99 µs The maximum latency for the 99% of the read operations
insert_ops num THe number of insert operations executed
insert_latency_avg µs The average latency of the insert operations
insert_latency_min µs The minimum latency of the insert operations
insert_latency_max µs The maximum latency of the insert operations
insert_latency_95 µs The maximum latency for the 95% of the insert operations
insert_latency_99 µs The maximum latency for the 99% of the insert operations
update_ops num THe number of update operations executed
update_latency_avg µs The average latency of the update operations
update_latency_min µs The minimum latency of the update operations
update_latency_max µs The maximum latency of the update operations
update_latency_95 µs The maximum latency for the 95% of the update operations
update_latency_99 µs The maximum latency for the 99% of the update operations

WebFrameworks

This is an open source tool [8] used to compare many web application frameworks executing fundamental tasks such as JSON serialization, database access, and server-side template composition. The tool has been developed and it is used to run the tests that generate the results available at: https://www.techempower.com/benchmarks/.

Currently, in the Benchmarking Suite the framework supported are: Django, Spring, CakePHP, Flask, FastHttp and NodeJS.

For each framework the following tests are executed:

Test types (source: https://www.techempower.com/benchmarks/#section=code&hw=ph)
json This test exercises the framework fundamentals including keep-alive support, request routing, request header parsing, object instantiation, JSON serialization, response header generation, and request count throughput.
query This test exercises the framework’s object-relational mapper (ORM), random number generator, database driver, and database connection pool.
fortunes This test exercises the ORM, database connectivity, dynamic-size collections, sorting, server-side templates, XSS countermeasures, and character encoding.
db This test uses a testing World table. Multiple rows are fetched to more dramatically punish the database driver and connection pool. At the highest queries-per-request tested (20), this test demonstrates all frameworks’ convergence toward zero requests-per-second as database activity increases.
plaintext This test is an exercise of the request-routing fundamentals only, designed to demonstrate the capacity of high-performance platforms in particular. Requests will be sent using HTTP pipelining.
update This test exercises the ORM’s persistence of objects and the database driver’s performance at running UPDATE statements or similar. The spirit of this test is to exercise a variable number of read-then-write style database operations.

For the types json, query, fortunes and db the tool executes six different burst of requests. Each burst last 15 seconds and have a different concurrency level (number of requests done concurrently): 16, 32, 64, 128, 256 and 512.

For the type plaintext, the tool executes four burst of 15 seconds each with the following concurrency levels: 256, 1024, 4096 and 16384.

For the type update, the tool executes five burst of 15 seconds each with a 512 concurrency level, but different number of queries to perform: 1, 5, 10, 15 and 20.

Metrics

Metric Unit Description
duration s The overall duration of the test
duration_N s The overall duration for the N concurrency level*. It is fixed to 15 seconds by default
totalRequests_N num The overall number of requests processed during the 15 seconds test at the N concurrency level*
timeout_N num The number of requests that went in timeout for the N concurrency level*
latencyAvg_N s the average latency between a request and its response for the N concurrency level*
latencyMax_N s the maximum latency between a request and its response for the N concurrency level*
latencyStdev_N s the standard deviation measure for the latency for the N concurrency level*

Adding a new benchmarking tool

In addition to the benchmarking tests coming with the standard Benchmarking Suite release, it is possible to add new benchmarking tools by providing a configuration file to instruct the Benchmarking Suite how to install, configure and execute the tool.

The configuration file must contain on section [DEFAULT] with the commands to install and execute the benchmarking tool, plus one or more sections that define different sets of input parameters to the tool. In this way, it is possible to execute the same tool to generate multiple workloads.

[DEFAULT]
class = benchsuite.stdlib.benchmark.vm_benchmark.BashCommandBenchmark


#
# install, install_ubuntu, install_centos_7 are all valid keys
install_<platform> =
    echo "these are the..."
    echo "...install %(option1)s commands"

execute_<platform> =
    echo "execute commands"

cleanup =
    echo "commands to cleanup the %(option2)s environment"



[workload_1]
option1 = value1
option2 = value

[workload_n]
option1 = value1
option2 = valueN

For instance, a very minimal configuration file to integrate the Sysbench [6] benchmarking tool is shown below:

[DEFAULT]
class = benchsuite.stdlib.benchmark.vm_benchmark.BashCommandBenchmark

install =
    curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.deb.sh | sudo bash
    sudo apt-get -yq install sysbench
    sysbench %(test)s prepare %(options)s

execute =
    sysbench %(test)s run %(options)s --time=0

cleanup =
    sysbench %(test)s cleanup %(options)s

[cpu_workload1]
    test = cpu
    options = --cpu-max-prime=20000 --events=10000

Configuration files of the benchmarks included in the Benchmarking Suite releases can be used as starting point and are available here [10]

Managing benchmarking tools through the GUI

The Benchmarking Suite comes with a set of third-party, widely-known, open source benchmarking workloads (e.g. SysBench, FileBench, DaCapo, Web Framework Benchmarking). These workloads are available to any registered user and are encouraged to be used to enable comparability of results along time and across providers and users. However, to support specific user requirements, custom workloads can be defined and, according to the user choice, shared with others or kept private. As with the CLI, workloads registered in the Benchmarking Suite are typically a wrapper around existing benchmarking applications for which the registration process should provide installation, execution and results parsing capabilities.

When the Benchmarking Suite is used through the web interface, benchmarking tools (a.k.a. workloads) can be added and edited as well.

From the ‘Workload’ panel, a new workload can be added via the ‘New Workload’ button which shows a form asking for metadata already presented in the previous section.

Similarly, once a workload has been selected, it can be modified, cloned into a new one or deleted (provided you have enough permissions on it).

Workload metadata

  • Workload name is a name, not necessarily unique, given to the workload;
  • Tool name is the name of the tool providing the given workload;
  • Workload ID is a unique identifier provided by the system, and is not modifiable;
  • Description is to tell what the workload does, what parameter is measured, and any other useful detail about the the workload;
  • Categories allows to specify some tags to ease search;
  • Abstract is to mark workload ‘templates’ not meant for execution but only for specialization;
  • Parent workload is a base workload definition from which properties/commands are inherited. The current workload only needs to specialize some of them. A typical usage of this feature is to define multiple workloads provided by a single tool;

Workload execution

  • Install scripts are executed just after the provisioning of the virtual machine, usually to download and install a benchmarking tool;
  • Post-create scripts are executed after the provisioning of the virtual machine to perform some general initialization (e.g. configure the DNS);
  • Execute scripts are executed to perform the benchmark of the environment;
  • Cleanup scripts are executed for multi-benchmark execution to ensure a clean environment for following benchmarks;
  • User and Support scripts and Workload parameters are meant to be used in above scripts for readability and better coding style;

Install, post-create, execute and cleanup scripts can be specialized for a specific operating system, as well as for specific version of the operating system. This is achieved through a mechanism matching the environment on the target VM with the name of the script.

As an example, when executing the workload on a VM running Debian 10.8, the following install scripts are searched:

  1. install_debian_10_8
  2. install_debian_10
  3. install_debian
  4. install

The first matched script (i.e. the most-specific one) is executed; the others are ignored.

Sharing workloads

  • Sharing sets the level of visibility for the workload. A workload can be private to its creator or publicly visible and, thus, executable. Note that this applies to the workload only and not to infrastructure being benchmarked nor to produced results which have their own visibility levels;

Exporting workloads

A JSON representation of the workload can be generated for offline inspection/editing. This can be done for all visible workloads (hit the ‘Export All’ button) or for individual workloads (hit the ‘Export’ button when viewing the workload).

When exported workloads are linked by some inheritance relationship, you can decide to export them so that:

  1. the hierarchy is preserved;
  2. inherited properties are collapsed into most-specific workloads (i.e. the hierarchy is lost).
[1](1, 2) CloudPerect project homepage: http://cloudperfect.eu/
[2]CFD Benchmark Case code: https://github.com/benchmarking-suite/cfd-benchmark-case
[3]DaCapo homepage: http://www.dacapobench.org/
[4]Filebench homepage: https://github.com/filebench/filebench/wiki
[5]IPerf homepage: https://iperf.fr/
[6](1, 2) Sysbench homepage: https://github.com/akopytov/sysbench
[7]YCSB homepage: https://github.com/brianfrankcooper/YCSB/wiki
[8]Web Framewoks Benchmarking code: https://github.com/TechEmpower/FrameworkBenchmarks
[10]Benchmark configuartion files: https://github.com/benchmarking-suite/benchsuite-stdlib/tree/master/data/benchmarks