Write your own benchmark apps

Benchmarking is a key part of any technology evaluation. Darren Yates explains how to write your own test apps.

2016-07-01 -

Many of you may remember ‘PC User’ magazine, the predecessor of our sister publication, TechLife. Back in 2005, PC User, like many other publications, was using benchmark apps from a major US publisher to carry out product reviews, when we received word those apps would no longer be developed. That left us in a bit of a pickle. However, it also provided an opportunity. Instead of looking for another benchmark suite filled with other people’s ideas of what makes a decent test, I figured why not use the then-ten years’ experience I’d built up reviewing PC products and develop a benchmark suite that fits our readers’ needs?

A couple of months later, the first of PC User’s UserBench benchmark tests was born and we began using it in the magazine. Development continued and new releases were produced for another seven years. To our knowledge, UserBench was the first PC benchmark suite developed in Australia.

You don’t necessarily need to be a professional programmer to find opportunities to put your skills into practice in your workplace or business. Understanding the problem and what is needed (broadly called ‘Systems Analysis’), along with building the solution (‘Systems Design’) are skills you may be able to grow through understanding your own job or career (‘Domain Knowledge’).

Play around with PC gear long enough and you’ll eventually run into benchmark testing, either evaluating your own gear or reading the results of others. But learning how to build and run even the simplest performance tests is a skill that can go way beyond CPUs and motherboards.

PRECISION TIMING

Any software benchmark test is about repeatable testing of system performance. Essentially, you’re looking to time how long it takes for a device to perform some given task, whether it’s a game demo, processing video, Javascript, HTML or whatever, without having to resort to a stopwatch. That starts with understanding how to code precision timing. You want to be able to independently and automatically measure the time it takes for your benchmark process to complete. You can choose whatever app you like to form the basis of your benchmark process, but you want the timing precise and accurate.

Java has a number of different ways to measure time, depending on your application. For situations where sub-second resolution is required, the most common option is the System. currentTimeMillis() method ( http:// tinyurl.com/hgwzymp). It returns an unsigned ‘long’ 64-bit data object (eight-byte non-negative integer) containing the current millisecond count since the Unix epoch (12am on 1 January 1970).

However, there’s a very good argument that it’s the wrong option for benchmarking. The reason is that currentTimeMillis() forms the basis of what’s described as Java’s ‘time-of-day’ clock, which is taken from the operating system. Since operating system time is frequently adjusted for accuracy (no PC clock is perfect), the currentTimeMillis() value also gets adjusted, being Java’s representation of absolute time. Furthermore, according to the Java docs, you can’t guarantee the precision or ‘granularity’ of currentTimeMillis() will be better than 10-milliseconds, the common update

time for many operating systems. These factors can affect the accuracy of timing systems.

The better alternative in this instance is to use System.nanoTime(). The original Java developers created nanoTime() to provide a precision timer that measures ‘elapsed’ time rather than ‘absolute’ time. That means it’s never updated or interfered with to account for correct time, it just measures time that has passed, making it a more suitable option. According to the Java docs, nanoTime() returns the nanoseconds since some arbitrary time origin as a ‘long’ datatype object, similar to currentTimeMillis().

HOW TO MEASURE INTERVAL TIME

The process for measuring time elapsed or time over an interval is pretty easy to learn. You start by taking the current nanoTime() reading and storing it in a ‘long’ variable: long timeStart = System. nanoTime();

You then carry out whatever task you want your benchmark to perform and take a new nanoTime() reading at the end: long timeFinish = System. nanoTime();

Subtract the former from the later and you’ll have the precise time your process took in nanoseconds: long timeRun = timeFinish – timeStart;

Now, if your test process lasts for seconds or longer, having precision to nine decimal places is probably overkill. However, Oracle says the actual resolution of nanoTime() is guaranteed only to be at least as good as currentTimeMillis(). In practice, we always found the nearest millisecond good enough for most PC-based benchmarking requirements, so here, we divide the result by one million to get from nanoseconds (10 to the power of 9) to milliseconds (10 to the power of 3). We’ll use the short-cut form here: runTime/=1000000; When to launch the timing code

In order to achieve the most accurate test timing, you should always sample the current system nanoTime() as the last thing you do before launching your benchmark test routine. Further, you should take the ‘finished’ time sample as the first thing you do after the test code completes. This will give you the timing that represents the actual task you’re testing.

HOW TO BENCHMARK THE RIGHT WAY

Over the years, I’ve read the odd few people mocking the need for

“Essentially, you’re looking to time how long it takes for a device to perform some given task, whether it’s a game demo, processing video, Javascript, HTML or whatever, without having to resort to a stopwatch.”

benchmark tests, decrying the ‘speeds and feeds’ concepts of reviews as a waste of time. Nonsense – the more information you know about a system, a process or a product, the more informed a decision you can make about whether to buy it, replace it or skip it.

If there’s ever a golden rule in benchmark testing, it’s this – change only one thing at a time. Normally, benchmark testing is all about trying to gain comparisons – how does one motherboard compare with another or one graphics card compare with another. If you’re testing a new graphics card, you don’t go changing the motherboard, the CPU and RAM at the same time unless it’s unavoidable, for example, testing AMD versus Intel performance – and even then, you change as little as possible. If you’re testing motherboards from the same class, you don’t go changing the CPU, graphics card or RAM for the sake of it. You want any change registered in your ‘ before’ and ‘after’ tests to be the result of the one item you’re testing and that item only. Otherwise, how do you know which component caused the change?

We might be talking about PC components here, but as a programmer, you could be asked to test the run-time performance of an external system, say, the time for a share transaction to be sent from a share trading system and received by a share-processing server. Understanding how a system works, the parts of the system that vary and which parts of that system can be locked down can make a huge difference to the overall accuracy of your measurements.

KNOW WHAT YOU’RE TESTING

It’s also equally important to understand what it is you’re testing – the ‘problem domain’. If you’re testing a smartphone for browser Javascript performance, you don’t necessarily want the device downloading your daily bitTorrents at the same time – unless you’re specifically testing for that condition.

Understanding how different devices work is key. For example, in a PC, you want to shutdown all other apps – the only app running should be your benchmark test. But on a smartphone, the memory model is completely different, so rather than killing other apps, you want the phone in an app-stable state. You still don’t want it downloading updates or anything else, but ensuring the device is in a ‘steady state’ is important, particularly for repeatability. Again, having this ‘domain knowledge’ makes a difference.

“If there’s ever a golden rule in benchmark testing, it’s this – change only one thing at a time. ”

MULTIPLE TEST RUNS

For whatever reason, you are not going to get the exact same result every time you run a benchmark test – for example, system processes can pop up for unexplained reasons at different times. That means it can be dangerous to simply run a test once, grab your run-time result and be on your way. At a minimum, you should run tests three times. There are various options you can then choose from for creating the final result – average all three results; always take the slowest, fastest or middle result; or drop the one that’s furthest away from the other two (the ‘outlier’) and average the remaining two.

REFERENCE POINT

Having the raw run-time in milliseconds from your test runs is excellent, but in terms of telling the story of performance to another stakeholder, it may not necessarily help. As we’ve said, benchmark testing is almost always about comparison – whether it’s comparing between two devices or products, or testing single devices and keeping a long-running on-going record. Unless you have multiple devices to test, comparing against a known reference can also give you an excellent starting point to discuss that performance.

A reference point is a known marker on which your benchmark has been tested and the result recorded. For example, UserBench Encode HD had a reference point set against a 2GHz Intel Pentium 4 desktop PC – yep, that’s ancient history dug up from the sands of Egypt kind-of stuff today, but a well-enough-known PC standard nonetheless. We first ran the benchmark test against this system on multiple occasions and set the averaged run-time as a reference point of 10.000. That run-time then became the reference point on which new devices under-test were compared. If a PC scored 84.07 on UserBench Encode HD, for example, we knew immediately that system was 8.407-times faster than the 2GHz Pentium 4 reference on that test.

Here’s how it works from a coding perspective: you run your test on your reference device and it takes, say, 30seconds to complete – this now becomes your base reference point, which you set as 10.000 and that 30seconds becomes the baseline scaler.

You can now run the test on a new device you’re reviewing and say it takes 20seconds. Divide your 30-second reference by the new 20-second time and you get 1.5. That tells you the new device is 50% faster than the reference. But as we’ve set a base reference point of 10.000, we can multiply the result by 10 to give a score of 15.000. Whatever you set your base reference score to is really up to you – we chose 10.000 for UserBench Encode HD because we wanted to differentiate it from the various benchmark sub-test component scores, which were referenced to 1.000. As another example, GeekBench 3.0 chose 2500.00 as their reference score against an Intel Core i5-2520M CPU.

If you now test a second device later without access to the first, it’s not a problem. Say a new device gets a run-time result of 15seconds – we compare that against the base reference score and end up with a score of 20.000, meaning this new device is twice as fast as the reference. But we can also compare this new device with the previous one – dividing the 20-second runtime of the previous device by the 15-second time here gives us 1.333, meaning this new device is 33.3% faster than the previous unit.

BENCHMARK EXAMPLE

To bring this all together, I’ve coded up a very simple little benchmark app called ‘Simple Pi Benchmark’. It takes the Riemann zeta function and runs it through one billion iterations. Those iterations are timed using both System. currentTimeMillis() and System. nanoTime() to show the differences these two timing methods can give in practice. The benchmark runs the Riemann zeta function in a separate thread and can be aborted at any time.

On my ageing 3.2GHz Intel Core i5 2300 desktop PC, the test finishes in (more or less) 12.655seconds – we’ve set that as the reference score of 10.000. If you run the test on your system and get a final score of 20, your system is twice as fast as mine.

GETTING THE SOURCE CODE

You’ll find the Simple Pi Benchmark source project files on our website at http://apcmag.com/magstuff. If you haven’t already, download and install the NetBeans IDE and Java SE Software Development Kit (SDK) bundle from Oracle’s website ( http://tinyurl.com/ apc429-bundle). Next, grab the downloaded source file and unzip the outer file only, launch NetBeans, select File, Import Project, From ZIP and choose inner ‘PcBenchmark’ zip file. Run it. If you’ve been following this series for a while, the code should be fairly easy to understand.

ALL ABOUT COMPARISONS

Performing a benchmark test in isolation isn’t going to tell you much. Almost always, you want to compare the results with another device or product to better understand the two. Particularly in PC hardware, having a known reference point provides some perspective that raw scores may not, allowing you to more quickly gauge comparative performance. The same thing goes for whatever you’re testing – starting with and comparing against a known reference can make it easier to explain performance differences.

It’s a system that’s easy to implement in code and has served me well for years.

?? ?? Simple Pi Benchmark uses a Core i5 2300 CPU to set a 10.000 reference score. — Simple Pi Benchmark uses a Core i5 2300 CPU to set a 10.000 reference score.

?? ?? We took the easy option of NetBean’s Swing GUI Builder to create the UI. — We took the easy option of NetBean’s Swing GUI Builder to create the UI.

?? ?? Oracle’s Java docs are the best source for references on Java commands. — Oracle’s Java docs are the best source for references on Java commands.

?? ?? Targeted benchmarks can be used to test individual components. — Targeted benchmarks can be used to test individual components.

?? ?? UserBench Image 2 timed different image filters on a high-resolution image. — UserBench Image 2 timed different image filters on a high-resolution image.

?? ?? UserBench Encode HD tested audio and video encoding times. — UserBench Encode HD tested audio and video encoding times.

?? ?? JetStream, like all benchmarks, uses on-board system timers to score tests. — JetStream, like all benchmarks, uses on-board system timers to score tests.

?? ?? Our Simple Pi Benchmark’s calculatePi() method runs in its own thread. — Our Simple Pi Benchmark’s calculatePi() method runs in its own thread.

?? ?? GeekBench 3.0 uses Intel’s Core i5 2520M CPU to set a reference 2500 score. — GeekBench 3.0 uses Intel’s Core i5 2520M CPU to set a reference 2500 score.

?? ?? Our Video Converter GUI app also has test timing functions. — Our Video Converter GUI app also has test timing functions.

Write your own benchmark apps

Benchmarking is a key part of any technology evaluation. Darren Yates explains how to write your own test apps.

Newspapers in English

Newspapers from Australia

Write your own benchmark apps

Benchmarki­ng is a key part of any technology evaluation. Darren Yates explains how to write your own test apps.

Newspapers in English

Newspapers from Australia

Benchmarking is a key part of any technology evaluation. Darren Yates explains how to write your own test apps.