Donnerstag, 26. April 2018

Factor Two Ryzen Java Performance Gain

Standard benchmarks you can find on the internet are of limited value when you use your machine
for serious computations based on your own code. My machines often run for several days utilizing
all available CPU cores, so performance is a serious issue, as is power consumption considering the
insane power price we pay here in Germany.

After AMD released the Ryzen processor I was quite disappointed by their multi threaded
performance at my application. Recently I found out what was the source of the problem and that
it can be solved. I want to share this information in case someone faces a similar issue.
Finally it seems Ryzen Threadripper 1950x, if configured and used correctly, can handle an incredible
amount of workload. It turned out, that for multi threaded Java applications heavily using JNI there
are huge improvements using Java 9 or Java 10 compared to Java 8 I haven't found documented
anywhere yet. These improvements are already significant for Intel CPUs, but quite dramatic for
AMD Ryzen.

Space Flight Trajectory Optimization

Since the movie "The Martian" space flight optimization got introduced to a wider audience

There is a world wide yearly competition around this topic called GTOC where I am participating since 2010.

Optimizing trajectories requires a lot of CPU resources. How much is dependent on:

a) CPU
b) Operating System
c) Efficiency of the used algorithms
d) JDK/Java/JVM version - my implementation is in Java + C++ called via JNI.

In order to be competitive I always try to improve on all four fronts.

Used Hardware / Software

As operating system I use Linux Mint 18.2 (based on Ubuntu 16.04). The base optimization algorithm I
use is called "CMA Evolution Strategy" , which scales very well
with the number of CPU cores and outperforms the popular particle swarm optimization algorithm

My two Threadrippers 1950x are used in Numa/local memory mode
restricting the number of threads to 16 for each optimization run. This way, executing two optimizations
in parallel - each using 16 threads in "NUMA" mode - I got the CPU fully utilized, which was not
possible using the default "UMA" mode. In "NUMA" mode the 1950x is more or less equivalent to
two Ryzen 8-cores, but with much less combined power consumption.
As CMA-ES implementation a C++ variant of my Apache Commons Math contribution of the CMA-ES
algorithm used, which additionally supports mirroring and multi threading. This C++ code is called from Java
using JNI.

Flying to 2016 HO3

To reproduce the performance issue with Java 8 I used two real world application scenarios related to
the search for a trajectory from earth to asteroid (469219) 2016 HO3, the smallest and closest Earth
quasi-satellite - see .

The trajectory is computed in two phases:

1) Monte Carlo search to find the optimal start / landing time
2) Optimization of the final transfer using a low thrust (ion thruster) flight model.

Both are quite typical space trajectory design / optimization tasks.

Here is a computed transfer (black) from earth (blue) to 2016 HO3 (red):
Java performance evaluations are tricky, you need a ramp up phase executing the code without
measuring to wait for the JVM to optimize / compile the code which it does dynamically after evaluating
internal performance metrics. Then we perform 20 runs and report both the mean value and the
standard deviation of these 20 timings.

For Java 8 we used “Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)”,
for “Java 9  "OpenJDK 64-Bit Server VM (Zulu build, mixed mode)”,
for Java 10 “Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)”.

Monte Carlo Search Benchmark

For the Monte Carlo search we run 1000 CMA-ES optimizations each evaluating 5.000 times a
5-segment lambert arc flight path approximation. On Ryzen 16 CMA-ES optimizations are executed
in parallel, where each CMA-ES optimization is single threaded. Since I don't have a new Intel
machine I used an older 6-core i7-6800K CPU @ 4.0GHz for comparison, where 12 optimizations
are executed in parallel. Using Java 8 the old i7-6800K easily outperforms the Ryzen @ 3.9Ghz.

Results for i7-6800K CPU @ 4.0GHz

20 runs
Java 8
10462 ms
64 ms
10359 ms
10595 ms
Java 9
8137 ms
152 ms
8024 ms
8558 ms
Java 9 2nd  parallel exec
16705 ms
314 ms
16298 ms
17500 ms

On the Intel processor we see already a hefty 25% performance gain using Java 9.

Results for AMD Ryzen Threadripper 1950X @ 3.9GHz

20 runs
Java 8
12326 ms
251 ms
11717 ms
12839 ms
Java 9
5725 ms
360 ms
5127 ms
6622 ms
Java 9 2nd  parallel exec
5874 ms
389 ms
5421 ms
6970 ms
Java 10
5906 ms
264 ms
5496 ms
6543 ms

These results show a serious problem for the Ryzen/Java 8 combination. Only using Java 9 or Java 10
the Ryzen can outperform the old Intel 6800k, it is more than twice as fast than with Java 8.

CMA-ES Final Optimization Benchmark

For the final transfer optimization we execute a single multi threaded CMA-ES optimization evaluating
200.000 times an 8-segment low thrust flight model using the GraggBulirschStoerIntegrator from
Apache Commons Math. Since only the function evaluations are executed in parallel, we have much
less CPU utilization and the Java 9 advantage is less dramatic.

Results for i7-6800K CPU @ 4.0GHz

20 runs
Java 8
15551 ms
297 ms
14705 ms
16003 ms
Java 9
14825 ms
666 ms
13123 ms
15757 ms

On the Intel processor we now see only a moderate advantage for Java 9. But regardless of the Java
version Ryzen is superior. Would be interesting to see how newer Intel processors perform here.

Results for AMD Ryzen Threadripper 1950X @ 3.9GHz

20 runs
Java 8
10895 ms
471 ms
9818 ms
12187 ms
Java 9
9641 ms
180 ms
9313 ms
10017 ms
Java 10
10395 ms
974 ms
9564 ms
13213 ms

Difference between Java versions is similar to the Intel results, again Java 9 is the winner.
The Java 8 / Ryzen multithreading problem seems related to CPU utilization which is much lower in
this experiment.

We also tried a new 8-core Ryzen 2 2700x processor, which can be slightly higher clocked than the
1950x, but showed nevertheless very similar results - of course it makes no sense to execute a
second run in parallel on the 2700x since it has only 8 cores.

To illustrate that the used algorithms are competitive: shows a solution
for the GTOC4 competition
which is significantly better than the solution which won the competition.