POH Tuning (Part 5 - Preliminary results)



Top level result

After running, we can use the Jupyter notebook to analyze the result. The result is surprising. To make the data easy to analyze, they are available as pandas data frame. For those who are unfamiliar with pandas, a data frame is really just a table.

To use the notebook to get to the data frame, we need to run the first cell (as it required to setup the functions), and then we can run the cell calling the get_test_metrics_numbers_for_jupyter function, there should be exactly one such cell. And this cell will give us the run_data_frame variable.

With that, running this command is going to give us the key metrics.

run_data_frame[["benchmark_name", "PctTimePausedInGC", "speed", "HeapSizeBeforeMB_Mean", "HeapSizeAfterMB_Mean"]]
benchmark_name PctTimePausedInGC speed HeapSizeBeforeMB_Mean HeapSizeAfterMB_Mean
2gb_pinning 83.886439 56.585445 4009.213205 4009.053025
2gb_poh 60.832619 27.530971 2883.491157 2947.730790

At a glance, the speed is significantly reduced despite all other metrics shows improvement. This is just weird.

The weird speed metric

Upon further investigation, I figured out that the speed metric reported by the infrastructure is really just the geometric mean of FirstToLastGCSeconds and PauseDurationMSec_95P. Both of them are measured in seconds, so the value is the smaller the better. So in fact, we are showing an improvement in the timing aspect as well.

What caused the improvement?

Apparently we are spending much less time in GC. Let’s take a look at the number of GCs first.

run_data_frame[["benchmark_name", "TotalNumberGCs", "CountIsGen0", "CountIsGen1", "CountIsBackground", "CountIsBlockingGen2"]]
benchmark_name TotalNumberGCs CountIsGen0 CountIsGen1 CountIsBackground CountIsBlockingGen2
2gb_pinning 683 557 117 0 9
2gb_poh 435 278 149 0 8

Looking at the count data, we realize that there is much less gen 0 GCs. Sadly, while the data tell us there is reduced number of gen 0 GC, we do not understand why from the data. In a future post, I am going to talk about the how to analyze why do we have a significantly reduced number of GCs.

Are we just observing an effect from chance?

As with any scientific studies, it is important to note that a single experiment result doesn’t mean much. If it could not be reproduced or it could have occurred by chance, then it isn’t particularly meaningful. To that end, I did an experiment and run the pair of benchmarks multiple times.

Here are the top level metrics for 5 pairs:

benchmark_name PctTimePausedInGC speed HeapSizeBeforeMB_Mean HeapSizeAfterMB_Mean
2gb_pinning 84.152044 55.317034 4018.010300 4017.822676
2gb_pinning 84.332101 54.444867 4021.135100 4021.038674
2gb_pinning 84.561195 56.717987 4024.397596 4024.306514
2gb_pinning 84.601454 52.776987 4006.901215 4006.789502
2gb_pinning 84.410538 53.956592 4006.952758 4006.901196
2gb_poh 61.779599 26.968380 2926.503793 2960.753005
2gb_poh 60.978760 26.036596 2890.863907 2940.875974
2gb_poh 60.329187 22.559079 2900.956131 2956.372021
2gb_poh 61.475650 26.602588 2890.614196 2941.809857
2gb_poh 60.654861 22.511328 2876.809548 2932.633737

and here are the GC counts

benchmark_name TotalNumberGCs CountIsGen0 CountIsGen1 CountIsBackground CountIsBlockingGen2
2gb_pinning 595 491 95 0 9
2gb_pinning 611 501 101 0 9
2gb_pinning 614 510 95 0 9
2gb_pinning 611 502 100 0 9
2gb_pinning 611 504 98 0 9
2gb_poh 390 243 139 0 8
2gb_poh 412 259 145 0 8
2gb_poh 480 305 167 0 8
2gb_poh 414 259 147 0 8
2gb_poh 484 310 166 0 8

In this case, the data is fairly obvious, the observation we saw is fairly stable. To make this even more concrete, we could use the Students’ T test. Here I show that the decrease of number of gen 0 GC is statistically significant. To perform the test, we run these (it is not currently in the Jupyter notebook, but it could be)

from scipy.stats import ttest_ind

cat1 = run_data_frame[run_data_frame['benchmark_name']=='2gb_pinning']
cat2 = run_data_frame[run_data_frame['benchmark_name']=='2gb_poh']

ttest_ind(cat1['CountIsGen0'], cat2['CountIsGen0'], equal_var=False)

The code produced this result:

Ttest_indResult(statistic=16.317779324341963, pvalue=4.106432582405551e-05)

The statistics is just a number used in the calculation. The key is a small p-value. Basically, the p-value is the probability of observing the experimental result if we assume the number of gen0 collection in both cases is equal. Since the probability is so small, we choose to believe otherwise, that the number of gen0 collections is indeed different.

For those who are familiar with statistics, here are some fine details for the test. We used unequal variance because we see and obvious flucation in the gen 0 counts in the pinned handle case but not in the pinned object heap case. The p-value outputted by scipy is a two-tailed p-value. Since we wanted to prove that the number of gen 0 is indeed smaller, we should have used a one-tailed p-value instead, which should be the number divided by 2, but they are small anyway.