singe/thirdparty/openssl/tlsfuzzer/docs/source/timing-analysis.rst

===============
Timing analysis
===============

As cryptographic implementations process secret data, they need to ensure
that side effects of processing that data do not reveal information about
the secret data.

When an implementation takes different amounts of time to process the messages
we consider it a timing side-channel. When such side-channels reflect the
contents of the processed messages we call them timing oracles.

One of the oldest timing oracles is the attack described by Daniel
Bleichenbacher against RSA key exchange. You can test for it using the
`test-bleichenbacher-timing.py
<https://github.com/tomato42/tlsfuzzer/blob/master/scripts/test-bleichenbacher-timing.py>`_
script.

The other is related to de-padding and verifying MAC values in CBC ciphertexts,
the newest iteration of which is called Lucky Thirteen. You can test for it
using the
`test-lucky13.py
<https://github.com/tomato42/tlsfuzzer/blob/master/scripts/test-lucky13.py>`_
script.

Environment setup
=================

As the scripts measure the time it takes a server to reply to a message,
a server running alone on a machine, with no interruptions from other
services or processes will provide statistically significant results with
fewest observations.

Hardware selection
------------------

You will want a server with at least 3 physical cores: one to run
the OS, tlsfuzzer script, etc., one to run the tcpdump process (to ensure
consistent timestamping of captured packets) and one to run the system under
test (to ensure consistent response times).

While you can run the tests against a network server, this manual
doesn't describe how to ensure low latency and low jitter
to such system under test.

It's better to use a desktop or server system with sufficient cooling as
thermal throttling is common for laptops running heavy workloads resulting
in jitter and overall inconsistent results.

OS configuration
----------------

To ensure the lowest level of noise in measurement, configure the
system to isolate cores for tcpdump and the system under test.

Red Hat Enterprise Linux 8
^^^^^^^^^^^^^^^^^^^^^^^^^^
To isolate CPUs on RHEL-8, install the following packages:

.. code:: bash

   dnf install -y tuned tuned-utils tuned-profiles-cpu-partitioning


And add the following code to ``/etc/tuned/cpu-partitioning-variables.conf``
file:

.. code::

   isolated_cores=2-10
   no_balance_cores=2-10

Then apply the profile:

.. code:: bash

   tuned-adm profile cpu-partitioning

and restart the system to apply the changes to the kernel.

Then you can install tlsfuzzer dependencies to speed-up the test execution:

.. code:: bash

   dnf install python3 python3-devel tcpdump gmp-devel swig mpfr-devel \
   libmpc openssl-devel make gcc gcc-c++ git libmpc-devel python3-six

   pip3 install m2crypto gmpy2
   pip3 install --pre tlslite-ng

And the general requirements to collect and analyse timing results:

.. code:: bash

   pip install -r requirements-timing.txt

.. note::

   Because the tests use packet capture to collect timing information and
   they buffer the messages until all of them have been created, the use
   of ``m2crypto`` and ``gmpy2`` does not have an effect on collected
   data points, using them will only make tlsfuzzer run the tests at a higher
   frequency.

Testing theory
==============

Because the measurements the test performs are statistical by nature,
the scripts can't just take a mean of observations and compare them with
means of observations of other tests—that will not provide quantifiable
results. This is caused by the fact that the measurements don't follow
a simple and well-defined distribution, in many cases they are
`multimodal
<https://en.wikipedia.org/wiki/Multimodal_distribution>`_
and not `normal <https://en.wikipedia.org/wiki/Normal_distribution>`_.
That means that the scripts need to use statistical tests to check if the
observations differ significantly or not.

Most statistical tests work in terms of hypothesis testing.
Scripts use
`Wilcoxon signed-rank test
<https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test>`_
and the
`Sign test
<https://en.wikipedia.org/wiki/Sign_test>`_ to compare samples.
After executing it against two sets of observations (samples), it outputs
a "p-value"—a probability of getting such samples, if they were taken from
the same population.
A high p-value (close to 1) means that the samples likely came from the
same source while a small value (close to 0, smaller than 0.05) means
that it's unlikely that they came from the same source distribution.

Generally, script assumes that the p-values below 0.05 mean that the values
came from different distributions, i.e. the server behaves differently
for the two provided inputs.

But such small values are expected even if the samples were taken from the same
distribution if the number of performed tests is large, so you need to check
if those values are no more common than expected.

If the samples did indeed come from the same population, then the distribution
of p-values will follow a
`uniform distribution
<https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)>`_ with
values between 0 and 1.

You can use this property to check if not only the failures (small p-values)
occur not more often than expected, but to check for more general inconsistency
in p-values (as higher probability of small p-values means that large
p-values occur less often).

The scripts perform the
`Kolmogorov–Smirnov test
<https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test>`_ to test
the uniformity of p-values of the Wilcoxon tests and the sign test.

The test scripts allow setting the sample size as it has impact on the smallest
effect size that the test can detect.
Generally, with both of the used tests, the sample size must be proportional
to 1/e² to detect effect of size e.
That is, to detect a 0.1% difference between expected values of samples, the
samples must have at least 1000 observations each.
The actual number depends on multiple factors (including the particular
samples in question), but it's a good starting point.
Also, it means that if you wish to decrease the reported confidence interval
by a factor of 10, you must execute the script with 100 times as many
repetitions (as 10²=100).

Note that this effect size is proportional to magnitude of any single
observation, at the same time things like size of pre master secret
or size of MAC are constant, thus configuring the server to use fast ciphers
and small key sizes for RSA will make the test detect smaller (absolute)
effect sizes, if they exist.

Finally, the scripts take the pair of samples most dissimilar to each other
and estimate the difference and the 99% confidence interval for the difference
to show the estimated effect size.

You can also use the following
`R
<https://www.r-project.org/>`_ script to calculate the confidence intervals
for the difference between a given pair of samples using the Wilcoxon test:

.. code::

   df <- read.csv('timing.csv', header=F)
   data <- df[,2:length(df[1,])]
   # print headers (names of tests)
   df[,1]
   # run Wilcoxon signed-rank test between second and third sample,
   # report 99% confidence interval for the difference:
   wilcox.test(as.numeric(data[2,]), as.numeric(data[3,]), paired=T, conf.int=T, conf.level=0.99)


To put into practical terms, a run with 10000 observations, checking a server
with a 100µs response time will not detect a timing side channel
that's smaller than 0.01µs (40 cycles on a 4GHz CPU).

Running the tests
=================

To run the tests:

1. Select a machine with sufficient cooling and a multi-core CPU
2. Use methods mentioned before to create isolated cores, watch out for
   hyperthreading
3. For RSA tests use small key (1024 bit), for CBC tests use a fast cipher and
   hash.
4. Start the server on one of the isolated cores, e.g.:

   .. code::

       taskset --cpu-list 2,3 openssl s_server -key key.pem -cert cert.pem -www
5. Start the test script, provide the IDs of different isolated cores:

   .. code::

       PYTHONPATH=. python3 scripts/test-lucky13.py -i lo --repeat 100 --cpu-list 4,5
6. Wait (a long) time
7. Inspect summary of the analysis, or move the test results to a host with
   newer python and analyse it there.

.. note::

   Since both using pinned cores and collecting packets requires root
   permissions, execute the previously mentioned commands as root.

.. warning::

   The tests use ``tcpdump`` to collect packets to a file and analyse it
   later.
   To process tests with large ``--repeat`` parameter, you need a machine
   with a large amount of disk space: at least 350MiB with 20 tests at
   10000 repeats.


Test argument interface
-----------------------

Any test that collects timing information provides the following
argument interface. Specifying the network interface that packet capture should
listen on should be enough to time the tests.

================ ========== ==================================================
 Argument        Required   Description
================ ========== ==================================================
``-i interface`` Yes        Interface to run tcpdump on
``-o dir``       No         Output directory (default ``/tmp``)
``--repeat rep`` No         Repeat each test ``rep`` times (default 100)
``--cpu-list``   No         Core IDs to use for running tcpdump (default none)
================ ========== ==================================================

Executing the test, extraction and analysis
-------------------------------------------

Tests can be executed the same way as any non-timing tests, just make sure the
current user has permissions to run tcpdump or use sudo. As an example, the
Bleichenbacher test is extended to use the timing functionality:

.. code:: bash

   sudo PYTHONPATH=. python scripts/test-bleichenbacher-timing.py -i lo

By default, if ``dpkt`` dependency is available, the extraction will run right
after the timing packet capture.
In case you want to run the extraction on another machine (e.g. you were not
able to install the optional dependencies) you can do this by providing the
log, the packet capture and server port and hostname (or ip) to the analysis
script. Resulting file will be outputted to the specified folder.

.. code:: bash

   PYTHONPATH=. python tlsfuzzer/extract.py -h localhost -p 4433 \
   -c capture.pcap -l log.csv -o /tmp/results/

Timing runner will also launch analysis, if its dependencies are available.
Again, in case you need to run it later, you can do that by providing the
script with an output folder where extraction step put the ``timing.csv``
file.

.. code:: bash

   PYTHONPATH=. python tlsfuzzer/analysis.py -o "/tmp/results"


With large sample sizes, to avoid exhausting available memory and to speed up
the analysis, you can skip the generation of some graphs using the
``--no-ecdf-plot``, ``--no-scatter-plot`` and ``--no-conf-interval-plot``.
That last option disables generation of the ``bootstrapped_means.csv`` file
too.

External timing data
--------------------

The ``extract.py`` can also process data collected by some external source
(be it packet capture closer to server under test or an internal probe
inside the server).

The provided csv file must have a header and one column. While the file
can contain additional data points at the beginning, the very last
data point must correspond to the last connection made by tlsfuzzer.

Place such file in the directory (in this example named ``timings-log.csv``)
with the ``log.csv`` file and execute:

.. code:: bash

   PYTHONPATH=. python tlsfuzzer/extract.py -l /tmp/results/log.csv \
   -o /tmp/results --raw-times /tmp/results/timings-log.csv

.. warning::

   The above mentioned command will overrite the timings extracted from the
   ``capture.pcap`` file!

Then run ``analysis.py`` as in the case of data extracted from ``capture.pcap``
file:

.. code:: bash

   PYTHONPATH=. python tlsfuzzer/analysis.py -o "/tmp/results"


Combining results from multiple runs
------------------------------------

You can use the ``combine.py`` script to combine the results from runs.

The script checks if the set of executed probes match in all the files,
but you need to ensure that the environments of the test execution match
too.

To combine the runs, provide the output directory (``out-dir`` here) and
paths to one or more ``timing.csv`` files:

.. code:: bash

   PYTHONPATH=. python tlsfuzzer/combine.py -o out-dir \
   in_1596892760/timing.csv in_1596892742/timing.csv

.. warning::

   The script overwrites the ``timing.csv`` in the output directory!

After combining the ``timing.csv`` files, execute analysis as usual.

.. tip::

   ``combine.py`` is the only script able to read the old format of
   ``timing.csv`` files. Use it with a single input file to covert from
   old file format (where all results for a given probe ware listed in a single
   line) to the new file format (where all results for a given probe are
   in a single column)

Interpreting the results
========================

You should start the inspection of test results with the ``scatter_plot.png``
graph. It plots all of the collected connection times. There is also a
zoomed-in version that will be much more readable in case of much larger
outliers. You can find it in the ``scatter_plot_zoom_in.png`` file.
If you can see that there is a periodicity to the collected measurements, or
the values can be collected in similarly looking groups, that means that
the data is
`autocorrelated
<https://en.wikipedia.org/wiki/Autocorrelation>`_ (or, in other words,
not-independent) and simple summary statistics like
mean, median, or quartiles are not representative of the samples.

The next set of graphs show the overall shape of the samples.
The ``box_plot.png`` shows the 5th
`percentile
<https://en.wikipedia.org/wiki/Percentile>`_, 1st `quartile
<https://en.wikipedia.org/wiki/Quartile>`_, median, 3rd
quartile and 95th percentile.
The ``ecdf_plot.png`` shows the `measured (that is, empirical) cumulative
distribution function
<https://en.wikipedia.org/wiki/Empirical_distribution_function>`_.
The ``ecdf_plot_zoom_in.png`` shows only the values between 1st and 95th
percentile, useful in case of few very large outliers.
The "steps" visible in the graph inform us if the distibution is
unimodal (like the common normal distribution) or if it is
`multimodal
<https://en.wikipedia.org/wiki/Multimodal_distribution>`_.
Multimodality is another property that makes simple summary statistics
like mean or median not representative of the sample.

To compare autocorrelated samples we need to compare the differences
between pairs of samples.
The ``diff_scatter_plot.png`` shows the differences of all the samples
when compared to the first sample (numbered 0).
The ``diff_ecdf_plot.png`` is the ECDF counterpart to the scatter plot.
Here, if the graph is
`symmetrical
<https://en.wikipedia.org/wiki/Symmetric_probability_distribution>`_ then the
results from the Wilcoxon signed-rank test are meaningful. If the graph
is asymmetric focus on sign test results.
The ``diff_ecdf_plot_zoom_in_98.png``, ``diff_ecdf_plot_zoom_in_33.png``,
and ``diff_ecdf_plot_zoom_in_10.png`` show just the central 98, 33, and 10
percentiles respectively of the graph (to make estimating small differences
between samples easier).

Finally, the ``conf_interval_plot_mean.png``,
``conf_interval_plot_median.png``, ``conf_interval_plot_trim_mean_05.png``,
``conf_interval_plot_trim_mean_25.png``, and ``conf_interval_plot_trimean.png``
show the mean, median, trimmed mean (5%), trimmed mean (25%), and trimean
respecively, of the differences between samples together with
`bootstrapped
<https://en.wikipedia.org/wiki/Bootstrapping_(statistics)>`_ confidence
interval for them.
For an implementation without a timing side channel present, all the graphs
should intersect with the horizonal 0 line.
If a graph does not intersect with the 0 line, then the number of heights
of it from the 0 line suggests how strong is the confidence in the
presence of side channel on an exponential scale.

As mentioned previously, the script executes tests in three stages, first
is the Wilcoxon signed-rank test and sign test between all the samples,
second is the uniformity test of those results, third is the Friedman test.

.. warning::

   The implementation of Friedman test uses an approximation using Chi-squared
   distribution. That means the results of it are reliable only with many
   samples (at least 5, optimally 10). You should ignore it for such small
   runs. It's also invalid in case of just two samples (used conversations).

The sign test is performed in three different ways: the default, used for
determining presence of the timing side-channel, is the two-sided variant,
saved in the ``report.csv`` file as the ``Sign test``. The two other ways,
the ``Sign test less`` and ``Sign test greater`` test the hypothesis that
the one sample stochastically dominates the other. High p-values here aren't
meangingful (i.e. you can get a p-value == 1 even if the alternative is not
statistically significant even at alpha=0.05).
Very low values of a ``Sign test less`` mean that the *second* sample
is unlikely to be smaller than the *first* sample.
Those tests are more sensitive than the confidence intervals for median, so
you can use them to test the theory if the timing signal depends on some
parameters, like the length of pre-master secret in RSA key exchange or place
of the first mismatched byte in CBC MAC.

The code also calculates the
`dependent t-test for paired samples
<https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples>`_,
but as the timings generally don't follow the normal distribution, it severly
underestimates the difference between samples (it is strongly influenced by
outliers). The results from it are not taken into account to decide failure of
the overall timing test.

If either the KS-tests of uniformity of p-values, or the Friedman test fails,
you should inspect the individual test p-values.

If one particular set of tests consistently scores low when compared to
other tests (e.g. "very long (96-byte) pre master secret" and
"very long (124-byte) pre master secret"
from ``test-bleichenbacher-timing.py``) but high when compared with each-other,
that strongly points to a timing side-channel in the system under test.

If the timing signal has a high relative magnitude (one set of tests
slower than another set by 10%), then you can also use the generated
``box_plot.png`` graph to see it.
For small differences with large sample sizes, the differences will be
statistically detectable, even if not obvious from from the box plot.
You can use the ``conf_interval_plot*.png`` graphs to see the difference
between samples and the first sample together with the 95% confidence
interval for them.

The script prints the numerical value for confidence interval for mean, median,
trimmed mean (with 5% of observervations on either end ignored), trimmed mean
(with 25% of smalles and biggest observations ignored), and trimean of
differences of the pair of two most dissimilar probes.
It also writes them to the ``report.txt`` file.

The ``report.csv`` file includes the exact p-values for the statistical
tests executed as well as the calculated descriptive statistics of
distribution of differences: the mean, standard deviation (SD), median,
interquartile range (IQR, as well as the
`median absolute deviation
<https://en.wikipedia.org/wiki/Median_absolute_deviation>`_ (MAD).
Note that the mean and SD are very sensitive to outliers, the other three
measures are more robust. The calculated MAD already includes the conversion
factor so for a normal distribution it can be compared directly to SD.

The ``sample_stats.csv`` file include the calculated mean, median, and MAD
for the samples themselves (i.e. not the differences between samples).
You can use this data to estimate the smallest detectable difference between
samples for a given sample size.

Using R you can also manually generate ``conf_interval_plot_mean.png`` graph,
but note that this will take about an hour for 21 tests and
samples with 1 million observations each on a 4 core/8 thread 2GHz CPU:

.. code::

   library(tidyr)
   library(ggplot2)
   library(dplyr)
   library(data.table)
   library(boot)
   df <- fread('timing.csv', header=F)
   data <- data.frame(t(df[,2:length(df[1,])]))
   colnames(data) <- as.matrix(df[,1:10])[,1]
   df <- 0
   R = 5000
   rsq <- function(data, indices) {
     d <- data[indices]
     return(mean(d, trim=0.25))
   }
   data2 = replicate(R, 0)
   data2 = cbind(data2)
   date()
   for (i in c(2:length(data[1,]))) {
     a = boot(data[,1]-data[,i], rsq, R=R, parallel="multicore",
              simple=TRUE, ncpus=8)
     data2 = cbind(data2, a$t)
   }
   date()
   data2 = data.frame(data2)
   data2 %>% gather(key="MeasureType", value="Delay") %>%
   ggplot( aes(x=factor(MeasureType, level=colnames(data2)), y=Delay,
               fill=factor(MeasureType, level=colnames(data2)))) +
   geom_violin() + xlab("Test ID") +
   ylab("Trimmed mean of differences [s]") + labs(fill="Test ID")
   colnames(data)


Writing new test scripts
========================
The ``TimingRunner`` repeatedly runs tests with
``tcpdump`` capturing packets in the background.
The timing information is then extracted from that ``tcpdump`` capture,
only the response time to the last client message is extracted from
the capture.

Test structure
--------------

After processing these arguments, one would proceed to write the test as usual,
probably adding a ``sanity`` test case and tests cases relating to the feature
under test. The example script ``test-conversation.py`` can be used as a
starting point.

After it is clear, that all the tests passed, timing of the tests can be
executed.
Please note that any tests with ``sanity`` prefix will be ignored in the
timing run.
Start by importing the ``TimingRunner`` class.
Because the timing information collection adds some extra dependencies, it is
necessary to wrap everything related to timing in an if statement:

.. code:: python

   if TimingRunner.check_tcpdump():

Now, the ``TimingRunner`` class can be initialized with the name of
the currently run test, list of conversations
(``sampled_tests`` in the reference scripts),
output directory (the ``-o`` argument), TLS server host and port, and finally
the network interface from the ``-i`` argument.

Next step is to generate log with random order of test cases for each run. This
is done by calling the function ``generate_log()`` from the ``TimingRunner``
instance. This function takes the familiar ``run_only`` and ``run_exclude``
variables that can filter what tests should be run. Note that this function
will exclude any tests named "sanity". The last argument to this function is
how many times each test should be run (``--repeat`` argument).
The log is saved in the output directory.

The last step is to call ``run()`` function
from the ``TiminingRunner`` instance in order to launch tcpdump and begin
iterating over the tests. Provided you were able to install the timing
dependencies, this will also launch extraction that will process the packet
capture, and output the timing information associated with the test class into
a csv file, and analysis that will generate a report with statistical test
results and supporting plots.