.. _time_20141024:

20141024 -- Filtering NTP
=========================

The hard part of NTP is actually not, as many people seem to think,
the PLL, but rather the filtering of the incoming phase error samples.

In difference from most other PLL or PI regulation loops the inputs
to a NTP PLL are horrible.

For one thing, they may not even be there.

With a non-trivial probability, clients get no response from NTP
servers or if it does, the packet must be discarded because it has
been delayed so long that it no longer tell us anything relevant
about the clock relationship between the client and the server.

When they arrive, they suffer from a number of not at all gaussian
error mechanisms, which no run-of-the-mill filter types handle
particularly well, in particular not given that we would really
like a constant delay through the filter, in order for the PLL
to remain stable.

I have collected a large number of NTP packet exchanges in order
to go over all the old wisdom on NTP filtering, and as I expected,
it's not your granddads internet any more.

The most interesting way to visualize these data is also a quite
pretty way:

.. image:: 20141024_fig1.png

First, notice that the axes are logarithmic and scale from 200µs
to 0.5s, they span a range of 1:2500.

Each distinct signature are the packets to/from a particular NTP
server, plotted with the time from my lab machine to the server on
the X axis and the return trip on the Y axis.

The red blob in the low left corner is a stratum two server in my
lab, packet travel times is around half a millisecond both ways.

The rest of the blobs are packets to and from from some particular
public NTP server, the futher up and right, the longer time the
packets took to cross the Internet.

I deliberately included three NTP servers from New Zealand in
the sample, since that is about as far from Denmark as you can
get, both geographically and RTT wise.

Ohh, and notice that red arc there ?  That's a pretty shit stratum
2 server somewhere in Germany, it drifts a couple all over the place
in my dataset.  As if the packet-transit noise wasn't bad enough
our filters should try to spot and blacklist such servers.

The really interesting thing is that even though they may be
100ms away from my lab, these servers provide quite good timeservice.

In this plot a running estimate of the standard-deviation of
the packet transit time is plotted for the same dataset:

.. image:: 20141024_fig2.png

Notice the change of scale, now from [1us to .5s]

You will notice the precense of a green blob below the red one,
if you look really carefully in the first plot, you can see to
green crosses there also:  This is my labs stratum=1 server.

This plot makes the important feature of NTP phase offset
measurements very obvious:
Notice how almost all these different blobs have the general shape
of a blob near X=Y and a tail up and another tail right.

This is because the probability of a packet getting delayed on the
way to the server is quite independent of the packet going back
being delayed.  Not totally independent, but quite independent.

Due to the way we calculate the phase offset, packets on the
two tails represent half the delay one of the two packets
suffered as phase error and that's bad for our PLL.

If the packet were delayed equally in both directions, it would
be fine and cancel out, but only delayed in one direction is bad.

Not Your Grandfathers ARPANET
-----------------------------

NTP is as old The Blues Brothers ("We have both kinds of protocols,
IPv4 *and* IPv6"), older than Das Boot and Raiders of the Lost Ark,
so old in fact that people still thought X.25 was a cool protocol
and that 56 kbit/sec was "high speed communications".

At least two things of relevance for NTP have changed since then:
Network congestion has dropped and computer clocks have become much
more stable.

The reduced congestion increases the probability that packets
don't encounter delays, in particular it reduces the chances
that it happens both directions.

The higher stability was necessary in order to make it feasible
to multiply a pedestrian 14.318 MHz quartz-crystals output
frequency to near 4GHz and use it to clock a billion transistor
CPU chip.

And both things are to our advantage:  Even using a NTP server
in New Zealand, it should be possible for me to synchronize
my machine here in Denmark into the low milliseconds, and do
it in a matter of minutes.

Filter design
-------------

We do not have a theoretical model of how packets traverse the
Internet which can inform our filter design, we have to rely
on collected real-life data, hoping it is representative,
extrapolate with caution and allow plenty of margin for the
unexpected.

On the other hand, we don't have to do it on a PDP/11 computer
without hardware floating point instructions, and we can burn
1000 times as many instructions today, and still not use as
much CPU power as updating the cursor on the screen does.

The filters I am currently playing with are not median filters
like NTPD, median filters suffer from variable delay and I
would really like to avoid that.

I maintain a running estimate of the average and standard deviation
of the packet transit times in both directions and in total.

If a packet comes back and the RTT is inside 2.5 standard
deviations (for some value of 2.5) we take it and run
with it.  That doesn't work half shabby at all.

But we can actually salvage many of the packets lost to this
filter, because they were only hit in one direction, and
the other direction almost always provides credible time
information on its own.  Obviously, this doesn't work for
the first packet, since we have no way to tell that one
or the other direction was out of character, but after as
little as four packet exchanges, this kicks in.

*phk*