.. index::
   single: subsystem; Network size estimation
   see: NSE; Network size estimation

.. _NSE-Subsystem:

NSE — Network size estimation
=============================

NSE stands for Network Size Estimation. The NSE subsystem provides other
subsystems and users with a rough estimate of the number of peers
currently participating in the GNUnet overlay. The computed value is not
a precise number as producing a precise number in a decentralized,
efficient and secure way is impossible. While NSE's estimate is
inherently imprecise, NSE also gives the expected range. For a peer that
has been running in a stable network for a while, the real network size
will typically (99.7% of the time) be in the range of [2/3 estimate, 3/2
estimate]. We will now give an overview of the algorithm used to
calculate the estimate; all of the details can be found in this
technical report.

.. todo:: link to the report.

.. _Motivation:

Motivation
----------

Some subsystems, like DHT, need to know the size of the GNUnet network
to optimize some parameters of their own protocol. The decentralized
nature of GNUnet makes efficient and securely counting the exact number
of peers infeasible. Although there are several decentralized algorithms
to count the number of peers in a system, so far there is none to do so
securely. Other protocols may allow any malicious peer to manipulate the
final result or to take advantage of the system to perform Denial of
Service (DoS) attacks against the network. GNUnet's NSE protocol avoids
these drawbacks.

NSE security
.. _Security:

:index:`Security <single: NSE; security>`
Security
^^^^^^^^

The NSE subsystem is designed to be resilient against these attacks. It
uses `proofs of
work <http://en.wikipedia.org/wiki/Proof-of-work_system>`__ to prevent
one peer from impersonating a large number of participants, which would
otherwise allow an adversary to artificially inflate the estimate. The
DoS protection comes from the time-based nature of the protocol: the
estimates are calculated periodically and out-of-time traffic is either
ignored or stored for later retransmission by benign peers. In
particular, peers cannot trigger global network communication at will.

.. _Principle:

:index:`Principle <single: NSE; principle of operation>`
Principle
---------

The algorithm calculates the estimate by finding the globally closest
peer ID to a random, time-based value.

The idea is that the closer the ID is to the random value, the more
\"densely packed\" the ID space is, and therefore, more peers are in the
network.

.. _Example:

Example
^^^^^^^

Suppose all peers have IDs between 0 and 100 (our ID space), and the
random value is 42. If the closest peer has the ID 70 we can imagine
that the average \"distance\" between peers is around 30 and therefore
the are around 3 peers in the whole ID space. On the other hand, if the
closest peer has the ID 44, we can imagine that the space is rather
packed with peers, maybe as much as 50 of them. Naturally, we could have
been rather unlucky, and there is only one peer and happens to have the
ID 44. Thus, the current estimate is calculated as the average over
multiple rounds, and not just a single sample.

.. _Algorithm:

Algorithm
^^^^^^^^^

Given that example, one can imagine that the job of the subsystem is to
efficiently communicate the ID of the closest peer to the target value
to all the other peers, who will calculate the estimate from it.

.. _Target-value:

Target value
^^^^^^^^^^^^

The target value itself is generated by hashing the current time,
rounded down to an agreed value. If the rounding amount is 1h (default)
and the time is 12:34:56, the time to hash would be 12:00:00. The
process is repeated each rounding amount (in this example would be every
hour). Every repetition is called a round.

.. _Timing:

Timing
^^^^^^

The NSE subsystem has some timing control to avoid everybody
broadcasting its ID all at one. Once each peer has the target random
value, it compares its own ID to the target and calculates the
hypothetical size of the network if that peer were to be the closest.
Then it compares the hypothetical size with the estimate from the
previous rounds. For each value there is an associated point in the
period, let's call it \"broadcast time\". If its own hypothetical
estimate is the same as the previous global estimate, its \"broadcast
time\" will be in the middle of the round. If its bigger it will be
earlier and if its smaller (the most likely case) it will be later. This
ensures that the peers closest to the target value start broadcasting
their ID the first.

.. _Controlled-Flooding:

Controlled Flooding
^^^^^^^^^^^^^^^^^^^

When a peer receives a value, first it verifies that it is closer than
the closest value it had so far, otherwise it answers the incoming
message with a message containing the better value. Then it checks a
proof of work that must be included in the incoming message, to ensure
that the other peer's ID is not made up (otherwise a malicious peer
could claim to have an ID of exactly the target value every round). Once
validated, it compares the broadcast time of the received value with the
current time and if it's not too early, sends the received value to its
neighbors. Otherwise it stores the value until the correct broadcast
time comes. This prevents unnecessary traffic of sub-optimal values,
since a better value can come before the broadcast time, rendering the
previous one obsolete and saving the traffic that would have been used
to broadcast it to the neighbors.

.. _Calculating-the-estimate:

Calculating the estimate
^^^^^^^^^^^^^^^^^^^^^^^^

Once the closest ID has been spread across the network each peer gets
the exact distance between this ID and the target value of the round and
calculates the estimate with a mathematical formula described in the
tech report. The estimate generated with this method for a single round
is not very precise. Remember the case of the example, where the only
peer is the ID 44 and we happen to generate the target value 42,
thinking there are 50 peers in the network. Therefore, the NSE subsystem
remembers the last 64 estimates and calculates an average over them,
giving a result of which usually has one bit of uncertainty (the real
size could be half of the estimate or twice as much). Note that the
actual network size is calculated in powers of two of the raw input,
thus one bit of uncertainty means a factor of two in the size estimate.

:index:`libgnunetnse <single: libgnunet; nse>`
libgnunetnse
------------

The NSE subsystem has the simplest API of all services, with only two
calls: ``GNUNET_NSE_connect`` and ``GNUNET_NSE_disconnect``.

The connect call gets a callback function as a parameter and this
function is called each time the network agrees on an estimate. This
usually is once per round, with some exceptions: if the closest peer has
a late local clock and starts spreading its ID after everyone else
agreed on a value, the callback might be activated twice in a round, the
second value being always bigger than the first. The default round time
is set to 1 hour.

The disconnect call disconnects from the NSE subsystem and the callback
is no longer called with new estimates.

.. _Results:

Results
^^^^^^^

The callback provides two values: the average and the `standard
deviation <http://en.wikipedia.org/wiki/Standard_deviation>`__ of the
last 64 rounds. The values provided by the callback function are
logarithmic, this means that the real estimate numbers can be obtained
by calculating 2 to the power of the given value (2average). From a
statistics point of view this means that:

-  68% of the time the real size is included in the interval
   [(2average-stddev), 2]

-  95% of the time the real size is included in the interval
   [(2average-2*stddev, 2^average+2*stddev]

-  99.7% of the time the real size is included in the interval
   [(2average-3*stddev, 2average+3*stddev]

The expected standard variation for 64 rounds in a network of stable
size is 0.2. Thus, we can say that normally:

-  68% of the time the real size is in the range [-13%, +15%]

-  95% of the time the real size is in the range [-24%, +32%]

-  99.7% of the time the real size is in the range [-34%, +52%]

As said in the introduction, we can be quite sure that usually the real
size is between one third and three times the estimate. This can of
course vary with network conditions. Thus, applications may want to also
consider the provided standard deviation value, not only the average (in
particular, if the standard variation is very high, the average maybe
meaningless: the network size is changing rapidly).

.. _libgnunetnse-_002d-Examples:

Examples
^^^^^^^^

Let's close with a couple examples.

Average: 10, std dev: 1 Here the estimate would be
   2^10 = 1024 peers. (The range in which we can be 95% sure is: [2^8,
   2^12] = [256, 4096]. We can be very (>99.7%) sure that the network is
   not a hundred peers and absolutely sure that it is not a million
   peers, but somewhere around a thousand.)

Average 22, std dev: 0.2 Here the estimate would be
   2^22 = 4 Million peers. (The range in which we can be 99.7% sure is:
   [2^21.4, 2^22.6] = [2.8M, 6.3M]. We can be sure that the network size
   is around four million, with absolutely way of it being 1 million.)

To put this in perspective, if someone remembers the LHC Higgs boson
results, were announced with \"5 sigma\" and \"6 sigma\" certainties. In
this case a 5 sigma minimum would be 2 million and a 6 sigma minimum,
1.8 million.

.. _The-NSE-Client_002dService-Protocol:

The NSE Client-Service Protocol
-------------------------------

As with the API, the client-service protocol is very simple, only has 2
different messages, defined in ``src/nse/nse.h``:

-  ``GNUNET_MESSAGE_TYPE_NSE_START`` This message has no parameters and
   is sent from the client to the service upon connection.

-  ``GNUNET_MESSAGE_TYPE_NSE_ESTIMATE`` This message is sent from the
   service to the client for every new estimate and upon connection.
   Contains a timestamp for the estimate, the average and the standard
   deviation for the respective round.

When the ``GNUNET_NSE_disconnect`` API call is executed, the client
simply disconnects from the service, with no message involved.

NSE Peer-to-Peer Protocol
.. _The-NSE-Peer_002dto_002dPeer-Protocol:

The NSE Peer-to-Peer Protocol
-----------------------------

GNUNET_MESSAGE_TYPE_NSE_P2P_FLOOD
The NSE subsystem only has one message in the P2P protocol, the
``GNUNET_MESSAGE_TYPE_NSE_P2P_FLOOD`` message.

This message key contents are the timestamp to identify the round
(differences in system clocks may cause some peers to send messages way
too early or way too late, so the timestamp allows other peers to
identify such messages easily), the `proof of
work <http://en.wikipedia.org/wiki/Proof-of-work_system>`__ used to make
it difficult to mount a `Sybil
attack <http://en.wikipedia.org/wiki/Sybil_attack>`__, and the public
key, which is used to verify the signature on the message.

Every peer stores a message for the previous, current and next round.
The messages for the previous and current round are given to peers that
connect to us. The message for the next round is simply stored until our
system clock advances to the next round. The message for the current
round is what we are flooding the network with right now. At the
beginning of each round the peer does the following:

-  calculates its own distance to the target value

-  creates, signs and stores the message for the current round (unless
   it has a better message in the \"next round\" slot which came early
   in the previous round)

-  calculates, based on the stored round message (own or received) when
   to start flooding it to its neighbors

Upon receiving a message the peer checks the validity of the message
(round, proof of work, signature). The next action depends on the
contents of the incoming message:

-  if the message is worse than the current stored message, the peer
   sends the current message back immediately, to stop the other peer
   from spreading suboptimal results

-  if the message is better than the current stored message, the peer
   stores the new message and calculates the new target time to start
   spreading it to its neighbors (excluding the one the message came
   from)

-  if the message is for the previous round, it is compared to the
   message stored in the \"previous round slot\", which may then be
   updated

-  if the message is for the next round, it is compared to the message
   stored in the \"next round slot\", which again may then be updated

Finally, when it comes to send the stored message for the current round
to the neighbors there is a random delay added for each neighbor, to
avoid traffic spikes and minimize cross-messages.


