Internet-Draft | ETS | October 2025 |
Yang, et al. | Expires 18 April 2026 | [Page] |
This document presents ETS: an Extensible TimeStamps option for TCP. It allows hosts to use microseconds as the unit for timestamps to improve the precision of timestamps, and extends the information provided in the [RFC7323] TCP Timestamps Option by including the receiver delay in the TSecr echoing, so that the receiver of the ACK is able to more accurately estimate the portion of the RTT that resulted from time traveling through the network. The ETS option format is extensible, so that future extensions can add further information without the overhead of extra TCP option kind and length fields.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 18 April 2026.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. In this document, these words will appear with that interpretation only when in UPPER CASE. Lower case uses of these words are not to be interpreted as carrying [RFC2119] significance.¶
Accurate round-trip time (RTT) estimation is necessary for TCP to adapt to diverse and dynamic traffic conditions.¶
The TCP timestamp option specified in [RFC7323] is designed largely for RTT samples intended for computing TCP's retransmission (RTO) timer [RFC6298].¶
Some congestion control algorithms may wish to use a form of RTT measurement as one of several congestion signals, since elevated RTT measurements can reflect increases in network queueing delays. For example, the Swift congestion control algorithm [KDJWWM20], successfully deployed in data-center environments, requires precise and accurate measurements of both network and host delays. However, the existing TCP RTT sampling mechanisms that measure the delay between data transmission and ACK receipt [RFC6298] do not separate network and host delays, and cannot measure the RTT of retransmitted data. Even the TCP timestamp option specified in [RFC7323] is not well-suited to use as a congestion signal, for a number of reasons.¶
With the TCP Timestamps Option [RFC7323], data senders can measure an RTT sample by computing the difference between the data sender's current timestamp clock value and the received TSecr value. However, there are some drawbacks in this [RFC7323] measurement method:¶
In many of the cases above, an RTT sample computed using [RFC7323] can be inflated for reasons other than network queuing. It is difficult for the ACK receiver to infer how long the non-network delay was, which makes it hard to use an [RFC7323] RTT measurement as a clean signal for congestion control.¶
Delayed ACKs, as mentioned above, are particularly problematic. TCP receivers typically implement a delayed ACK algorithm. To avoid spurious timeouts due to these delayed ACKs, TCP senders can adapt to this delayed ACK behavior by guessing the maximum delayed ACK value of the remote receiver. Historically, many implementations tended to delay ACKs by up to roughly 200ms [WS95], so some implementations have correspondingly used a minimum RTO of 200ms. However, this imposes a latency penalty that is very large compared to RTTs in some of today's datacenter networks.¶
This document presents ETS: an Extensible TimeStamps option for TCP. ETS extends the information provided in the [RFC7323] TCP Timestamps Option, adding several features. First, ETS allows connections to use microseconds as the unit for timestamps, to improve the precision of timestamps. Second, ETS allows connections to include information about the delay between data receipt and ACK generation, so that the receiver of the ACK is able to more accurately estimate the portion of the RTT that resulted from time that data and ACK segments spent traveling through the network. Third, the ETS option format is extensible, so that future extensions can add further information without the overhead of extra TCP option kind and length fields.¶
The ETS protocol has two phases: an exchange of ETS options (ETSopt) in the negotiation handshake in <SYN> and <SYN,ACK> segments, and then ETS options included in all following segments.¶
All segments include AckDelay, the delay in the TSecr echoing process that was inserted by the data receiver, helping the receiver of an ACK to estimate the portion of the RTT delay caused by the network.¶
An example of a handshake exchange is illustrated below:¶
TCP A (Client) TCP B (Server) ______________ ______________ CLOSED LISTEN #1 SYN-SENT --- <SYN,TSval=X,TSecr=0, AckDelay=0> -----------> SYN-RCVD <SYN,ACK,TSval=Y,TSecr=X, ---------- SYN-RCVD #2 ESTABLISHED <-- AckDelay=E1> #3 ESTABLISHED -- <ACK,TSval=Z,TSecr=Y, AckDelay=E2> ---------> ESTABLISHED¶
Active connect: An actively connecting host that wishes to negotiate ETSopt MUST include the ETSopt in the <SYN>. For backward compatibility, the endpoint performing the active connect MAY also include a [RFC7323] TSopt in the <SYN> segment, so that if the passive side or middleboxes do not support and respond to the ETSopt, the active and passive sides can proceed with the [RFC7323] TSopt negotiation for the connection.¶
Passive connect: For a passively connecting host that is willing to proceed with ETSopt negotiation, if the <SYN> includes an ETSopt, the host MUST include a TCP ETS option in the initial <SYN,ACK> segment. A retransmission of the <SYN,ACK> segment may omit the ETSopt, to increase robustness in the presence of middleboxes that block segments containing ETSopt.¶
Processing of <SYN,ACK> for active connect: If the ETSopt is absent from the <SYN,ACK> segment received by the actively connecting endpoint, suggesting that the passive endpoint does not support ETSopt, or some middlebox has stripped the option from the <SYN,ACK> segment, then the actively connecting endpoint MUST disable ETSopt for this connection. In such cases the actively connecting endpoint MAY fall back to using [RFC7323] timestamps if both the <SYN> and <SYN,ACK> segments include valid [RFC7323] timestamps.¶
The reader is expected to be familiar with the TCP Timestamps Option (TSopt), including TSval, TSecr, and TS.Recent [RFC7323].¶
Variables introduced by this document are described below:¶
TSval: Same as the TSval field described in [RFC7323] except the unit is in microseconds.¶
TSecr: The echo of TS.recent¶
TS.Recent: The recently received TSval sent by the remote TCP endpoint in the TSval field of an ETSopt, updated using the TS.Recent rules specified in [RFC7323].¶
AckDelay: The field quantifying the delay between data receipt and ACK generation in the TSecr echoing process, so that the receiver of the ACK is able to more accurately estimate the NetworkRTT.¶
NetworkRTT: The time from when the data segment leaves the sender until when it arrives at the receiver, plus the time from when the corresponding ACK leaves the (data) receiver until the ACK arrives at the data sender (here sender and receiver refer to the TCP layer only).¶
The header format for TCP ETS options (ETSopt) is as follows:¶
01234567 89012345 67890123 45678901 +--------+--------+--------+--------+ | Kind | Length | ExID | +--------+--------+--------+--------+ | TSval | +--------+--------+--------+--------+ | TSecr | +--------+--------+--------+--------+ |Un| AckDelay |R| +-----------------+ | | / \ .---´ `---------------------. / \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Unit | AckDelay |Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2 bits | 13 bits | 1 bit Kind: 1 byte, has value 254, TCP experimental option codepoint [RFC6994] Length: 1 byte option length, value is 14 (value MAY be higher in later ETSOpt protocol versions). ExID: 2 byte [RFC6994] experiment ID; MUST be 0x4554. TSval and TSecr: 32 bits each, have the same definition as [RFC7323] except that both are in microseconds. AckDelay.Unit: 2 bits, has value: 0: indicates AckDelay is in microsecond units 1: indicates AckDelay is in millisecond units 2: indicates AckDelay is invalid 3: reserved AckDelay: 13 bits, the value of AckDelay. Reserved: 1 bit, in this protocol version; sender MUST set to 0; receiver MUST ignore field and tolerate 0 or 1¶
The semantics of the option fields are as follows:¶
TSval: Same as the TSval field described in [RFC7323] except the unit is in microseconds, contains the value of the sender host's timestamp clock, in, at the time the sender schedules transmission of the segment.¶
TSecr: The TSecr field contains the current value of TS.Recent, the recently received TSval sent by the remote TCP that is recorded using the TS.recent update algorithm described in [RFC7323], section 4.3. TSecr is only valid when the ACK bit is set. When the ACK bit is not set, senders MUST set this field to 0, and receivers MUST ignore the value in this field (and MUST tolerate any value in this field).¶
AckDelay: Field AckDelay contains the delay inserted by the receiver in the TSecr echoing process. Field AckDelay is only valid when the ACK bit is set, otherwise MUST be set to 0 by the sender and ignored by the receiver. When the ACK bit is set, the sender computes the AckDelay using the following algorithm:¶
(1) When TS.Recent is updated by a received segment SEG (as in [RFC7323]): TS.RecentClock = SEG.ArrivalTime (2) When an ETSopt is sent in a segment ACK: TSecr = TS.Recent (as in [RFC7323]) AckDelay = ACK.SendTime - TS.RecentClock¶
In (1), it is RECOMMENDED that the Network Interface Controller (NIC) receiving timestamp of the segment SEG be used as the ArrivalTime of SEG. This practice aids endpoints in more accurately estimating NetworkRTT by excluding delays unrelated to network queuing, such as host-side receive delays, including those incurred by CPU wake-up from power management "C-states".¶
We discuss how the ACK receiver can estimate the NetworkRTT using AckDelay in the next section.¶
When an endpoint receives an ACK with TSecr=X, we define NetworkRTT as the time from when the data segment, which has TSval=X, is sent until when it arrives at the receiver, plus the time from when the corresponding ACK is sent by the (data) receiver until the ACK arrives at the data sender. It should be noted that despite an ACK may acknowledge multiple segments, only those packets involved in the TSval echoing process are related to NetworkRTT calculation.¶
With AckDelay in ETSopt, when a data sender (TCP A) receives an ACK segment with ETSopt from the remote endpoint (TCP B), NetworkRTT can be estimated by:¶
NetworkRTT = ACK.ArrivalTime - ACK.TSecr - ACK.AckDelay¶
For better accuracy, it is also RECOMMENDED that the NIC receiving timestamp of the segment (ACK) be used as the ArrivalTime.¶
The following example shows how NetworkRTT is computed:¶
TCP A TCP B ______________ ______________ -- <TSval=1> --> arrives at t=2 TS.Recent=TSval=1 TS.RecentClock=2 -- <TSval=2> --> lost -- <TSval=3> --> arrives at t=10 arrive at t=11 <-- <ACK, TSecr=1, AckDelay=8> -- Send ACK at t=10¶
In this example, AckDelay in the last ACK segment is ACK.SendTime - TS.RecentClock = 10 - 2 = 8, and the NetworkRTT from the last ACK segment is computed as 11 - 1 - 8 = 2, which is the network RTT of the segment sent with TSval = 1.¶
In order to make use of this NetworkRTT estimate as a clean signal that more precisely reflects network queuing, it is RECOMMENDED that timestamps ACK.ArrivalTime and TS.RecentClock use the time at which the segment arrives at the host, e.g. the time the NIC receives the segment, if the NIC supports hardware receive timestamping. Within this context, NetworkRTT is then defined as the time from when the data segment leaves the sender’s TCP until when it is received at the receiver’s NIC, plus the time from when the corresponding ACK leaves the (data) receiver’s TCP until the ACK is received at the data sender’s NIC.¶
Protection Against Wrapped Sequences (PAWS), introduced by [RFC7323] Section 5, is a mechanism to reject old duplicate segments that might corrupt an open TCP connection. In the PAWS mechanism, a segment can be discarded as an old duplicate if it is received with a timestamp SEG.TSval that is "before" some timestamps recently received on this connection.¶
As in [RFC7323], ETS receivers need to exercise care to avoid spurious PAWS discards due to wrapping 32-bit timestamp values during periods in which the connection is idle. When microsecond units are used, as in ETS, the 32-bit timestamp could trigger wrapping issues and spurious PAWS discards after 2^31 ticks of idleness, which is around 2147 seconds (or around 35.7 minutes). To prevent a false positive PAWS rejection of a valid segment, an ETS receiver MUST skip the PAWS check for the first arriving segment after the timestamp used by PAWS, e.g. TS.Recent, has not been updated for 2147 seconds or more.¶
The Eifel Detection Algorithm [RFC3522] detects a spurious recovery by comparing a received TSecr to RetransmitTS, the value of the TSval in the retransmit sent when loss recovery is initiated. ETS allows Eifel to work as-is because the fields TSval and TSecr in the ETSopt have the same semantics as in TSopt [RFC7323]. Further in sub-millisecond environments, ETS microsecond precision is more effective at detecting spurious retransmission compared to TSopt’s more coarse unit.¶
The RTT measurement used in the calculation of RTO (retransmission timeout) [RFC6298] stays the same as described in [RFC7323]. It is NOT RECOMMENDED to use only NetworkRTT measurements for RTO calculation because RTO needs to include the host side delays to avoid spurious RTO events due to host delays.¶
The ETSopt has a length of 14 bytes. This leaves a remaining space of 26 bytes for other TCP options. A SACK option [RFC2018] with at most 3 SACK blocks is able to coexist with ETSopt in a single TCP segment as with TSopt.¶
[HNRGHT11] shows that middleboxes could drop an unrecognized TCP option or even drop the whole segment.¶
In order to fall back on [RFC7323] TSopt, the sender MAY include a [RFC7323] TSopt in the <SYN> and <SYN,ACK> segments, so that the [RFC7323] TSopt can be adopted when the ETSopt is stripped by a middlebox.¶
Once an expected ETSopt is missing from an incoming segment, the sender MUST NOT include an ETSopt for all future segments of this TCP connection. An implementation could negatively cache such incidents to avoid using ETS on these hosts or routes on future connections.¶
Another consideration is the interaction with hardware offloads like Receive Segment Coalescing (RSC). RSC should remain compatible with ETSopt, provided that RSC coalesces segments with bitwise identical TCP headers. Some NICs require to parse the TCP options and impose additional conditions on them. In such cases, new TCP options, like ETSopt, may be unrecognizable by the NIC, and thereby disables RSC.¶
This document specifies a new TCP option that uses the shared experimental options format [RFC6994], with ExID in network-standard byte order.¶
The authors plan to request the allocation of ExID value 0x4554 for the TCP option specified in this document.¶
A malicious receiver can manipulate the sender’s network RTT estimated by forging an AckDelay value. However this does not introduce a new vulnerability relative to the Timestamp option [RFC7323], because a malicious receiver could already forge the TSecr field to manipulate the RTT measured by the other side.¶