Internet-Draft | Proxy BCP | June 2025 |
DeKok | Expires 24 December 2025 | [Page] |
RADIUS proxying is terrible. We try to make it better. This document is currently a rough draft, and is a work in progress.¶
This note is to be removed before publishing as an RFC.¶
Status information for this document may be found at https://datatracker.ietf.org/doc/draft-dekok-proxy-bcp/.¶
Discussion of this document takes place on the RADEXT Working Group mailing list (mailto:radext@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/radext/. Subscribe at https://www.ietf.org/mailman/listinfo/radext/.¶
Source for this draft and an issue tracker can be found at https://github.com/freeradius/proxy-bcp.git.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 24 December 2025.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The Remote Authentication Dial In User Service (RADIUS) protocol defines proxying in [RFC2865], Section 2.3. That section gives a brief overview of how proxying on a packet-by-packet basis, but ot does not discuss larger issues of how a proxy server should operate. The discussion of those topics is largely limited to a few sentences in [RFC2865], Section 2.3:¶
A forwarding server MAY need to modify attributes to enforce local policy. Such policy is outside the scope of this document ...¶
Implementers of forwarding servers should consider carefully which values it is willing to accept for Service-Type. Careful consideration must be given to the effects of passing along Service- Types of NAS-Prompt or Administrative in a proxied Access-Accept, and implementers may wish to provide mechanisms to block those or other service types, or other attributes. Such mechanisms are outside the scope of this document.¶
Some additional discussion of issues related to proxies is in [RFC2865], Section 2:¶
The Access-Request is submitted to the RADIUS server via the network. If no response is returned within a length of time, the request is re-sent a number of times. The client can also forward requests to an alternate server or servers in the event that the primary server is down or unreachable. An alternate server can be used either after a number of tries to the primary server fail, or in a round-robin fashion. Retry and fallback algorithms are the topic of current research and are not specified in detail in this document.¶
This text goes back to the original specification of [RFC2058], in 1997. Very little research on this topic has been performed since then, with the effect that there has been very little guidance available for implementers or operators. This lack of guidance has resulted in ad-hoc solutions, and in unstable networks.¶
This document addresses that lack of guidance by providing operational considerations and best current practice recommendations for proxying. The recommendations here are the result of decades of experience by multiple RADIUS server implementers, along with decades of experience by operators of large-scale and large-volume proxy networks. These recommendations have been demonstrated to work, and are documented here in order to encourage their adoption.¶
The document first motivates the guidance by describing a set of issues with the RADIUS protocol, with associated solutions and/or recommendations for addressing these issues. It then continues with a set of recommendations for best current practices.¶
As a "best current practices" specification, this document does not make any changes to the RADIUS protocol.¶
A proxy is composed of a server to receive packets, and an associated client to forward packets to another destination. The discussion here is limited to proxies, but there is of necessity overlap with recommendations for clients and servers. However, there are additional considerations for proxies which are not seen by a stand-alone client or server. This document focusses on issues seen by proxies.¶
It is still RECOMMENDED that client and server implementations be aware of the issues raised in this document, and where relevant, implement the recommendations made herein. Where the recommendations below apply to a stand-alone client or server, that relevance is mentioned.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This section outlines problems with the RADIUS protocol and the recommended solutions.¶
[RFC2865], Section 1.2 defines the term "silently discard" as:¶
This means the implementation discards the packet without further processing. The implementation SHOULD provide the capability of logging the error, including the contents of the silently discarded packet, and SHOULD record the event in a statistics counter.¶
While it can be necessary to silently discard packets for security reasons, there are other situations where packets are silently discarded for reasons unrelated to security. These silent discards are a large source of instability in RADIUS deployments.¶
There are cases where an implementations has to silently discard packets because there is no defined way to respond to them. The result is that a server may be operational, but not responsive. A client has no way of knowing if the server is down, slow, or if there is a problem somewhere else down a long chain of proxies.¶
Some implementations are known to silently discard packets:¶
unrecognized attributes¶
when they get packets in the middle of an EAP session which isn't theirs (see load-balancing, below).¶
Standards are silent or wrong:¶
Many issues here.¶
Retransmits have the possibility of creating packet storms.¶
Some proxies ignore client retransmits. Others only retransmit when the client retransmits. Others do both (!!!)¶
[RFC3539], Section 2.8 explains that end-to-end retransmissions are the best. [RFC6613], Section 2.5 also says¶
Intrinsically, proxy systems operate with multiple control loops instead of one end-to-end loop, and so they are less stable. This is true even for TCP-TCP proxies. As discussed in [RFC3539], the only way to achieve stability equivalent to a single TCP connection is to mimic the end-to-end behavior of a single TCP connection. This typically is not achievable with an application-layer RADIUS implementation, regardless of transport.¶
The main lack here is the negative signaling of lost packets. Protocol-Error helps here.¶
Proxies MUST suppress duplicate retransmissions from a client within a short time frame, say 1s.¶
Proxies MUST NOT do their own retransmissions, subject to transport issues. i.e. a TCP to UDP proxy MUST retransmit on the UDP side.¶
Acct-Delay-Time ([RFC2866], Section 5.2 requires that Accounting-Request packets to be updated when the packet is retransmitted. It's an artifact of 1993, when NTP was not widely used.¶
Acct-Delay-Time contributes to congestive collapse of the network. One set of accounting information with the same data (other than Acct-Delay-Time) can be sent as many different packets. When seen in conjunction with the retransmission issues noted above, it is possible to increase that number substantially.¶
Packets may take multiple paths through the network.¶
Old packets for an EAP session may appear at the home server late. This happens when there is a proxy fail-over, and one element retransmits. This is not just UDP, it's possible with TCP, too, when connections go up / down.¶
Home servers MAY cache responses to Access-Request packets based on the State attribute. And then reply them when the same State is seen for a different packet src IP/port/Code. This caching is different from the [RFC5080], Section 2.2.1 packet dedup!¶
OR a home server MAY respond with a Protocol-Error indicating "unknown State". This can also occur for home server fail-over as seen in Section 5.3.1.¶
Proxies can filter attributes in both requests and responses.¶
Access-Request and Accounting-Request packets can contain private information about the local network, which no one else needs to see.¶
Access-Accept packets can contain assignments which are not appropriate for a remote network.¶
Clients and proxies should allow servers or next hops to be marked as "internal" or "external". Servers marked "internal" should be sent all attributes. Servers marked "internal" should have private local information stripped, and replaced with Opaque-NAS-Identifier, etc.¶
In general, only send attributes which are necessary for authentication / accounting, and filter out everything else. i.e. have a positive filter, not a negative one¶
These recommendations can be superseded by agreements between parties, such as to send location information.¶
It is hard to distinguish protocol-layer errors from (e.g.) authentication errors.¶
The timeouts on the original client and proxies are generally uncorrelated. This issue means that we have multiple points in the proxy chain where fail-over and/or timeouts can occur. This behavior does not contribute to network stability.¶
proxy timeouts should be lower than NAS timeouts, and then send Protocol-Error¶
timeouts should get shorter along the path.¶
perhaps client can add indication of timeout to packets, so proxies can know when the clients will give up.¶
perhaps a separate document?¶
should be on a per-packet basis¶
we don't know what the client timeout is?¶
The only real utility of Proxy-State is loop detection. It is possible to implement a RADIUS server / proxy which does nothing more than forward / copy reply of Proxy-State as per [RFC2865], Section 5.33. A proxy doesn't in fact need to add a Proxy-State attribute in order to work in most situations.¶
Proxy implementations SHOULD create Proxy-State using a site-local 64-bit value, taken from a secure random number generator. This value can then be used to identify a particular proxy server, or set of servers. This value should be the same for all packets proxied by the server. This same value should be used by all "equivalent" proxies, such as ones used in a load-balance configuration. Using 64 bits means that the chance of a hash collision is approximately one in 2^32, which should be sufficient. i.e. there are significantly fewer than 2^32 RADIUS servers on the planet.¶
Any one of the equivalent proxy servers can then examine Proxy-State, and look for their unique token. If the token exists, then a loop has been detected.¶
This recommendation isn't perfect. Other servers can still delete the Proxy-State, or copy the values that they see. But such corner cases are best addressed by administrator intervention.¶
This recommendation allows administrators and implementations to better detect inadvertent proxy loops, rather than deliberate misconfigurations.¶
Proxies should also NOT copy the Proxy-State attributes from the proxied reply to their reply. The downstream server may have deleted or modified the attributes. Instead a proxy should copy the Proxy-State from the original request to the reply.¶
It is possible for a malicious client to send enormous amounts of packets to proxies in a roaming consortium. There are a few different cases:¶
users immediately retrying after a reject. Home servers SHOULD implement a reject delay. Proxies SHOULD also enforce a reject delay for proxied packets. Note that this reject delay should enforce AT LEAST a delta between request and response. It MUST NOT simply add a delay to the response.¶
clients overloading upstream proxies with thousands of packets. e.g. when a power failure occurs, there may be thousands or tens of thousands of devices which are all trying to authenticate in a short period of time. Such behavior can result in congestive collapse, and overload of home servers. Clients (and proxies) should implement some kind of rate limiting.¶
Load balance Access-Request packets using Calling-Station-Id where it exists, because it's usually the MAC address. The User-Name may be "@example.com", so it is unsuitable for load balancing. Use it only when Calling-Station-ID does not exist.¶
Or, load balance based on session identification attributes (NAS-IP-Address, NAS-IPv6-Address, NAS-Port). Do NOT load-balance on packet source IP, or anything which identifies the NAS itself. A NAS may have multiple sessions!¶
If the system proxies both Access-Request and Accounting-Request packets, try to load balance on the same information for both types of packets. This consistency will ensure that all of the data for one sessions shares the same path (and fate) through the network.¶
When the next hop is known to be a home server, use consistent hashing for load balancing, in order to ensure that packets for one session are sent to the same destination. This is most important for EAP.¶
Load balancing is made more complex when the next hop is a home server, and the authentication is a multi-step process (challenge-response or EAP). When a home server fails, all "in process" authentications associated with that home server will fail.¶
However, as proxies do not track authentication sessions, the load balancing algorithms recommended above will simply redirect packets from the failed home server to other home servers. Those home servers will be unable to process packets in the middle of an authentication session.¶
And as noted above in Section 4.1, implementations are known to discard these packets without response, which contributes to instability.¶
One solution is for the proxy closest to the home server to maintain state for an authentication session. This is simple to do at low packet rates, but becomes more difficult at high packet rates.¶
One solution (experimental, but it should work) is to have the home server and the local proxies agree on magic values for the State attribute.¶
[RFC2865], Section 5.24 limits the use of State:¶
... the client MUST NOT interpret the attribute locally.¶
We relax this statement to say that a proxy MAY interpret the State attibute, but it MUST NOT do so without prior consultation with the next hop server.¶
The two systems could agree (for example) that the first octet of the State attribute is an identifier for the home server. Where State does not exist, the proxy can load-balance as described above. Where State does exist, it can bypass load balancing, and send the packet directly to the specified home server.¶
If the home server is down, the proxy knows that the packet cannot processed by other home servers, and therefore does not redirect the packet to them. Instead, it returns a Protocol-Error indicating that the home server is unreachable.¶
Where the proxy cannot coordinate with the home server, it could instead add an octet (as an identifier) to the State for responses, and remove that octet from requests. This does violate the prohibitions of [RFC2865], Section 5.24, but it is likely to work in limited situations. i.e. when the only way to reach the home servers is either directly from clients (which do not do this), or when all proxies implement the same method. It is therefore fragile.¶
This behavior is likely to not work for other proxies in a multi-hop proxy chain.¶
This isn't limited to UDP to TCP proxying. Even for UDP, there may be differing limits on packets in / packets out. The best solution here is Protocol-Error.¶
There is no overload indication in RADIUS. Instead, there is only per-packet signaling.¶
Failing that, adding capacity.¶
don't use a per-packet "store and forward", as multiple accounting packets with the same data will be sent. * instead, use session-based store and forward. Keep a session table (or even the last packet for a session), and send only that.¶
Session based store and forward also helps with poorly behaved clients, which are known to send accounting packets for all open sessions at fixed intervals. This is effectively a DoS attack on the server. A store and forward proxy can quench the attack by storing the data locally, and then proxying it upstream using [RFC5080], Section 2.2 algorithms.¶
bad clients are attacking servers, with 1000's of packets a second.¶
Sometimes it's the NAS, which is allowing the user to retry immediately (i.e 1ms later) after a reject. OR allowing the supplicant to retry even before the reject is sent! so 1000's of packets a second. This appears to be a violation of 802.1X requirements on rate limiting¶
openroaming is public, and are getting attacked with Syn flood attacks.¶
Syn flood mitigation is critical for public RADIUS Servers.¶
Public RADIUS servers should have aggressive timers on new connections. If a connection doesn't complete within a short period of time, it should be closed or discarded. i.e. connections which complete and which make progress should be prioritized over connections which don't complete, or which don't make progress.¶
This document has no actions for IANA.¶
There are no privacy considerations in this specification. It does not add or change any transport protocols, and it does not change the roles of any participant in the RADIUS ecosystem.¶
This specification has no direct impact on security. It does not add or change any existing cryprography in RADIUS. It is compatible with all RADIUS standards, including (ALPN draft).¶
This specification does, however, recommend ways that networks can be better protected from a variety of attacks, misconfiguration, and improper implementations. As a result, RADIUS proxy networks which implement this specification should become more stable, and therefore more secure.¶
This document came out of discussions both in the RADEXT WG, and in the 2025 RADIUS Technical Conference in Tampere, Finland.¶