Considerations for Happy Eyeballs Error Reporting

Happy Eyeballs () provides a way for improving user-visible delay when FQDN's have multiple IP addresses and connectivity is performing worse in those that should be the preferred ones. However, this hides possible connectivity issues to the operator or other parties in the chain between the client and the service being accessed, because users will not notice anything broken, so they will not report it to the providers. For example, in the case of a dual-stack web site, if IPv6 connectivity is somehow broken at any point between the client and the hosting service, Happy Eyeballs (HE across the rest of this document), will quickly fall-back to IPv4. The goal of this document is to discuss different aspects to be considered, in order to provide a decision path towards the best possible choice for the final error reporting solution for HE. The error reporting solution should allow an integral HE error reporting mechanism that enables setting up alarms and triggering further investigations so to improve network reliability, facilitating the detection of failures as soon as they appear, without the need of additional external monitoring.

For HE reporting, we consider five different personas and identified the following use cases covering parts of the network that can cause HE fallback (IPv6 to IPv4 or QUIC to TCP): Application developers and users: To validate their own applications and infrastructure as well as providing support, application developers may want to be able to trace why certain communication prefer IPv4 instead of IPv6. In this case, it is especially important to understand whether the client has received an AAAA RR, whether connecting IPv6 was actually tried (and potentially why not) and how the attempt failed, e.g., lost race, TCP/TLS/QUIC handshake failed, … Especially for web applications, exposing this information through developer tools or performance logs is crucial. Technical capable users and support personnel can use the same mechanisms. Corporate and other managed network operators: Corporate middle-boxes like firewalls and other endpoint security solutions often infer with IPv6 and QUIC. Therefore, application feedback is essential to diagnose issues with these solutions and their configuration. Service provider operators: The service provider is usually the primary party for receiving and acting on reports. Within its administrative realm, HE fallback can be caused by issues on the (provider managed) CE, routers and middle-box misconfiguration inside the provider network, or issues at direct peerings/transits. In cellular dual-stack deployments, the problem is typically in the provider network rather than in the UE or customer LAN. Even when the customer manages parts of the CE or internal network, fallback reports remain useful to the provider because they can indicate customer-impacting misconfiguration Intermediate transit operators: While not the most likely source of HE fallbacks, also misconfigurations in these networks may lead to IPv6 or protocol degradation. While fine-grained analysis of reporting is likely infeasible, sampling may be used as an indicator for larger issues. Content providers. This has been the most common source of the problem for some time, for instance, content providers having configured AAAA RR's when IPv6 connectivity was not good or even inexistent, wrong configuration in load-balancers or firewalls, etc. If the error reporting is sent to the content provider, they will only be able to fix it if it is a general problem, affecting any possible source address in the Internet. However, they are usually unable to understand an fix problems along the path and need personas 1-4 to fix problems in their administrative domain. In the case 1, for developers, some kind of plug-in for the developer tools and performance logs is needed in order to understand HE decisions. Same is true for language libraries that do happy eyeballs under the hood. So in this case, it may be something that can be turned on/off by the developer code, as part of tracing/debugging facilities.

TBD: Discussion needed. As the privacy impact heavily depend on the persona, the reporting solutions should also differ in the information exposed. As a minimum, source address (possibly just a prefix, not individual address, which also may resolve privacy issues), destination address (or even FQDN) are needed. It seems logic also to inform about what destination address failed, which one succeeded, and if the problem was IPv6 (fallback to IPv4) or QUIC (fallback to TCP). For Personas 1 and 2, it seems also convenient to inform about the timers that caused the fallback (Resolution Delay, Connection Attempt Delay).

TBD: Discussion needed. In case of a developer or user, reporting every failure is expected as long as reporting is enabled. In case of a corporate or managed network, the reporting granularity should be configurable by the network administrator by some kind of logging policy. In the case of service providers, reporting every failure may generate too much additional load and mechanisms like the ones to rate limit are advisable.

TBD: Discussion needed. The reporting mechanisms will depend on the persona, and the use case, but in general, we can consider the following options: Reports for developers and users, to be used in developer tools and performance logs, and to be turned on/off by the developer code, as part of tracing/debugging facilities. For corporate and managed network operators, using existing mechanism that integrate into the client management like syslog, systemd-journal, or windows events is crucial to be able to integrate the reporting into the existing monitoring and alerting systems. New mechanism to be used by content providers (W3C Network Error Logging ). For service provider operators, either a new mechanism, designed on purpose for HE error reporting (such as ICMP, ), or existing mechanisms such as in b. TBD - Discussion needed. Considerations for choosing a protocol: Balance of work to be done in reporting hosts vs service provider. Chances to be implemented in hosts vs chances to be implemented in service providers. TBD. Format: JSON, QLOG? TBD. Service discovery to identify the listener of the reporting protocols: IANA dual-stack defined address? TBD.

TBD. Very draft text follows from previous work and list inputs. The goal is to provide the operator information about the failures detected by HE, without requiring specific users traffic information. Towards this, it will be sufficient to provide to the error collector details about the failed destination address and source prefix. So privacy issues regarding identification of a specific device or users are avoided. Nowadays, operators already log this information in order to comply with lawful interception regulations, and in general, data protection regulations allow this logging when technically required. Data protection regulations explicitly say that the data can't be disclosed, and there is no need to do so. In general, vendors also collect telemetry data from devices, in order to improve OSs and in some situations, there are regulations that enforce offering the user to enable/disable that feature. So we could consider offering the same feature for this mechanism. When the mechanism described in this document detects a failure, the operator will need to find if the problem is related to: A specific subscriber (customer internal networks, or even at their CE). A group of subscribers or the entire service provider network (e.g., one or several part service provider network). Intermediate transits. Content provider. Those cases, in terms of privacy considerations, will fall into one of the following categories: Failure cause in customer internal network: The operator may decide, depending on their country regulations and services offered to that customer, to inform the customer (and decide what information is provided), or ignore the failure and include it in a "while list" (i.e., list of "don't care" failures), so the monitoring system doesn't keep providing alerts on it. Failure cause due to the service provider network: The operator will need to find the cause and fix the failure, without disclosing any personal data. Failure cause due to third parties (intermediate transits or content provider): The operator don't need to disclose any specific user source address/prefix, because in this case, the shorter prefix (typically the RIR allocated prefix or part of it, when is being announced split among different BGP peers), from which the failure has been verified is sufficient to re-verify the error. In the most extreme case, a more restrictive usage of this procedure, not involving logging any user source address/prefix, will be to log only the failed destination address. In a big percentage of the cases, it will be enough for the service provider to detect the failure (use cases 2, 3, and 4), as experience shows that HE fallback occurs mainly because path or destination misconfiguration or issues. So, the service provider could replicate the failure from any other source address in its network to the same failed destination. If we take this approach, failures internal to a specific subscriber, could not be reported by the operator to the customer (as there is no source data logging), and together with partial failures of the operator network will require extra work from operator's staff to research the cause of the failure (i.e., it is in my network, part of it, a specific customer or external). So, there is any distinction between the privacy issues from this protocol compared to regular network operation and management, abuse reporting, etc. ? TBD: In the case of content providers reporting, something like Network Error Logging and/or Navigation Timing could help for content providers in a way where more detailed information can be sent to an endpoint within a TLS connection in a way that isn't exposing anything to the network.?

This document does not have any specific security considerations.

This document does not have any IANA considerations.

The author would like to acknowledge the inputs of Gert Doering, Erik Nygren ...