Internet-Draft Michael DALLA RIVA Intended status: Standards Track Janeway and Wildman Limited (UK) Expires: May 25, 2026 November 25, 2025 Service Status Resource Format for Web Services draft-dallariva-web-service-status-json-00 Abstract This specification defines a standard JSON format for service status resources. It provides a machine-readable format for overall service health indicators, component-level status information with criticality classification, geographic scope identification, and structured incident reporting. This standard enables interoperability between status monitoring tools and service providers. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on May 23, 2026. Copyright Notice Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . 4 2. Status Resource Format . . . . . . . . . . . . . . . . . . . 5 2.1. Overall Structure . . . . . . . . . . . . . . . . . . . . 5 2.2. Service Status Object . . . . . . . . . . . . . . . . . . 6 2.3. Component Object . . . . . . . . . . . . . . . . . . . . 7 2.4. Incident Object . . . . . . . . . . . . . . . . . . . . . 9 3. Status Values and Semantics . . . . . . . . . . . . . . . . . 11 3.1. Component Status Values . . . . . . . . . . . . . . . . . 11 3.2. Criticality Levels . . . . . . . . . . . . . . . . . . . 12 3.3. Geographic Scope . . . . . . . . . . . . . . . . . . . . 13 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1. Minimal Status Resource . . . . . . . . . . . . . . . . . 14 4.2. Complete Status Resource . . . . . . . . . . . . . . . . 14 4.3. Regional Incident . . . . . . . . . . . . . . . . . . . . 16 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 6. Security Considerations . . . . . . . . . . . . . . . . . . . 17 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.1. Normative References . . . . . . . . . . . . . . . . . . 18 7.2. Informative References . . . . . . . . . . . . . . . . . 18 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 19 Appendix B. Migration from Existing Formats . . . . . . . . . . 19 1. Introduction Many web services provide status pages to communicate service health and incident information to their users. While RFC 8631 [RFC8631] defined the "status" link relation to enable discovery of these status resources, it intentionally left the format of the status data unspecified, allowing flexibility for different implementations. This has resulted in a fragmented ecosystem where each service provider implements their own status format, making it difficult for: o Monitoring tools to aggregate status across multiple services o Users to understand the scope and severity of incidents o Automated systems to respond to service degradations o Organizations to compare service reliability metrics o Security teams to detect coordinated infrastructure attacks o Dependent services to assess cascading failure impact This specification defines a standard JSON format for status resources, building on lessons learned from existing implementations. 1.1. Critical Use Cases This standard addresses several critical operational scenarios: Cascading Failure Response: When a major infrastructure provider experiences an outage (e.g., AWS, Cloudflare, Azure), thousands of dependent services need to rapidly assess impact. Machine- readable criticality and component status enables automated assessment of which services are affected. Multi-Provider Incident Correlation: During coordinated DDoS attacks or widespread network issues, security teams need to correlate incidents across multiple providers. Standardized format enables automated threat intelligence aggregation. Automated Failover Decision Making: Services using multiple providers need to automatically fail over during outages. Machine-readable status with clear criticality and scope enables automated decision making without human intervention. Incident Communication to End Users: Services dependent on infrastructure providers need to communicate clearly to their users. Standardized geographic scope and impact metrics enable accurate user communication. Security Incident Detection: Unusual patterns across multiple providers may indicate coordinated attacks. Standardized format enables security monitoring tools to detect these patterns. Disaster Recovery Planning: Organizations need to assess their exposure when providers have incidents. Clear component criticality enables rapid risk assessment. 1.2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. This document uses the following terms: Service: A web service or API that provides functionality to users. Component: A discrete functional unit of a service (e.g., "API", "Database", "CDN"). Incident: An unplanned event that degrades or disrupts service functionality. Status Resource: A resource linked via the "status" link relation that provides service health information. 1.3. Problem Statement Analysis of 50+ production status page implementations reveals significant interoperability challenges: o Format fragmentation: Services use incompatible JSON structures, HTML-only pages, or proprietary APIs o Missing criticality semantics: No standard way to distinguish critical infrastructure (API, database) from non-critical features (billing UI, analytics dashboard) o Geographic scope ambiguity: Regional incidents (e.g., "SMS delays in Kenya") indistinguishable from global outages o Inconsistent impact reporting: No standard for communicating percentage of users affected or estimated resolution time o Integration burden: Monitoring tools must implement custom parsers for each service provider These problems result in: o False alarms when non-critical components fail o Delayed incident response due to unclear severity o User confusion about whether issues affect them o Wasted engineering time on integration and maintenance 1.4. Real-World Impact: Major Internet Outages Recent major incidents demonstrate the critical need for standardized status communication: Cloudflare Database Update (November 2025): A database schema update caused widespread service disruption affecting millions of websites. Human error in the deployment process led to cascading failures. Standardized criticality classification would have helped downstream services understand impact severity immediately. Facebook/Meta BGP Incident (2021): Incorrect BGP route advertisements took down Facebook, Instagram, and WhatsApp for 6 hours, affecting 3.5 billion users globally. The lack of standardized status reporting made it difficult for dependent services to communicate impact to their users. AWS us-east-1 Outages (October 2025): Multiple major outages in AWS's largest region caused cascading failures across thousands of services. Services dependent on AWS struggled to communicate which of their features were affected versus operational. Azure Active Directory Outage (2023): Authentication failures prevented users from accessing Microsoft 365, Teams, and thousands of dependent services. Geographic scope was initially unclear, causing global panic for what was primarily a regional issue. These incidents share common challenges: o Cascade effect: One service's outage impacts hundreds of dependent services o Communication breakdown: No standard way to report which components are affected o User confusion: Unclear whether issues are global or regional o Delayed response: Time wasted parsing different status formats o Amplified impact: Lack of criticality data causes overreaction 1.5. Security Considerations: Malicious Attacks The increasing frequency of coordinated attacks on internet infrastructure highlights additional needs: DDoS Attacks on DNS Providers: Distributed denial-of-service attacks against major DNS providers (Dyn 2016, Google Cloud 2017, AWS Route 53 2019) demonstrate the need for rapid, machine- readable status updates to enable automated failover. Targeted Infrastructure Attacks: State-sponsored and criminal groups increasingly target internet infrastructure providers. Standardized status reporting enables: * Faster detection of coordinated multi-provider attacks * Automated correlation of incidents across services * Rapid triggering of disaster recovery procedures * Better communication during security incidents Supply Chain Attacks: When infrastructure providers are compromised, dependent services need clear, machine-readable status to assess their exposure and communicate with their users. Without standardized status reporting: o Security teams cannot aggregate threat intelligence from status pages o Automated defense systems cannot trigger based on provider status o Incident response is delayed by manual status checking o Users receive inconsistent security guidance A standardized format enables: o Automated monitoring for coordinated attacks o Rapid failover when providers are under attack o Machine-readable security incident disclosure o Correlation of incidents across multiple providers o Faster mean time to detection (MTTD) and response (MTTR) 2. Status Resource Format 2.1. Overall Structure A status resource MUST be a JSON object containing the following top-level fields: version (string, REQUIRED): The version of this specification. Value MUST be "1.0" for this specification. service (object, REQUIRED): Information about the service. See Section 2.2. status (object, REQUIRED): Overall service status. See Section 2.2. components (array, OPTIONAL): Array of component status objects. See Section 2.3. incidents (array, OPTIONAL): Array of active incident objects. See Section 2.4. updated_at (string, REQUIRED): ISO 8601 timestamp of last status update. Example: { "version": "1.0", "service": { ... }, "status": { ... }, "components": [ ... ], "incidents": [ ... ], "updated_at": "2025-11-23T17:00:00Z" } 2.2. Service Status Object The "service" object MUST contain: name (string, REQUIRED): Human-readable service name. url (string, OPTIONAL): URL of the main service. The "status" object MUST contain: indicator (string, REQUIRED): Overall status indicator. MUST be one of: * "operational" - All systems functioning normally * "degraded" - Some systems experiencing issues * "down" - Major service disruption description (string, OPTIONAL): Human-readable status description. The "status" object MAY contain: affected_percentage (number, OPTIONAL): Percentage of users affected (0.0-100.0). estimated_resolution (string, OPTIONAL): ISO 8601 timestamp of estimated resolution. Example: { "service": { "name": "Example CDN", "url": "https://example.com" }, "status": { "indicator": "degraded", "description": "Elevated API error rates in US-East region", "affected_percentage": 5.2, "estimated_resolution": "2025-11-23T18:00:00Z" } } 2.3. Component Object Each component object MUST contain: id (string, REQUIRED): Unique identifier for the component. name (string, REQUIRED): Human-readable component name. status (string, REQUIRED): Current component status. MUST be one of: * "operational" - Functioning normally * "degraded" - Performance issues * "partial_outage" - Partially unavailable * "major_outage" - Completely unavailable * "under_maintenance" - Planned maintenance Each component object SHOULD contain: criticality (string, RECOMMENDED): Component importance level. MUST be one of: * "critical" - Core service functionality (API, database, auth) * "high" - Important but non-essential (dashboard, notifications) * "medium" - Optional features (analytics, reporting) * "low" - Cosmetic features (documentation, marketing pages) scope (string, RECOMMENDED): Geographic scope of component. MUST be one of: * "global" - Worldwide availability * "regional" - Specific geographic region(s) * "local" - Single location or facility Each component object MAY contain: description (string, OPTIONAL): Component description. regions (array, OPTIONAL): Array of ISO 3166-1 alpha-2 country codes if scope is "regional". group (string, OPTIONAL): Component group identifier for logical organization. Example: { "id": "api-prod", "name": "Production API", "status": "operational", "criticality": "critical", "scope": "global", "description": "REST API for customer applications" } { "id": "sms-kenya", "name": "SMS Delivery - Kenya", "status": "degraded", "criticality": "medium", "scope": "regional", "regions": ["KE"], "description": "SMS message delivery through Kenyan carriers" } 2.4. Incident Object Each incident object MUST contain: id (string, REQUIRED): Unique incident identifier. name (string, REQUIRED): Human-readable incident title. status (string, REQUIRED): Current incident status. MUST be one of: * "investigating" - Issue identified, investigation ongoing * "identified" - Root cause found * "monitoring" - Fix applied, monitoring stability * "resolved" - Issue fully resolved impact (string, REQUIRED): Incident severity. MUST be one of: * "none" - No user impact (informational) * "minor" - Minimal user impact * "major" - Significant user impact * "critical" - Severe widespread impact started_at (string, REQUIRED): ISO 8601 timestamp when incident started. Each incident object SHOULD contain: updates (array, RECOMMENDED): Array of incident update objects with fields: * timestamp (string, REQUIRED): ISO 8601 update time * status (string, REQUIRED): Status at time of update * message (string, REQUIRED): Human-readable update affected_components (array, RECOMMENDED): Array of component IDs affected by this incident. Each incident object MAY contain: estimated_resolution (string, OPTIONAL): ISO 8601 timestamp of estimated resolution. resolved_at (string, OPTIONAL): ISO 8601 timestamp when resolved. affected_percentage (number, OPTIONAL): Percentage of users affected. Example: { "id": "inc-2025-11-23-001", "name": "API Response Time Degradation", "status": "monitoring", "impact": "minor", "started_at": "2025-11-23T16:45:00Z", "affected_components": ["api-prod"], "affected_percentage": 12.5, "updates": [ { "timestamp": "2025-11-23T16:45:00Z", "status": "investigating", "message": "We are investigating elevated API response times." }, { "timestamp": "2025-11-23T17:00:00Z", "status": "identified", "message": "Issue identified in database connection pool. Deploying fix." }, { "timestamp": "2025-11-23T17:15:00Z", "status": "monitoring", "message": "Fix deployed. Monitoring for stability." } ] } 3. Status Values and Semantics 3.1. Component Status Values This specification defines five standard component status values: operational: Component is functioning within normal parameters. Response times, error rates, and availability meet SLA targets. degraded: Component is functional but experiencing performance issues. Users may experience slower response times or elevated error rates, but core functionality remains available. partial_outage: Component is partially unavailable. Some features or geographic regions are affected while others remain operational. major_outage: Component is completely unavailable or non-functional. All users are affected. under_maintenance: Component is undergoing planned maintenance. This status SHOULD only be used for scheduled maintenance windows. 3.2. Criticality Levels Components SHOULD be classified by criticality to enable monitoring tools to distinguish between critical infrastructure failures and minor feature issues: critical: Core service functionality that, if degraded, prevents users from accessing primary service features. Examples: API endpoints, authentication systems, primary databases, CDN edge networks. high: Important functionality that significantly impacts user experience but does not prevent core service usage. Examples: web dashboards, notification systems, secondary databases, caching layers. medium: Optional features that enhance user experience but are not essential for basic service operation. Examples: analytics dashboards, reporting tools, administrative interfaces. low: Cosmetic or informational features that do not impact service functionality. Examples: marketing pages, documentation sites, status page itself, billing interfaces. Rationale: This classification enables monitoring systems to: o Generate alerts only for critical component failures o Avoid alarm fatigue from non-critical issues o Calculate more accurate service availability metrics o Provide users with appropriate context about impact 3.3. Geographic Scope Components and incidents SHOULD specify geographic scope to distinguish between global outages and regional issues: global: Affects all users worldwide. This SHOULD be the default for components that serve all regions. regional: Affects users in specific geographic regions. The "regions" array MUST be populated with ISO 3166-1 alpha-2 country codes. local: Affects a single data center, facility, or point of presence. Rationale: This classification enables: o Users to determine if an incident affects their region o Monitoring tools to display region-specific status o Service providers to maintain accurate global vs regional SLA metrics o Automated systems to route traffic away from affected regions 4. Examples 4.1. Minimal Status Resource The simplest valid status resource: { "version": "1.0", "service": { "name": "Example Service" }, "status": { "indicator": "operational" }, "updated_at": "2025-11-23T17:00:00Z" } 4.2. Complete Status Resource A comprehensive status resource with all optional fields: { "version": "1.0", "service": { "name": "Example Cloud Platform", "url": "https://example.com" }, "status": { "indicator": "degraded", "description": "Elevated error rates in API", "affected_percentage": 8.5 }, "components": [ { "id": "api", "name": "REST API", "status": "degraded", "criticality": "critical", "scope": "global", "description": "Primary REST API endpoints" }, { "id": "dashboard", "name": "Web Dashboard", "status": "operational", "criticality": "high", "scope": "global", "description": "Customer management dashboard" }, { "id": "cdn-asia", "name": "CDN - Asia Pacific", "status": "operational", "criticality": "critical", "scope": "regional", "regions": ["JP", "SG", "AU", "IN"] } ], "incidents": [ { "id": "inc-001", "name": "API Error Rate Increase", "status": "monitoring", "impact": "minor", "started_at": "2025-11-23T16:00:00Z", "affected_components": ["api"], "affected_percentage": 8.5, "estimated_resolution": "2025-11-23T18:00:00Z", "updates": [ { "timestamp": "2025-11-23T16:00:00Z", "status": "investigating", "message": "Investigating elevated API error rates" }, { "timestamp": "2025-11-23T16:30:00Z", "status": "identified", "message": "Database connection pool exhaustion identified" }, { "timestamp": "2025-11-23T17:00:00Z", "status": "monitoring", "message": "Connection pool limits increased, monitoring" } ] } ], "updated_at": "2025-11-23T17:00:00Z" } 4.3. Regional Incident Status resource showing a regional carrier issue: { "version": "1.0", "service": { "name": "SMS Gateway" }, "status": { "indicator": "operational", "description": "Regional carrier issues (limited impact)" }, "components": [ { "id": "sms-global", "name": "SMS - Global", "status": "operational", "criticality": "critical", "scope": "global" }, { "id": "sms-ke", "name": "SMS - Kenya (Telkom)", "status": "degraded", "criticality": "medium", "scope": "regional", "regions": ["KE"] } ], "incidents": [ { "id": "inc-002", "name": "SMS Delays to Telkom Kenya", "status": "investigating", "impact": "minor", "started_at": "2025-11-23T15:00:00Z", "affected_components": ["sms-ke"], "updates": [ { "timestamp": "2025-11-23T15:00:00Z", "status": "investigating", "message": "SMS delivery delays to Telkom Kenya. Working with carrier partner." } ] } ], "updated_at": "2025-11-23T17:00:00Z" } 5. IANA Considerations 5.1. Media Type Registration This document requests registration of the media type "application/vnd.service-status+json": Type name: application Subtype name: vnd.service-status+json Required parameters: None Optional parameters: version - Version of status format (default "1.0") Encoding considerations: UTF-8 Security considerations: See Section 6 Interoperability considerations: None Published specification: This document Applications that use this media type: Status monitoring systems, service health dashboards, incident management tools 6. Security Considerations 6.1. Information Disclosure Status resources may reveal information about service architecture, dependencies, and vulnerabilities. Service providers SHOULD: o Avoid exposing internal system names or implementation details o Use generic component names (e.g., "Database" not "PostgreSQL v14.2 on prod-db-01") o Consider rate limiting or authentication for detailed status APIs o Exclude security-sensitive incidents from public status pages 6.2. Denial of Service Status resources SHOULD be: o Cached aggressively to reduce load during incidents o Served from separate infrastructure to maintain availability o Rate limited to prevent abuse 6.3. Data Integrity Status resources SHOULD be served over HTTPS to prevent tampering. Automated monitoring systems SHOULD validate: o TLS certificates o JSON schema conformance o Timestamp freshness to detect stale data 6.4. Attack Detection and Response Standardized status reporting provides significant security benefits: 6.4.1. Coordinated Attack Detection Machine-readable status across multiple providers enables automated detection of coordinated attacks: DDoS Pattern Recognition: When multiple DNS providers or CDNs show degraded performance simultaneously, automated systems can recognize coordinated DDoS campaigns. Supply Chain Attack Indicators: Unusual incident patterns across infrastructure providers may indicate supply chain compromise. Geographic Attack Correlation: Multiple providers reporting regional issues in the same geographic area may indicate targeted infrastructure attacks. Example Detection Scenario: 1. Monitoring system observes Cloudflare, Akamai, and Fastly all report "degraded" status for "CDN" components 2. All incidents started within 15-minute window 3. All show scope: "global" 4. Automated alert: "Possible coordinated CDN attack detected" 5. Security team investigates and coordinates response Without standardized format, this correlation requires manual effort and is often detected too late. 6.4.2. Automated Failover During Attacks Clear criticality and scope enable automated security responses: o Automatically fail over from attacked provider to backup o Route traffic away from regions under attack o Trigger incident response procedures based on criticality o Disable non-critical features to preserve core functionality Example Failover Scenario: Primary DNS provider status: { "components": [{ "name": "DNS", "status": "major_outage", "criticality": "critical", "scope": "global" }] } Automated response: 1. Detect critical + major_outage + global 2. Automatically switch DNS to secondary provider 3. Notify operations team 4. Continue monitoring for recovery 6.4.3. Incident Response Coordination During major incidents, standardized format enables: Clear Communication: Security teams can rapidly assess which components are affected and communicate clearly with stakeholders. Dependency Mapping: Organizations can automatically determine their exposure when infrastructure providers have incidents. Prioritization: Critical component failures trigger immediate response while low criticality issues are queued. Historical Analysis: Standardized incident reporting enables post- incident analysis of attack patterns and provider reliability. 6.5. Mitigating False Security Alerts Standardized criticality reduces security alert fatigue: o Security systems should only alert on critical component failures o Non-critical issues (billing, UI) should not trigger security response o Regional issues should not trigger global security alerts o Maintenance windows should be distinguished from attacks This reduces mean time to detection (MTTD) by eliminating noise and enabling security teams to focus on genuine threats. 6.6. Privacy Considerations Status resources should not reveal: o Individual customer data or affected user identities o Specific attack vectors or vulnerabilities being exploited o Internal security measures or monitoring capabilities o Detailed incident response procedures The "affected_percentage" field should provide aggregate metrics only and should not enable identification of specific users or customers. 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 7.2. Informative References [RFC8631] Wilde, E., "Link Relation Types for Web Services", RFC 8631, DOI 10.17487/RFC8631, July 2019, . Appendix A. Acknowledgments This specification builds on the work of Erik Wilde (RFC 8631) and practical experience from existing service status implementations. Special thanks to the operators of monitoring systems who provided feedback on the challenges of status page integration. Appendix B. Migration from Existing Formats Many existing status page implementations use similar JSON structures. Common mappings include: - Overall status indicators (operational, degraded, etc.) - Component-level status tracking - Incident impact levels This specification adds standardization for: - Component criticality classification - Geographic scope identification - Structured region information Services with existing status APIs can: 1. Maintain existing endpoints for backward compatibility 2. Add new endpoint implementing this specification 3. Gradually migrate clients to the standardized format 4. Use content negotiation to serve multiple formats Author's Address Michael DALLA RIVA Janeway and Wildman Limited (UK) ietf@janewayandwildman.com