Delayed Processing for a Subset of Metrics

lowDatadogMonitoringFeb 23, 2025 09:30Duration: 6h 1m
api
API Issue

Summary

This incident has been resolved.

Impact

minor

Timeline

Feb 23, 2025 09:30

[investigating] We are investigating increased latency processing Trace Metrics. As a result of this issue, some users may see delays or gaps for a subset of their metrics on graphs and statistics on Service Catalog.

via statuspage
+31m
Feb 23, 2025 10:02

[identified] We have identified the underlying issue and are working on a fix. It is important to note that no data has been lost, and it will be backfilled and available once the service is operational again.

via statuspage
+31m
Feb 23, 2025 10:33

[identified] We have identified the underlying issue and continue to work on a fix. It is important to note that no data has been lost, and it will be backfilled and available once the service is operational again.

via statuspage
+50m
Feb 23, 2025 11:24

[identified] We have identified the underlying issue and continue to work on a fix. It is important to note that no data has been lost: data is being backfilled and will be available once the service is operational again.

via statuspage
+1h 42m
Feb 23, 2025 13:06

[identified] We have identified the underlying issue and continue to work on a fix. It is important to note that no data has been lost: data is being backfilled and will be available once the service is operational again.

via statuspage
+1h 12m
Feb 23, 2025 14:17

[monitoring] We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.

via statuspage
+1h 15m
Feb 23, 2025 15:32

[resolved] This incident has been resolved.

via statuspage

Lessons Learned

Datadog has experienced 33 incidents in the past year. This frequency suggests systemic reliability challenges that may warrant additional monitoring.

📊Incidents related to api have occurred 182 times across all providers in the past year. This is one of the most common failure categories in cloud infrastructure.

💡This incident is categorized as: API Issue. Consider implementing preventive measures specific to this failure category.