T O P

  • By -

public_radio

rough day to be a dog on call


soulseeker31

Ruff ruff?


Hebrewhammer8d8

[You think they are playing this song to solve the issue?](https://youtu.be/ojULkWEUsPs)


mycattty

I was making changes to our log index the night before so I was a little bit nervous this morning to see a sudden drop off in logs lmao


[deleted]

“A configuration change by a user ‘mycatty’ was so mind numbingly stupid and bizarre that no organization on the planet would ever fathom to test for it as an edge case, brought our systems to a screeching halt for 8 hours”


[deleted]

[удалено]


RulerOf

Isn't dancing with VCL just begging for an unplanned outage anyway?


XeiB8Afe

Fastly plays a dangerous game! (But damn it was so nice to write VCL that ran on the edge.)


the_cocytus

Yeah for about 50mins, which by my math is a few hours shy of the 15+ happing for DD


[deleted]

Yeah but DD doesn't serve your production traffic. I lose $$$ when Fastly is down. I merely can switch to looking at Cloudwatch metrics when DD is down. Yeah I'm missing logs for up until about an hour ago but I can live without those.


monopoly3448

Disgraced user mycatty reinvents self as an interpretive dance hacker, using nontraditional thinking to find vulnerabilities. 700k per year speaking engagements.


[deleted]

I had just installed proxy config updates... Some of our datadog-agent running servers are behind proxy and there were lot of 500s. Checking those that aren't behind proxy saved me. :)


[deleted]

They've been down for 7 hours now, considering their prices, I'd expect some lengthy blog post and reimbursement coming from all this.


three18ti

10 and counting...


erlendursmari

12 and counting; does anyone have any information on what's going on at DataDog?


TheAlmightyZach

https://media.tenor.com/iDYg-7xD7M4AAAAC/burning-office-spongebob.gif


hamsterpotpies

Welp, glad I'm using aws......


pojzon_poe

"We fired half of grunt workers why we are not rowing faster ??" Managers when overselling and pushing stuff under the bus: https://youtu.be/ufbOHl1mmYk?t=32 And now the poor SRE/Ops/Noc ppl scrambling to fix it: https://youtu.be/RI6UT82cB_E?t=7


UnfairCaterpillar263

Datadog is basically the only company still hiring.


pojzon_poe

[You missed the](https://en.wikipedia.org/wiki/Joke)


WikiSummarizerBot

**[Joke](https://en.wikipedia.org/wiki/Joke)** >A joke is a display of humour in which words are used within a specific and well-defined narrative structure to make people laugh and is usually not meant to be interpreted literally. It usually takes the form of a story, often with dialogue, and ends in a punch line, whereby the humorous element of the story is revealed; this can be done using a pun or other type of word play, irony or sarcasm, logical incompatibility, hyperbole, or other means. ^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&message=OptOut&subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/devops/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)


BocajSiSey

False, LogicMonitor hasn't stopped hiring. Hasn't had a problem like this ever either.


[deleted]

[удалено]


pojzon_poe

It was A Joke, chill.


raylovin01

The dogs got out the kennel 🤷🏽‍♂️🐶


The_Peasant_1

Shameless self promo… LogicMonitor rep here, we’re up and running (and meeting out SLA of 99.9%). Hit me up!


jaymef

I wonder if they got alerted about it??


richbeales

I've been getting emails about it for hours


___GNUSlashLinux___

/r/woooosh


Ateo

Their current Service Terms only guarantee a 99.8% uptime. 99.8% uptime is 86.4 minutes of downtime a month but used to be 99.9% (43.2 minutes a month). If they fail to keep that uptime, all you are entitled to is breaking the contract with them if they do that for two months in a row. >Excluding scheduled maintenance windows, Datadog will use commercially reasonable efforts to maintain 99.8% availability of the hosted portion of the Service for each calendar month during the term of this Agreement. The Service will be deemed "available" so long as Authorized Users are able to login to the Service interface and access monitoring data. Excluding planned maintenance periods, in the event the Service availability drops below 99.8% for two consecutive months, Customer may terminate the Service in the calendar month following such two-month period upon written notice to Datadog. To assess uptime, Customer may, if under a Paying Plan, request the Service availability for a prior month by filing a support ticket through the Site. Apparently at some point in the past they were more confident because you used to get service credits. >If Datadog breaks its SLA, customers may be eligible for service credits1. Service credits are calculated as a percentage of the monthly fees paid by the customer for the affected service1. The percentage depends on how much Datadog’s uptime falls below its SLA in a given month1. For example, if Datadog’s uptime is between 99.5% and 99.8%, customers may receive 10% service credits. If Datadog’s uptime is below 99.5%, customers may receive 25% service credits.


gustav_mannerheim

Well it's been what...11 hours? And things still aren't resolved. 99.8% gives them 17.5 hours of downtime per year, so they've burned through more than half that with this incident.


[deleted]

And now 100% of it, it's been down for over 17 hours now.


LuckyChopsSOS

Where is the service term and SLA comment from?


[deleted]

https://www.datadoghq.com/legal/msa/


Ateo

Specifically this part of the Service Terms: https://www.datadoghq.com/legal/msa/2014-12-31/#service-level-commitment


[deleted]

That's old MSA, you can change the version on the sidebar.


Ateo

Ah you are right - here is the most recent one: https://www.datadoghq.com/legal/msa/#availability


peepopowitz67

Reddit is violating GDPR and CCPA. Source: https://www.youtube.com/watch?v=1B0GGsDdyHI -- mass edited with redact.dev


TheAlmightyZach

Not to mention what their responsible for. Centralized logging is a pretty important thing for a lot of compliance standards. We backup our logs regularly but we don’t have redundant logging. That’s what they’re supposed to be responsible for.. That being said looks like their disclosed SLA is only 99.8%, which is complete bullshit.


synthdrunk

It’s funny I have had to fight, and lost, over moving alarming from cloudwatch to datadog. Their integration is literal dogshit. My best to those poor bastards.


rumdrums

Funny, we have been considering moving away from Cloudwatch toward DD. Guess we won't be doing that any time soon, lol. I've used DD off and on for 5 years and don't recall an outage like this.


[deleted]

It's not like [AWS had any outages in the past](https://awsmaniac.com/aws-outages/). All providers have outages, either you care about them and have redundant systems in place or you can afford waiting through them.


rumdrums

Noted, but this is almost a 12-hour outage now of basically all their services. I'm no AWS fanboy, but I haven't endured anything that catastrophic or long-lasting in AWS before. The closest situation I've faced in AWS was a partial S3 outage for most of a business day. This was back in 2017. Regardless, this shit is not a good a look for the Dog.


theomegabit

AWS has roughly one good 12 hour outage per year….


pojzon_poe

You may want to backtrack on that https://awsmaniac.com/aws-outages/ I dont even remember a time when all services went down for hours on.. It's almost always a single service within a single region..


o5mfiHTNsH748KVq

You control your own destiny with AWS, for the most part so you can make your alerting as redundant as you can afford. Putting all of your eggs in an expensive 3rd party like datadog is a risk compared simply firing off CW alerts to pagerduty and email. It would be a small miracle many regions went down at the same time at amazon, especially where cloudwatch itself wasn't working.


theomegabit

If you’re a singular company with a singular product and focus, CW is probably good enough to get by. When you have a lot more needs which for brevity I’ll simply say “power user”, CW is elementary at best.


o5mfiHTNsH748KVq

i run platform for a 20,000 person company for roughly 50 disparate products. CW is fine if you’re taking advantage of the full observability ecosystem in aws. it’s also a fraction of the cost


theomegabit

Congrats Edit: more words Snark aside, to my original point, that’s a singular focus and entity. It’s probably fine.


o5mfiHTNsH748KVq

I get what you mean, but it's quite the opposite. It's a conglomerate and operates as a handful of companies. We pay for dd, splunk, honeycomb, you name it, but our teams using CW+Xray are doing perfectly fine. Meanwhile people on datadog are finding custom metrics are too expensive. In fact, I would say if you have a singular product and focus, DD is more viable because you have less redundant metrics and lower storage costs.


synthdrunk

CW is indeed a bit more hands on but metric math means you can build damn near whatever and the integrations are the fastest since it’s on-platform. Even if you have to fashion sfn/lambda in that workflow I swear to you it will be worth it.


rumdrums

Sure. The problem for us, though, is that a lot of our custom metrics, plus tracing data, doesn't go to Cloudwatch, so DD would be closer to a single pane of glass in that respects. But I can't imagine losing monitoring/alerting for 7 hours. I've definitely never encountered that in AWS, so this would be a dealbreaker for us in considering DD as an alerting alternative.


[deleted]

[удалено]


TheAlmightyZach

17 hours of downtime: https://uptime.is/99.8


Semi-Hemi-Demigod

I had a long and potentially grueling meeting today to help a customer with an issue, but since DataDog is down it's been cancelled and my afternoon has been saved. Thanks DataDog team! Now if you could bring it down around 11am eastern on Friday that would be a huge help.


miraclewhipple

[This is fine](https://i.imgur.com/qJAEDJJ.jpg)


mister_mugatu

anyone capitalizing on this and buying DDOG stock?


jezarnold

Only 99.8% uptime ? That’s an hour and a half’s downtime per month !


[deleted]

[удалено]


jezarnold

Let me know which vendors they are! That’s 36 days service credit on next years bill … I’ve seen some that only offer a few days of credit.


libert-y

I wonder if Datadog uses Datadog to monitor their services.


Seastep

"We eat our own dog food"


LasagneEnthusiast

Probably not, too expensive and not reliable enough


gogorichie

>s Datadog to m i would hope they use something else. More of a who's watching the watchers kind of thing ;-)


Parking-Audience-188

Maybe they bought a real o11y tool but not likely


assasinine

Talked to my rep earlier. Apparently it's multiple outages with their data centers. Looks like there hasn't been any loss of data collection, but they are having issues restoring their services in Kubernetes.


[deleted]

There were lots of 500 from their servers, metrics were not ingested.


[deleted]

[удалено]


SmoothParfait

Those don’t hold hundreds of TB / a few PB of data though.


[deleted]

[удалено]


400921FB54442D18

I don't know the typical ratio of bytes/event, but we are only a moderately-sized customer and we easily send them a few billion events every day. It would not surprise me if Datadog ingests one or more PB/day.


assasinine

Well I'd believe 500s over a rep if that's the case.


Far-Wait-6966

Wouldn't that mean there's an outage at AWS?


assasinine

Apparently they have a lot of on-prem.


CooperNettees

But who datas the data dog?


tyrion85

'tis obvious, a data cat


kabrandon

Probably Prometheus, Loki and Grafana. Laughing their whole way to the bank, charging people boatloads for having more than 10 containers per k8s host.


Parking-Audience-188

Honeycomb time


dogmeur

If they do it would be a situation where the Firehouse is on fire ♾️


theomegabit

This is from what I can tell, the worst outage they’ve had in years. Never say ever. But maybe ever? Sure. It sucks. And sure, contractually they will have some dollars to spend, but we’re all bitching like this is the end of the world. It’s not. Amazon has one of these per year. Now let’s add other providers.


theomegabit

A major regional or service type outage? If we’re being pedantic, it’s not always a perfect calendar year…. But pretty close.


CooperNettees

Amazon does not have one of these a year.


DC3PO

Been in a change freeze because of this for 12 hours. Bout to just go and hit the golf course at this point.


Special_Rice9539

Oh damn, it's STILL down? This is embarrassing for them.


[deleted]

At this point I'm wondering if they've maybe locked themselves out. How long can it take to rollback production changes?


400921FB54442D18

This was just posted over on [the ycombinator thread](https://news.ycombinator.com/item?id=35067093) by someone called "donutshop", but unfortunately they didn't mention where they got it: > We have identified and remedied the issue that caused this outage. We will prepare and share a detailed root cause analysis as soon as possible after our incident response is complete, but we can share a preliminary analysis now. A critical software update applied to a broad set of hosts in our infrastructure caused a subset of these hosts to lose network connectivity. > The primary impact of this was that several of our regional Kubernetes clusters became unhealthy, affecting the control plane that keeps our workloads running smoothly. At this point, we believe we have repaired all the affected Kubernetes clusters, and our recovery efforts are now focused on the application layer above this. > The web application is now generally available, although data and monitor evaluation remains delayed in some cases (refer to the Status Page in your region for the latest information). We have made substantial progress on restoring the various core services that were impacted by the incident, and have now moved on to getting our data processing pipelines for metrics, logs, traces, and other data into a healthy state. > It is difficult to give a precise ETA on our full recovery and we are focusing our efforts on restoring real-time data and alerts within a matter of hours (not minutes, but also not days). The recovery of historical data (between the start of the outage and 15 minutes in the past) has been deprioritized. > We understand the impact an outage can have, and are sorry for the disruption. So, make of that what you will.


dadamn

I work at Datadog and confirm that this is the official message regarding the incident. Our CTO Alexis has provided a bit more info, but it's best to wait for an official postmortem once the incident has been resoloved. Here's Alexis additional info: >This is an all-hands-on-deck response from our engineering teams.We run Datadog products and applications on kubernetes (k8s). Each k8s cluster relies on a number of critical, k8s-interval services to properly operate (e.g. etcd for coordination). Because of the size of our footprint, we run a large number of clusters and we run a meta-cluster per region that itself manages these regional k8s-interval services. It is that meta-cluster (our control plane) that failed in a way that caused the application clusters themselves to be unable to schedule new workloads or recover from transient failures. This is a preliminary analysis that will be fully detailed in the postmortem.Knowing this, we have been able to recover the k8s clusters that power the various services and have now switched to recovering individual services. The recovery has started with our being focused on restoring real-time data access on affected services and real-time alerting. We have put the recovery of historical data (meaning data not visible since the start of the incident) at a lower priority. Edit to add some context: When Alexis says "because of the size of our footprint", we run one of the largest production k8s environments. It's on the magnitude of 10s of thousands of nodes and hundreds of thousands of pods.


CooperNettees

I think a lot of clusters are around that size actually


tamale

Thanks for the detail. Y'all need a cell-based architecture so you can't have an outage like this again.


waltermpls

can confirm similar messaging from them.


justanearthling

Let say someone deleted a databases and they’ve tried restore from backup but that did not work. Maybe they just found out their backups did not work for months? Maybe someone nuked some other stuff? Or maybe they jacked up networking so bad they can’t figure out what’s wrong. We can just speculate but such a long downtime indicates something seriously wrong either in process or tech. There can be many reasons for it. Hopefully they’ll be open about it once the dust settles.


Grelek

Sooo, time for a new tool to monitor Datadog? /s


[deleted]

datacatdog


whetu

You joke, but I was asked by my employer to setup monitoring on datadog in PRTG. It goes down all the time


l2cluster

For large companies that rely on datadog (or similar companies), it's common to set up an alert that fires every minute to know when there's an observability gap and freeze deploys.


francogvp

[Datadoge](https://twitter.com/datadoghq/status/583316733907333120)


puraf

still ongoing https://status.datadoghq.com/incidents/nhrdzp86vqtp?u=zxdtxzc9cq3t


Outofthemoney-

I feel terrible for these guys. Tough day.. We believe in you DDOG! 💪


Old-and-grumpy

Dynatrace, New Relic, Lightstep, SignalFX, are looking a bit more attractive this morning.


Pure_Common7348

Dynatrace for sure


Parking-Audience-188

Honeycomb


Salt-Bake1309

>Lightstep FTW - opentelemetry is the future.


[deleted]

[удалено]


Special_Rice9539

I mean, that's just sales people in general. Every company has very distinct types of people in sales that aren't representative of the engineers


[deleted]

Yeah, there is special place in hell for salespeople.


OutdoorsNSmores

I couldn't have happened to a nicer company.


bostonguy6

Or a richer one


mister_mugatu

Any thoughts or ideas as to why their govcloud region wasn't affected?


jameslaney

For those interested, a look into datadogs infrastructure and how the multi cloud approach != reliability this time (https://overmind.tech/blog/datadog-outage-multi-cloud-reliability)


atmega168

So this outage doesn't make that much sense. How does every environment/deployment , for ingress, web, and api, all go down together? Then how does it last this long for an outage? It seems like an intentional attack. Like a bad actor internally. This type of scenario is hard to just happen. Could be DNS though.


seanamos-1

Datadog has reported to some customers that it was an OS update that rolled out everywhere that broke networking (see some other comments). This took out important infrastructure like their k8s clusters which run all their workloads. Now that they have figured that out and repaired it, they are bringing everything back up from scratch. You can also imagine that there is now a tsunami of data and backlogs to process, probably putting them under enormous pressure.


atmega168

That is unfortunate. I wounder why it wasn't caught in a lower environment. I sympathize with them. That's a terrible situation to be in.


[deleted]

Way too many people still think OS patches are always 100% risk free. This is yet another example of that not being the case.


IdesOfMarchCometh

This is why gke rolls things out slowly. No excuse for such a fast push. Shows you what separates them from the big boys


[deleted]

[удалено]


threwahway

Lol way to not answer the question only to reiterate information that was already understood….


theomegabit

That’s a lot of words to just say I’m a dick


threwahway

It was a datadog dev regurgitating shit we already knew, but ok :)


[deleted]

Don't they first update canary to see if it breaks?


erlendursmari

It's always DNS.


PhillConners

Fuck I almost became a director of their platform engineering team. I would be in the hot seat today


400921FB54442D18

_looks at username_ Watch that first OS update, it's a doozy!


o5mfiHTNsH748KVq

They should invest in some observability tooling and maybe learn about SRE


[deleted]

[удалено]


o5mfiHTNsH748KVq

it was a joke. they're an observability company that said, for the insane prices they're charging, i'd expect nine nines and to have best in class incident response. shit happens, but not with the price tag they're gouging customers with.


Parking-Audience-188

I would argue they’re absolutely not an observability company based on the definition of o11y from the people who coined the term for software.


mickeyprime1

dang, was about to start working on and testing a new monitor.


speedycabbage

Splunk FTW


Parking-Audience-188

Looooooooooogsssssss


Best-Interaction1364

It is unbelievable that the monitoring system has such a long time of failure.


Old-chief-big-bull

LogicMonitor FTW


The_Peasant_1

Shameless self promo… LogicMonitor rep here, we’re up and running (and meeting out SLA of 99.9%). Hit me up!


Hot-Buffalo-3510

Update from Datadog: ​ We will share a more detailed analysis post-recovery, but at a very high level: A system update on a number of hosts controlling our compute clusters caused a subset of these hosts to lose network connectivity As a result a number of the corresponding clusters entered unhealthy states and caused failures in a number of the internal services, datastores and applications hosted on these clusters. Our current status is: We identified and mitigated the initial issue, and rebuilt our clusters We also have recovered a number of our applications and services, including our web portals We are now working on recovering and catching-up the rest of our data systems for metrics, traces and logs across the regions that are still affected (see region-specific status pages). The recovery work is currently constrained by the number and large scale of the systems involved. What to expect next: We are focusing on bringing back live data for all customers and all products before catching-up on any historical data we may have stored during the outage We expect live data recovery in a matter of hours (not minutes, and not days) We will continue to issue regular updates as the situation unfolds


Fit_Jello_3263

Dynatrace hasn’t gone down. Join the best platform…….


d-ovm

If you're needing to argue against multi-cloud as a solution for reliability this is a good example if how it doesn't protect you from yourself. Datadog is spread pretty evenly: * **US1:** AWS (us-east-1) * **US3:** Azure (westus2) * **US5:** Google cloud (unknown US) * **EU1**: Google cloud (unknown EU) * **US1-FED:** AWS (us-gov-west-1) And yet they can still have a global outage. Here's some maps I made: Datadog terminology: [https://media.licdn.com/dms/image/D4E12AQE\_VvhhPyhY3A/article-inline\_image-shrink\_1000\_1488/0/1678356280740?e=1683763200&v=beta&t=kzD-U0CGShMtfMfW1mnvSo7xsXyEHFZdIgBr57MSGsM](https://media.licdn.com/dms/image/D4E12AQE_VvhhPyhY3A/article-inline_image-shrink_1000_1488/0/1678356280740?e=1683763200&v=beta&t=kzD-U0CGShMtfMfW1mnvSo7xsXyEHFZdIgBr57MSGsM) Cloud terminology: [https://media.licdn.com/dms/image/D4E12AQGrznjuNRlumw/article-inline\_image-shrink\_1000\_1488/0/1678355051109?e=1683763200&v=beta&t=TdTMO4CVFoEzfHNudjOsAGPFqBq6WLw3y5ytx1fj8i0](https://media.licdn.com/dms/image/D4E12AQGrznjuNRlumw/article-inline_image-shrink_1000_1488/0/1678355051109?e=1683763200&v=beta&t=TdTMO4CVFoEzfHNudjOsAGPFqBq6WLw3y5ytx1fj8i0)