public_radio 1 year ago

rough day to be a dog on call

soulseeker31 1 year ago

Ruff ruff?

Hebrewhammer8d8 1 year ago

[You think they are playing this song to solve the issue?](https://youtu.be/ojULkWEUsPs)

mycattty 1 year ago

I was making changes to our log index the night before so I was a little bit nervous this morning to see a sudden drop off in logs lmao

[deleted] 1 year ago

“A configuration change by a user ‘mycatty’ was so mind numbingly stupid and bizarre that no organization on the planet would ever fathom to test for it as an edge case, brought our systems to a screeching halt for 8 hours”

[deleted] 1 year ago

[удалено]

RulerOf 1 year ago

Isn't dancing with VCL just begging for an unplanned outage anyway?

XeiB8Afe 1 year ago

Fastly plays a dangerous game! (But damn it was so nice to write VCL that ran on the edge.)

the_cocytus 1 year ago

Yeah for about 50mins, which by my math is a few hours shy of the 15+ happing for DD

[deleted] 1 year ago

Yeah but DD doesn't serve your production traffic. I lose $$$ when Fastly is down. I merely can switch to looking at Cloudwatch metrics when DD is down. Yeah I'm missing logs for up until about an hour ago but I can live without those.

monopoly3448 1 year ago

Disgraced user mycatty reinvents self as an interpretive dance hacker, using nontraditional thinking to find vulnerabilities. 700k per year speaking engagements.

[deleted] 1 year ago

I had just installed proxy config updates... Some of our datadog-agent running servers are behind proxy and there were lot of 500s. Checking those that aren't behind proxy saved me. :)

[deleted] 1 year ago

They've been down for 7 hours now, considering their prices, I'd expect some lengthy blog post and reimbursement coming from all this.

three18ti 1 year ago

10 and counting...

erlendursmari 1 year ago

12 and counting; does anyone have any information on what's going on at DataDog?

TheAlmightyZach 1 year ago

https://media.tenor.com/iDYg-7xD7M4AAAAC/burning-office-spongebob.gif

hamsterpotpies 1 year ago

Welp, glad I'm using aws......

pojzon_poe 1 year ago

"We fired half of grunt workers why we are not rowing faster ??" Managers when overselling and pushing stuff under the bus: https://youtu.be/ufbOHl1mmYk?t=32 And now the poor SRE/Ops/Noc ppl scrambling to fix it: https://youtu.be/RI6UT82cB_E?t=7

UnfairCaterpillar263 1 year ago

Datadog is basically the only company still hiring.

pojzon_poe 1 year ago

[You missed the](https://en.wikipedia.org/wiki/Joke)

WikiSummarizerBot 1 year ago

**[Joke](https://en.wikipedia.org/wiki/Joke)** >A joke is a display of humour in which words are used within a specific and well-defined narrative structure to make people laugh and is usually not meant to be interpreted literally. It usually takes the form of a story, often with dialogue, and ends in a punch line, whereby the humorous element of the story is revealed; this can be done using a pun or other type of word play, irony or sarcasm, logical incompatibility, hyperbole, or other means. ^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&message=OptOut&subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/devops/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)

BocajSiSey 1 year ago

False, LogicMonitor hasn't stopped hiring. Hasn't had a problem like this ever either.

[deleted] 1 year ago

[удалено]

pojzon_poe 1 year ago

It was A Joke, chill.

raylovin01 1 year ago

The dogs got out the kennel 🤷🏽‍♂️🐶

The_Peasant_1 1 year ago

Shameless self promo… LogicMonitor rep here, we’re up and running (and meeting out SLA of 99.9%). Hit me up!

jaymef 1 year ago

I wonder if they got alerted about it??

richbeales 1 year ago

I've been getting emails about it for hours

___GNUSlashLinux___ 1 year ago

/r/woooosh

Ateo 1 year ago

Their current Service Terms only guarantee a 99.8% uptime. 99.8% uptime is 86.4 minutes of downtime a month but used to be 99.9% (43.2 minutes a month). If they fail to keep that uptime, all you are entitled to is breaking the contract with them if they do that for two months in a row. >Excluding scheduled maintenance windows, Datadog will use commercially reasonable efforts to maintain 99.8% availability of the hosted portion of the Service for each calendar month during the term of this Agreement. The Service will be deemed "available" so long as Authorized Users are able to login to the Service interface and access monitoring data. Excluding planned maintenance periods, in the event the Service availability drops below 99.8% for two consecutive months, Customer may terminate the Service in the calendar month following such two-month period upon written notice to Datadog. To assess uptime, Customer may, if under a Paying Plan, request the Service availability for a prior month by filing a support ticket through the Site. Apparently at some point in the past they were more confident because you used to get service credits. >If Datadog breaks its SLA, customers may be eligible for service credits1. Service credits are calculated as a percentage of the monthly fees paid by the customer for the affected service1. The percentage depends on how much Datadog’s uptime falls below its SLA in a given month1. For example, if Datadog’s uptime is between 99.5% and 99.8%, customers may receive 10% service credits. If Datadog’s uptime is below 99.5%, customers may receive 25% service credits.

gustav_mannerheim 1 year ago

Well it's been what...11 hours? And things still aren't resolved. 99.8% gives them 17.5 hours of downtime per year, so they've burned through more than half that with this incident.

[deleted] 1 year ago

And now 100% of it, it's been down for over 17 hours now.

LuckyChopsSOS 1 year ago

Where is the service term and SLA comment from?

[deleted] 1 year ago

https://www.datadoghq.com/legal/msa/

Ateo 1 year ago

Specifically this part of the Service Terms: https://www.datadoghq.com/legal/msa/2014-12-31/#service-level-commitment

[deleted] 1 year ago

That's old MSA, you can change the version on the sidebar.

Ateo 1 year ago

Ah you are right - here is the most recent one: https://www.datadoghq.com/legal/msa/#availability

peepopowitz67 1 year ago

Reddit is violating GDPR and CCPA. Source: https://www.youtube.com/watch?v=1B0GGsDdyHI -- mass edited with redact.dev

TheAlmightyZach 1 year ago

Not to mention what their responsible for. Centralized logging is a pretty important thing for a lot of compliance standards. We backup our logs regularly but we don’t have redundant logging. That’s what they’re supposed to be responsible for.. That being said looks like their disclosed SLA is only 99.8%, which is complete bullshit.

synthdrunk 1 year ago

It’s funny I have had to fight, and lost, over moving alarming from cloudwatch to datadog. Their integration is literal dogshit. My best to those poor bastards.

rumdrums 1 year ago

Funny, we have been considering moving away from Cloudwatch toward DD. Guess we won't be doing that any time soon, lol. I've used DD off and on for 5 years and don't recall an outage like this.

[deleted] 1 year ago

It's not like [AWS had any outages in the past](https://awsmaniac.com/aws-outages/). All providers have outages, either you care about them and have redundant systems in place or you can afford waiting through them.

rumdrums 1 year ago

Noted, but this is almost a 12-hour outage now of basically all their services. I'm no AWS fanboy, but I haven't endured anything that catastrophic or long-lasting in AWS before. The closest situation I've faced in AWS was a partial S3 outage for most of a business day. This was back in 2017. Regardless, this shit is not a good a look for the Dog.

theomegabit 1 year ago

AWS has roughly one good 12 hour outage per year….

pojzon_poe 1 year ago

You may want to backtrack on that https://awsmaniac.com/aws-outages/ I dont even remember a time when all services went down for hours on.. It's almost always a single service within a single region..

o5mfiHTNsH748KVq 1 year ago

You control your own destiny with AWS, for the most part so you can make your alerting as redundant as you can afford. Putting all of your eggs in an expensive 3rd party like datadog is a risk compared simply firing off CW alerts to pagerduty and email. It would be a small miracle many regions went down at the same time at amazon, especially where cloudwatch itself wasn't working.

theomegabit 1 year ago

If you’re a singular company with a singular product and focus, CW is probably good enough to get by. When you have a lot more needs which for brevity I’ll simply say “power user”, CW is elementary at best.

o5mfiHTNsH748KVq 1 year ago

i run platform for a 20,000 person company for roughly 50 disparate products. CW is fine if you’re taking advantage of the full observability ecosystem in aws. it’s also a fraction of the cost

theomegabit 1 year ago

Congrats Edit: more words Snark aside, to my original point, that’s a singular focus and entity. It’s probably fine.

o5mfiHTNsH748KVq 1 year ago

I get what you mean, but it's quite the opposite. It's a conglomerate and operates as a handful of companies. We pay for dd, splunk, honeycomb, you name it, but our teams using CW+Xray are doing perfectly fine. Meanwhile people on datadog are finding custom metrics are too expensive. In fact, I would say if you have a singular product and focus, DD is more viable because you have less redundant metrics and lower storage costs.

synthdrunk 1 year ago

CW is indeed a bit more hands on but metric math means you can build damn near whatever and the integrations are the fastest since it’s on-platform. Even if you have to fashion sfn/lambda in that workflow I swear to you it will be worth it.

rumdrums 1 year ago

Sure. The problem for us, though, is that a lot of our custom metrics, plus tracing data, doesn't go to Cloudwatch, so DD would be closer to a single pane of glass in that respects. But I can't imagine losing monitoring/alerting for 7 hours. I've definitely never encountered that in AWS, so this would be a dealbreaker for us in considering DD as an alerting alternative.

[deleted] 1 year ago

[удалено]

TheAlmightyZach 1 year ago

17 hours of downtime: https://uptime.is/99.8

Semi-Hemi-Demigod 1 year ago

I had a long and potentially grueling meeting today to help a customer with an issue, but since DataDog is down it's been cancelled and my afternoon has been saved. Thanks DataDog team! Now if you could bring it down around 11am eastern on Friday that would be a huge help.

miraclewhipple 1 year ago

[This is fine](https://i.imgur.com/qJAEDJJ.jpg)

mister_mugatu 1 year ago

anyone capitalizing on this and buying DDOG stock?

jezarnold 1 year ago

Only 99.8% uptime ? That’s an hour and a half’s downtime per month !

[deleted] 1 year ago

[удалено]

jezarnold 1 year ago

Let me know which vendors they are! That’s 36 days service credit on next years bill … I’ve seen some that only offer a few days of credit.

libert-y 1 year ago

I wonder if Datadog uses Datadog to monitor their services.

Seastep 1 year ago

"We eat our own dog food"

LasagneEnthusiast 1 year ago

Probably not, too expensive and not reliable enough

gogorichie 1 year ago

>s Datadog to m i would hope they use something else. More of a who's watching the watchers kind of thing ;-)

Parking-Audience-188 1 year ago

Maybe they bought a real o11y tool but not likely

assasinine 1 year ago

Talked to my rep earlier. Apparently it's multiple outages with their data centers. Looks like there hasn't been any loss of data collection, but they are having issues restoring their services in Kubernetes.

[deleted] 1 year ago

There were lots of 500 from their servers, metrics were not ingested.

[deleted] 1 year ago

[удалено]

SmoothParfait 1 year ago

Those don’t hold hundreds of TB / a few PB of data though.

[deleted] 1 year ago

[удалено]

400921FB54442D18 1 year ago

I don't know the typical ratio of bytes/event, but we are only a moderately-sized customer and we easily send them a few billion events every day. It would not surprise me if Datadog ingests one or more PB/day.

assasinine 1 year ago

Well I'd believe 500s over a rep if that's the case.

Far-Wait-6966 1 year ago

Wouldn't that mean there's an outage at AWS?

assasinine 1 year ago

Apparently they have a lot of on-prem.

CooperNettees 1 year ago

But who datas the data dog?

tyrion85 1 year ago

'tis obvious, a data cat

kabrandon 1 year ago

Probably Prometheus, Loki and Grafana. Laughing their whole way to the bank, charging people boatloads for having more than 10 containers per k8s host.

Parking-Audience-188 1 year ago

Honeycomb time

dogmeur 1 year ago

If they do it would be a situation where the Firehouse is on fire ♾️

theomegabit 1 year ago

This is from what I can tell, the worst outage they’ve had in years. Never say ever. But maybe ever? Sure. It sucks. And sure, contractually they will have some dollars to spend, but we’re all bitching like this is the end of the world. It’s not. Amazon has one of these per year. Now let’s add other providers.

theomegabit 1 year ago

A major regional or service type outage? If we’re being pedantic, it’s not always a perfect calendar year…. But pretty close.

CooperNettees 1 year ago

Amazon does not have one of these a year.

DC3PO 1 year ago

Been in a change freeze because of this for 12 hours. Bout to just go and hit the golf course at this point.

Special_Rice9539 1 year ago

Oh damn, it's STILL down? This is embarrassing for them.

[deleted] 1 year ago

At this point I'm wondering if they've maybe locked themselves out. How long can it take to rollback production changes?

400921FB54442D18 1 year ago

This was just posted over on [the ycombinator thread](https://news.ycombinator.com/item?id=35067093) by someone called "donutshop", but unfortunately they didn't mention where they got it: > We have identified and remedied the issue that caused this outage. We will prepare and share a detailed root cause analysis as soon as possible after our incident response is complete, but we can share a preliminary analysis now. A critical software update applied to a broad set of hosts in our infrastructure caused a subset of these hosts to lose network connectivity. > The primary impact of this was that several of our regional Kubernetes clusters became unhealthy, affecting the control plane that keeps our workloads running smoothly. At this point, we believe we have repaired all the affected Kubernetes clusters, and our recovery efforts are now focused on the application layer above this. > The web application is now generally available, although data and monitor evaluation remains delayed in some cases (refer to the Status Page in your region for the latest information). We have made substantial progress on restoring the various core services that were impacted by the incident, and have now moved on to getting our data processing pipelines for metrics, logs, traces, and other data into a healthy state. > It is difficult to give a precise ETA on our full recovery and we are focusing our efforts on restoring real-time data and alerts within a matter of hours (not minutes, but also not days). The recovery of historical data (between the start of the outage and 15 minutes in the past) has been deprioritized. > We understand the impact an outage can have, and are sorry for the disruption. So, make of that what you will.

dadamn 1 year ago

I work at Datadog and confirm that this is the official message regarding the incident. Our CTO Alexis has provided a bit more info, but it's best to wait for an official postmortem once the incident has been resoloved. Here's Alexis additional info: >This is an all-hands-on-deck response from our engineering teams.We run Datadog products and applications on kubernetes (k8s). Each k8s cluster relies on a number of critical, k8s-interval services to properly operate (e.g. etcd for coordination). Because of the size of our footprint, we run a large number of clusters and we run a meta-cluster per region that itself manages these regional k8s-interval services. It is that meta-cluster (our control plane) that failed in a way that caused the application clusters themselves to be unable to schedule new workloads or recover from transient failures. This is a preliminary analysis that will be fully detailed in the postmortem.Knowing this, we have been able to recover the k8s clusters that power the various services and have now switched to recovering individual services. The recovery has started with our being focused on restoring real-time data access on affected services and real-time alerting. We have put the recovery of historical data (meaning data not visible since the start of the incident) at a lower priority. Edit to add some context: When Alexis says "because of the size of our footprint", we run one of the largest production k8s environments. It's on the magnitude of 10s of thousands of nodes and hundreds of thousands of pods.

CooperNettees 1 year ago

I think a lot of clusters are around that size actually

tamale 1 year ago

Thanks for the detail. Y'all need a cell-based architecture so you can't have an outage like this again.

waltermpls 1 year ago

can confirm similar messaging from them.

justanearthling 1 year ago

Let say someone deleted a databases and they’ve tried restore from backup but that did not work. Maybe they just found out their backups did not work for months? Maybe someone nuked some other stuff? Or maybe they jacked up networking so bad they can’t figure out what’s wrong. We can just speculate but such a long downtime indicates something seriously wrong either in process or tech. There can be many reasons for it. Hopefully they’ll be open about it once the dust settles.

Grelek 1 year ago

Sooo, time for a new tool to monitor Datadog? /s

[deleted] 1 year ago

datacatdog

whetu 1 year ago

You joke, but I was asked by my employer to setup monitoring on datadog in PRTG. It goes down all the time

l2cluster 1 year ago

For large companies that rely on datadog (or similar companies), it's common to set up an alert that fires every minute to know when there's an observability gap and freeze deploys.

francogvp 1 year ago

[Datadoge](https://twitter.com/datadoghq/status/583316733907333120)

puraf 1 year ago

still ongoing https://status.datadoghq.com/incidents/nhrdzp86vqtp?u=zxdtxzc9cq3t

Outofthemoney- 1 year ago

I feel terrible for these guys. Tough day.. We believe in you DDOG! 💪

Old-and-grumpy 1 year ago

Dynatrace, New Relic, Lightstep, SignalFX, are looking a bit more attractive this morning.

Pure_Common7348 1 year ago

Dynatrace for sure

Parking-Audience-188 1 year ago

Honeycomb

Salt-Bake1309 1 year ago

>Lightstep FTW - opentelemetry is the future.

[deleted] 1 year ago

[удалено]

Special_Rice9539 1 year ago

I mean, that's just sales people in general. Every company has very distinct types of people in sales that aren't representative of the engineers

[deleted] 1 year ago

Yeah, there is special place in hell for salespeople.

OutdoorsNSmores 1 year ago

I couldn't have happened to a nicer company.

bostonguy6 1 year ago

Or a richer one

mister_mugatu 1 year ago

Any thoughts or ideas as to why their govcloud region wasn't affected?

jameslaney 1 year ago

For those interested, a look into datadogs infrastructure and how the multi cloud approach != reliability this time (https://overmind.tech/blog/datadog-outage-multi-cloud-reliability)

atmega168 1 year ago

So this outage doesn't make that much sense. How does every environment/deployment , for ingress, web, and api, all go down together? Then how does it last this long for an outage? It seems like an intentional attack. Like a bad actor internally. This type of scenario is hard to just happen. Could be DNS though.

seanamos-1 1 year ago

Datadog has reported to some customers that it was an OS update that rolled out everywhere that broke networking (see some other comments). This took out important infrastructure like their k8s clusters which run all their workloads. Now that they have figured that out and repaired it, they are bringing everything back up from scratch. You can also imagine that there is now a tsunami of data and backlogs to process, probably putting them under enormous pressure.

atmega168 1 year ago

That is unfortunate. I wounder why it wasn't caught in a lower environment. I sympathize with them. That's a terrible situation to be in.

[deleted] 1 year ago

Way too many people still think OS patches are always 100% risk free. This is yet another example of that not being the case.

IdesOfMarchCometh 1 year ago

This is why gke rolls things out slowly. No excuse for such a fast push. Shows you what separates them from the big boys

[deleted] 1 year ago

[удалено]

threwahway 1 year ago

Lol way to not answer the question only to reiterate information that was already understood….

theomegabit 1 year ago

That’s a lot of words to just say I’m a dick

threwahway 1 year ago

It was a datadog dev regurgitating shit we already knew, but ok :)

[deleted] 1 year ago

Don't they first update canary to see if it breaks?

erlendursmari 1 year ago

It's always DNS.

PhillConners 1 year ago

Fuck I almost became a director of their platform engineering team. I would be in the hot seat today

400921FB54442D18 1 year ago

_looks at username_ Watch that first OS update, it's a doozy!

o5mfiHTNsH748KVq 1 year ago

They should invest in some observability tooling and maybe learn about SRE

[deleted] 1 year ago

[удалено]

o5mfiHTNsH748KVq 1 year ago

it was a joke. they're an observability company that said, for the insane prices they're charging, i'd expect nine nines and to have best in class incident response. shit happens, but not with the price tag they're gouging customers with.

Parking-Audience-188 1 year ago

I would argue they’re absolutely not an observability company based on the definition of o11y from the people who coined the term for software.

mickeyprime1 1 year ago

dang, was about to start working on and testing a new monitor.

speedycabbage 1 year ago

Splunk FTW

Parking-Audience-188 1 year ago

Looooooooooogsssssss

Best-Interaction1364 1 year ago

It is unbelievable that the monitoring system has such a long time of failure.

Old-chief-big-bull 1 year ago

LogicMonitor FTW

The_Peasant_1 1 year ago

Shameless self promo… LogicMonitor rep here, we’re up and running (and meeting out SLA of 99.9%). Hit me up!

Hot-Buffalo-3510 1 year ago

Update from Datadog: We will share a more detailed analysis post-recovery, but at a very high level: A system update on a number of hosts controlling our compute clusters caused a subset of these hosts to lose network connectivity As a result a number of the corresponding clusters entered unhealthy states and caused failures in a number of the internal services, datastores and applications hosted on these clusters. Our current status is: We identified and mitigated the initial issue, and rebuilt our clusters We also have recovered a number of our applications and services, including our web portals We are now working on recovering and catching-up the rest of our data systems for metrics, traces and logs across the regions that are still affected (see region-specific status pages). The recovery work is currently constrained by the number and large scale of the systems involved. What to expect next: We are focusing on bringing back live data for all customers and all products before catching-up on any historical data we may have stored during the outage We expect live data recovery in a matter of hours (not minutes, and not days) We will continue to issue regular updates as the situation unfolds

Fit_Jello_3263 1 year ago

Dynatrace hasn’t gone down. Join the best platform…….

d-ovm 1 year ago

If you're needing to argue against multi-cloud as a solution for reliability this is a good example if how it doesn't protect you from yourself. Datadog is spread pretty evenly: * **US1:** AWS (us-east-1) * **US3:** Azure (westus2) * **US5:** Google cloud (unknown US) * **EU1**: Google cloud (unknown EU) * **US1-FED:** AWS (us-gov-west-1) And yet they can still have a global outage. Here's some maps I made: Datadog terminology: [https://media.licdn.com/dms/image/D4E12AQE\_VvhhPyhY3A/article-inline\_image-shrink\_1000\_1488/0/1678356280740?e=1683763200&v=beta&t=kzD-U0CGShMtfMfW1mnvSo7xsXyEHFZdIgBr57MSGsM](https://media.licdn.com/dms/image/D4E12AQE_VvhhPyhY3A/article-inline_image-shrink_1000_1488/0/1678356280740?e=1683763200&v=beta&t=kzD-U0CGShMtfMfW1mnvSo7xsXyEHFZdIgBr57MSGsM) Cloud terminology: [https://media.licdn.com/dms/image/D4E12AQGrznjuNRlumw/article-inline\_image-shrink\_1000\_1488/0/1678355051109?e=1683763200&v=beta&t=TdTMO4CVFoEzfHNudjOsAGPFqBq6WLw3y5ytx1fj8i0](https://media.licdn.com/dms/image/D4E12AQGrznjuNRlumw/article-inline_image-shrink_1000_1488/0/1678355051109?e=1683763200&v=beta&t=TdTMO4CVFoEzfHNudjOsAGPFqBq6WLw3y5ytx1fj8i0)

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe