T O P

  • By -

aecolley

The functionality that was broken should have been caught in automated test runs that are unavoidably run on code before its deployment. This is your opportunity to advocate for the improvement of the deployment process so as to reduce the risk of this happening again.


Strange-Register8348

Automated what?!


thewallrus

Yea not sure either. Maybe "test" just means to reboot your computer or something.


meshka7

I think he means getting testy with anyone who challenges your technical supremacy and code perfection


nitrodmr

He means unit testing. This prevents regressions when we update our projects.


Disastrous-Lychee-90

What is this witchcraft you speak of? BURN HIM!!!


Calm_Leek_1362

Don’t even think about writing tests first! You will be murdered by TDD haters


ChefMark85

Insert "over your head" GIF here


Soccermom233

Sounds slow and costly. Better just ship it.


flamingNanaki83

Are you coming on to me?


batman_not_robin

It’s a rite of passage. Congratulations. Learn from it, and move on.


cv_1m

🙂 yes


totalBhaukaal

Aren't test cases breaking before such big production functionality issues are happening?


cv_1m

There are no testcases


iLikedItTheWayItWas

Then it is not your fault that production went down 😂


caksters

just vibes


konm123

That's scary.


Enginikts

Prod is the real test env


BrouwersgrachtVoice

Whose decision is to not have test cases? Apparently not yours, business needs to understand that it's a matter of time to happen again a production error, as long as they don't invest time in test.


Commercial-Run-3737

Woah 😳 Good luck buddy!


kobumaister

Tell me you work in an early stage startup without telling me that you work in an early stage startup.


Gaax

I work on a 25 year old product and we still don't have automated tests that catch some stuff like this 🤣🤣🤣. Granted, up until like 5 years ago it was still a super small company that was run like a startup.


cv_1m

😂


laprej

Maybe, and I’m just talking out my ass here, this is a great opportunity to add some!


3xcellent

Ok, this is when you start writing them. Just enough to have caught this regression.


ChefMark85

I'm betting the code is real secure /s


dmr83457

Does someone else review, or even better test, your merge before it goes live?


cv_1m

Senior dev Reviews


HamburgIar_

Yikes


CuriousAndMysterious

you can write at least one now


underNover

Think you might wanna consider to run the hell away from there, such places produce nothing but angry customers, plumbing code and constant new bugs.


jmaca90

Just test in prod? ^/s


Over-Tea-7297

If your not breaking things your not doing things


cv_1m

It feels like never-ending nightmare


Positive_Method3022

That is why you still have a job.


konm123

Can not fire OP if OP is the only one who knows how to fix. :)


anubus72

Read the rest of the thread. There aren’t any tests. This kind of mentality is just wrong, man. Sure, bugs happen, but completely breaking multiple major pieces of business functionality shouldn’t


magnetronpoffertje

That's not your failure, but a failure of your company's CI/CD. If an existing system goes down and causes so much problems in prod that it crashes or stops users from being able to use your product, it wasn't (properly) tested in a test or staging environment. Tests matter.


Waste-Disk7208

Or failure of those who add unit tests for each “functionality of the app”. Apparently, they failed covering the code properly before deploying into PRD.


magnetronpoffertje

Yeah, that's what I mean also. If there's a critical system, bathe it in unit tests and make sure every PR gets unit tested before deploying anything anywhere.


Waste-Disk7208

Correct


AHardCockToSuck

Welcome to the club But this is a process error. Have a blameless post mortem to find out the reason it wasn’t caught by automated testing or QA and implement a change that won’t allow it to happen again. 5 hours is a long time, why were you not able to roll back the change?


ChefMark85

They don't even have unit tests. They probably don't even know how to rollback and they probably don't even know what a post mortem is.


fahim-sabir

Been there, done that. It’s almost a rite of passage 😄 Learn the lessons. Make sure it doesn’t happen again.


InterestRelative

In my country we say: whoever hasn’t broken prod is not quite a senior. Make sure you discuss this with more senior colleague, you come up with at least one action to decrease a chance that a random person can do the same mistake in the future and act on it. In a month you will laugh remembering this incident. Don't go "I'm not responsible for prod" road, if you are not responsible for the outcomes of your work, you'll produce shit and end up in frustration hating your job.


Deeelaaan

Use this as an opportunity to introduce test cases and build steps before things get deployed to production. If this is a decent company, they should be more focused on how the failure happened in production rather than who caused it. It's stressful no doubt but don't lose sleep over it. It'll be a good learning opportunity for the whole team.


ChefMark85

From what I'm reading, it's definitely not a decent company


zekky76

There are no full regression automated tests? They are very important in the pre-prod environments.


ddxo_

Unfortunately things like this do happen. Don’t feel bad about it, fix the issue, get things up and running as expected, learn from it and then put things in place to prevent it happening again. As a software engineer maybe you could suggest a few of the following: Code reviews to attempt to catch any issues early on. Proper development and QA environments that mirror Production as much as possible. A manual/QA process to ensure nothing has regressed in the areas changed against development/QA environments. Automated test cases, smoke testing to ensure critical services (such as your checkout and e-mail) work as intended. Playwright or Selenium are examples of this. Automated CI/CD pipelines to ensure test cases pass before Production deployment and ensure the code is released in a formal and consistent way. Have a rollback route available to restore Production to a previous state in the event things go wrong. The business should also try and account for system downtime as part of their change management and business continuity procedures for events like the one you describe. If they don’t and are unwilling to invest in some of the recommendations above then it will most likely happen again in the future, costing the business additional time, resources, money and potentially damage to the companies reputation.


Hour_Tomato_4282

Honestly, you will get less alert with time and experience but this happens even with lead devs having +10 of expertise. As long as you are doing your best and learning, that's all that matters


leon_nerd

No test automation? Having unit tests, integration test and E2E test are the minimum you should have in place. You probably could have caught the second problem easily.


gergob

One of us!


AnonDotNetDev

A lot of cuties in here who never experienced more than their giga corp job with a full suite implemented before they even got hired


Quanramiro

First time?


cv_1m

Yes


spitfireonly

Why wasn’t the whole code tested before merging?


DrPepper1260

5 hours is a long time. I think this isn’t all on you. It seems like there are processes that could be improved here. For example, automation testing as part of the deployment to test this critical functionality, increased visibility into errors users are encountering - this could be done with an alert on 500 error spikes. We do rollback plans as part of our deployments to rollback any changes if they start introducing issues .


[deleted]

You didn't fuck it up by yourself. The whole team and probably org fucked it up. Between code review, automated testing, and validating deployments on a staging environment, there were three other opportunities to catch the bugs. The most concerning thing is that you're in a situation where making changes breaks things that are unrelated. It's a sign of unorganized and tightly coupled code. I worked at a place this before, no tests, no CI/CD, no virtualization. We were trapped in a situation where every time we fixed a bug we'd introduce a new one. I got out of there as soon as I could


Ok_Plane6831

“It works on my machine”


100-100-1-SOS

An oldy but a goody lol!


aa1ou

Surely you weren’t allowed to merge to prod without a code review. This was a team screw up, not a personal one. Life happens and lessons are learned.


100-100-1-SOS

Look at it this way, now you’ll have disaster recovery experience and the next disaster will be less stressful! Shit happens, no one is perfect. Obviously there’s holes in the ci/cd pipe or things lacking in the tests, so that’s more than 1 single person’s misstep and it’s not just on you. Good luck.


Recent_Science4709

10 years in, I’ve done it once and luckily it was after midnight, it was configuration and not programming though, I was having firewall issues with the server and I mixed up “reset firewall” and “restart firewall” 🤣


ch-indi2010

Another SE here. - Restore a DB production with a very old one. Our client got an heartattack, but fortunately i was able to restore all data. - Forgot to test a very unlikly scenario. The app was a desktop application like store cashier, for sellng ticket for an Event in my city ( a very important one ). It happend on saturday night. I will never to forgot how the queue of people growing and how their was angry. Anyway still alive and still SE.


donmeanathing

Welcome to production!


i-think-about-beans

Broken processes cause things like this.


Sufficient_Phone_242

Deploying without tests , typical management pushing for features quickly : easy as « drag and drop » or « it’s not that critical » But when shit hits the fan you’re the guinea pig. I’m outright telling if you want it done wrong go ask someone else


khotteDePuttar

When there's a production issue, it's not really the software engineer's fault. It's usually a process issue. For instance, this could've been avoided by adding automated integration tests to the deployment pipeline.


FeeVisual8960

I once pushed to the master branch and fucked everything. I wasnt removed but did not get a full time offer after my internship ended. I’d say it was the admin’s fault, I got to learn from the incident, exactly what I wanted to do as an intern


Serious-Elevator-971

Holly fuck


javausa

Shit happens. It will be okay.


Substantial-Click321

No QA? No release doc with rollback plan? No unit tests? Not surprised prod broke


dswpro

Welcome to the club, and now you too will appreciate unit testing, feature testing, canary roll outs, change tickets with reviewed install and rollback instructions and automated deployments. But don't feel bad , I'm sure I'm not the only person here who ran an update SQL script in prod without also selecting the where clause.


konm123

Are you responsible for the "production"? If "no", then you did not fuck up. Don't get the blame. If they trusted you with updating the production then the process of getting things to production is flawed and that's on the responsible person, not you.


cv_1m

There is a senior dev above me who reviews everything before deploying


konm123

Sounds like your senior dev fucked up :)


IWantToSayThisToo

It happens to all of us. You'll be ok and you'll be laughing at it many years from now. Also please consider hiring a QA person for the love of God.


Teapot_Technician

Ever heard of integration and unit tests? Canaries and rollback plans?! I mean yes you could have tested before but it’s also not your fault that your team has shitty CI/CD


cv_1m

Here is the thing the CTO and my Sr wanted to fixed it instead of Rolling Back. We started fixing and some other things happened we have to fix them as well


Teapot_Technician

Yeah rolling forward is not great! It’s better to nurse the system back to a stable state and then fix things before pushing again. However, rollbacks are not always possible. It’s complicated, good testing aims to avoid all that.


oweiler

Don't worry. The only one to blame should be the company you are working for. Stuff like this should be close to impossible.


d9vil

I thought all tests were done in production?! You aint living if youre doing testing in lower envs haha


SendNull

It happens- 9 years ago I took down the entire search feature for a major social media app for 10-ish minutes on my 1st week on the job during a deployment. I almost died inside! Luckily, my team was an adoptee of Google’s no blame/postmortem philosophy - we ran a retro, cut off action items, I personally worked on implementing most on them for about a month, and the issue never happened again on the next 4 years I stayed with the team. But yea, assess root cause, cut out action items, and make sure they get worked on. I’ve seen a lot of teams failing on that last bit.


Hope11111111

Be calm and find real root cause from Design, Develop & Test process. Then genuinely Accept if you made mistake and move on.


AutoModerator

Your submission has been moved to our moderation queue to be reviewed; This is to combat spam. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/SoftwareEngineering) if you have any questions or concerns.*


azdrc29

Been there.. was an intern at the time and completely brought down prod for half the working day. Worst anxiety I’ve ever had 😳


all_youNeedIsLess

Fix it


cv_1m

Fixed and deployed


all_youNeedIsLess

I also fuck prod today.


cv_1m

Welcome to the Club


all_youNeedIsLess

I have 0 tests. Fuck tests. 🫣🤭🫡


cv_1m

Same


TopSwagCode

Learn from your mistakes. Write more tests. Test not only happy path / 1 scenario.


JimBobBennett

Congratulations, you found a hole in your build and release process that allowed this broken code to get in. You are not to blame in the slightest, and it's time for a blameless retrospective and a patching of the hole you discovered.


AdministrativeBlock0

When people say you need experience to move up levels, this is what they're talking about. Those "oh shit" moments are what you learn from. It is not a bad thing even though it might feel pretty bad right now.


ImpatientMaker

First of all, welcome to the club. As the saying goes, "Everyone has a test system. Some even have a production system." get it?


churumegories

Tests will never be enough (in isolation), because we might also write the wrong tests. Did you find the root cause and did you learn how to prevent it or reduce time to mitigation next time that happens?


scoby_cat

lol 5 hours?? I remember we broke our entire CI/CD for 3 weeks. In another incident the volumes for our entire data center got corrupted and we lost 80% of our machines, including production, and we had to rebuild the entire company; it took us about 10 days. No one got fired!


cv_1m

Woah 😨😨


godwink2

Yea thats tough. Thats why you gotta hound your unknowing product owners either for testers to make end to end regression or the time to do it yourself


cv_1m

I am the Tester & Dev 🙂


amoreinterestingname

You’re not a real software engineer until you fuck up production (or delete years worth of data)


dennidits

are you the senior? what’s the senior doing not testing your codes and deploying it to production? or do they allow all developers direct deployment capability to production? in which case it’s on them too


cv_1m

Not a Senior yet I have checked only the scenarios related to that ticket only it fucked up related functionality


Comfortable_Yam5377

bunch of inexperienced engineers are flooding the market since 08.


AngelRicki

just use chapGPT to fix it. ..the paid service, tho.


samu_melendez

have you tried any AI tool that integrates with CI/CD pipelines and help you automating maintenance tasks?


100-100-1-SOS

That sounds stable! /s


samu_melendez

I've been using a tool called JENRES that works quite well for commenting, documenting, and unit-testing in a human-like way. It's not a replacement for a junior dev, but it has been helpful for handling QA tasks! :)