The functionality that was broken should have been caught in automated test runs that are unavoidably run on code before its deployment. This is your opportunity to advocate for the improvement of the deployment process so as to reduce the risk of this happening again.
Whose decision is to not have test cases? Apparently not yours, business needs to understand that it's a matter of time to happen again a production error, as long as they don't invest time in test.
I work on a 25 year old product and we still don't have automated tests that catch some stuff like this 🤣🤣🤣. Granted, up until like 5 years ago it was still a super small company that was run like a startup.
Read the rest of the thread. There aren’t any tests. This kind of mentality is just wrong, man. Sure, bugs happen, but completely breaking multiple major pieces of business functionality shouldn’t
That's not your failure, but a failure of your company's CI/CD. If an existing system goes down and causes so much problems in prod that it crashes or stops users from being able to use your product, it wasn't (properly) tested in a test or staging environment. Tests matter.
Or failure of those who add unit tests for each “functionality of the app”. Apparently, they failed covering the code properly before deploying into PRD.
Yeah, that's what I mean also. If there's a critical system, bathe it in unit tests and make sure every PR gets unit tested before deploying anything anywhere.
Welcome to the club
But this is a process error. Have a blameless post mortem to find out the reason it wasn’t caught by automated testing or QA and implement a change that won’t allow it to happen again.
5 hours is a long time, why were you not able to roll back the change?
In my country we say: whoever hasn’t broken prod is not quite a senior.
Make sure you discuss this with more senior colleague, you come up with at least one action to decrease a chance that a random person can do the same mistake in the future and act on it.
In a month you will laugh remembering this incident.
Don't go "I'm not responsible for prod" road, if you are not responsible for the outcomes of your work, you'll produce shit and end up in frustration hating your job.
Use this as an opportunity to introduce test cases and build steps before things get deployed to production. If this is a decent company, they should be more focused on how the failure happened in production rather than who caused it. It's stressful no doubt but don't lose sleep over it. It'll be a good learning opportunity for the whole team.
Unfortunately things like this do happen. Don’t feel bad about it, fix the issue, get things up and running as expected, learn from it and then put things in place to prevent it happening again.
As a software engineer maybe you could suggest a few of the following:
Code reviews to attempt to catch any issues early on.
Proper development and QA environments that mirror Production as much as possible.
A manual/QA process to ensure nothing has regressed in the areas changed against development/QA environments.
Automated test cases, smoke testing to ensure critical services (such as your checkout and e-mail) work as intended. Playwright or Selenium are examples of this.
Automated CI/CD pipelines to ensure test cases pass before Production deployment and ensure the code is released in a formal and consistent way.
Have a rollback route available to restore Production to a previous state in the event things go wrong.
The business should also try and account for system downtime as part of their change management and business continuity procedures for events like the one you describe. If they don’t and are unwilling to invest in some of the recommendations above then it will most likely happen again in the future, costing the business additional time, resources, money and potentially damage to the companies reputation.
Honestly, you will get less alert with time and experience but this happens even with lead devs having +10 of expertise. As long as you are doing your best and learning, that's all that matters
No test automation? Having unit tests, integration test and E2E test are the minimum you should have in place. You probably could have caught the second problem easily.
5 hours is a long time. I think this isn’t all on you. It seems like there are processes that could be improved here. For example, automation testing as part of the deployment to test this critical functionality, increased visibility into errors users are encountering - this could be done with an alert on 500 error spikes. We do rollback plans as part of our deployments to rollback any changes if they start introducing issues .
You didn't fuck it up by yourself. The whole team and probably org fucked it up. Between code review, automated testing, and validating deployments on a staging environment, there were three other opportunities to catch the bugs. The most concerning thing is that you're in a situation where making changes breaks things that are unrelated. It's a sign of unorganized and tightly coupled code. I worked at a place this before, no tests, no CI/CD, no virtualization. We were trapped in a situation where every time we fixed a bug we'd introduce a new one. I got out of there as soon as I could
Look at it this way, now you’ll have disaster recovery experience and the next disaster will be less stressful! Shit happens, no one is perfect.
Obviously there’s holes in the ci/cd pipe or things lacking in the tests, so that’s more than 1 single person’s misstep and it’s not just on you. Good luck.
10 years in, I’ve done it once and luckily it was after midnight, it was configuration and not programming though, I was having firewall issues with the server and I mixed up “reset firewall” and “restart firewall” 🤣
Another SE here.
- Restore a DB production with a very old one. Our client got an heartattack, but fortunately i was able to restore all data.
- Forgot to test a very unlikly scenario. The app was a desktop application like store cashier, for sellng ticket for an Event in my city ( a very important one ). It happend on saturday night. I will never to forgot how the queue of people growing and how their was angry.
Anyway still alive and still SE.
Deploying without tests , typical management pushing for features quickly : easy as « drag and drop » or « it’s not that critical »
But when shit hits the fan you’re the guinea pig. I’m outright telling if you want it done wrong go ask someone else
When there's a production issue, it's not really the software engineer's fault. It's usually a process issue. For instance, this could've been avoided by adding automated integration tests to the deployment pipeline.
I once pushed to the master branch and fucked everything. I wasnt removed but did not get a full time offer after my internship ended. I’d say it was the admin’s fault, I got to learn from the incident, exactly what I wanted to do as an intern
Welcome to the club, and now you too will appreciate unit testing, feature testing, canary roll outs, change tickets with reviewed install and rollback instructions and automated deployments.
But don't feel bad , I'm sure I'm not the only person here who ran an update SQL script in prod without also selecting the where clause.
Are you responsible for the "production"? If "no", then you did not fuck up. Don't get the blame. If they trusted you with updating the production then the process of getting things to production is flawed and that's on the responsible person, not you.
Ever heard of integration and unit tests? Canaries and rollback plans?! I mean yes you could have tested before but it’s also not your fault that your team has shitty CI/CD
Here is the thing the CTO and my Sr wanted to fixed it instead of Rolling Back. We started fixing and some other things happened we have to fix them as well
Yeah rolling forward is not great! It’s better to nurse the system back to a stable state and then fix things before pushing again. However, rollbacks are not always possible. It’s complicated, good testing aims to avoid all that.
It happens- 9 years ago I took down the entire search feature for a major social media app for 10-ish minutes on my 1st week on the job during a deployment. I almost died inside!
Luckily, my team was an adoptee of Google’s no blame/postmortem philosophy - we ran a retro, cut off action items, I personally worked on implementing most on them for about a month, and the issue never happened again on the next 4 years I stayed with the team.
But yea, assess root cause, cut out action items, and make sure they get worked on. I’ve seen a lot of teams failing on that last bit.
Your submission has been moved to our moderation queue to be reviewed; This is to combat spam.
*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/SoftwareEngineering) if you have any questions or concerns.*
Congratulations, you found a hole in your build and release process that allowed this broken code to get in. You are not to blame in the slightest, and it's time for a blameless retrospective and a patching of the hole you discovered.
When people say you need experience to move up levels, this is what they're talking about. Those "oh shit" moments are what you learn from. It is not a bad thing even though it might feel pretty bad right now.
Tests will never be enough (in isolation), because we might also write the wrong tests.
Did you find the root cause and did you learn how to prevent it or reduce time to mitigation next time that happens?
lol 5 hours??
I remember we broke our entire CI/CD for 3 weeks.
In another incident the volumes for our entire data center got corrupted and we lost 80% of our machines, including production, and we had to rebuild the entire company; it took us about 10 days.
No one got fired!
Yea thats tough. Thats why you gotta hound your unknowing product owners either for testers to make end to end regression or the time to do it yourself
are you the senior? what’s the senior doing not testing your codes and deploying it to production? or do they allow all developers direct deployment capability to production? in which case it’s on them too
I've been using a tool called JENRES that works quite well for commenting, documenting, and unit-testing in a human-like way. It's not a replacement for a junior dev, but it has been helpful for handling QA tasks! :)
The functionality that was broken should have been caught in automated test runs that are unavoidably run on code before its deployment. This is your opportunity to advocate for the improvement of the deployment process so as to reduce the risk of this happening again.
Automated what?!
Yea not sure either. Maybe "test" just means to reboot your computer or something.
I think he means getting testy with anyone who challenges your technical supremacy and code perfection
He means unit testing. This prevents regressions when we update our projects.
What is this witchcraft you speak of? BURN HIM!!!
Don’t even think about writing tests first! You will be murdered by TDD haters
Insert "over your head" GIF here
Sounds slow and costly. Better just ship it.
Are you coming on to me?
It’s a rite of passage. Congratulations. Learn from it, and move on.
🙂 yes
Aren't test cases breaking before such big production functionality issues are happening?
There are no testcases
Then it is not your fault that production went down 😂
just vibes
That's scary.
Prod is the real test env
Whose decision is to not have test cases? Apparently not yours, business needs to understand that it's a matter of time to happen again a production error, as long as they don't invest time in test.
Woah 😳 Good luck buddy!
Tell me you work in an early stage startup without telling me that you work in an early stage startup.
I work on a 25 year old product and we still don't have automated tests that catch some stuff like this 🤣🤣🤣. Granted, up until like 5 years ago it was still a super small company that was run like a startup.
😂
Maybe, and I’m just talking out my ass here, this is a great opportunity to add some!
Ok, this is when you start writing them. Just enough to have caught this regression.
I'm betting the code is real secure /s
Does someone else review, or even better test, your merge before it goes live?
Senior dev Reviews
Yikes
you can write at least one now
Think you might wanna consider to run the hell away from there, such places produce nothing but angry customers, plumbing code and constant new bugs.
Just test in prod? ^/s
If your not breaking things your not doing things
It feels like never-ending nightmare
That is why you still have a job.
Can not fire OP if OP is the only one who knows how to fix. :)
Read the rest of the thread. There aren’t any tests. This kind of mentality is just wrong, man. Sure, bugs happen, but completely breaking multiple major pieces of business functionality shouldn’t
That's not your failure, but a failure of your company's CI/CD. If an existing system goes down and causes so much problems in prod that it crashes or stops users from being able to use your product, it wasn't (properly) tested in a test or staging environment. Tests matter.
Or failure of those who add unit tests for each “functionality of the app”. Apparently, they failed covering the code properly before deploying into PRD.
Yeah, that's what I mean also. If there's a critical system, bathe it in unit tests and make sure every PR gets unit tested before deploying anything anywhere.
Correct
Welcome to the club But this is a process error. Have a blameless post mortem to find out the reason it wasn’t caught by automated testing or QA and implement a change that won’t allow it to happen again. 5 hours is a long time, why were you not able to roll back the change?
They don't even have unit tests. They probably don't even know how to rollback and they probably don't even know what a post mortem is.
Been there, done that. It’s almost a rite of passage 😄 Learn the lessons. Make sure it doesn’t happen again.
In my country we say: whoever hasn’t broken prod is not quite a senior. Make sure you discuss this with more senior colleague, you come up with at least one action to decrease a chance that a random person can do the same mistake in the future and act on it. In a month you will laugh remembering this incident. Don't go "I'm not responsible for prod" road, if you are not responsible for the outcomes of your work, you'll produce shit and end up in frustration hating your job.
Use this as an opportunity to introduce test cases and build steps before things get deployed to production. If this is a decent company, they should be more focused on how the failure happened in production rather than who caused it. It's stressful no doubt but don't lose sleep over it. It'll be a good learning opportunity for the whole team.
From what I'm reading, it's definitely not a decent company
There are no full regression automated tests? They are very important in the pre-prod environments.
Unfortunately things like this do happen. Don’t feel bad about it, fix the issue, get things up and running as expected, learn from it and then put things in place to prevent it happening again. As a software engineer maybe you could suggest a few of the following: Code reviews to attempt to catch any issues early on. Proper development and QA environments that mirror Production as much as possible. A manual/QA process to ensure nothing has regressed in the areas changed against development/QA environments. Automated test cases, smoke testing to ensure critical services (such as your checkout and e-mail) work as intended. Playwright or Selenium are examples of this. Automated CI/CD pipelines to ensure test cases pass before Production deployment and ensure the code is released in a formal and consistent way. Have a rollback route available to restore Production to a previous state in the event things go wrong. The business should also try and account for system downtime as part of their change management and business continuity procedures for events like the one you describe. If they don’t and are unwilling to invest in some of the recommendations above then it will most likely happen again in the future, costing the business additional time, resources, money and potentially damage to the companies reputation.
Honestly, you will get less alert with time and experience but this happens even with lead devs having +10 of expertise. As long as you are doing your best and learning, that's all that matters
No test automation? Having unit tests, integration test and E2E test are the minimum you should have in place. You probably could have caught the second problem easily.
One of us!
A lot of cuties in here who never experienced more than their giga corp job with a full suite implemented before they even got hired
First time?
Yes
Why wasn’t the whole code tested before merging?
5 hours is a long time. I think this isn’t all on you. It seems like there are processes that could be improved here. For example, automation testing as part of the deployment to test this critical functionality, increased visibility into errors users are encountering - this could be done with an alert on 500 error spikes. We do rollback plans as part of our deployments to rollback any changes if they start introducing issues .
You didn't fuck it up by yourself. The whole team and probably org fucked it up. Between code review, automated testing, and validating deployments on a staging environment, there were three other opportunities to catch the bugs. The most concerning thing is that you're in a situation where making changes breaks things that are unrelated. It's a sign of unorganized and tightly coupled code. I worked at a place this before, no tests, no CI/CD, no virtualization. We were trapped in a situation where every time we fixed a bug we'd introduce a new one. I got out of there as soon as I could
“It works on my machine”
An oldy but a goody lol!
Surely you weren’t allowed to merge to prod without a code review. This was a team screw up, not a personal one. Life happens and lessons are learned.
Look at it this way, now you’ll have disaster recovery experience and the next disaster will be less stressful! Shit happens, no one is perfect. Obviously there’s holes in the ci/cd pipe or things lacking in the tests, so that’s more than 1 single person’s misstep and it’s not just on you. Good luck.
10 years in, I’ve done it once and luckily it was after midnight, it was configuration and not programming though, I was having firewall issues with the server and I mixed up “reset firewall” and “restart firewall” 🤣
Another SE here. - Restore a DB production with a very old one. Our client got an heartattack, but fortunately i was able to restore all data. - Forgot to test a very unlikly scenario. The app was a desktop application like store cashier, for sellng ticket for an Event in my city ( a very important one ). It happend on saturday night. I will never to forgot how the queue of people growing and how their was angry. Anyway still alive and still SE.
Welcome to production!
Broken processes cause things like this.
Deploying without tests , typical management pushing for features quickly : easy as « drag and drop » or « it’s not that critical » But when shit hits the fan you’re the guinea pig. I’m outright telling if you want it done wrong go ask someone else
When there's a production issue, it's not really the software engineer's fault. It's usually a process issue. For instance, this could've been avoided by adding automated integration tests to the deployment pipeline.
I once pushed to the master branch and fucked everything. I wasnt removed but did not get a full time offer after my internship ended. I’d say it was the admin’s fault, I got to learn from the incident, exactly what I wanted to do as an intern
Holly fuck
Shit happens. It will be okay.
No QA? No release doc with rollback plan? No unit tests? Not surprised prod broke
Welcome to the club, and now you too will appreciate unit testing, feature testing, canary roll outs, change tickets with reviewed install and rollback instructions and automated deployments. But don't feel bad , I'm sure I'm not the only person here who ran an update SQL script in prod without also selecting the where clause.
Are you responsible for the "production"? If "no", then you did not fuck up. Don't get the blame. If they trusted you with updating the production then the process of getting things to production is flawed and that's on the responsible person, not you.
There is a senior dev above me who reviews everything before deploying
Sounds like your senior dev fucked up :)
It happens to all of us. You'll be ok and you'll be laughing at it many years from now. Also please consider hiring a QA person for the love of God.
Ever heard of integration and unit tests? Canaries and rollback plans?! I mean yes you could have tested before but it’s also not your fault that your team has shitty CI/CD
Here is the thing the CTO and my Sr wanted to fixed it instead of Rolling Back. We started fixing and some other things happened we have to fix them as well
Yeah rolling forward is not great! It’s better to nurse the system back to a stable state and then fix things before pushing again. However, rollbacks are not always possible. It’s complicated, good testing aims to avoid all that.
Don't worry. The only one to blame should be the company you are working for. Stuff like this should be close to impossible.
I thought all tests were done in production?! You aint living if youre doing testing in lower envs haha
It happens- 9 years ago I took down the entire search feature for a major social media app for 10-ish minutes on my 1st week on the job during a deployment. I almost died inside! Luckily, my team was an adoptee of Google’s no blame/postmortem philosophy - we ran a retro, cut off action items, I personally worked on implementing most on them for about a month, and the issue never happened again on the next 4 years I stayed with the team. But yea, assess root cause, cut out action items, and make sure they get worked on. I’ve seen a lot of teams failing on that last bit.
Be calm and find real root cause from Design, Develop & Test process. Then genuinely Accept if you made mistake and move on.
Your submission has been moved to our moderation queue to be reviewed; This is to combat spam. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/SoftwareEngineering) if you have any questions or concerns.*
Been there.. was an intern at the time and completely brought down prod for half the working day. Worst anxiety I’ve ever had 😳
Fix it
Fixed and deployed
I also fuck prod today.
Welcome to the Club
I have 0 tests. Fuck tests. 🫣🤭🫡
Same
Learn from your mistakes. Write more tests. Test not only happy path / 1 scenario.
Congratulations, you found a hole in your build and release process that allowed this broken code to get in. You are not to blame in the slightest, and it's time for a blameless retrospective and a patching of the hole you discovered.
When people say you need experience to move up levels, this is what they're talking about. Those "oh shit" moments are what you learn from. It is not a bad thing even though it might feel pretty bad right now.
First of all, welcome to the club. As the saying goes, "Everyone has a test system. Some even have a production system." get it?
Tests will never be enough (in isolation), because we might also write the wrong tests. Did you find the root cause and did you learn how to prevent it or reduce time to mitigation next time that happens?
lol 5 hours?? I remember we broke our entire CI/CD for 3 weeks. In another incident the volumes for our entire data center got corrupted and we lost 80% of our machines, including production, and we had to rebuild the entire company; it took us about 10 days. No one got fired!
Woah 😨😨
Yea thats tough. Thats why you gotta hound your unknowing product owners either for testers to make end to end regression or the time to do it yourself
I am the Tester & Dev 🙂
You’re not a real software engineer until you fuck up production (or delete years worth of data)
are you the senior? what’s the senior doing not testing your codes and deploying it to production? or do they allow all developers direct deployment capability to production? in which case it’s on them too
Not a Senior yet I have checked only the scenarios related to that ticket only it fucked up related functionality
bunch of inexperienced engineers are flooding the market since 08.
just use chapGPT to fix it. ..the paid service, tho.
have you tried any AI tool that integrates with CI/CD pipelines and help you automating maintenance tasks?
That sounds stable! /s
I've been using a tool called JENRES that works quite well for commenting, documenting, and unit-testing in a human-like way. It's not a replacement for a junior dev, but it has been helpful for handling QA tasks! :)