I’ve broken prod a bunch of times. Overtime, I’ve learned how to best avoid this, and what systems need to be in place, but breaking prod is simply a symptom of moving fast. As it was put by one of the best engineers I’ve ever worked with:
I’m keeping track of this mainly because it comprises of good stories, but also to show that it happens to other people like me who used to get very stressed about things like this.
- Replit First Week - not even sure what I pushed but I hit ~300 people with an website equivalent of “blue screen of death”
- Replit - internal hackathon, I thought it would be fun to directly edit some prod code hosted on a Repl as a surprise feature of thing I was working on—allowing it to be available to everyone. This was a bad idea, and woke up with a strongly worded message from the same person as above
- Synthesis - we had error logs pushed to Sentry even in dev environments. I was testing some code that logged an error every 10ms of the app being on, couldn’t figure out the issue, then went to sleep. I woke up to a very very very high Sentry bill, and a message from the CTO to stop the shenanigans. I did this once more afterwards oops.
- Synthesis - was messing with auto generated tts system to allow speed changes, but forgot we had a caching system. People’s AI tutor was on perma 2x speed. I am interested how that affects learning methods though
- Wispr - put out a release, but the release DMG was a 30kb shortcut instead of the actual app
- Wispr - working on CD tooling and wrote the wrong bucket name → accidentally killed the app so bad for some people that they could do nothing but reinstall
This actually looks like a lot, but it’s made me realize that these were quite small in hindsight compared to all of the things that I did that went well.