Recently at Kooth we ran a game day. We assembled all the engineers, and some of the product team, to practice what we would do in case of a production incident and learn how we and our systems might respond.
There are a few reasons we wanted to do this:
- Not everyone has experience dealing with incidents and we want to ensure that all our engineers feel confident responding to incidents themselves.
- Everyone can improve how they handle incidents, especially understanding how best to work collaboratively to solve them.
- It allowed us to explore our systems and better understand how they function, as well as how they can fail, in a safe environment.
- We wanted to see if we could come up with user stories or processes to mitigate failure that we hadn’t considered.
- We haven’t yet had to deal with a major security incident—thankfully!—but we wanted to be more prepared should that happen.
- It gave us an opportunity to work with people we wouldn’t normally work with and form better bonds.
How did we plan?
Three of us volunteered to plan and run the game day and we started by figuring out what we wanted to achieve from the day. Since very few of us had experience with participating in a game day we decided that we would start by coming up with a set of scenarios we could control, rather than doing something like red team/blue team.
We came up with a list of scenarios we could run that would help achieve one or more of our goals. We wanted to ensure that people could practice incident response collaboratively and in as real a way as possible. Examples included:
- A data breach is discovered.
- A third party provider we heavily rely on is down.
- One of our databases is unavailable.
We also thought about who would be participating and split people up into teams. We tried to spread out people with more experience as well as splitting up people who normally work together.
One question we discussed was how we would run the scenarios: what environment should we use? Should we resize it to production scale? Should we try and simulate production levels of traffic? Should each team do the same scenario or each take a different one?
In the end, we decided to keep it as simple as possible and just use one of our non-production environments without any modifications. We figured that if it was successful, we could do something more complex at a future game day. We also decided to allocate each team a different scenario, partly because some of the scenarios might result in teams interfering with each other and partly because we felt we would learn more that way.
Finally, we discussed how to structure the day. We wanted to make sure we left time for teams to showcase what they’d learned, as well as hold a retrospective at the end of the day on how the day went.
What did we learn?
Overall the day was a great success! People had a lot of fun exploring the scenarios and we gained plenty of useful insights.
The biggest thing that came out of the retro was that the day had highlighted people’s knowledge gaps which will allow us to focus training and future exercises.
We also learned about where our processes, documentations and systems are lacking and now have ideas of how to improve them.
Finally, people really liked spending time working with people from outside their teams. This sparked a lengthy discussion about team rotations.