In the last few months of 2018 we encountered growth issues and had some incidents with large customers reporting slow system response.
We responded swiftly and mercilessly:
- Put aside all new feature development and concentrated on reliability.
- Changed our development processes to prevent recurrence of the situation.
In this post we report on the results of our response. Here’s what we did in November and December of 2018:
The big reliability drive
To get an understanding of what steps we needed to take, we investigated where the malfunctions were occuring most often. We collected all the recent errors and grouped them:
We categorised issues into three groups and set to work on them:
- Late discovery of errors.
- Slow architecture.
- User processes interfering with one another.
The first category is issues that could have been dealt with easily if we’d found and fixed them when they first arose. For swift resolution of such issues in future we improved our monitoring for better visibility into system health:
- Detailed metrics for trigger/operation/segment health.
- Tracking which features and queries create the most server load.
- In addition to technical solutions, we assigned a dedicated reliability engineer.
As firm believers in the importance of transparency, we also introduced more detailed incident reporting on our status page (status.mindbox.ru):
We now have a clearer picture of overall system health and we can respond faster to malfunctions, but we’re not stopping there. We plan to increase system reliability even further with more monitoring improvements and a gradual transition to 24/7 supervision.
The second category of issues arose from individual parts of the system struggling under high load. Performance bottlenecks of this nature are an inevitable part of any complex system. We tackled them in decreasing order of severity.
Prevented triggers from interfering with one another and with user procedures
Let’s say that on completion of an order you want to send the customer an SMS, award points, issue a coupon code and then start an email feed. In this case triggers can start to get in each other’s way. For example, one of them wants to issue a coupon code, but it has to wait for another trigger to finish awarding loyalty points. The end result is that triggers can take considerable time to complete their operations and user procedures get slowed down.
We added a sprinkle of engineering magic dust (in the form of optimistic parallel computing) to make even large numbers of triggers work together simultaneously without hiccups. The result speaks for itself:
Made key reports faster
It could take hours to generate reports in some large projects, and in exceptional cases they were failing to generate altogether. We worked on resolving this for the four most popular reports:
- Messaging summary report
- Subscription sources
- RFM distribution
- CRM effectiveness
- The messaging summary, RFM distribution and subscription sources reports now take a fraction of the time to generate, and they are now fully functional in the projects where they were previously failing to generate.
- The CRM effectiveness report was completely rewritten using a more complex and accurate mathematical algorithm (it now uses bootstrapping instead of normal distribution) with minimal effect on performance.
To speed up report generation we implemented our machine-learning operationalisation microservice and then rewrote the reports using data analysis libraries.
Faster recalculation of complex segment groups
Complex segment groups are those that use complex filters, for example:
The more conditions are in the filter, the longer it will take to calculate the segment group. To speed things up, we figured out how to break the filter into parts that can be calculated in parallel. Here’s the result:
After optimisation, the majority of segment groups took less than 10 minutes to recalculate. With this in mind, we reduced the calculation timeout from four hours down to two. Plans for future improvements include interface updates to give the user more transparency into what’s going on in segment groups.
Reining in rogue user processes
The next category of issues was related to badly configured user processes and integrations creating excess load and slowing down other processes.
Introduced operation quotas
Troubles were arising from improperly configured customer database synchronisation. When we handle customer data for a client who has 2,000,000 customers in their database, this data must be kept in sync between our server and the client’s. Such synchronisation would ideally be carried out once per day. In some cases, however, synchronisation was being carried out once every 10 minutes or so.
This was creating so much server load that other processes were starting to malfunction, for example triggers were being delayed and message sending was slowing down considerably.
To keep this from happening in future, every automated process now has a daily data transfer quota. If the limit is reached, the process will be stopped:
Notifications are sent to process owners.
Throttling dangerous triggers
In certain cases it’s possible to create a trigger so complicated that it will eat up all available resources and other triggers simply won’t be able to fire. We updated our monitoring to identify and throttle such triggers:
Throttling the bad triggers ensures that innocent triggers can continue to go about their business as usual and user processes are not affected.
If throttling isn’t sufficient to keep a renegade trigger under control, it will be stopped and its owner notified.
After making all these improvements we realised that we couldn’t just stop there.
Transition to Large Scale Scrum (LeSS)
The end of 2018 demonstrated the shortcomings of our previous working processes with autonomous products. We were unable to synchronise team backlogs due to the shared codebase. As a result we were falling short on product quality.
To address this issue we’re transitioning to Large Scale Scrum (LeSS) starting from 2019:
- Fewer epics open at once.
- One epic is split between 2-4 teams.
- The CPO decides which epic to work on next.
This way we hope to release major updates much faster, and quality will increase since experience will be shared across teams.
You can read more about the LeSS framework here – https://less.works/, or come and talk to us about it during our open doors day on March 12th.
Here are our plans for the first half of 2019:
- An online demo store to showcase our automation features.
- Complete product localisation.
- Automatic billing and online customer billing account interface.
- A major recommendations epic with new algorithms and a new interface with flexible configuration and testing.
- New loyalty programme integration protocol, new conflict resolution for promotions, support for anonymous ordering.
Starting from the from the second half of 2019 we’ll be redesigning:
- Messaging (we’ll also be adding cascade messaging)
- Admin interface
- Project home page
P.S. We’re not completely doing away with the idea of autonomous teams, we decided that the level of team autonomy should correspond to the level of autonomy of components they work on. In other words, the further we go down the path of microservice architecture, the more autonomous the respective teams will be.
Reliability above all else
The second half of 2018 showed us that we hadn’t given due attention to process. The steps we took in November and December should take care of the most pressing issues, but in order to keep system reliability in top form we decided to dedicate one third of development resources to technical tasks – increasing reliability, speeding up development and improving architecture.