How We Almost Burned Down the House
November 2017 – we were prepared for the event of the year. Timetracker 4 was about to become publicly available.
Even though the new features list was ultimately not that long, the significant change in the architecture was quite a milestone for 7pace and the future of Timetracker. Product version 3 only allowed the stopping of time locally on the Windows Client. Version 4 lifted this feature into the core of the server, accessible through an interface. In other words – time now ticked on the server, not on the client.
We knew that this roll-out was unlike any roll-out we had done before. Data needed to be converted. Old clients would stop working. Our servers would do something that they never did before – handle a huge number of ticking watches.
The roll-out of Timetracker for VSTS was planned to occur in nine circles, starting with us internally, followed by close friends and related companies, registered pre-release accounts, through five additional enlargements to the target audience, and finally to everyone, over a period of two months.
On November 28, 2017 we migrated the last circle in VSTS. Rolling out the previous circles, our calculations on the load of the next circle were correct. However, for the last circle, where we added about 50% additional users, they were not. Timetracker for VSTS is hosted in Azure, so there was significant power we could add by just selecting higher tiers and adding systems. Doing so (hosting cost quadrupled in December), we managed to improve performance most of the time for most of the users, but we also saw the experience dropping to a truly unacceptable state, without a clear pattern. Some users even appeared to experience bad performance all of the time. And at the time, we had no idea what the root cause could be for this unexpected and escalating resource-eating explosion.
Technical Dive – What Caused This
With the initial release of Server Side Tracking, the new Timetracker Web Client was designed to hold the connection on all pages. Once opened, data was refreshed, even if the tab was not active in the browser, to ensure that data was always current. On every opened tab. Timetracker also held a connection to every page of VSTS that displayed the “Start Tracking” button, which could easily include even more tabs. And there was one more major cause – the API. The API was designed to accept unlimited queries, and some users set up their systems to refresh full record sets every minute. All this was by design and should have been doable for the servers. However, it turned out that this led to more exponential growth than expected.
In mid-December, we planned the first sprint of 2018 for performance optimization, but for the time being, we decided to power up Azure systems and distribute the load to eight web apps. This promised to do the job.
But there was another coincidence. We extended the cache handling with Redis, which we previously used for critical values only in the context of fault tolerance, as a first performance measure. Further, we released this update between Christmas and New Years Eve, when fewer users were online. Everything looked promising. Actually, this additional caching saved some memory usage on the web apps, which was a bonus, but the CPU usage of Redis on higher loads should have become the final blast.
On January 2nd, following the holidays, the number of requests returned to ‘normal’ and all the factors I mentioned escalated the issue. Caching made Redis’ CPU collapse, causing the application pool to wait endlessly for response. A single API call could cause all users on one of the Azure web apps to experience a ten-minute page freeze.
We Learned a Lesson
Since December, we have rolled out updates, – versions 4.3, 4.4 and 4.5. And the journey isn’t over – there will be a 4.6 and 4.7 in March 2018, until we have everything implemented to make Timetracker really fast. Most of the problems should be solved already; a lot of customers have confirmed that performance is back to normal.
Yes, we created the perfect storm. Needless to say, this particular cyclone will not repeat. We have since identified three factors that would have helped, if not prevented, this problem from progressing into the operative user experience, and this is the lesson we really learned.
Lesson 1: There was monitoring for server load, but before, reports weren’t critical and we just thought everything would be alright. In reality, our monitoring and alerting system was insufficient. We didn’t receive notice when a server load exceeded the healthy state before the user experience became bad. In the beginning of the roll-out, we didn’t even know that the problem was already present to some countries with longer latency. We relied on insufficient monitoring.
Lesson 2: We rolled out in exponential tier sizes, something like 1, 2, 4, 8, 16, 32 accounts. As mentioned before, with the last circle, we added only 50% of new users. We did this because we were expecting hidden bugs to appear and we wanted to keep the group of users experiencing these bugs as small as possible. There was never a thought that there could be a performance problem of that dimension; we believed in our initial calculations. Rolling out in equal pieces, such as 7, 7, 7, 7, 7, 7, would have detected the issue earlier, with a smaller group added, hence, it would have been easier to handle or roll back and less users would have been affected.
Lesson 3: As Server Side Tracking lets the clock tick on the server, load balancing and scaling across the system is kind of tricky. We knew we needed to address this at some point, but it was not planned for the launch. The remaining resources we could add in Azure were huge, and more than sufficient, we thought. We ultimately used eight Azure-managed apps at full scale, all up to 100% of load, and we could not scale more. We accepted limited scalability, as the limits seemed so harmless.
As with things like this, there was no plan on how to deal with the aftermath. We could try to frame this as a success story, as in there were “too many new users at once”. But the reality is – this just should not have happened. We know that we have stretched the patience and tolerance of a lot of you. Please accept my personal apologies for this – we, as a team, know that we failed to deliver perfect service. We decided to post the full story, hoping that we can at least contribute, in some way, to our fellow software developers’ experience; maybe one of you will be in a similar situation some day and can use this as a use case. We have learned a lot about performance and availability over the last three months.
Thank you for bearing with us.