Ask any engineer and he’d agree: With growth, comes performance & scale challenges. Instamojo is no different. At the rate we are growing, we did face more than hiccups over time.
Our only source of knowing the performance is Google Analytics for page load time. A average load time of 18s is not good at all. We believed it is time we include performance into our core priorities. The first basic rule of performance – Measure, and New Relic fits our bill perfectly with our stack.
At this time, our average server response time is approximately 370ms, but with many outliers crossing 30sec. The next 2 months, we fine-tuned our integration with New Relic to identify proper bottlenecks whenever and wherever we can.
It was time to act upon the insights we got from the performance analysis from New Relic & Google Analytics.
1. Optimize database queries a lot. In terms of Django, we started effectively using `prefetch_related` and `select_related` and adding proper indexes to our database tables.
2. Use CDN for our static resources instead of serving directly from S3. While this is easy to set up, we still need solution to avoid cache validation. Luckily django.contrib.staticfiles.storage.CachedStaticFilesStorage with some fixes worked like charm to generate hashed file paths for each of our files. For example, we have favicon.6b5a70e7628d.png generated from favicon.png which can be cached by CDN as well as browser forever, without having to deal with cache-invalidation at all.
3. Support gzipped static assets. We made a trade-off here not to support non-gzipped static assets, given most (and all modern) browsers supports gzip compression.
We also observed some serious memory leaks with our Django application, causing memory thrashing and causing slowdown.
Our servers are much more predictable now. While we couldn’t totally fix the memory leaks, we mitigated the effects, and have workarounds in place. We started distributing various of our services (both internal and external) into separate servers, to have minimal impact on each other.
How are we doing today?
For last 7 days (from New Relic):
1. Measure and Monitor. You can’t fix the problems that you don’t know of.
2. Scalability. Design your systems for horizontal scalability. You’ll outgrow a single server, however big the server might be. Its only a question of when.
3. Its equally important to monitor and measure the performance of all 3rd party services / modules. (I’m looking at you, sorl-thumbnail).
4. Improving performance is a continuous process.
We have come long way but we still have a long way ahead. Join us and help us beat rocket speed.
- Sai Prasad