Recently our team was preparing for the first launch of a massive operations focused application. The team had run all of the basic load & performance testing required to meet our contractual obligations. The app was in a temporary code freeze – hot fix only time. PMs & Scrum Masters anxiously worked down the checklist of activities to hit a successful launch.
Let's Put It To The Test
In a dark room in the wee hours of the morning a few days before launch, we decided to test the system under extreme load. 1X required future state load had performed just fine in past testing efforts.
- Push to 1.5X. Anticipated slow-down in non-mission critical functions, but overall works great.
- 2x: New bug identified. Added to the hotfix list, no big deal.
- 2.5: Still humming along!
Then to 3x – the intended future state load. Oddly, performance degraded rapidly. Error counts rose well beyond acceptable levels. No recognizable errors were being recorded in the application or the service layer.
Most developers would leave good enough alone. Heck, we were at 3x FUTURE state load, that’s years of growth. However, we knew the customer was growing & may grow faster than expected. Plus, over time the solution would likely grow horizontally as the user base grew vertically. We dug in a bit & performed some limited debugging before raising any alarm.
Initial debugging & evaluation suggested the data layer was at fault. However, the data structure under a mission-critical operations app can be complex. It would not be time to dig deeply into performance monitoring across several data structures, especially since we were only a few days away from a production release. However, the team was concerned with quality & wanted to see if they could find an answer quickly.
Given those data points, they made a call to the infrastructure team. Together, they decided to push a little further & see if the solution would perform at 3X if they scaled the infrastructure to meet this new demand. They chose to leverage Azure’s scaling features to run a few additional tests.
First, they scaled the front-end web servers to 6x required sizing. No change when re-running the test suite at full load. Then they scaled the service tier to 6x as well. Still no change under load testing.
Fortunately, over the course of the project, the team had compiled roughly 20,000 unit tests & full test cases. By running a series of specific regression tests at load, they were able to isolate the issue & pin it down to a simple SQL issue. A script had failed during deployment, resulting in a missing index on one of the ancillary tables on the production server. The DBA quickly added the index & issued a re-test.
Testing & Debugging Conclusion
In less than 2 hours, the team identified an unknown issue, tested & fixed the new problem, & continued with the launch. Had we not been able to scale, test, & evaluate the application at an extremely high level of load, we might not have caught this basic issue until after the launch.
The impact of this particular issue was evaluated by a BA days later. While the impact was small, the missing index would have cost the customer’s CSRs around 2 seconds per call. Over the course of a year, this 2 seconds extrapolated out to $15,000 in lost productivity. It’s amazing the impact 2 hours of expertise & cheap scalable Azure servers can make.
Review our case studies and engagements where we helped companies just like yours solve a variety of business needs.
Since 1981, Oakwood has been helping companies of all sizes, across all industries, solve their business problems. We bring world-class consultants to architect, design and deploy technology solutions to move your company forward. Our proven approach guarantees better business outcomes. With flexible engagement options, your project is delivered on-time and on budget. 11,000 satisfied clients can’t be wrong.