Everyone has heard about SQL Server.  We share client stories in the hope that something will spark something, cause someone to take some action, or otherwise turn on a light bulb to alleviate some sort of corporate darkness (ha), and this story is, frankly, filled with corporate darkness – the type of darkness we see all too frequently. In fact, Rob Helvey wrote an article recently that ties right into this: The Problem with Infrastructure Operations Managing Applications. This story proves out his position.

Broken SQL Environments – A Client Story

Our client has a large, mission-critical SQL Server running their ERP system, but, over the years, has not followed best practices nor performed regular maintenance on the SQL environment. They’ve been able to get away with this for some time due to the reliability and ease of use of SQL Server, but, well, it can’t last forever, now can it?

Here’s the situation we walked into: The database environment was in terrible shape. CPU utilization on the database server was consistently running at 100%, and during the busiest time of day, the system would become unresponsive. Mission-critical ERP was at a standstill and no work could be done; manufacturing sites and the business itself came to a halt.

Key Issues

Leading up to the systems current state, the client had been experiencing poor performance with their ERP system. Their IT team had started working on the problem by adding numerous indexes to the database, but it didn’t help and performance continued to degrade.  However, during that process, they noticed that performance would improve temporarily after a reboot, so the IT team scheduled regular server restarts, but performance continued to degrade until multiple restarts were needed throughout the day, leading them to attempt a different fix.  Unfortunately, that fix created the CPU utilization issues mentioned above and to the eventual locking of the system during heavy periods of system load.  An expensive monitoring tool was purchased and installed on the SQL Server, and the monitoring activities of the tool caused the problem to actually get worse. Their final solution was to upgrade the server in a hope to gain improved performance. The business cost was mounting, and the corporate reputation was even being impacted.

Summary

  • Added more SQL indexes – no  improvement
  • Buy and install an expensive monitoring tool – problem gets worse
  • Added more hardware – no improvement

Once they called for help, we were able to fix the problem in one day, without buying any more software or hardware.

“How?” you might be asking right about now.

Action Steps

We quickly got to the root cause of the problem, established a plan of attack, put the plan into action, then created a custom maintenance plan based on business rules, so that this situation would never arise again. Even if you are not a SQL Server expert (as I am not), the root cause analysis below (look at the end of the post) should make you shudder (as I did), and the pre- and post-remediation statistics are shudder-worthy as well.

The bottom line here is that SQL Servers should not behave like this, and trying to solve fundamental SQL Server issues with standard network server troubleshooting techniques will not work. This client spent more money than they needed to spend, with negative results, because they did not have a properly trained SQL Server expert involved in diagnosing and remediating the issue.

SQL Server Root Cause Analysis

  • SQL Server configuration not appropriate for workload
  • Object maintenance not run within time frame relevant to workload
  • Excessive index quantities (most unused), all fragmented and part of regular DML operation (very large tables with nearly 100 columns, with indexes on every column)
  • Long-running queries with insufficient query plans due to out of date statistics and fragmented index utilization contributing to extensive lock waits, then blocking (top wait statistics LOCK and Parallelism related)
  • Backups taking more than 48 hours, not completing due to overlap
  • ‘AppDomain’ memory dumps, flushing the procedure cache every 15-30 seconds
  • Page life expectancy consistently near 0
  • DBCC checks not occurring
  • Average Runnable Tasks per Scheduler: ~10
  • CPU utilization: 95-100% consistently
  • Boost SQL priority set ON
  • All default server level and database level configuration
  • Best practice tempdb and data file sizing absent
  • Reporting off of live production environment

Pre-Remediation Implementation Performance Results

  • Page Life Expectancy per NUMA: <50
  • Average Runnable Tasks per Scheduler: 10-20
  • CPU utilization: 95-100% consistently
  • Large Index Fragmentation: >95%

Post-Remediation Implementation Performance Results

  • Page Life Expectancy per NUMA: >11,000
  • Average Runnable Tasks per Scheduler: 0
  • CPU utilization: normal
  • Large Index Fragmentation: <5%
  • SQL Server environment is now stable.

Next Steps

Review our case studies and engagements where we helped companies just like yours solve a variety of business needs.


About Oakwood

Since 1981, Oakwood has been helping companies of all sizes, across all industries, solve their business problems.  We bring world-class consultants to architect, design and deploy technology solutions to move your company forward.   Our proven approach guarantees better business outcomes.  With flexible engagement options, your project is delivered on-time and on budget.  11,000 satisfied clients can’t be wrong.

Like what you've read? Please spread the word!

Leave a Reply

Your email address will not be published. Required fields are marked *