A tale of performance impact and root cause analysis
Every organization faces minor and major performance issues at some point in time. These issues can relate to impact experienced by users or systems showing sluggy or slow behavior. Root cause analysis for such issues is usually performed when issues persist or if the impact on performance can not be ignored. Performing such an analysis can be quite cumbersome. This blog will show a typical search for answers.
1. Problem identification
Performance related issues are usually recognized by users complaining about daily tasks running slow, devices feeling sluggy or business processes noticeably being impacted. User complaints are normally collected in ticketing systems given that users actually bother to create tickets or a performance threshold is hit in a monitoring system. To determine the magnitude of the issue we would have to search for similar items in our systems and combine them in one overview.
2. Problem description
After collecting similar issues and adding additional information we have to summarize a description to characterize the problem. This assumes we have enough data on incidents and behavior from the identification phase we have gone through. Describing the problem can prove to be challenging since acknowledged baselines and run times are often not available for digital workloads we run. This means it can be difficult to determine what does ‘slow’ means compared to ‘normal’ since we haven’t defined what the acceptable threshold actually is.
3. Gather problem data from monitoring sources
The description of the problem should provide an outline of what parts of the solution stack the performance issues may reside in. As these parts are often managed by multiple teams each using its own set of tools gathering centralized data takes time and coordination. For example, application administrators may use a generic APM solution that monitors web performance while the database administrators use proprietary monitoring for the specific database used and datacenter administrators use global monitoring solutions for all infrastructure components. Each solution delivers data in different formats without correlation between the datasets and interpreting the data requires specialist knowledge. In addition, teams tend to believe the issue does not reside in their part of the stack since everything ‘works as intended’ for their part.
4. Analyze and correlate data
After all, data is collected and centralized going through the data normally requires expertise from multiple teams to match incidents to data deviations. Sometimes alerts or data in certain parts of the stack that lead to a clear understanding of the problem but often this is not the case. As data is collected in a dispersed way without normalization on for example timestamps or key performance metrics comparing situations proves difficult and cumbersome. However often we do find items that look different than expected and that we want to include in a change to improve performance.
5. Construct and implement change for improvement
To combat the performance impact we use the outcomes of data analysis to create a change to be implemented which could solve the issue. These changes tend to be silo-centric as changes have to be applied by the specific teams managing that part of the stack. Changes can include multiple tasks to be performed by various teams.
6. Determine performance resolve
After changes have been implemented we want to determine if they helped resolve the issue. Some changes are smaller and make it easier to capture results while other changes are more complex and difficult to measure various parts of the change. This usually means we revert to waiting for user feedback on performance or looking at the metrics used for identifying the problem.
Use Tuuring for faster root cause analysis
The Tuuring platform allows persistent data collection from a wide variety of performance sources. Because the platform normalizes, enriches, and indexes incoming data into consistent patterns it enables analysis on equalized and documented datasets. This allows for multi-silo analysis of performance data with built-in knowledge from our experts. Dynamic baselines provide objective performance measurements on all types of user interaction and processes. Machine learning capabilities such as trend analysis and anomaly detection allow for faster root cause analysis comparing problematic timeframes to acceptable baseline values measured throughout the entire stack. Going through the process from problem identification to actual results will be smoother and more accurate, reducing the meantime to resolve and improve performance faster.
See for yourself
If you are interested to see what a centralized AIOps platform specifically designed for digital performance can do for you please visit our platform solution page to continue reading.