Saturday, 1 November 2014

What other product can learn from Oracle database performance management

Whenever I diagnose and tune oracle performance issues I am filled with a deep sense of respect for the engineers who designed and implemented the diagnostic capabilities of the Oracle database.

Levels of performance management can be
  • Descriptive
    •  What happened? ie Monitoring
  • Diagnostic
    • Why it happened?
  • Prescriptive
    • What to do?
Oracle does a excellent job at all the three levels

Oracle is a complex product with a more complex memory and process architecture that a general server side application. This makes the performance management more challenging. On top of it characterization of workload of a general server application is simple while in Oracle  DB even if you look at selects they can have so different resource and time foot print and can be of so many types. All this implies that the performance management of Oracle DB is very challenging. This makes the DB diagnostic solutions provided by Oracle even more admirable.

Oracle provides tuning parameters for each subsystem, be it checkpoint process, LWR process etc or the size of shared pool, Buffer cache, SGA, PGA etc. Oracle diagnostic data(AWR Reports) provides rich information to detect inefficiencies in any of the process or memory subsystems or hardware capacity. AWR report contains the top foreground waits which help in identifying which subsystem is becoming bottleneck and needs to be tuned or needs more capacity. The report identifies the high load SQLs that are candidates of tuning and optimization.

 Oracle goes beyond just diagnostics to prescriptions of solutions also. ADDM can be used analyse the AWR data and it will provide actionable recommendations to optimize the system. SQL Tuning Advisor can be used get the tuning recommendations for SQL query.

Oracle has achieved all the above three levels(Monitoring, Diagnostics, and Prescription ) where a capable customer can do it all by himself without needing support from Oracle, greatly reducing the mean time to response for performance issues as well as providing rich information for proactive performance management

Hats off to the architects and engineers of Oracle Diagnostics !

In future posts we will discuss how to do root cause analysis of Oracle issues using AWR reports

Other Posts

Thursday, 30 October 2014

CMG India 2014: 1st Annual Conference

Computer Measurement Group India is having its first annual conference CMG India 2014 in Performance Engineering and Capacity Management in Pune this December 2014


Dates : Fri Dec 12th 2014 (9am to 5:30pm + Dinner) & Sat Dec 13th 2014 (9am to 4:30pm)


Venue: The conference will be co-located at Persistent Systems and Infosys, Phase I, Rajeev Gandhi Infotech Park, Hinjewadi, Pune 411057. (The offices are opposite each other.)


The conference has three tracks


Other Posts

Sunday, 26 October 2014

Capacity Planning: Working within operational capacity


 
Final objective of all performance engineering, validation etc is to deliver performance that meets service requirements in production providing a smooth experience to the customer. One aspect is engineering of software  that has high capacity, low response time and optimally utilizes the resources(CPU, Memory, Storage and network), another aspect is managing the operations for optimal working of the software.  Even a well engineered software if used beyond its capacity will provide very poor user experience.  

Providing performance in the production environment is more important than just demonstrating it in performance labs.

Software utilizes hardware infrastructure to function and provide service. saturation in the infrastructure(easy to detect) implies saturation of software
In the world of distributed software, software modules/ components takes services of each other to provide service. Any one of the module that gets saturated will saturate the entire system or a major subsystem


Smooth operations implies all the subsystems work within their operational capacity.
Problem detection implies that monitoring is in place for the user experience so that problems are detected well before the service levels become completely unacceptable
Diagnostics implies that we can identify the subsystem(infrastructure + software service) that has saturated and take corrective actions
Predictive analytics implies that we can foresee the capacity issues before they arise.


General respons time vs  workload graph is below. The major characteristics for this curve is same for the hardware or software service.
 
If capacity and utilization of each subsystem is clear we have the decision support information to ensure smooth operations.

Operation of each subsystem should be within the operational capacity otherwise the performance will suffer.

Subsystem with the highest utilization is the one that will saturate first at higher workload and become the bottleneck.

Head room shown in the graph is the additional workload that the system can endure before it saturates. Depending on the risk tolerance additional capacity  can be added while operational head room is still remaining.

 
Sounds simple! what are the challenges?

 
In a distributed deployed solution there are too many subsystems.Lets say thousands of hardware equipment and similar number of software components
  • Are you monitoring all these subsystems?
  • what is the workload in these subsystems?
  • Do you know what is the capacity of these subsystems?
  • Do you know what is the utilization of these subsystems?
  • What is the head room in these subsystems? To endure more load

Another challenge is the workload itself. This depends on type of business. Inherently in business like eCommerce, Internet services there can be sudden surges in the end user activity. Some of these events can be anticipated like in festive season you can expect more online shopping or after a big marketing initiative with discounts there can be a big stress/workload on the eCommerce software services. So the systems that are working within operational capacity get saturated and provide poor end user experience at the critical business period.Typically govt websites for tax or online for submissions become unresponsive around the last dates. In trading systems lot of activity is driven by market volumes and critical events like corporate results, interest rate movements etc can trigger higher activity.

  • Do you have historical workload for your services and the operational baseline of workload
  • Work load trends (regular + seasonal)
  • Can you forecast the workload for important business event?
  • Is your capacity elastic ie if you know you need twice the capacity can you add it in time. More hardware + more software service + load distribution
Next challenge is do you understand the relationship between service utilization and the workload?
  • Do you have required  models?
  • Are you capturing the data required for these models?In all sub systems?
    • OS provides monitoring info about the hardware
    • What about the middleware?
    • What about the application software?
    • What about the DB?
  • Do you know the highest  load to which you can drive the system at which the response time meets the service requirements not in labs but in production.
We will discuss various models that solve these problems and data that you need to solve these issues in future posts.

Other Posts

Friday, 24 October 2014

Little's Law

The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: 
 
Stable system : 
  1. Steady state
  2. Number of arrivals equals number of departures
  3. System is not in beyond saturation state
 λ - Arrival rate :
  1. As arrivals equal departures, this is also equivalent to throughput
W -  average time a customer spends in the system :
  1. This includes the time spent in wait as well as service so it is response time
L-  average number of customers in a stable system
  1. This is work in progress 
  2. transaction in queue as well as being serviced
  3. requests in queue and in progress in the system
  4. concurrent users in the system

Requests in queue + in progress= Mean Response Time * Throughput


In a test configured for 10000 requests/ transactions per seconds if mean response time is 10 milliseconds than at a time pending + in progress requests are (10000 * 0.01=) 100.

While trying different configurations of server component the configurations in which response time is slow for the same throughput the memory consumption is higher. This is expected from little's law. At the same throughput if response time is larger than the requests in queue and in progress will be more hence higher memory consumption.

Mean Response Time = Requests in queue + in progress/Throughput

In systems where in queue/progress requests are known and throughput is known the mean response time can be calculated.


Little's law  is also used for validation of the performance test, if all the three indicators are tracked in the test than the relationship must hold


Other Posts


Sunday, 25 May 2014

Analytical thinking in performance analysis

Many performance job requirements list analytical skills as one of the requirements. For people who love analysis this is a interesting job.

Lot of analysis is done using mental models or mathematical models for the scenario. Without analysis the test results are just numbers.

Major limitation of performance testing is that you cannot test all possible scenarios and use-cases. Testing and modeling complement each other. One cannot expect major re-engineering decision to be taken based on performance flaws pointed out by analytical reasoning only, they have to be backed by actual measurements.

Good modeling requires a solid understanding of how things work. When you keep looking at data from various perspectives and keep asking questions interesting insights emerge.

Below are some of the scenarios where modeling and analysis can complement testing and measurements.

Modeling and pre-analysis helps in design of right tests

For example from the understanding of architecture, when you expect synchronization issues ,you will design the test so that saturation even while CPU resource is available is clearly established

Modeling and analysis can even provide bounds on  expected results

For example, In case where all concurrent transactions serialize on a shared data structure, we know that time spent in synchronized block is serial. So if 50 microseconds are being spent inside synchronized block then there can be at most 20000 transactions per second beyond which the system will saturate.This limit is independent of number of threads executing the transactions in the server and number of available cores. If synchronization is the primary bottleneck by reducing the time spent in synchronized block to half we can double the throughput.

Modeling and analysis help in identifying the bottleneck

Simple model above has given a upper limit on throughput which can be validated by testing. if the solution saturates before that we know that primary bottleneck is somewhere else. Quite often people justify the observed results just on hunches like “Oh! solution saturated using 50% of CPU,it must be a synchronization issue” and accept the results.Without a right quantitative approach this can hide the actual bottleneck. For example in the above example if the saturation is much before 20000 transactions per seconds asking the poor developer to optimize the synchronization will not really solve the problem as the bottleneck is somewhere else

Modeling and analysis can help in answering the what if scenarios.

For example you have tested using 10 G network and someone asks what are your projections if the customer runs on a 1 G network? What are the capacity projections for the solution in this scenario?
What do you think we need higher per core capacity or larger number of cores for the deployment of the solution? if we double the available RAM what kind of improvements can we expect?
If we double the number of available cores what is the expected capacity of the system?

Modeling and analysis help in evaluation of hypothesis

For example someone thinks a particular performance problem is due to OS scheduling, this can be analytically evaluated. For example we had a scenario in which 600 threads were scheduled on 8 core server. When there were performance issues developer felt problem should be from thread contending to get scheduled. When we created the model of how CPU was being used it was very obvious the issue was not of scheduling.By accepting a false hypothesis the real cause of issues remain hidden. 

Modeling and analysis enhances our understanding of the system

When models dont match the measurements there are new learnings as we revisit our assumptions

Modeling and analysis helps in extrapolation of results

We might want to extrapolate the results due to various reasons , there can be practical limitations like you dont have enough hardware to simulate the required number of users.

We will discuss how we can model various performance scenarios in future posts.

Related Post :

Software bottlenecks:- Understanding performance of software systems with examples from road traffic