Thursday, 30 October 2014

CMG India 2014: 1st Annual Conference

Computer Measurement Group India is having its first annual conference CMG India 2014 in Performance Engineering and Capacity Management in Pune this December 2014


Dates : Fri Dec 12th 2014 (9am to 5:30pm + Dinner) & Sat Dec 13th 2014 (9am to 4:30pm)


Venue: The conference will be co-located at Persistent Systems and Infosys, Phase I, Rajeev Gandhi Infotech Park, Hinjewadi, Pune 411057. (The offices are opposite each other.)


The conference has three tracks


Other Posts

Sunday, 26 October 2014

Capacity Planning: Working within operational capacity


 
Final objective of all performance engineering, validation etc is to deliver performance that meets service requirements in production providing a smooth experience to the customer. One aspect is engineering of software  that has high capacity, low response time and optimally utilizes the resources(CPU, Memory, Storage and network), another aspect is managing the operations for optimal working of the software.  Even a well engineered software if used beyond its capacity will provide very poor user experience.  

Providing performance in the production environment is more important than just demonstrating it in performance labs.

Software utilizes hardware infrastructure to function and provide service. saturation in the infrastructure(easy to detect) implies saturation of software
In the world of distributed software, software modules/ components takes services of each other to provide service. Any one of the module that gets saturated will saturate the entire system or a major subsystem


Smooth operations implies all the subsystems work within their operational capacity.
Problem detection implies that monitoring is in place for the user experience so that problems are detected well before the service levels become completely unacceptable
Diagnostics implies that we can identify the subsystem(infrastructure + software service) that has saturated and take corrective actions
Predictive analytics implies that we can foresee the capacity issues before they arise.


General respons time vs  workload graph is below. The major characteristics for this curve is same for the hardware or software service.
 
If capacity and utilization of each subsystem is clear we have the decision support information to ensure smooth operations.

Operation of each subsystem should be within the operational capacity otherwise the performance will suffer.

Subsystem with the highest utilization is the one that will saturate first at higher workload and become the bottleneck.

Head room shown in the graph is the additional workload that the system can endure before it saturates. Depending on the risk tolerance additional capacity  can be added while operational head room is still remaining.

 
Sounds simple! what are the challenges?

 
In a distributed deployed solution there are too many subsystems.Lets say thousands of hardware equipment and similar number of software components
  • Are you monitoring all these subsystems?
  • what is the workload in these subsystems?
  • Do you know what is the capacity of these subsystems?
  • Do you know what is the utilization of these subsystems?
  • What is the head room in these subsystems? To endure more load

Another challenge is the workload itself. This depends on type of business. Inherently in business like eCommerce, Internet services there can be sudden surges in the end user activity. Some of these events can be anticipated like in festive season you can expect more online shopping or after a big marketing initiative with discounts there can be a big stress/workload on the eCommerce software services. So the systems that are working within operational capacity get saturated and provide poor end user experience at the critical business period.Typically govt websites for tax or online for submissions become unresponsive around the last dates. In trading systems lot of activity is driven by market volumes and critical events like corporate results, interest rate movements etc can trigger higher activity.

  • Do you have historical workload for your services and the operational baseline of workload
  • Work load trends (regular + seasonal)
  • Can you forecast the workload for important business event?
  • Is your capacity elastic ie if you know you need twice the capacity can you add it in time. More hardware + more software service + load distribution
Next challenge is do you understand the relationship between service utilization and the workload?
  • Do you have required  models?
  • Are you capturing the data required for these models?In all sub systems?
    • OS provides monitoring info about the hardware
    • What about the middleware?
    • What about the application software?
    • What about the DB?
  • Do you know the highest  load to which you can drive the system at which the response time meets the service requirements not in labs but in production.
We will discuss various models that solve these problems and data that you need to solve these issues in future posts.

Other Posts

Friday, 24 October 2014

Little's Law

The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: 
 
Stable system : 
  1. Steady state
  2. Number of arrivals equals number of departures
  3. System is not in beyond saturation state
 λ - Arrival rate :
  1. As arrivals equal departures, this is also equivalent to throughput
W -  average time a customer spends in the system :
  1. This includes the time spent in wait as well as service so it is response time
L-  average number of customers in a stable system
  1. This is work in progress 
  2. transaction in queue as well as being serviced
  3. requests in queue and in progress in the system
  4. concurrent users in the system

Requests in queue + in progress= Mean Response Time * Throughput


In a test configured for 10000 requests/ transactions per seconds if mean response time is 10 milliseconds than at a time pending + in progress requests are (10000 * 0.01=) 100.

While trying different configurations of server component the configurations in which response time is slow for the same throughput the memory consumption is higher. This is expected from little's law. At the same throughput if response time is larger than the requests in queue and in progress will be more hence higher memory consumption.

Mean Response Time = Requests in queue + in progress/Throughput

In systems where in queue/progress requests are known and throughput is known the mean response time can be calculated.


Little's law  is also used for validation of the performance test, if all the three indicators are tracked in the test than the relationship must hold


Other Posts