Sunday 25 May 2014

Analytical thinking in performance analysis

Many performance job requirements list analytical skills as one of the requirements. For people who love analysis this is a interesting job.

Lot of analysis is done using mental models or mathematical models for the scenario. Without analysis the test results are just numbers.

Major limitation of performance testing is that you cannot test all possible scenarios and use-cases. Testing and modeling complement each other. One cannot expect major re-engineering decision to be taken based on performance flaws pointed out by analytical reasoning only, they have to be backed by actual measurements.

Good modeling requires a solid understanding of how things work. When you keep looking at data from various perspectives and keep asking questions interesting insights emerge.

Below are some of the scenarios where modeling and analysis can complement testing and measurements.

Modeling and pre-analysis helps in design of right tests

For example from the understanding of architecture, when you expect synchronization issues ,you will design the test so that saturation even while CPU resource is available is clearly established

Modeling and analysis can even provide bounds on  expected results

For example, In case where all concurrent transactions serialize on a shared data structure, we know that time spent in synchronized block is serial. So if 50 microseconds are being spent inside synchronized block then there can be at most 20000 transactions per second beyond which the system will saturate.This limit is independent of number of threads executing the transactions in the server and number of available cores. If synchronization is the primary bottleneck by reducing the time spent in synchronized block to half we can double the throughput.

Modeling and analysis help in identifying the bottleneck

Simple model above has given a upper limit on throughput which can be validated by testing. if the solution saturates before that we know that primary bottleneck is somewhere else. Quite often people justify the observed results just on hunches like “Oh! solution saturated using 50% of CPU,it must be a synchronization issue” and accept the results.Without a right quantitative approach this can hide the actual bottleneck. For example in the above example if the saturation is much before 20000 transactions per seconds asking the poor developer to optimize the synchronization will not really solve the problem as the bottleneck is somewhere else

Modeling and analysis can help in answering the what if scenarios.

For example you have tested using 10 G network and someone asks what are your projections if the customer runs on a 1 G network? What are the capacity projections for the solution in this scenario?
What do you think we need higher per core capacity or larger number of cores for the deployment of the solution? if we double the available RAM what kind of improvements can we expect?
If we double the number of available cores what is the expected capacity of the system?

Modeling and analysis help in evaluation of hypothesis

For example someone thinks a particular performance problem is due to OS scheduling, this can be analytically evaluated. For example we had a scenario in which 600 threads were scheduled on 8 core server. When there were performance issues developer felt problem should be from thread contending to get scheduled. When we created the model of how CPU was being used it was very obvious the issue was not of scheduling.By accepting a false hypothesis the real cause of issues remain hidden. 

Modeling and analysis enhances our understanding of the system

When models dont match the measurements there are new learnings as we revisit our assumptions

Modeling and analysis helps in extrapolation of results

We might want to extrapolate the results due to various reasons , there can be practical limitations like you dont have enough hardware to simulate the required number of users.

We will discuss how we can model various performance scenarios in future posts.

Related Post :

Software bottlenecks:- Understanding performance of software systems with examples from road traffic

Thursday 8 May 2014

Time scale of system latencies


In my previous post we were discussing why providing low latency is more difficult than providing scalability. Below is a scale showing latency cost of various operations.



Event
Latency
Scaled
1 CPU Cycle
0.3 ns
1 s
Level 1 cache access
0.9 ns
3 s
Level 2 cache access
2.8 ns
9 s
Level 3 Cache access
12.9 ns
43 s
Main memory access (DRAM, from CPU)
120 ns
6  min
Solid-state disk I/O (flash memory)
50-150 µs
2-6 days
Rotational disk I/O
1-10 ms
1-12 months
Internet: San Francisco to New York
40 ms
4 years
Internet: San Francisco to UK
81 ms
8 years
Internet: San Francisco to Australia
183 ms
19 years
TCP packet retransmit
1-3 s
105-317 years



Unit
Abbreviation
Fraction of 1 sec
Millisecond
ms
0.001
Microsecond
µs
0.000001
Nanosecond
ns
0.000000001



Data is from the book Systems Performance: Enterprise and the Cloud by Brendan Gregg
First table gives the cost of operations and the second table explains the units of time. The third column of first table converts the cost to time scale that we understand, time scale of seconds, minutes, days, months and years. Here the cost of operations is scaled assuming if the cost of CPU cycle is 1 second hypothetically than what will be the cost of other operations.

How to use knowledge of above latencies?
  •  It can help in understanding the lower bound of latency or response time for a certain scenario
    • For example if the user is in Australia and server is in San Francisco than time between request and response will be greater than 2 network hops or greater than 366 ms (183 * 2 )
  •   Performance objectives cant be met using a certain technology/operation
    • For example if you want response in microseconds than the critical path of code execution cannot have a disk read/write. 
    • If your algorithm has more than 10 random memory operation to do some work it will at least take more than 1.2 microseconds (120ns * 10).
  • Overall latencies can be estimated by knowing the cost of basic operations 
  • Finding reasons for performance jitters
    • For example if response time jumped  from microseconds to milliseconds. What could be the possible reasons? We know disk reads can take milliseconds so page fault can be a possible reason. GC in Java could be a reason as GC jitters are from few to 100s of milliseconds, but we should not suspect networking at 1 GBPS LAN as it will not cause delays in milliseconds especially if utilization is low. Knowing the system latencies help you in narrowing your search for the culprit.

Related previous posts: