Performance Things: April 2013

What			Get the whole picture of system , get a direction to anayze base on experience
	What's the issue?		What's the business? Which product components be used? How much the workload?

	Reproduciable or not?
		Reproduciable?	Get the steps of reproduce
		Can't reproduce	Keep watching and collect required loggs
			Suggestion: monitor method or tuning ,try to find the regular pattern
	Environment Information?
		Component structure	platform/HA/Cluster
		Vendor & Versions	Any known issue of other products
		Bussiness data model	Data distribution
		External services	Any backend service ： webservice
	Contact information	Email&Phone
	Follow up Space	1. Easy To follow next actions 2. Archive history work	1. new Email thread 2. new one forum thread to discussion	1.Notes 2. Lotus connection forum of community
Why			To know current action ,and know what's the next base on different result by tools
	Reproduce issue in local or product base on steps from customer
		In Local	Cost consideration (new or reuse), simulate business and data model
		In Customer	1. Give guide to reproduce and collect required information 2. Get access information (host/user/pwd) , do backup firstly
	Issue Pattern
		Always/Sometimes	1. Always: resource shortage, need to monitor system resource usage. 2. Sometime: workload(hardware resource shortage or poor design) or time range(business or backend tasks).
		Single User/Multiple User	1. Single : profile every tier cost to locate bottleneck 2. Multiple : check appserver/db pool usage (thread/connection…)
	High Resource usage
		High CPU	1. Using OS tools to monitor which process cost CPU 2. What's the incomming requests 3. Frequency GC	1.nmon 2. GCMV 3.perfmon
		Crash	Which process cause	vmstat iostat
	High Response Time or low throughput
		Functions/operations	To confirm the functions pattern 1.one. Find which one function slow, to know the business logic or backend service. 2. Some functions, they are simillar or not (retrieve / create ….) 3 all functions , to monitor system resource
		Isolate the logic tiers	1. Custom API testing log time to confirm the issue of product or custom application 2. Using system performance tool to monitor all system resource usage to confirm which tier is bad 3. Check log to find any time	1.qatool 2.nmon 3.perfmon/
		Isolate the function module	1. Profiling tool by single user. 2. Check trace log, apply for test env with single user.	1.Jprofiler 2.WAS PMI 3.WAS Performance tuning toolkit
	Goals
		customer	Discuss with custom to confirm a goal base on business
		common	Set a common goal base on industry standard
How			To give different solution base on skills and resources
	CPU
		High	1. add CPU
		Low	Less network roundtrip
	Memory
		wrap	1. increase memory 2. increase heap size
	Disk
		Busy	1. RAID 2. less write logic	iostat/perfmon
	Network
	App server
		GC	1. GC policy 2. Heap Size

		Cache	1. increase cache size
		Pool limit	1. increase pool size : thread pool/ data source
	DB			Snapshot SQL Explain
		Buffer pool
		Poor Index
	Bad Design ,need a workaround
		Bad index
		Load too much at one page
Summarize			Growth
	Personal	Skill Improvement	1. Technical skill ,which knowledge is new ,to learn it 2. Troubleshooting skill ,experience improvement
	Customer	Best practise for other customers	1.To design solution for other customers or resolve same problem 2. Base current customer env ,give some suggestion to system stable.
	Product	Design suggestion	1.Analyze the reason of the issue,any possible to redesign the product? 2. Analyze why it doesn't be test? 3. Give a suggestion to improve product monitor

Issue pattern:

Means: a. system resource limitation. b. bad logic.

Always there (Means some resource is not enough)
1. Single User? (To locate which tier consume the most time. Need profiling tool to figure out.)
  - AppServer profiling tool
  - JVM profiling too
  - Trace log
  - Application debug log
2. Multiple Users? To locate the bottleneck
  - AppServer
    - Thread pool
    - Data source
    - Cache
  - Database
    - Agents limitation
    - Locks
    - Buffer pool limitation
3. System resource limitation
  - High CPU
  - Memory
  - Disk I/O
  - Network
  - JVM(Optional)
  - Application Performance Data (to check application logic)
Sometimes
- Check what's functional work doing at that time
  - Batch operation
  - Backup operation
  - Migration operation
  - Cache refresh
- Check any outside factors effection
Percific functions (Need to know the application logic and relative DB operations)
1. Authentication
2. Create
3. Update
4. Delete
5. Search
6. Retrieve

Peter Booth ? I'm assuming that we're discussing the response times of real production systems though the following is also applicable to load test results.

I routinely trac both median and 90% values, and on some occasions also look at the 10% percentile. Response times aren't normally distributed, so mean and standard deviation can be quite misleading.

The median is a useful measure for "typical experience" and is perhaps better understood than 90th percentile for most audiences. For many purposes the 90% is a "better" metric, though less familiar, and here's why:

Imagine we are tracking the response time of a web service over time, and we deploy some new code, how can we see if the new code impacts performance? The values that we measure will vary, and part of that variation is measurement error. If our change did cause a consistent change to response times, the size of the change will tend to be larger at the 90% percentile than the median, so we can distinguish it quicker.

The 10% percentile can be a useful measure of the best case response time, and, depending on workloads, the actual service time of the service.

All three metrics are useful and they are often driven by different factors- its useful to view scatter plots of actual data points which can highlight things like bimodal response times, periodicities, and absolute shifts in response times.

Manzoor Mohammed ? I normally look at a number of measures, including average, 90 percentile, standard deviation especially when looking at test results or a system I'm not familiar with. I think looking at a single metric only tells you part of the story while looking at a range of measures will give you a better feel of the distribution. If your familiar with a system you could get away at looking at averages and only look at the other metrics when you see an unusual deviation from the typical average response times.

Michael Brauner ? It is nice to know that people vary in what they look at with response times but what about other measures from the application that can and do contribute to the response times that you see. How about the backends that you are dependent upon or the connections pools for starters... :-)

Weifeng Tang ? I am confused here about your "big spikes in our test, use 90% percentile".

If 90% is meaningful, it means the spike affect less than 10% for your result. Thus you can eliminate the spike by ignore the 10%.

However, if the spike renders Median Response Time unmeaning, the number of spike may be much larger than 10%. Otherwise, since the Median the data on the 50% position, even the whole 10% is on one direction, the Median might be at 60% position(Actually, 55% or 45% if we think all these data are abnormal). In a real production environment, I would not expect a 50% --> 60% position will generate big difference that could make the test nonsense. Specially, the position change is happened in the center of curves. Unless your system falls into some very strange behavior.

Weifeng Tang ? I think Median or 90% are both OK to rep your test if your tested application is error free. Both of them get rid of the spikes and better than Average.

However, the strange spikes ( not one) in the middle of the execution seems problematic. It might be some resource competition at that time or pool expansion that incurred this point. Though performance curves are not expected to be normal, yet I always try to find a stable curve. I'll doubt myself if need present this result to my client. Anyway I have no knowledge of your circumstance, you may have your own limitations or assumptions to do so.

Peter Booth ? Can you add some detail to this data? Is the response time curve a scatterplot of every response time with the horizontal axis being the number of datapoints? How long was your test run? Did it show a spike of 3.5seconds during the test?

What are the three columns on the percentile table showing? Is this milliseconds or seconds? Are these percentiles with and without outliers? Is it the result of only a single test run?

Peter Booth ? Feng,

Some comments on the data:

1. Obviously the large spike that appears in one of the test runs is an issue. Does it represent usual behavior (say a full GC cycle) or a one-off issue?

2. The individual data points are in the tens of milliseconds. What is the request rate of the test workload?

3. Are these dynamic web requests?

4. Are you familiar with hypothesis testing, significance, and the statistical power of test designs? You can get more information from 54x 10 minute tests than 3x 3 hr tests, and will be able to see whether your results are consistent.

Peter Booth ? Right now you have a few long-lived test runs at a modest request rate (33 req/sec), and they show that most requests are served within 30 to 40 ms. If we ignore the spike and use the median response time of 33ms as an estimate of response time, then Little's Law implies there will be on average 33*(2000/60) / 1000 = 1.11 requests in the system at any time. Thats a pretty small number.

If it were me, I'd want to begin measuring performance under a trivially small workload (1 or 2 requests per second) . First thing is to look at a scatter plot of response times - are there any obvious trends? banding? cycles? I would want to estimate the EJB service time. Does your DB also expose transaction response times / AWR reports or similar? If not, I'd want to use a tool like NewRelic to capture and cross reference the DB performance against EJB performance.

I'd want to run a large number of short test runs to see if the performance numbers are consistent from run to run. I'd probably use a shell script to run tests using wrk or httperf and, after measuring the best-case EJB performance under a light load, would go onto measure the performance as workload increases, using autoperf or similar. I'd be measuring median, 90th % and 10% values. For this my goal would be to quantify scalability using the approach described http://www.perfdynamics.com/Manifesto/USLscalability.html or http://www.percona.com/files/white-papers/forecasting-mysql-scalability.pdf

Performance Things

Pages

Monday 29 April 2013

Performance Troubleshooting Process

How to see percentile?