Monday 29 April 2013

Performance Troubleshooting Process

What

Get the whole picture of system , get a direction to anayze base on experience

What's the issue?What's the business? Which product components be used? How much the workload?



Reproduciable or not?


Reproduciable?Get the steps of reproduce


Can't reproduceKeep watching and collect required loggs


Suggestion: monitor method or tuning ,try to find the regular pattern

Environment Information?


Component structureplatform/HA/Cluster


Vendor & VersionsAny known issue of other products


Bussiness data modelData distribution


External servicesAny backend service : webservice

Contact informationEmail&Phone

Follow up Space1. Easy To follow next actions 2. Archive history work1. new Email thread 2. new one forum thread to discussion1.Notes 2. Lotus connection forum of community
Why
To know current action ,and know what's the next base on different result by tools

Reproduce issue in local or product base on steps from customer


In LocalCost consideration (new or reuse), simulate business and data model


In Customer1. Give guide to reproduce and collect required information 2. Get access information (host/user/pwd) , do backup firstly

Issue Pattern


Always/Sometimes1. Always: resource shortage, need to monitor system resource usage. 2. Sometime: workload(hardware resource shortage or poor design) or time range(business or backend tasks).


Single User/Multiple User1. Single : profile every tier cost to locate bottleneck 2. Multiple : check appserver/db pool usage (thread/connection…)

High Resource usage


High CPU1. Using OS tools to monitor which process cost CPU 2. What's the incomming requests 3. Frequency GC1.nmon 2. GCMV 3.perfmon


CrashWhich process causevmstat iostat

High Response Time or low throughput


Functions/operationsTo confirm the functions pattern 1.one. Find which one function slow, to know the business logic or backend service. 2. Some functions, they are simillar or not (retrieve / create ….) 3 all functions , to monitor system resource


Isolate the logic tiers1. Custom API testing log time to confirm the issue of product or custom application 2. Using system performance tool to monitor all system resource usage to confirm which tier is bad 3. Check log to find any time1.qatool 2.nmon 3.perfmon/


Isolate the function module1. Profiling tool by single user. 2. Check trace log, apply for test env with single user.1.Jprofiler 2.WAS PMI 3.WAS Performance tuning toolkit

Goals


customerDiscuss with custom to confirm a goal base on business


commonSet a common goal base on industry standard
How
To give different solution base on skills and resources

CPU


High1. add CPU


LowLess network roundtrip

Memory


wrap1. increase memory 2. increase heap size

Disk


Busy1. RAID 2. less write logiciostat/perfmon

Network

App server


GC1. GC policy 2. Heap Size




Cache1. increase cache size


Pool limit1. increase pool size : thread pool/ data source

DBSnapshot SQL Explain


Buffer pool


Poor Index

Bad Design ,need a workaround


Bad index


Load too much at one page
Summarize
Growth

PersonalSkill Improvement1. Technical skill ,which knowledge is new ,to learn it 2. Troubleshooting skill ,experience improvement

CustomerBest practise for other customers1.To design solution for other customers or resolve same problem 2. Base current customer env ,give some suggestion to system stable.

ProductDesign suggestion1.Analyze the reason of the issue,any possible to redesign the product? 2. Analyze why it doesn't be test? 3. Give a suggestion to improve product monitor
Issue pattern:
Means: a. system resource limitation. b. bad logic.

  1. Always there (Means some resource is not enough)
    1. Single User? (To locate which tier consume the most time. Need profiling tool to figure out.)
      • AppServer profiling tool
      • JVM profiling too
      • Trace log
      • Application debug log
    2. Multiple Users? To locate the bottleneck
      • AppServer
        • Thread pool
        • Data source
        • Cache
      • Database
        • Agents limitation
        • Locks
        • Buffer pool limitation
    3. System resource limitation
      • High CPU
      • Memory
      • Disk I/O
      • Network
      • JVM(Optional)
      • Application Performance Data (to check application logic)
  2. Sometimes
    • Check what's functional work doing at that time
      • Batch operation
      • Backup operation
      • Migration operation
      • Cache refresh
    • Check any outside factors effection
  3. Percific functions (Need to know the application logic and relative DB operations)
    1. Authentication
    2. Create
    3. Update
    4. Delete
    5. Search
    6. Retrieve

No comments:

Post a Comment