10 Feb
Posted by coder as Architecture, Best practices, Java, Monitoring, Patterns, Performance, Processor, Profiling, Reliability, Scalability
(Q): If you have to pick top 10 things that you must monitor on any server to look for performance and/or scalability issues…what would they be?
Richard McDougall (A): Off the top of my head, in no particular order:
- CPU: Check idle time and run queue length.
- If there’s a CPU bottleneck, check if it’s an application or kernel CPU utilization issue with mpstat: high percentages of users indicate it’s an application issue. High sys may point to high network load or lock contention.
- Memory: Check MDBs memstat to ensure there is sufficient free memory
- Network: Check that networks are not overloaded by observing the bytes xfered against the availability bandwidth per link.
- CPU for network: check if any CPUs are 100% busy servicing network interrupts. CPUs at 100% in mpstat, or intrstat are possible candidates.
- File system latency: check the application visible latency with DTrace at the system call level (perhaps fsstat, iosnoop, or an aggregation around system calls).
- Storage latency: check disk latency with iostat
- Application level lock contention: check application level locks are now visible with plockstat
- Kernel level locks: Check for hot locks with lockstat.
- Check MMU activity on SPARC using trapstat. Sometimes an application may be reporting as running 100% in user mode, but may actually be spending a significant amount of time in kernel mode servicing TLB misses. Trapstat will show the % of time spent using TLB misses. If a significant amount of time (>10%) is evident, then large MMU pages may help.
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « Jan | Mar » | |||||
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | |||||
One Response
Kirk
February 10th, 2006 at 1:46 pm
1On point 2, hi system time is typically due to high level of interrupt handling or the inability of threads to consume their entire time slice resulting in a lot of context switching. Some sort of very short term lock contention without any other underlying contention issues is a typical cause of this problem. If you have an I/O bottlenecks typically will result in long response times with an inability to fully utilize the CPU.
RSS feed for comments on this post · TrackBack URI
Leave a reply