Posted by coder.
Feb
10
Posted by coder.
Top 10 things that you must monitor on any server to look for performance and/or scalability issues
(Q): If you have to pick top 10 things that you must monitor on any server to look for performance and/or scalability issues…what would they be?
Richard McDougall (A): Off the top of my head, in no particular order:
- CPU: Check idle time and run queue length.
- If there’s a CPU bottleneck, check if it’s an application or kernel CPU utilization issue with mpstat: high percentages of users indicate it’s an application issue. High sys may point to high network load or lock contention.
- Memory: Check MDBs memstat to ensure there is sufficient free memory
- Network: Check that networks are not overloaded by observing the bytes xfered against the availability bandwidth per link.
- CPU for network: check if any CPUs are 100% busy servicing network interrupts. CPUs at 100% in mpstat, or intrstat are possible candidates.
- File system latency: check the application visible latency with DTrace at the system call level (perhaps fsstat, iosnoop, or an aggregation around system calls).
- Storage latency: check disk latency with iostat
- Application level lock contention: check application level locks are now visible with plockstat
- Kernel level locks: Check for hot locks with lockstat.
- Check MMU activity on SPARC using trapstat. Sometimes an application may be reporting as running 100% in user mode, but may actually be spending a significant amount of time in kernel mode servicing TLB misses. Trapstat will show the % of time spent using TLB misses. If a significant amount of time (>10%) is evident, then large MMU pages may help.
There is 1 comment to this post.
Add Your Comment.
Previous Post
« How to waste $10 billion. Next Post
Tales from the trenches with a former Enron performance guru »
« How to waste $10 billion. Next Post
Tales from the trenches with a former Enron performance guru »
On point 2, hi system time is typically due to high level of interrupt handling or the inability of threads to consume their entire time slice resulting in a lot of context switching. Some sort of very short term lock contention without any other underlying contention issues is a typical cause of this problem. If you have an I/O bottlenecks typically will result in long response times with an inability to fully utilize the CPU.