Sun Dialogue Programs

(Q): If you have to pick top 10 things that you must monitor on any server to look for performance and/or scalability issues…what would they be?
Richard McDougall (A): Off the top of my head, in no particular order:

  1. CPU: Check idle time and run queue length.
  2. If there’s a CPU bottleneck, check if it’s an application or kernel CPU utilization issue with mpstat: high percentages of users indicate it’s an application issue. High sys may point to high network load or lock contention.
  3. Memory: Check MDBs memstat to ensure there is sufficient free memory
  4. Network: Check that networks are not overloaded by observing the bytes xfered against the availability bandwidth per link.
  5. CPU for network: check if any CPUs are 100% busy servicing network interrupts. CPUs at 100% in mpstat, or intrstat are possible candidates.
  6. File system latency: check the application visible latency with DTrace at the system call level (perhaps fsstat, iosnoop, or an aggregation around system calls).
  7. Storage latency: check disk latency with iostat
  8. Application level lock contention: check application level locks are now visible with plockstat
  9. Kernel level locks: Check for hot locks with lockstat.
  10. Check MMU activity on SPARC using trapstat. Sometimes an application may be reporting as running 100% in user mode, but may actually be spending a significant amount of time in kernel mode servicing TLB misses. Trapstat will show the % of time spent using TLB misses. If a significant amount of time (>10%) is evident, then large MMU pages may help.