Consider a critical application (Java Based) in production and you see noise all around during an outage. Bridge is opened and you see many people jumping on the call and asking questions “What is the issue?”. After short while the question changes to “What is the RCA of the issue”?
Among all these there is a person who is actually responsible for finding the root case of the issue and provide resolution. What all areas can he/she look at quickly? In such situation the following statistics gathering exercise could help.
Where could things go wrong?
Though, there are many other things to look at, a typical Java based application will fall in the above category.
Let’s look at the host (Linux), considering you are logged in as a user who has permission to run the commands.
Run the ‘ top ‘ command
The data points to look at are:
- The first line has two data points
- Uptime – Since how long the server has been up since last reboot. Above its 78 days. Typically, if the server has been up for more than 180 days, consider rebooting the server.
- Load average – Higher number indicates that the system is stressed. In most cases, if the value nears to twice the core on the host, you would notice queue build up and degraded performance.
- The third line has information on CPU Utilization, showing distribution across user, system and idle.
- The fourth line has information on Memory Utilization, which on Linux includes buffers and cached memory. So this is not the actual RAM utilization and if you see this 90% or 100%, don’t get alarmed.
- The fifth line has Swap memory usage
- The detail table has information on Virtual Memory, Resident Memory, Shared Memory, %CPU utilization, %Memory Utilization and command
- VIRT (Virtual Memory) – The memory used by the process including Swap. If there is a native memory leak you could see the virtual memory growing
- RES (Resident Memory) – The physical memory used by a process. It is the size of the actual pages present in the RAM.
- In the above example the heap size has been defined as 8 GB but the Resident Memory used is 9.9 GB. A java process uses native memory which is in addition to the Heap Memory.
- SHR (Shared Memory) – It is the memory that could be potentially shared with other processes.
Customize Screen Display
- Refresh Interval – To change the interval of screen refresh, press ‘d’ and enter a numeric value ( Unit – sec )
- Column Order – To change the order of the column displayed, press ‘o’ and follow the instruction on the screen to move the column left or right
- Column Sort – By default the data displayed is sorted by %CPU. To change press ‘SHIFT o’ and follow the instruction on the screen
- Individual CPU – To see the utilization of individual CPU utilization on screen press ‘1’
Run the command ‘ free –g ‘, the displayed data is in GB
The total Memory is 62 GB. The cached is 48 GB. The actual utilization is 13 GB, even though the used column shows 62 GB. Refer to the “buffers/cache” rows for actual usage and free memory. The free memory is 48 GB.
Another way is to run the ‘ vmstat ‘ command
The memory shows is in KB.
The ‘vmstat’ command also gives information on the run queue, shown above in the first column. The column ‘r’ shows the number of processes waiting for run time. A high value would indicate that many processes are waiting for the resources to execute.
I/O Wait time
Run the command ‘ iostat ‘ to get the I/O statistics
The “%iowait” shows the percentage of time that the CPUs were waiting for an outstanding disk I/O request to complete. In the event of slow response reading or writing data to disk, the %iowait time will increase.
Run the ‘ netstat –ant ‘ command to capture the tcp connection information. There are different states of the connection such as ESTABLISHED, TIME_WAIT, CLOSE_WAIT, FIN_WAIT, FIN_WAIT2.
‘ netstat –ant | grep LISTEN ‘ will provide all the listening connections. ‘ netstat –l ‘ command can also be used.
To find which program is associated with the connection, use the –p flag
‘ netstat –anp ‘
To get the packet level summary for different Network protocol run the command
‘ netstat –s ‘
To check the disk space utilization run ‘ df –h’ command
To check the size of directories in a directory, navigate to the directory and run the command
du – sh *
Run the command ‘ ulimit –a ‘ to check the user level settings
The key attributes to look at are:
- Core file size – In the event of process crash, if you want the core file to be generated, set this value to ‘unlimited’
- Open files – The maximum number of files that can be opened concurrently
- Max user processes – The maximum number of processes that can exist concurrently
The system messages are written to the /var/log/messages file. Check the file for error messages generated by the system.
Run the ‘ last ’ command to get information about when a user logged-in and logged-out of the system. This is useful to investigate in cases where an external connection to the host is occurring frequently and the user making the connection never logs out. It would show many logged-in user sessions.
Run the ‘ users ’ command which shows the currently logged-in users. Use the ‘ finger ’ command to retrieve more information about the user if available.