ESXTOP

I am a huge fan of esxtop! I used to read a couple of pages of the esxtop bible every day before I went to bed, not anymore as the doc is unfortunately outdated (yes I have requested an update various times.). Something I, however, am always struggling with is the “thresholds” of specific metrics. I fully understand that it is not black/white, performance is the perception of a user in the end.

There must be a certain threshold however. For instance it must be safe to say that when %RDY constantly exceeds the value of 20 it is very likely that the VM responds sluggish. I want to use this article to “define” these thresholds, but I need your help. There are many people reading these articles, together we must know at least a dozen metrics lets collect and document them with possible causes if known.

Please keep in mind that these should only be used as a guideline when doing performance troubleshooting! Also be aware that some metrics are not part of the default view. You can add fields to an esxtop view by clicking “f” on followed by the corresponding character.

I used VMworld presentations, VMware whitepapers, VMware documentation, VMTN Topics and of course my own experience as a source and these are the metrics and thresholds I came up with so far. Please comment and help build the main source for esxtop thresholds.

vSphere 6.5

For vSphere 6.5 there are various different metrics added. For instance, on the “power management” section there is %A/MPERF added which indicates if Turbo Boost is being used. On CPU also in 6.5 the %RUN will be lower as the system thread which is doing work for a particular VM will now be “charged” to %RUN instead of %SYS. That way %RUN actually represents everything and is more intuitive.

Metrics and Thresholds

Display Metric Threshold Explanation
CPU %RDY 10 Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. Note that you will need to expand the VM Group to see how this is distributed across vCPUs. If you have many vCPUs than per vCPU may be low and this may not be an issue. 10% is per world!
CPU %CSTP 3 Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.
CPU %MLMTD 0 The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.
CPU %SWPWT 5 VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.
MEM MCTLSZ 1 If larger than 0 hosts is forcing VMs to inflate balloon driver to reclaim memory as host is overcommited.
MEM SWCUR 1 If larger than 0 hosts has swapped memory pages in the past. Possible cause: Overcommitment.
MEM SWR/s 1 If larger than 0 host is actively reading from swap(vswp). Possible cause: Excessive memory overcommitment.
MEM SWW/s 1 If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment.
MEM CACHEUSD 0 If larger than 0 hosts has compressed memory. Possible cause: Memory overcommitment.
MEM ZIP/s 0 If larger than 0 hosts is actively compressing memory. Possible cause: Memory overcommitment.
MEM UNZIP/s 0 If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on memory.
MEM N%L 80 If less than 80 VM experiences poor NUMA locality. If a VM has a memory size greater than the amount of memory local to each processor, the ESX scheduler does not attempt to use NUMA optimizations for that VM and “remotely” uses memory via “interconnect”. Check “GST_ND(X)” to find out which NUMA nodes are used.
NETWORK %DRPTX 1 Dropped packets transmitted, hardware overworked. Possible cause: very high network utilization
NETWORK %DRPRX 1 Dropped packets received, hardware overworked. Possible cause: very high network utilization
DISK GAVG 25 Look at “DAVG” and “KAVG” as the sum of both is GAVG.
DISK DAVG 25 Disk latency most likely to be caused by the array.
DISK KAVG 2 Disk latency caused by the VMkernel, high KAVG usually means queuing. This is the ESXi storage stack, the vSCSI layer and the VMM. Check “QUED”.
DISK QUED 1 Queue maxed out. Possibly queue depth set to low, or controller overloaded. Check with array vendor for optimal queue depth value. (Enable this via option “F” aka QSTATS
DISK ABRTS/s 1 Aborts issued by guest(VM) because storage is not responding. For Windows VMs this happens after 60 seconds by default. Can be caused for instance when paths failed or array is not accepting any IO for whatever reason.
DISK RESETS/s 1 The number of commands resets per second.
DISK ATSF 1 The number of failed ATS commands, this value should be 0
DISK ATS 1 The number of successful ATS commands, this value should go up over time when the array supports ATS
DISK DELETE 1 The number of successful UNMAP commands, this value should go up over time when the array supports UNMAP!
DISK DELETE_F 1 The number of failed UNMAP commands, this value should be 0
DISK CONS/s 20 SCSI Reservation Conflicts per second. If many SCSI Reservation Conflicts occur performance could be degraded due to the lock on the VMFS.
VSAN SDLAT 5 Standard deviation of latency, when above 10ms latency contact support to analyze vSAN Observer details to find out what is causing the delay

Running esxtop

Although understanding all the metrics esxtop provides seem to be impossible using esxtop is fairly simple. When you get the hang of it you will notice yourself staring at the metrics/thresholds more often than ever. The following keys are the ones I use the most.

Open console session or ssh to ESX(i) and type:

esxtop

By default the screen will be refreshed every 5 seconds, change this by typing:

s 2

Changing views is easy, type the following keys for the associated views:

c = cpu
m = memory
n = network
i = interrupts
d = disk adapter
u = disk device
v = disk VM
p = power mgmt
x = vsan

V = only show virtual machine worlds
e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID)
k = kill world, for tech support purposes only!
l = limit display to a single group (GID), enables you to focus on one VM
# = limiting the number of entitites, for instance the top 5

2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
e = statistics broken down per world
6 = statistics broken down per world

Add/Remove fields:

f
<type appropriate character>

Changing the order:

o
<move field by typing appropriate character uppercase = left, lowercase = right>

Saving all the settings you’ve changed:

W

Keep in mind that when you don’t change the file-name it will be saved and used as default settings.
Help:

?

In very large environments esxtop can high CPU utilization due to the amount of data that will need to be gathered and calculations that will need to be done. If CPU appears to highly utilized due to the number of entities (VMs / LUNs etc) a command line option can be used which locks specific entities and keeps esxtop from gathering specific info to limit the amount of CPU power needed:

esxtop -l

More info about this command line option can be found here.

Capturing esxtop results

First things first. Make sure you only capture relevant info. Ditch the metrics you don’t need. In other words, run esxtop and remove/add(f) the fields you don’t actually need or do need! When you are finished make sure to write(W) the configuration to disk. You can either write it to the default config file(esxtop4rc) or write the configuration to a new file.

Now that you have configured esxtop as needed run it in batch mode and save the results to a .csv file:

esxtop -b -d 2 -n 100 > esxtopcapture.csv

Where “-b” stands for batch mode, “-d 2” is a delay of 2 seconds and “-n 100” is 100 iterations. In this specific case, esxtop will log all metrics for 200 seconds. If you want to record all metrics make sure to add “-a” to your string.

Or what about directly zipping the output as well? These .csv can grow fast and by zipping it a lot of precious disk space can be saved!

esxtop -b -a -d 2 -n 100 | gzip -9c > esxtopoutput.csv.gz

Please note that when a new VM is powered on, a VM is vMotion to the host or a new world is created it will not show up within esxtop when “-b” is used as the entities are locked! This behavior is similar to starting esxtop with “-l”.

Analyzing results

You can use multiple tools to analyze the captured data.

  1. VisualEsxtop
  2. perfmon
  3. excel
  4. esxplot

What is VisualEsxtop as it is a relatively new tool (published 1st of July 2013).

VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or ESX hosts, and display ESX server stats with a better user interface and more advanced features.

That sounds nice right? Let us have a look how it works, this is what I did to get it up and running:

  • Go to “http://labs.vmware.com/flings/visualesxtop” and click “download”
  • Unzip “VisualEsxtop.zip” into a folder you want to store the tool
  • Go to the folder
  • Double click “visualesxtop.bat” when running Windows (Or follow William’s tip for the Mac)
  • Click “File” and “Connect to Live Server”
  • Enter the “Hostname”, “Username” and “Password” and hit “Connect”
  • That is it…

Now some simple tips:

  • By default, the refresh interval is set to 5 seconds. You can change this by hitting “Configuration” and then “Change Interval”
  • You can also load Batch Output, this might come in handy when you are a consultant for instance and a customer sends you captured data, you can do this under: File -> Load Batch Output
  • You can filter output, very useful if you are looking for info on a specific virtual machine/world! See the filter section.
  • When you click “Charts”  and double click “Object Types” you will see a list of metrics that you can create a chart with. Just unfold the ones you need and double click them to add them to the right pane

There are a bunch of other cool features in their like color-coding of important metrics for instance. Also, the fact that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide a description of the counter! If you ask me, a tool everyone should download and check out.

Let’s continue with my second favorite tool, perfmon. I’ve used perfmon(part of Windows also known as “Performance Monitor”) multiple times and it’s probably the easiest as many people are already familiar with it. You can import a CSV as follows:

  1. Run: perfmon
  2. Right click on the graph and select “Properties”.
  3. Select the “Source” tab.
  4. Select the “Log files:” radio button from the “Data source” section.
  5. Click the “Add” button.
  6. Select the CSV file created by esxtop and click “OK”.
  7. Click the “Apply” button.
  8. Optionally: reduce the range of time over which the data will be displayed by using the sliders under the “Time Range” button.
  9. Select the “Data” tab.
  10. Remove all Counters.
  11. Click “Add” and select appropriate counters.
  12. Click “OK”.
  13. Click “OK”.

The result of the above would be:
Imported ESXTOP data

With MS Excel it is also possible to import the data as a CSV. Keep in mind though that the amount of captured data is insane so you might want to limit it by first importing it into perfmon and then select the correct timeframe and counters and export this to a CSV. When you have done so you can import the CSV as follows:

  1. Run: excel
  2. Click on “Data”
  3. Click “Import External Data” and click “Import Data”
  4. Select “Text files” as “Files of Type”
  5. Select File and click “Open”
  6. Make sure “Delimited” is selected and click “Next”
  7. Deselect “Tab” and select “Comma”
  8. Click “Next” and “Finish”

All data should be imported and can be shaped / modelled / diagrammed as needed.

Another option is to use a tool called “esxplot“. It hasn’t been updated in a while, and I am not sure what the state of the tool is. You can download the latest version here though, but personally I would recommend using VisualEsxtop instead of esxplot, just because it is more recent.

  1. Run: esxplot
  2. Click File -> Import -> Dataset
  3. Select file and click “Open”
  4. Double click hostname and click on metric

Using ESXPLOT for ESXTOP data
As you can clearly see in the screenshot above the legend(right of the graph) is too long. You can modify that as follows:

  1. Click on “File” -> preferences
  2. Select “Abbreviated legends”
  3. Enter appropriate value

For those using a Mac, esxplot uses specific libraries which are only available on the 32Bit version of Python. In order for esxplot to function correctly set the following environment variable:

export VERSIONER_PYTHON_PREFER_32_BIT=yes

Limiting your view

In environments with a very high consolidation ratio (high number of VMs per host), it could occur that the VM you need to have performance counters for isn’t shown on your screen. This happens purely due to the fact that height of the screen is limited in what it can display. Unfortunately, there is currently no command line option for esxtop to specify specific VMs that need to be displayed. However, you can export the current list of worlds and import it again to limit the amount of VMs shown.

esxtop -export-entity filename

Now you should be able to edit your file and comment out specific worlds that are not needed to be displayed.

esxtop -import-entity filename

I figured that there should be a way to get the info through the command line as and this is what I came up with. Please note that <virtualmachinename> needs to be replaced with the name of the virtual machine that you need the GID for.

VMWID=`vm-support -x | grep <virtualmachinename> |awk '{gsub("wid=", "");print $1}'`
VMXCARTEL=`vsish -e cat /vm/$VMWID/vmxCartelID`
vsish -e cat /sched/memClients/$VMXCARTEL/SchedGroupID

Now you can use the outcome within esxtop to limit(l) your view to that single GID. William Lam has written an article a couple of days after I added the GID section. The following is a lot simpler than what I came up with,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s