بسم الله الرحمن الرحيم
offline Demo-VSANلتجربة معظم الكونفجريشن الخاص ببال
I was recently uprading a vCenter Server Appliance from 6.0 to 6.5 and ran into an interesting error during the upgrade pre-check. It identified that the appliance’s hostname had an IPv6 DNS record and indicated that IPv6 networking would not be preserved after the upgrade.
I’m guessing that the IPv6 DNS entry was a leftover from Windows dynamically registering it because this vCenter was previously migrated from Windows to the vCSA. I knew that IPv6 was not being used and hoped there was a way to continue with the upgrade anyway, but this error does not allow you to proceed. I planned to remove the IPv6 entry as the error message suggests, however deleting a DNS record can be a significant change in some environments. So I began looking for a workaround that would allow me to proceed with the upgrade and fix DNS later.
First I disabled IPv6 in the DCUI on both the source and destination appliance.
If you’re unfamiliar with the upgrade process it actually deploys a new 6.5 appliance with a temporary IP address, migrates the configuration and data from the 6.0 appliance (the source) to the 6.5 appliance(the destination), powers off the 6.0 appliance, and finally the 6.5 appliance assumes the network identity (IP and hostname) of the old one.
With IPv6 disabled on both appliances I attempted the upgrade again. The error message persisted, but at this point I was even more confident that IPv6 was not being used and that the pre-upgrade script was still checking for the IPv6 address resolution anyway.
I should insert a warning before continuing – the following is purely for educational purposes, should not be attempted in a production environmentand, and is most likely unsupported! Running “ps aux” in my SSH session on the 6.5 appliance revealed several upgrade scripts running in “/usr/lib/vmware/cis_upgrade_runner/bootstrap_scripts”. I changed to this directory and began searching for the source of the error message. Running “grep ‘resolves to IPv6’ *.py” directed me to the file “upgrade_commands.py”, where I located the section that was checking for this condition and commented it out by prefixing each line with a “#”.
#if hasNetConflicts:
# addrStr = ', '.join(ipv6Addresses)
# logger.error("Source hostname '%s' resolves to IPv6 address(es) '%s', but "
# "IPv6 networking will not be preserved during the upgrade.",
# sourceFqdn, addrStr)
# self.reporter.addError(
# text=_(IPV6_RESOLVABLE_IN_IPV4_NETCONFIG_ERROR_TEXT,
# [sourceFqdn, addrStr]),
# description=_(IPV6_RESOLVABLE_IN_IPV4_NETCONFIG_ERROR_DESCRIPTION,
# [sourceFqdn, addrStr]),
# resolution=_(IPV6_RESOLVABLE_IN_IPV4_NETCONFIG_ERROR_RESOLUTION,
# sourceFqdn))
After commenting out the above section that was logging the IPv6 error I restarted the upgrade and it completed without issue. Again, I don’t recommend doing this unless it is a true test environment that you can afford to rebuild if necessary. I just wanted to document/share the experience and a few details about how the vCSA 6.5 pre-upgrade check works. If you haven’t tried the vCenter 6.5 appliance yet I highly recommend it. See more about what’s new over on the VMware vSphere Blog.
Starting with AOS 5.15.2 (LTS) and AOS 5.17.1, Nutanix introduced VMware vSphere 7 support. Now, you can upgrade your vSphere ESXi to version 7 on the Nutanix platform.
To make sure your hardware platform is supported with VMware vSphere 7, go to my.nutanix.com and check the compatibility matrix.
To upgrade the Nutanix cluster to vSphere 7, you can use the Nutanix 1-Click feature available in every Prism Element console. Go to settings –> Upgrade software –> Hypervisor. Upload hypervisor binary and JSON file. You can download JSON file from the Nutanix support portal. VMware ESXi 7.0 zip package download from the VMware support portal.
When the upload is finished, hit the UPGRADE button and wait. The system in the background will migrate all VMs off the first host, put a host into maintenance more, upgrade ESXi 6.7 to ESXi 7.0, reboot it, exit from maintenance mode. And move to the next host.
The upgrade may take between 20 to 40 minutes per node, depends how busy the cluster is, how big is the node (RAM, CPU, and disks), how many VMs are hosted on the Nutanix cluster.
VMware has released a vRealize Network Insight 5.1. vRealize Network Insight (vRNI) supports a large number of vendors which are switch/router vendors. We name a few like Dell, Cisco (ACI, Nexus), Arista, Juniper. Also firewalls such as Palo Alto, Checkpoint, Cisco ASA, Fortinet.
vRNI has also Azure support which gives you visibility on application dependency mappings, flow analysis (inter, intra and hybrid VNET, NSG, ASG, VM, Subnet etc).
vRNI allows Accelerate micro-segmentation deployment and troubleshoots security for SDDC, native AWS and hybrid environments.
There are 3 versions of vRNI. Advanced, Enterprise and Cloud Service. The Advanced does not have a possibility to plan security for AWS, AWS visibility and Troubleshooting, PCI compliance dashboard, Netwlow from physical devices, Configurable and extended retention period of data or Infoblox integration.
VMware SD-WAN by VeloCloud®
Application Discovery and Troubleshooting
VMware NSX-TTM
VMware CloudTM on AWS
Containers
Other Enhancements
VMware vRealize Network Insight 5.1.0 Download
NSX-T Data Center 2.5 provides a variety of new features to provide new functionality for virtualized networking and security for private, public, and hybrid clouds. Highlights include enhancements to intent-based networking user interface, context-aware firewall, guest and network introspection features, IPv6 support, highly-available cluster management, profile-based NSX installation for vSphere compute clusters, and enhancements to migration coordinator for migrating from NSX Data Center for vSphere to NSX-T Data Center.
NSX-T Data Center 2.5 introduces NSX Intelligence v1.0, a new NSX analytics component. NSX Intelligence provides a user interface via a single management pane within NSX Manager, and provides the following features:
New API support is available for container inventory. See the API documentation.
NSX-T 2.5 continues to enhance the IPv6 routing/forwarding feature-set. This includes the support for:
Layer-7 AppID Support
NSX-T 2.5 adds more Layer-7 capabilities for distributed and gateway firewall. This includes the support for:
FQDN/URL Filtering Enhancements
NSX-T 2.5 has minor enhancements to FQDN filtering support, including:
Firewall Operations have been enhanced with the following features:
Identity Firewall
Support of single cluster designs with fully collapsed Edge+Management+Compute VMs, powered by a single N-VDS, in a cluster with a minimum of four hosts. The typical reference designs for VxRail and other cloud provider host solution prescribe 4x10G pNICs with two host switches. One switch is dedicated to Edge+Management (VDS), whereas the other one is dedicated to compute VMs (N-VDS). Two host-switches effectively separate the management traffic from the compute traffic. However, with the trending economics of 10 and 25G, many small data center and cloud provider customers are standardizing on two pNICs host. Using this form factor, small data centers and cloud provider customers can build an NSX-T based solution with single N-VDS, powering all the components with two pNICs.
For compatibility and system requirements information, see the NSX-T Data Center Installation Guide.
NSX-T Data Center System Communication Port Changes
Starting with NSX-T Data Center 2.5, the NSX Messaging channel TCP port from all Transport and Edge nodes to NSX Managers has changed to TCP port 1234 from port 5671. With this change, make sure all NSX-T Transport and Edge nodes can communicate on both TCP ports 1234 to NSX Managers and TCP port 1235 to NSX Controllers before you upgrade to NSX-T Data Center 2.5. Also make sure to keep port 5671 open during the upgrade process.
L2 Networking
As a result of the enhancements for Layer-2 bridges, the ESXi bridge is deprecated. NSX-T was initially introduced with the capability of dedicating an ESXi host as a bridge to extend an overlay segment to a VLAN. This model is deprecated as of this release because the new Edge bridge supersedes it in term of features, does not require a dedicated ESXi host, and benefits from the optimized data path of the Edge node. See “What’s New” for more information.
Transport Node Template APIs are deprecated in this release. It is recommended that you use Transport Node Profiles APIs instead. See the API Guide for the list of deprecated types and methods.
See code.vmware.com to use the NSX-T Data Center APIs or CLIs for automation.
The API documentation is available from the API Reference tab. The CLI documentation is available from the Documentation tab.
NSX-T Data Center has been localized into multiple languages: English, German, French, Japanese, Simplified Chinese, Korean, Traditional Chinese, and Spanish. Because NSX-T Data Center localization utilizes the browser language settings, ensure that your settings match the desired language.
19 September 2019. First edition.
23 September 2019. Added Known Issues 2424818 and 2419246. Added Resolved Issues 2364756, 2406018, and 2383328.
24 September 2019. Updated What’s New items.
03 October 2019. Added Resolved Issue 2313673.
12 November 2019. Added Known Issues 2362688 and 2436302. Corrected Issue 2282798 by moving it to Resolved.
For VM-based Edge transport nodes, users are unable to connect the Edge transport node uplinks to the NSX-T logical switches/segments. They can connect them only to the vCenter’s DVPGs. On the Configure NSX screen for VM-based Edge transport node’s add/edit flows, the users are presented with the option to map the uplinks only with vCenter’s DVPGs. The option to map the uplinks to the NSX-T logical switches/segments is missing.
The known issues are grouped as follows.
/var/log/proxy/reverse-proxy.log /var/log/syslog
Workaround: Close all open authentication windows/tabs and retry authentication.Verify by opening the key-file. If a passphrase was entered when generating the key, the second line in the file will show something like “Proc-Type: 4,ENCRYPTED”.
This line is missing if the key-file was generated without passphrase.
Workaround: You can prevent forwarding loops by configuring the problematic router (or its peer) to block routes being advertised back to it.
Logical Networking Known Issues
Security Services Known Issues
Solution Interoperability Known Issues
NSX Intelligence does not support source or destination port parsing for GRE, ESP, and SCTP protocol flows. NSX Intelligence provides full header parsing for TCP and UDP flows along with flow related statistics. For other supported protocols (such as GRE, ESP, and SCTP) NSX Intelligence can only provide IP information without protocol specific source or destination ports. For these protocols, the source or destination port will be zero.
Workaround: None.
These errors will self-correct after the NSX Intelligence appliance has been deployed longer than the user-selected visualization period.
Workaround: None. If the user moves out of the Visualization period during which the NSX Intelligence appliance was deployed, the issue will not appear.
Failed to load requested application. Please try again or contact support if the problem persists.
Workaround:
$ GET https://{{nsx_ua_server}}/api/v1/trust-management/certificates/{{certificate ID
from previous step}}
$ ssh root@nsx-pace
$ export NEW_CERT_FILE=/root/new_cert.pem
$ export HTTP_CERT_PWD_FILE=/config/http/.http_cert_pw
$ export HTTP_CERT_PW='cat $HTTP_CERT_PWD_FILE'
$ export CLIENT_TRUSTSTORE_FILE="/home/secureall/secureall/.store/.client_truststore"
$ cat > $NEW_CERT_FILE
-----BEGIN CERTIFICATE-----
<pem_encoded field contents>
-----END CERTIFICATE-----
sed
to remove new line expressions from the text string:$ sed 's/\\n/\
/g' -i $NEW_CERT_FILE
$ keytool -import -alias new_nsx_cluster_key -file $NEW_CERT_FILE -keystore \
$CLIENT_TRUSTSTORE_FILE -storepass $HTTP_CERT_PW -noprompt
$ <new_nsx_cluster_key> keytool -list -v -keystore $CLIENT_TRUSTSTORE_FILE -storepass \
$HTTP_CERT_PW -noprompt
$ systemctl restart proxy
You should now be able to refresh the Plan & Troubleshoot page and view the flow information as before.
Operations and Monitoring Services Known Issues
get service install-upgrade
dpkg --configure -a
dpkg --configure nsx-agent
dpkg --configure nsx-vdpi
As most of you have seen, vSAN 6.7 just released together with vSphere 6.7. As such I figured it was time to write a “what’s new” article. There are a whole bunch of cool enhancements and new features, so let’s create a list of the new features first, and then look at them individually in more detail.
Yes, that is a relatively long list indeed. Lets take a look at each of the features. First of all, HTML-5 support. I think this is something that everyone has been waiting for. The Web Client was not the most loved user interface that VMware produced, and hopefully the HTML-5 interface will be viewed as a huge step forward. I have played with it extensively over the past 6 months and I must say that it is very snappy. I like how we not just ported over all functionality, but also looked if workflows could be improved and if presented information/data made sense in each and every screen. This also however does mean that new functionality from now on will only be available in the HTML-5 client, so use this going forward. Unless of course the functionality you are trying to access isn’t available yet, but most of it should be! For those who haven’t seen it yet, here’s a couple of screenshots… ain’t it pretty? 😉
For those who didn’t notice, but in the above screenshot you actually can see the swap file, and the policy associated with the swap file, which is a nice improvement!
The next feature is native vROps dashboards for vSAN in the H5 client. I found this very useful in particular. I don’t like context switching and this feature allows me to see all of the data I need to do my job in a single user interface. No need to switch to the VROps UI, but instead vSphere and vSAN dashboards are now made available in the H5 client. Note that it needs the VROps Client Plugin for the vCenter H5 UI to be installed, but that is fairly straight forward.
Next up is support for Microsoft Windows Server Failover Clustering for the vSAN iSCSI service. This is very useful for those running a Microsoft cluster. Create and iSCSI Target and expose it to the WSFC virtual machines. (Normally people used RDMs for this.) Of course this is also supported with physical machines. Such a small enhancement, but for customers using Microsoft clustering a big thing, as it now allows you to run those clusters on vSAN without any issues.
Next are a whole bunch of enhancements that have been added based on customer feedback of the past 6-12 months. Fast Network Failovers was one of those. Majority of our customers have a single vmkernel interface with multiple NICs associated with them, some of our customers have a setup where they create two vmkernel interfaces on different subnets, each with a single NIC. What that last group of customers noticed is that in the previous release we waited 90 seconds before failing over to the other vmkernel interface (tcp time out) when a network/interface had failed. In the 6.7 release we actually introduce a mechanism that allows us to failover fast, literally within seconds. So a big improvement for customers who have this kind of network configuration (which is very similar to the traditional A/B Storage Fabric design).
Adaptive Resync is an optimization to the current resync function that is part of vSAN. If a failure has occurred (host, disk, flash failure) then data will need to be resynced to ensure that the impacted objects (VMs, disks etc) are brought in to compliance again with the configured policy. Over the past 12 months the engineering team has worked hard to optimize the resync mechanism as much as possible. In vSAN 6.6.1 a big jump was already made by taking VM latency in to account when it came to resync bandwidth allocation, and this has been further enhanced in 6.7. In 6.7 vSAN can calculate the total available bandwidth, and ensures Quality Of Service for the guest VMs prevails by allocating those VMs 80% of the available bandwidth and limiting the resync traffic to 20%. Of course, this only applies when congestion is detected. Expect more enhancements in this space in the future.
A couple of release ago we introduced Witness Traffic Separation for 2 Node configurations, and in 6.7 we introduce the support for this feature for Stretched Clusters as well. This is something many Stretched vSAN customers have asked for. It can be configured through the CLI only at this point (esxcli) but that shouldn’t be a huge problem. As mentioned previously, what you end up doing is tagging a vmknic for “witness traffic” only. Pretty straight forward, but very useful:
esxcli vsan network ip set -i vmk<X> -T=witness
Another enhancement for stretched clusters is Preferred Site Override. It is a small enhancements, but in the past when the preferred site failed and returned for duty but would only be connected to the witness, it could happen that the witness would bind itself directly to the preferred site. This by itself would result in VMs becoming unavailable. This Preferred Site Override functionality would prevent this from happening. It will ensure that VMs (and all data) remains available in the secondary site. I guess one could also argue that this is not an enhancement, but much more a bug fix. And then there is the Efficient Resync for Stretched Clusters feature. This is getting a bit too much in to the weeds, but essentially it is a smarter way of bringing components up to the same level within a site after the network between locations has failed. As you can imagine 1 location is allowed to progress, which means that the other location needs to catch up when the network returns. With this enhancement we limit the bandwidth / resync traffic.
And as with every new release, the 6.7 release of course also has a whole new set of Health Checks. I think the Health Check has quickly become the favorite feature of all vSAN Admins, and for a good reason. It makes life much easier if you ask me. In the 6.7 release for instance we will validate consistency in terms of host settings and if an inconsistency is found report this. We also, when downloading the HCL details, will only download the differences between the current and previous version. (Where in the past we would simply pull the full json file.) There are many other small improvements around performance etc. Just give it a spin and you will see.
Something that my team has been pushing hard for (thanks Paudie) is the Enhanced Diagnostic Partition. As most of you know when you install / run ESXi there’s a diagnostic partition. This diagnostic partition unfortunately was a fixed size, with the current release when upgrading (or installing green field) ESXi will automatically resize the diagnostic partition. This is especially useful for large memory host configurations, actually useful for vSAN in general. No longer do you need to run a script to resize the partition, it will happen automatically for you!
Another optimization that was released in vSAN 6.7 is called “Efficient Decomissioning“. And this is all about being smarter in terms of consolidating replicas across hosts/fault domains to free up a host/fault domain to allow for maintenance mode to occur. This means that if a component is striped, for other reasons then policy, they may be consolidated. And the last optimization is what they refer to as Efficient and consistent storage policies. I am not sure I understand the name, as this is all about the swap object. Per vSAN 6.7 it will be thin provisioned by default (instead of 100% reserved), and also the swap object will now inherit the policy assigned to the VM. So if you have FTT=2 assigned to the VM, then you will have not two but three components for the swap object, still thin provisioned so it shouldn’t really change the consumed space in most cases.
Then there are the two last items on the list: 4K Native Device Support and FIPS 140-2 Level 1 validation. I think those speak for itself. 4K Native Device Support has been asked for by many customers, but we had to wait for vSphere to support it. vSphere supports it as of 6.7, so that means vSAN will also support it Day 0. The VMware VMkernel Cryptographic Module v1.0 has achieved FIPS 140-2, vSAN leverages the same module for vSAN Encryption. Nice collaboration by the teams, which is now showing the big benefit.
we will take about upgrading vCenter Server Appliance with External PSC from 6.0 u3b to 6.5 U1e we done this upgrade before the release of 6.5 U2 which was released few day’s before Release note so we upgrade to 6.5 U2
before we upgrade to 6.5 you must be prepare the for that :
Source System Prerequisites
• Verify that the appliance that you want to upgrade does not run on an ESXi host that is part of a fully automated DRS cluster.
• Verify that port 22 is open on the appliance that you want to upgrade. The upgrade process establishes an inbound SSH connection to download the exported data from source appliance.
• If you are upgrading a vCenter Server Appliance that is configured with Update Manager, run the Migration Assistant on the source Update Manager machine.
• Verify that port 443 is open on the source ESXi host
• Create a snapshot of the appliance that you want to upgrade as a precaution in case of failure during the upgrade process.
• If you use an external database, back up the vCenter Server Appliance database
Target System Prerequisites
• If you plan to deploy the new appliance on an ESXi host, verify that the target ESXi host is not part of a fully automated DRS cluster.
• If you plan to deploy the new appliance on a DRS cluster of the inventory of a vCenter Server instance, verify that the cluster is not fully automated.
for this case we have 2 platform service controller linked with LB and 2 vCenter linked to each other
first step to upgrade the platform service controller PSC and it’s partner
upgrade will take place in two stage
stage 1 : will create a new PSC VM with temp IP address
after the end of stage 1 , stage 2 will start in which the new PSC VM will copy all configuration from the old PSC VM and take the same DNS name and IP Address then shutdown the old one
after this step compete you can check from the VAMI interface. To do that, just navigate to https:// <Platform service controller address>:5480 and login with your root account
after upgrading PSC and it’s partner we now read to start vCenters upgrade concurrent
before upgrade vCenter we need to run the migration assistance on update manager VM .
now we ready to start vCenter upgrade which also will be in 2 stage
1 stage creating a new vCenter VM
Stage 2 will transfer data from old Center to the New vCenter including DNS name IP Address ,configuration and update manager
you can check from the VAMI interface. To do that, just navigate to https:// <vCenter Server Appliance>:5480 and login with your root account
so we now successfully upgraded plate form service controller and the vCenter to 6.5U1e
now from the VAMI interface we will check update and install the latest update which is 6.5U2
we will follow the same sequence first update PSC and it’s partner then vCenter concurrent
first PSC upgrade
after packages upgraded we need to reboot
now vCenter
at the end in this blog post we show how to upgrade vCenter Server appliance with External Platform Service Controller from 6.0U3b to 6.5U1e and then to 6.5U2
you can upgrade direct from 6.0U3b to 6.5U2 with same steps as above
We have vCenter 6.5 U2 when i try to add ESXI host 6.0 U3 i have got the below error message
Add standalone host A general system error occurred: Unable to get signed certificate for host: Error: Access denied, reason = rpc_s_auth_method (0x16c9a0f6). (382312694).
To Solve this error Follow Below Steps :
Connect to the vCenter Server using vSphere Client and an Administrative account.
Go to Configure > vCenter Server settings > Advanced Settings.
Change the value of vpxd.certmgmt.mode to thumbprint and click OK.
Add the ESXi host again.
I am a huge fan of esxtop! I used to read a couple of pages of the esxtop bible every day before I went to bed, not anymore as the doc is unfortunately outdated (yes I have requested an update various times.). Something I, however, am always struggling with is the “thresholds” of specific metrics. I fully understand that it is not black/white, performance is the perception of a user in the end.
There must be a certain threshold however. For instance it must be safe to say that when %RDY constantly exceeds the value of 20 it is very likely that the VM responds sluggish. I want to use this article to “define” these thresholds, but I need your help. There are many people reading these articles, together we must know at least a dozen metrics lets collect and document them with possible causes if known.
Please keep in mind that these should only be used as a guideline when doing performance troubleshooting! Also be aware that some metrics are not part of the default view. You can add fields to an esxtop view by clicking “f” on followed by the corresponding character.
I used VMworld presentations, VMware whitepapers, VMware documentation, VMTN Topics and of course my own experience as a source and these are the metrics and thresholds I came up with so far. Please comment and help build the main source for esxtop thresholds.
vSphere 6.5
For vSphere 6.5 there are various different metrics added. For instance, on the “power management” section there is %A/MPERF added which indicates if Turbo Boost is being used. On CPU also in 6.5 the %RUN will be lower as the system thread which is doing work for a particular VM will now be “charged” to %RUN instead of %SYS. That way %RUN actually represents everything and is more intuitive.
Display | Metric | Threshold | Explanation |
CPU | %RDY | 10 | Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. Note that you will need to expand the VM Group to see how this is distributed across vCPUs. If you have many vCPUs than per vCPU may be low and this may not be an issue. 10% is per world! |
CPU | %CSTP | 3 | Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities. |
CPU | %MLMTD | 0 | The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU. |
CPU | %SWPWT | 5 | VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment. |
MEM | MCTLSZ | 1 | If larger than 0 hosts is forcing VMs to inflate balloon driver to reclaim memory as host is overcommited. |
MEM | SWCUR | 1 | If larger than 0 hosts has swapped memory pages in the past. Possible cause: Overcommitment. |
MEM | SWR/s | 1 | If larger than 0 host is actively reading from swap(vswp). Possible cause: Excessive memory overcommitment. |
MEM | SWW/s | 1 | If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment. |
MEM | CACHEUSD | 0 | If larger than 0 hosts has compressed memory. Possible cause: Memory overcommitment. |
MEM | ZIP/s | 0 | If larger than 0 hosts is actively compressing memory. Possible cause: Memory overcommitment. |
MEM | UNZIP/s | 0 | If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on memory. |
MEM | N%L | 80 | If less than 80 VM experiences poor NUMA locality. If a VM has a memory size greater than the amount of memory local to each processor, the ESX scheduler does not attempt to use NUMA optimizations for that VM and “remotely” uses memory via “interconnect”. Check “GST_ND(X)” to find out which NUMA nodes are used. |
NETWORK | %DRPTX | 1 | Dropped packets transmitted, hardware overworked. Possible cause: very high network utilization |
NETWORK | %DRPRX | 1 | Dropped packets received, hardware overworked. Possible cause: very high network utilization |
DISK | GAVG | 25 | Look at “DAVG” and “KAVG” as the sum of both is GAVG. |
DISK | DAVG | 25 | Disk latency most likely to be caused by the array. |
DISK | KAVG | 2 | Disk latency caused by the VMkernel, high KAVG usually means queuing. This is the ESXi storage stack, the vSCSI layer and the VMM. Check “QUED”. |
DISK | QUED | 1 | Queue maxed out. Possibly queue depth set to low, or controller overloaded. Check with array vendor for optimal queue depth value. (Enable this via option “F” aka QSTATS |
DISK | ABRTS/s | 1 | Aborts issued by guest(VM) because storage is not responding. For Windows VMs this happens after 60 seconds by default. Can be caused for instance when paths failed or array is not accepting any IO for whatever reason. |
DISK | RESETS/s | 1 | The number of commands resets per second. |
DISK | ATSF | 1 | The number of failed ATS commands, this value should be 0 |
DISK | ATS | 1 | The number of successful ATS commands, this value should go up over time when the array supports ATS |
DISK | DELETE | 1 | The number of successful UNMAP commands, this value should go up over time when the array supports UNMAP! |
DISK | DELETE_F | 1 | The number of failed UNMAP commands, this value should be 0 |
DISK | CONS/s | 20 | SCSI Reservation Conflicts per second. If many SCSI Reservation Conflicts occur performance could be degraded due to the lock on the VMFS. |
VSAN | SDLAT | 5 | Standard deviation of latency, when above 10ms latency contact support to analyze vSAN Observer details to find out what is causing the delay |
Although understanding all the metrics esxtop provides seem to be impossible using esxtop is fairly simple. When you get the hang of it you will notice yourself staring at the metrics/thresholds more often than ever. The following keys are the ones I use the most.
Open console session or ssh to ESX(i) and type:
esxtop
By default the screen will be refreshed every 5 seconds, change this by typing:
s 2
Changing views is easy, type the following keys for the associated views:
c = cpu m = memory n = network i = interrupts d = disk adapter u = disk device v = disk VM p = power mgmt x = vsan V = only show virtual machine worlds e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID) k = kill world, for tech support purposes only! l = limit display to a single group (GID), enables you to focus on one VM # = limiting the number of entitites, for instance the top 5 2 = highlight a row, moving down 8 = highlight a row, moving up 4 = remove selected row from view e = statistics broken down per world 6 = statistics broken down per world
Add/Remove fields:
f <type appropriate character>
Changing the order:
o <move field by typing appropriate character uppercase = left, lowercase = right>
Saving all the settings you’ve changed:
W
Keep in mind that when you don’t change the file-name it will be saved and used as default settings.
Help:
?
In very large environments esxtop can high CPU utilization due to the amount of data that will need to be gathered and calculations that will need to be done. If CPU appears to highly utilized due to the number of entities (VMs / LUNs etc) a command line option can be used which locks specific entities and keeps esxtop from gathering specific info to limit the amount of CPU power needed:
esxtop -l
More info about this command line option can be found here.
First things first. Make sure you only capture relevant info. Ditch the metrics you don’t need. In other words, run esxtop and remove/add(f) the fields you don’t actually need or do need! When you are finished make sure to write(W) the configuration to disk. You can either write it to the default config file(esxtop4rc) or write the configuration to a new file.
Now that you have configured esxtop as needed run it in batch mode and save the results to a .csv file:
esxtop -b -d 2 -n 100 > esxtopcapture.csv
Where “-b” stands for batch mode, “-d 2” is a delay of 2 seconds and “-n 100” is 100 iterations. In this specific case, esxtop will log all metrics for 200 seconds. If you want to record all metrics make sure to add “-a” to your string.
Or what about directly zipping the output as well? These .csv can grow fast and by zipping it a lot of precious disk space can be saved!
esxtop -b -a -d 2 -n 100 | gzip -9c > esxtopoutput.csv.gz
Please note that when a new VM is powered on, a VM is vMotion to the host or a new world is created it will not show up within esxtop when “-b” is used as the entities are locked! This behavior is similar to starting esxtop with “-l”.
You can use multiple tools to analyze the captured data.
What is VisualEsxtop as it is a relatively new tool (published 1st of July 2013).
VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or ESX hosts, and display ESX server stats with a better user interface and more advanced features.
That sounds nice right? Let us have a look how it works, this is what I did to get it up and running:
Now some simple tips:
There are a bunch of other cool features in their like color-coding of important metrics for instance. Also, the fact that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide a description of the counter! If you ask me, a tool everyone should download and check out.
Let’s continue with my second favorite tool, perfmon. I’ve used perfmon(part of Windows also known as “Performance Monitor”) multiple times and it’s probably the easiest as many people are already familiar with it. You can import a CSV as follows:
The result of the above would be:
With MS Excel it is also possible to import the data as a CSV. Keep in mind though that the amount of captured data is insane so you might want to limit it by first importing it into perfmon and then select the correct timeframe and counters and export this to a CSV. When you have done so you can import the CSV as follows:
All data should be imported and can be shaped / modelled / diagrammed as needed.
Another option is to use a tool called “esxplot“. It hasn’t been updated in a while, and I am not sure what the state of the tool is. You can download the latest version here though, but personally I would recommend using VisualEsxtop instead of esxplot, just because it is more recent.
As you can clearly see in the screenshot above the legend(right of the graph) is too long. You can modify that as follows:
For those using a Mac, esxplot uses specific libraries which are only available on the 32Bit version of Python. In order for esxplot to function correctly set the following environment variable:
export VERSIONER_PYTHON_PREFER_32_BIT=yes
In environments with a very high consolidation ratio (high number of VMs per host), it could occur that the VM you need to have performance counters for isn’t shown on your screen. This happens purely due to the fact that height of the screen is limited in what it can display. Unfortunately, there is currently no command line option for esxtop to specify specific VMs that need to be displayed. However, you can export the current list of worlds and import it again to limit the amount of VMs shown.
esxtop -export-entity filename
Now you should be able to edit your file and comment out specific worlds that are not needed to be displayed.
esxtop -import-entity filename
I figured that there should be a way to get the info through the command line as and this is what I came up with. Please note that <virtualmachinename> needs to be replaced with the name of the virtual machine that you need the GID for.
VMWID=`vm-support -x | grep <virtualmachinename> |awk '{gsub("wid=", "");print $1}'` VMXCARTEL=`vsish -e cat /vm/$VMWID/vmxCartelID` vsish -e cat /sched/memClients/$VMXCARTEL/SchedGroupID
Now you can use the outcome within esxtop to limit(l) your view to that single GID. William Lam has written an article a couple of days after I added the GID section. The following is a lot simpler than what I came up with,
The software-defined data center is made possible by virtualization the key components and functionalities of the datacenter. This started of course with virsualizing compute with the initial virtualization wave.
Next, was the virtualization of the network components. VMware’s NSX-V platform has made tremendous waves in the software-defined data center and has allowed organizations to be truly freed from the underlying hardware network components for data center communication.
VMware’s NSX product has certainly matured over the last several releases with the latest release by the VMware being, NSX-T 2.1. In this blog, we will walk through
To understand the use case for NSX-T let’s think about the requirements for NSX-V which would help us to see where NSX-T fits into the VMware SDN ecosystem.
NSX-V (NSX for “vSphere”) is designed for vSphere deployments only and is architected so that a single NSX-V manager platform is tied to a single VMware vCenter Server instance. The NSX-V platform is the original NSX platform that has been around for a few years now.
It is specifically designed with VMware virtual machines in mind as that is the legacy virtualization mechanism for workloads that has been around since the onset of server virtualization.
With NSX-V, organizations are able to mobilize network connectivity between virtual machines and allow those workloads to be connected in ways that were otherwise unable to be delivered efficiently by physical networking hardware.
For the most part, if you are wanting to run a software-defined networking infrastructure within the realm of VMware vSphere, NSX-V is the platform that you are most likely to be using.
NSX-T (NSX “Transformers”) is designed to address many of the use cases that NSX-V was not designed for, such as the multi-hypervisors. NSX-T is a multi-hypervisor aware SDN stack brought to the likes of vSphere, KVM, OpenStack, Kubernetes, and Docker.
It is designed to address emerging application frameworks and architectures that have heterogeneous endpoints and technology stacks. One of the major use cases for NSX-T is with containers. In today’s virtualization, we are seeing more and more applications are running in environments outside of virtual machines.
Important as well when considering the multi-hypervisor support is the fact that NSX-T has been decoupled from VMware vCenter Server. NSX-T is a standalone solution for vCenter and vSphere environments but it can also support KVM, public cloud, containers, and can also be integrated into frameworks like Red Hat OpenShift, Pivotal, and others.
One of the major shifts in focus you will see when comparing the two products is that NSX-T is more cloud focused with forward looking functionality.
It also allows organizations more flexibility in choosing the solution that best fits their use case whether that is including hypervisors, containers, bare metal, and public clouds.
VMware NSX-T is integrated with the VMware Photon Platform which is the cloud centric operating system that VMware developed from the ground up with the likes of the current vCenter server running atop this platform. NSX-T also contains the NSX-T Container Networking interface (CNI) plugin that will allow developers to configure network connectivity for container applications that help deliver Infrastructure as a Service
Interestingly, with NSX-T VMware has moved over from the VXLAN based encapsulation that is utilized by NSX-V, and has adopted the newer “Geneve” encapsulation. This architectural difference makes NSX-T and NSX-V incompatible at the moment.
What is the Geneve encapsulation standard as opposed to the more prevalent VXLAN, especially when there are a lot of hardware devices on the market that supports VXLAN?
Geneve is a newly minted encapsulation co-authored by VMware, Microsoft, Red Hat and Intel. Geneve combines the best of the current encapsulation protocols such as VXLAN, STT, and NVGRE into a single protocol. Much has been learned from current network virtualization protocols and as NSX has matured, the need for an even more extensible encapsulation protocol has come to light. Geneve allows inserting metadata as TLV fields which can be used for new features as needed.
Other NSX-T Architecture changes to note:
Thoughts
VMware NSX is certainly evolving, especially with the introduction of VMware NSX-T. VMware is showing commitment to the NSX platform moving beyond simply vSphere environments and including KVM, Openstack, and multiple public clouds. This decoupling from vSphere will certainly attract others to VMware’s ever popular network virtualization platform. There are several key points to note with NSX-T including the following.
Key Points of Interest with NSX-T
It will be interesting to see how VMware handles the two product lines between NSX-V and NSX-T and if the two products will remain separate or VMware will attempt to bring both together at some point in the future.