About VSAN 6.6

From Notes_Wiki
Revision as of 18:02, 19 August 2018 by Saurabh (talk | contribs) (Created page with "<yambe:breadcrumb>VMWare_VSAN|VMWare VSAN</yambe:breadcrumb> =About vSAN 6.6= vSAN allows using Directly Attached Storage (DAS) disks which are physically placed on the same...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

<yambe:breadcrumb>VMWare_VSAN|VMWare VSAN</yambe:breadcrumb>

About vSAN 6.6

vSAN allows using Directly Attached Storage (DAS) disks which are physically placed on the same hosts which are participating for compute. This allows policy based high performance redundant storage available for VMs without requiring external storage.

Following points can be useful to learn about vSAN

  • vSAN has two tiered architecture. It uses cache tier (SSD) for caching reads or writes. It uses capacity tier (SSD or HDD) for storage. Number of capacity drives must be greater than or equal to number of cache drives claimed per host.
  • Each vSAN node can have 1 to 5 disk groups. Each disk group has at least one SSD disk for caching and maximum 7 capacity disks for storage.
  • vSAN supports two type of architecture
    All-flash
    (vSAN 6.0 onwards) In this case both cache and capacity tiers are SSD. In this architecture 100% cache is used only for caching writes by default. Reads happen directly from capacity tier by default.
    In case of All-flash vSAN deduplication and compression can be enabled. Deduplication is done within the disk group. Dedepulication SHA-1 algorithm uses fixed 4K block size. LZ4 algorithm compressed data is used only if after compressing 4K size block is <=2KB. Deduplication and compression is disabled by default and can be configured only at cluster level. Deduplication and compression is performed when data is destaged from cache to capacity tier.
    RAID-5 and RAID-6 erasure coding can be implemented in case of all-flash vSANs.
    Hybrid
    In this case cache tier is SDD. Capacity tier is normal HDD. Only one SSD disk can be used for cache tier in case of Hybrid deployments. De-duplication and compression is not available in case of hybrid vSAN deployments. In this type of deployment 70% of cache capacity is reserved for reads and 30% of capacity is reserved for writes by default.
  • All-flash and hybrid hosts cannot be mixed together in same vSAN cluster.
  • Before deploying or enabling vSAN on a cluster, the following requirements must be met
    • A kernel port for vSAN should be created on all hosts which are part of the cluster. vSAN storage I/O traffic always uses vSAN kernel port. Both standard and distributed virtual switch are supported. In case of DVS VMWare vSphere network I/O control should be enabled.
    • vSAN license should be present
    • At least one SSD disk should be present on each host for cache tier. At least one or more SSD or HDD should be present on each hostfor capacity tier.
    • For all-flash vSANs 10G NIC should be available. For hybrid vSANs 10G or dedicated 1G NICs should be available.
    • Multicast should be enabled on network for vSAN nodes to be able to communicate in case of vSAN version less than 6.6. Note that multicast is used only for object creation, change in object status and publication of statistics. It is not used for actual vSAN I/O.
      • In case multicast is used it is better to have either separate VLAN for vSAN traffic or IGMP snooping should be enabled on L2 switches. By ensuring all VSAN kernel ports to be in same network, requirement of L3 multicast support from routers can be avoided.
    • In case multiple VSAN clusters use same L2 network then multicast address for all hosts of one of the two clusters must be manually changed so that both clusters become independent.
      • vSAN 6.6 onwards unicast is used instead of multicast for vSAN communication which support for multicast for backward compatibility. Once all nodes are upgaded to vSAN 6.6 cluster converts to unicast mode. New vSAN 6.6 deployments always use unicast mode. We can go to vSAN -> Configure -> General and look at "Networking mode" value to see whether vSAN is operating in multicast or Unicast mode.
    • At least 3 hosts should be available for default vSAN storage policy of Failure To Tolerate (FTT=1). For RAID 5 at least 4 hosts are required and for RAID 6 at least 6 hosts are required.
      • Two node configuration is supported but requires a witness node
    • Underlying RAID card should support pass-through or RAID 0 with caching disabled. Passthrough controllers are preferred.
    • The hardware should be part of VMWare Hardware Compatibility List (HCL) for corresponding vSAN version.
    • At least 32GB RAM per host is recommended.
    • Ideally all nodes should run same version of vSphere and vSAN.
  • There are VxRail and VxRack vSAN ready nodes from Dell EMC that can be used to deploy vSAN or come with vSAN deployed already.
  • Following different types of RAID are present:
    RAID 0
    Striping : Split data into blocks and write contiguous blocks into different disks for faster read/write. No redundancy. Failure should lead to data loss.
    RAID 1
    Mirroring : Mirror data of one disk to another disk. Reads are faster as data can be read from both places. Failure of a single disk does not leads to loss of data. Storage capacity is reduced by 50%.
    vSAN supports two-way, three-way and four-way mirroring
    RAID 10
    Striping + Mirroring : Better than RAID 1 as even though storage capacity is still reduced by 50%, due to stripes the read and write throughput is better due to use of blocks.
    RAID 5
    Striping + Parity : Data is split into stripes and written. One of the members stores parity instead of data. Thus, if four members of size X participate in RAID 5. Total capacity is only 3X and not 4X as one member is used for storing parity per block. This gives good performance and reads are spread. Write has smaller overhead of parity calculation. Loss of one disk can be tolerated.
    For using RAID 5 erasure coding in VSAN minimum 4 hosts are required as data is stored in 3+1(parity) manner across hosts. This also required all-flash vSAN.
    RAID 6
    Striping + Double parity : Same as RAID 5 but in this case there are two parity bits.
    For using RAID 6 erasure coding in VSAN minimum 6 hosts are required as data is stored in 4+2(parity) manner across hosts. This also requires all-flash vSAN.
  • vSphere HA, DRS and vMotion work properly on top of vSAN datastore. ESXi hosts cannot boot from vSAN datastore. Also Raw Device Mapping (RDM) is not supported for vSAN datastores.
  • Apart from vSAN vpshere supports Fiber Channel (FC), Fiber Channel over Ethernet (FCoE), iSCSI, Directly attached storage (Local datastores), NFS and vSphere Virtual Volumes. In case of FC, FCoE, iSCSI and DAS VMFS clustered file-system is deployed before VM disks are created as files on top of datastore. VMFS allows multiple hosts read/write access to same block storage.
  • Irrespective of type of underlying physical storage device, VMs always see disks as SCSI drives connected to SCSI controllers.
  • All hosts in a vSAN cluster can create only one single vSAN datastore.
  • Maximum 64 hosts can be part of a single vSAN cluster. vSAN supports maximum 200 VMs per host. In case of vSAN streched cluster there can be maximum 30 hosts with 15 hosts on each site (excluding 1 witness).
  • vSAN datastore stores various objects (Eg VM disks images) using Virtual Machine Storage Policies. The policy should be created before VM or disk is created. This policy can be changed at point in time. Policy which is used by at least one object cannot be deleted. Different disks of each VM can be stored with different vSAN storage policy. The policy only affects VM home (VM folder including .vmx file, .log file) and VM disks (.vmdk). It does not affects VSWP (.vswp) and VMEM (.vmem) objects which always use RAID1 and PFTT1 (Primary Failure To Tolerate). VSWP objects also use 100% space reservation. Virtual machine snapshot deltas (-00000#-delta.vmdk) get same policy as the disk for which snapshot is taken.
    vSAN monitors and reports VM against policy for compliance. If a policy becomes non-compliant (eg due to host failure) or newly applied policy, then vSAN takes remedial actions by reconfiguring vSAN so that VM is again compliant.
    Multiple policies can be created for same vSAN datastore. If a storage policy is not explicitly assigned to VM during provisioning, default vSAN datastore policy is used for the VM.
    Default storage policy has
    FTT
    1 (0 to 3). For VSWP and VMEM FTT=1 irrespective of VM storage policy.
    stripe
    1 (1 to 12). In all flash 1 is ideal. In hybrid configurations this can be changed, if required only for high performance virtual machines. Since writes go to flash (cache) increasing this has very little impact on write unless cache is overwhelmed and SSD garbage collection causes write through situation. In case of read where not many reads come from cache, having stripes can increase performance as multiple magnetic disks would participate in read.
    VM Home object which includes (.vmx, .log, etc.) uses stripe value of 1 irrespective of value specified in VM storage policy.
    IOPS limit
    0. This is available from vSAN 6.2 onwards. A single IOPS can read/write max 32KB. Smaller block sizes do not increase no of IOPS. Thus any block size less than 32KB will get configured no. of IOPS. For blocks of size 64KB, no. of IOPS would reduce to half as each operation would be treated as to 2*32KB operations from calculation point of view.
    FTT method
    RAID1 (RAID 1, 5 or 6). Only modifiable in all-flash configuration. In vSAN 6.2 and later RAID 5 is used in case FTT=1 and RAID 6 is used in case FTT=2. FTT=3 cannot be used with erasure coding and mirroring (RAID1) must be used in case of FTT=3.
    Flash read cache reservation
    0 (0 to 100%). Only avaialble in hybrid configuration. All-flash configurations use cache only for write buffering (not for read caching). This is not required to obtain a read-cache in case of hybrid deployments. This is required only in case reservation should be forced for read-intensive VMs.
    VM Home object which includes (.vmx, .log, etc.) uses 0% read reservation irrespective of value specified in VM storage policy.
    Force provisioning
    no (yes or no). This allows VMs to be created with paritcular storage policy even if required no. of disks or hosts needed to support corresponding policy are not present during VM deployment. Later on if the resources are available, vSAN makes object compliant.
    disable object checksum
    no (yes or no). This can be used to detect data corruption and try automatic repair. vSAN automatically checks complete data at once per year. The period can be modified using advanced ESXi host setting 'VSAN.ObjectScrubsPerYear'.
    object space reservation
    0% (0 to 100). If compression and deduplication is enabled only 0 or 100 can be used. Any value in middle is not available in such cases. Reserved storage is thick provisioned (lazy zero) and remainder is thin provisioned. In case of VSWP 100% space is reserved irrespective of VM object space reservation value of assigned storage policy.
    VM Home object which includes (.vmx, .log, etc.) uses 0% object space reservation irrespective of value specified in VM storage policy.
  • For n failures to be tolerated n+1 copies of data are required. For n failures to be tolerated 2n+1 hosts contributing to storage are required.
  • When virtual machine is created we limit maximum amount of memory the VM can use (eg 4GB). We also reserve minimum amount of memory that VM should get from the underlying host RAM (eg 1GB). Thus, the VM has total 4GB of RAM out of which at least 1GB is from real physical RAM of the host. In case of memory over committment remaining 3GB can come from VSWP created on top of datastore. Thus, if a VM with 4GB RAM with 1GB reservation is started a 3GB thick provisioned VSWP file is created by default. This behvaior can be changed to create a thin VSWP disk by setting ESXi advanced system setting VSAN.SwapThickProvisionDisabled as 1. Default value is 0 for ensuring thick provisioing for swap. However, this should not be done in case memory is overcommitted and Swap files are likely to be used for providing memory to VMs.
  • When virtual machine snapshots are taken while VM is running there is option to take running VM memory snapshot as well. Such snapshots of running memory are stored in VMEM files.
  • vSAN stores data (eg vmdk) in form of objects. Objects can have one or more components which are stored on stripes. A component cannot be larger than 255 GB (VSAN.ClomMaxComponentSizeGB). Thus any VM disk large than 255GB must be split into two or more components. Components are transparently assigned caching and buffering capacity from cache device. Components are stored on capacity devices at rest. A vSAN object is divided into components based on vSAN storage policy which dictates no. of stripes (1 to 12) and no. of copies or failures to tolerate. A single ESXi host can have maximum 9000 components. The stripe width of 1 is good for all-flash and most other configurations. Stripes are distributed across drives but not necessarily across hosts.
  • vSAN integrated with vCenter using VMWare vSphere API for Storage Awareness version 1.5. In case of vSAN storage provider is available on ESXi host itself. vSAN automatically registers and configures a storage provider for each host in a vSAN cluster as part of enabling vSAN. This is different in comparison to external providers which provide virtual volumes where storage providers might have to be created manually.
  • For tie-breaking in case of split brain scenarios witnesses are created. This indirectly implies minimum 3 hosts requirement for vSAN cluster. Ideally since vSAN cluster nodes might have to be taken down for maintenance, 4 nodes configuration is even better. Witness components are quite small (at most a few MBs) and have checksums and not the whole data. Witnesses are created automatically as per requirement.
  • vSAN 5.5 snapshots use vmfsSparse for which performance degrades over time as number of snapshots or use of snapshot increases. vSAN 6.0 onwards vsanSparse delta disks are used. These give performance close to native SANs even when no. of snapshots increase or snapshots are used for longer periods. vsanSparse supports full limit of 32 snapshots. vSAN 5.5 used VMFS-L with version 1. vSAN 6.2 uses VSAN FS with version 3. In vSAN 6.2 deduplication and compression, RAID 5/6 erasure coding and Swap object efficiency was introducted. vSAN 6.6 uses VSAN FS with version 5.
    vsanSparse format is used automatically if version of vSAN used is >= 2 and no older vmfsSparse/redo log format snapshots exist for particular VM.
    vsanSparse snapshot format uses 512-byte allocation unit size called a grain. (Older vmfsSparse used 1MB unit size). vsanSparse expands in 4MB chunks.
  • Commands starting with 'esxcli vsan' provide comman-line options for working with vSAN.
  • vSAN architecture includes four major software components
    Cluster Level Object Manager (CLOM)
    CLOM manages placement and migration of objects. It also distributes components across ESXi hosts.
    Distributed Object Manager (DOM)
    DOM manages data paths for the objects. Applies configuration dictated by CLOM. It also handles various errors.
    Logical Log Structure Object Manager (LSOM)
    Reports events for devices and their states. Also assists in recovery of objects.
    Cluster Monitoring Membership and Directory Services (CMMDS)
    Establishes and maintains a cluster of node members. It also elects the owners for the different objects.
  • vSAN Ports
    vSAN clustering service
    UDP 12345, 23451. This is required only in case of multicast mode.
    vSAN Transport
    TCP 2233
    VASA Provider
    TCP 8080
    vSAN Observer
    TCP 8010
    Reliable Datagram Transport
    Port 2233 (Only in case fault-domains are implemented and ESXi hosts are located in different DCs)
    Cluster Monitoring, Membership and Directory Service
    Ports 12345, 23451 (Only in case fault-domains are implemented and ESXi hosts are located in different DCs)
  • vSAN can use one active and one failover link. Use of two links in active-active for bandwidth aggregation is not supported. If bandwidth aggregation is desired than first a LAG using Link Aggregation Control Protocol (LACP) should be formed and than LAG should be used as active adapter for corresonding kernel port instead of a specific Uplink.
  • vCenter Server Appliance Installer can install vCenter on hosts vSAN datastore even before vSAN is configured by using vCenter. This reduces complexity of green-field deployments by eliminating need of external storage while deploying vCenter on nodes planned for vSAN. In this case data-center name and cluster name must be configured during vCenter installation. Also cache and capacity disks for vSAN need to be selected during vCenter deployment.
  • While enabling or disabling vSAN on an existing cluster first vSphere HA should be disabled. vSAN cannot be enabled or disabled while vSphere HA is enabled. After changing vSAN status (enable/disable) vSphere HA can be enabled again.
  • Prior to vSAN 6.6 there was option of automatic disk claiming during cluster creation. vSAN 6.6 onwards automatic disk claiming option has been removed.
  • In vSphere 6.6 VMWare Host Client can be used to create a VSAN datastore on a standalone host. This is not useful for production but can be useful for testing or small lab. This requires marking capacity disks from command line using IsCapacity Flag before enabling vSAN. A host configured vSAN cluster configuring must be removed before host can add its drives to vSAN cluster in vCenter Server.
  • If vSAN host is put into maintenance mode overall storage capacity decreases as host in maintenance mode does not contributes to overall storage capacity. While putting host in maintenance mode following options are available:
    Evacuate all data
    This moves all data to other hosts irrespective of whether data would remain accessible even if host becomes offline or not. This should be used if removing host permanently or doing operation that may lead to permanent host failure.
    Ensure data accessibility
    This moves data with FTT=0 and some other metadata. Data with FTT=1 or higher is not moved as it remains accessible via other copy. Use this in case host is being put under maintenance temporarily for reboot, adding RAM, adding disks, installing drivers, etc.
    No data evacuation
    This does not moves any data. Some active objects might become inaccessible while using this mode. This should be used only if host would remain online (not powered off and not disconnected) while in maintenance mode.
  • In VSAN 6.6 encryption feature for capacity disks is introduced. To use encryption Key Management Server (KMS) which operates over Key Management Interoperability Protocol (KMIP) 1.1 is required. vCenter acts as KMIP client and requests keys over secure channel from KMS using KMIP protocol which works on top of IP. Ideally KMS and vCenter should not be placed on top of encrypted vSAN datastore.
  • vSAN can be used to create iSCSI targets and LUNs. 1024 LUNs can be created in a single VSAN cluster. 128 targets can be created in a single VSAN cluster. 256s LUNs can be mapped to a single target. Max LUN size of 62TB is supported. While enabling iSCSI, kernel port (eg vmk0) on which TCP port 3260 would be used to create iSCSI target must be selected.
  • vSAN has proactive rebalance and reactive rebalance to balance load across disks. If usage between disks exceeds 30% (Eg drive A with 45% load and drive B with 10% load with 35% difference), then reactive rebalance is started automatically. Proactive rebalance can be started from Health Monitoring tab at any time. Ongoing rebalance can also be stopped from the Health page.
    If any disk is utilized more than 80% rebalance is automatically triggered. Hardware failures and host going into maintenance mode can also trigger reactive rebalance. Ideally at least 30% space should be free on any VSAN datastore to avoid too many rebalance operations.
  • vSAN 6.6 onwards throttle resync option is provided. This can cap bandwidth used for vSAN resync. This should be used only under direction of technical support.
  • Compression and Deduplication is a cluster wide setting and can be enabled / disabled in running cluster. However, if a cluster has only 4 hosts and is using RAID-5, it is already at lowest no. of hosts required for required FTT=1 on top of RAID 5. In such cases for rolling reformat, 'Allow Reduced Redundancy' option must be enabled as during implementation FTT would reduce.
  • Fault domains can be used to make vSAN rack aware. Minimum three fault domains are required. Minimum of four fault domains is recommended to support various data evacuation and data protection options. Each fault domain should have at least 1 ESXi hosts.
  • vSAN 6.1 introduced streched cluster where two DCs act as data site and a third DC acts as witness. Two two DCs must have identical no. of vSAN hosts. The usage should be below 50% so that VMs can failover to the working site. Witness is required only for quorum and can be created using direct OVA deployment without going through witness setup from scratch. Witness do not contribute towards compute and storage. In such cases each site is its own fault domain and only FTT=1 is supported. vSphere Fault Tolerance (FT) does not works on top of streched cluster.
    • Since vSAN 6.6 Failure To Tolerate has been replaced with Primary Failure To Tolerate (PFTT) to better support streched clusters. A second Secondary Failure To Tolerate (SFTT) was also introducted. PFTT=1 allows one of the two sites to fail. Only RAID-1 is supported across sites. SFTT=1 ensures that even in a single data site host failures should be tolerated. RAID 5 and 6 are also supported within same site.
    • In single site 10Gbps bandwidth and <5ms RTT is expected. Across data-sites <5ms RTT is expected along with high bandwidth close to 10Gbps for most workloads. RTT between witness and data sites can be high. For two node cluster latency between data and witness site can be as high as 500ms. For up to 10 hosts on each data-site latency between data and witness site should be less than 200ms. If there are more than 10 hosts on each data-site (Max 15 supported) latency should be less than 100mb between data-site and witness site. Witness site should have about 2Mbps bandwidth for every 1000 objects.
    • Streched cluster should have VM Kernel connectivity for vMotion, Management and vSAN. VM network connectivity is also required for fail-over or migration. Recommended vSAN connectivity between data sites using streached L2. Data sites to witness site recommendation is to use L3 connectivity.
    • One of the sites between two sites should be defined as preferred sites. During network partition VMs of other site are started on preferred site.
    • Unlike local vSAN RAID1 where reads are distributed across both copies, in case of streched cluster all reads are peformed using local copy to reduce latency.
    • If a ESXi host is configured as witness it requires standard license. If VM is used as witness it comes with its own embedded license and separate license is not required. In either case once a host is designated as witness VMs cannot be started on top of the host. In case of physical ESXi host existing VMs continue to run. But once these VMs are powered off they cannot be started on the same witness host again. Note that ESXi host to be used as witness must be a vSAN node with cache+capacity tiers and related licensing.
    • Streched cluster can be created using wizard where primary and secondary fault domain ESXi hosts should be specified followed by witness host. Once streached cluster is created, DRS should be used to ensure that VMs of one site should run on hosts of that site.
    • When data is already replicated at application level (such as Microsoft AD Additional Domain controller, or SQL Servers AlwaysOn or Oracle RAC), then PFTT can be 0 and Site Affinity can be set to one of the two data-sites. This ensures that data is not replicated at storage level again, since it is already being replicated at application level.
    • A heartbeat is sent every second between data-sites and between witness and data-sites. In streched cluster an available ESXi host on preferred site is chosen as master. Similarly available ESXi host on secondary site is chosen as backup. If communicate fails for consecutive 5 heartbeats (5 seconds) the corresponding component is assumed to be failed and new component is tried from same or alternate site as applicable.
    • Two node cluster and streched cluster are same architecture-wise and have similar deployment steps. Difference is that in case of 2-node cluster both nodes are in same site. In case of streched cluster nodes are spread across sites.
  • vSAN components have following different states
    Active
    Healthy and functioning
    Reconfiguring
    Components undergoing storage policy changes implementation
    Absent
    No longer available due to a failure
    Stale
    No longer in sync with other components of same vSAN object
    Degraded
    Not expected to return due to a detected failure
  • Default ClomRepairDelay is 60 minutes. If this delay is changed CLOMD must be restarted on each host. After waiting for these many minutes vSAN starts repair without waiting any more for absent components. Note that only in case of absent (eg host failure) 60 minutes waiting period is used, in case of Degraded (eg drive failure) rebuild is started immediately.
    There is option to "Repair Object Immediately" from health page without waiting for 60 minutes delay.
  • If SSD acting as cache tier fails entire disk group is considered as degraded. Hence if possible multiple disk groups should be created on each hosts so that failure of a single cache disk does not leads to complete host being degraded from vSAN point of view. Similarly failure of RAID controller on a ESXi host also causes all disk groups formed via that controller to degrade.
  • vSAN health service provides health reports to VSAN administrators. It checks health of cluster, whether vSAN kernel ports are able to communicate between hosts, health of drives participating in VSAN, driver and firmware compatability against VMWare HCL, etc. among other things. In case health is not good troubleshooting steps and relevant KB article links are suggested via "Ask VMWare" button.
    vSAN 6.6 onwards configuration asist tool is provided to automatically fix some of the identified issues which might cause vSAN to malfunction or which are against VMWare recommendations.
  • vSAN health service is enabled by default. vSAN performance service is disabled by default.
  • Performance service monitors performance-based metrics at cluster, host, virtual-machine and virtual disk levels. Performance service creates performance history database stored as stats object. Data collected remains in stats object for 90 days. Performance history database can consume up to 255 GB storage on vSAN datastore.
  • Health tests are run every 60 minutes. A re-test can also be started manually.
  • Health service tests various items and conditions
    Cluster
    Advanced configurations, deduplication/compression consistency, drive format, disk groups, CLOMD liveness, drive balance
    Network
    VMkernel port, subnets, connectivity issues, MTU check, hosts with vSAN disabled
    Data
    Object health
    Physical disk
    Overall drive health, congestion, drive capacity, memory pools, metadata health
    Hardware compatibility
    Contoller compatibility, issues retrieving hardware information, up-to-date HCL database
    Limits
    Host component limits, vSAN component limits, drive space, current cluster situation



<yambe:breadcrumb>VMWare_VSAN|VMWare VSAN</yambe:breadcrumb>