I was listening to an episode of the Pure Report recently where Rob Ludeman interviewed Andrew Miller:
Also a post on ransomware and FlashBlade:
It’s a good listen–and it did get me thinking about vVols (like most things do these days). Before I get into that though… We (Pure) are doing a fair amount around helping customers protect against, or at least easily recover from ransomware attacks. My personal thinking around this is certainly still evolving, and I have a fair amount to learn, but here are a few things I think are important points.
- Ransomware attacks do not begin and end with encryption of your data. Generally, once an attacker gets in they find out what they can do. What can they access? What can they disable? Can they disable your protection? It is worth their time to figure the answers to these questions out. The more damage they do to your protection, the more likely they will get paid.
- You need to ASSUME that the attacker has gained administrative credentials. In building your protection, good RBAC is a part of but not the end all, be all. A disgruntled sys admin even–doesn’t have to be a shadowy figure in a cave.
- Look at the forest and the trees. Protection requires consideration of each component (as an admin of this piece of the infrastructure how can I protect what I am in charge of?) and consideration of the entire infrastructure (how do I protect my business if an entire part of my stack gets compromised?).
- Prevention, insulation, detection, mitigation, and restore. My five phases of ransomware.
- How can I prevent it?
- How can I reduce the blast radius if one part or many get successfully attacked?
- Can I detect it?
- How can I stop it?
- How would I restore and how quickly?
- When did the attack actually start? Restoring to a non-encrypted version doesn’t mean it isn’t infected. Having access to longer-term point-in-time, while still having fast restore is important.
So with these things in mind, I started thinking about my role. At Pure I am responsible for our VMware solutions. Is there something I can do to help? Best practices? Integrations? Guidance? The short answer is all of those things of course. And I have a lot of plans around this. But what can you do today?
So the first thing is what do we have on the FlashArray that can help? Let’s take a look at a few things:
- Immutable, efficient snapshots.
- Definition: Once a snapshot has been created, it cannot be changed. Furthermore everything on a FlashArray is globally reduced (deduped) so the footprint of this snapshot existing is not expensive. Since they are meta-data based they are independent objects that just share pointers. So there is no redirect-on-write or performance penalty to the source volume.
- Benefit: Not being able to change the data of a snapshot is critical to protection. FlashArray snapshots can also be restored to from instantly–decreasing the time to restore a data set after an attack. The low-capacity footprint of a snapshot allows for more data sets to be protected, for longer.
- Protection Group Snapshot Policies.
- Definition: Protection groups allow you to have the FlashArray take snapshot creation to a policy-based level. You can have local snapshots created of a volume or specified set of volumes (or all of the volumes for a specific host or host groups) on a schedule with a specific short-term and long term retention level.
- Benefit: Automated protection of volumes, with long term protection and specified retention. Still providing instant restore.
- Protection Group FlashArray Snapshot Replication.
- Definition: Protection groups can also replicate snapshots to different targets. These can be other FlashArrays or Cloud Block Store (FlashArray software in AWS and soon Azure) instances. Provide specific replication intervals as well as short and long term retention policies.
- Benefit: Protect data sets to remote FlashArrays or CBS instances to provide additional capacity for longer term retention and/or more frequent points-in-time. Since they are still FlashArray snapshots and replication preserves data reduction, footprint is smaller and restore is quicker even from remote targets.
- 3rd Party Snapshot Offload
- Definition: Use the same protection group features (intervals, short, long term retention) to send those snapshots to generic NFS targets, or even AWS S3 or Azure Blob. Fast restore in particular with FlashBlade as a target
- Benefit: Provide much more capacity that you might not have available on-premises or on your FlashArrays, for much longer retention for many more volumes. Since the snapshots preserve data reduction restore time is not limited to just the throughput of the target–we need to just pull back the unique, reduced snapshot you need.
- Volume level Data Reduction reporting.
- Definition: Report on the data reduction of a single volume or snapshot.
- Benefit: Encryption is the enemy of data reduction–encryption removes the ability to identify patters. Data reduction is identifying patterns. So seeing data reduction go from 5:1 (which is common in VMware environments) to 1:1 means it was encrypted. Giving insights into something happening in your environment and when it started and what points-in-times are not affected.
- Eradication
- Definition: When you delete a data object (volume, snapshot, protection group, protection group snapshot, pod) it goes into a 24 hour pending bucket. So a deletion is not permanent immediately.
- Benefit: Tools and integrations that leverage the FlashArray often delete objects but do not eradicate them–they let the FlashArray take care of it. Therefore is someone deletes objects using something like our vSphere Plugin, the objects are still protected by the timer. They would need to get into the FlashArray and eradicate the objects manually. So if access to the FlashArray is secure objects are still protected for a window.
- Safe Mode
- Definition: Disabling manual eradication. This means that no administrator (any any customer level) can login to the FlashArray and eradicate objects fully from the FlashArray. Only time, dictated by the eradication timer can do this.
- Benefit: This fully protects objects on the FlashArray. If someone gets administrative credentials to the FlashArray they cannot even fully delete objects. If someone deletes everything on a FlashArray you have 24 hours to recover from this. This allows for protection even if someone gets a hold (or simply has) admin credentials to your entire environment.
Note: FlashBlade also has Safe Mode which goes further in some wonderful ways – see the end of the post I linked above for details.
Obviously all of this is just a part of your strategy–we have more things we (Pure) can do (and are working on) to make this better. We are putting a lot of energy in understanding the issues here and how we can better support our customers in each “phase” of the ransomware cycle.
Okay, so what about vVols?
vVols and Ransomware
One of the core benefits of vVols is that it allows you to use the features of your array the way it was designed. I think this is really the core benefit that underlies everything else.
Let’s walk through some examples.
But one thing I want to re-iterate:
I am NOT arguing that this replaces your backup products. I am not arguing that this is a panacea. The point I am trying to make is how to effectively, or more effectively rather, use the features available to you to better protect your environment. In this case specifically how vVols can potentially help you do this better. This is somewhat tactical (how do I better protect the VMs hosted on my FlashArray). But remember: ransomware management must also be looked at holistically.
Snapshot Protection
Whether this is local snapshots, FlashArray snapshot replication, or offloads there is a fundamental problem here with VMFS. I assign protection to a datastore, not a VM, not a virtual disk. So if I Storage vMotion a VM to a different VMFS, is that VMFS still protected? If I moved it and something happens, was the old VMFS protected? Which snapshot is it on? If I don’t remember where it is, how do I find which VMFS snapshots have the VM?
vVols on the other hand does not separate the storage from VM. When you create an array based snapshot, you are taking a snapshot of the VM itself (or its disk). So whether it is a single snapshot or a protection group snapshot, the VM has its protection and it is easy to see if it was, and is protected.
Undelete
If a VM is deleted on VMFS, the destroy/eradication isn’t really applicable. Sure if you delete the whole datastore, but if you delete a VM or a disk (far more common and a native function in vSphere), it just wipes it from the VMFS file system (ATM machine). There is no sub-object that goes into the eradication bucket. Just like the Windows recycling bin–if you delete the PPT file it will go there, but if you open up the PPT and delete all of the slides, it does not help you. vVols make the PPT like each slide being its own file.
A vVol VM is just a volume group and a bunch of volumes on the FlashArray. So when you delete the whole VM or just one disk, the volume group and/or volumes go into the eradication pending state. They can then be recovered for the eradication window. Without any pre-created snapshots or backup. This is a good example of vVols allowing you to take advantage of an array feature as it was really intended.
Restore
Since snapshots on the array (or replicated snapshots) are granular to the virtual disk, you can quickly instantly the point in time of the VM or the individual disk from the snapshot on the array. The process to restore a VM or VMDK on VMFS is to find a snapshot of the datastore, hope the VM is on it, then resignature the whole datastore, register the VM and power it on. If you need a different point-in-time, you have to do the whole process again.
With vVols, it is simply find the snapshots of the VMs and re-register the VM. Or create a new blank VM and copy the snapshots to the new volumes. Boot up. If you need a different point-in-time, power down and copy from new snapshots and power back on. Only want to restore the data? Then just restore from that.
Data Reduction Reporting
The encryption process in ransomware is going to take one of two forms:
- Encrypt individual file system in the VMs (most likely)
- Encrypt the entire datastore
The first scenario is the most likely. Let’s take the scenario that there are 500 VMs on a datastore. If one by one they get encrypted, that could take a while. Therefore, restoring the VMs might require them to be from different snapshots, therefore requiring you to bring up many copies of the same datastore to restore the VMs on it. Snapshot A from 24 hours ago has a good copy of VM A, but VM B was encrypted 25 hours ago, so I need the snapshot before that too.
Furthermore, as the VMs encrypt, the data reduction rate on the volume hosting the VMs will slowly change. Since the granularity in this case is of the datastore volume (all 500 VMs) the change will not be dramatic as it is averaged out. Same as if the entire datastore was encrypted outside of the VMs. Therefore early warning doesn’t really exist. If they are vVols you would see data reduction go to 1:1 on each virtual disk on the array. The trend would be apparent and you could move to protect the remaining VMs faster. The overwrite rate (of the encryption process) would likely show as well per-volume on the array.
On a semi-related note, with vVols there is no datastore, so that attack vector (encrypting the whole datastore) does not exist.
Day 2 Compliance
One limitation around VMFS datastores is that the configuration is tied to the datastore. And assignment of this configuration requires additional software that is not integrated into feature management of vSphere. vVols, are. vVols integrate into the provisioning mechanism of storage policy based management. So as VMs are provisioned, you can make sure they are protected by snapshot policies (for instance). Importantly, it is not just about day 0 configuration, but also compliance. If someone removes that protection from the array, VMware is alerted that the storage is out of compliance.
Final Notes
As I said before, vVols does not by itself solve the ransomware problem. But I think they can help expose some features and protections to you in ways that VMFS cannot as easily provide.
Can these problems be solved with VMFS? Yeah sure–there are lots of other ways to protect yourself. The message here is not YOU WILL BURN UNDER A PILE OF RANSOMWARE IF YOU DON’T USE VVOLS. It is that I believe vVols can help–it is a feature available to you that can potentially make it easier to defend against this. As arrays get more features and solutions around ransomware this is done on a volume level, which is what vVols is all about.
So I ask you to think about. Leave feedback! And let me know how we can do more! We certainly plan to.
I think it’s a problem that the account that is stored in vCenter for managing vVols is an administrative account with no logical separation. It would be desirable for this account not to have access to, say, snapshots created through protection groups.
klemens
Makes sense! I opened an internal ticket for this. Only allow the account to communicate with VASA, not our GUI, CLI, or REST directly.
Any luck with separating that account out via pure rbac?
Once I configure the syslog section to start sending data to our SIEM, what should I tell them to be looking for if we are being attacked?