Following up on my last post, here's the first one I want to answer:
-------------
https://twitter.com/tony_bridges_el/status/1353468708358926340
"My
current bugbear is MS' Shielded VMs. I understand the marketing, but I
don't understand on a technical level how they prevent introspection by
the hardware/fabric owners.
Part of a larger topic of securing secrets in untrusted environments."
-------------
TL;DR, go to the summary at the bottom of this post.
I haven't heard of this as an offering before, but this sounds like some of the awesome difficult to inspect/detect malware ideas a friend of mine had a while ago and gave a "don't share this" presentation on. Time to see if that's the case.
First link, from 2019: https://techcommunity.microsoft.com/t5/data-center-security/what-are-shielded-vms-in-windows-server-2016-hyper-v/ba-p/372179
Best link so far (1hr, 10min): https://sec.ch9.ms/ch9/fa49/1fbd93dd-250d-47d1-a420-02cf2234fa49/BRK3124_mid.mp4
The first gives it an introduction, but doesn't tell me if they're encrypting memory. The second is long and goes into great detail, but doesn't break down process/cpu resource segmentation. Re: ram, I remember researching cold boot attacks when they came out and worrying about hardware attestation/security at a longer engagement I was on when I was at iSEC Partners.
So let's compare this to normal VM infra:
Normal VM infra:
1. VM server downloads/has access to storage VHD/virtual disk and reads the config, sends it to iron that has a shim to talk to VM server and runs it.
A. VHD isn't encrypted, anyone with physical access to iron can copy the disk or pull it out while running, and have the sdata from it.
B Can maybe do a cold boot attack on the ram, although ecc ram dissipates data faster, you can do the swap quickly enough with the right equipment/budget.
C. Assuming bypassable or lacking attestation on the vm management to the iron that's getting the VHD, you can also run everything as emulated/with a debugger, although emulating the speed of iron seems like it would side-channel leak that something is going on by being slower than normal. On low workloads or without looking at that frequently/closely, that could very well go undetected.
The point of Shielded VMs are to reduce trust in the people running the data center, to give you a chance at believing that maybe "the cloud" isn't just "someone else's computer" Let's break down how shielded VMs do that and I'll see what else may be possible that they're not doing/that I didn't read thoroughly enough. I'm happy to make changes in my posts on any inaccuracies, I'm rushing through public documentation and relying on a few years old knowledge re this, but this was just before they came out, so the tech was the same then.
Shielded VMs:
Main docs: https://docs.microsoft.com/en-us/windows-server/security/guarded-fabric-shielded-vm/guarded-fabric-and-shielded-vms-top-node
- Introduce a new service: Host Guardian Service (HGS)
A. This runs attestation on hosts that the VM server wishes to run a shielded vm on. It won't approve a host to run a shielded vm unless attestation passes.
Attestation has a few parts:
a. Secure boot: I assume it's configurable, but the TPM, which is/should be a physical chip, (but is available as a virtual TPM that hyper-v offers, but I don't know much about). This chip is separate from the cpu and runs crypto operations separate from the CPU(s).
As the machine boots, the TPM has a key baked into it by the manufacturer. It uses this to compute a chain of hashes (PCR) on (OEM/sysadmin) approved software/first boot software that should all be matched by the booting parts of the system (this is before windows starts and is hardware/bios/ueif level).
If the parts running and sending their code through the CPU don't add up, then the TPM sees that one or more hashes don't match and says, "You shall not pass" It's the beginning of this blog post: https://oofhours.com/2019/07/09/tpm-attestation-what-can-possibly-go-wrong/
More general Secure Boot info: https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-secure-boot
Installing some new hardware or updating bios without putting the secure boot process into a new "learning mode" is a great way to be locked out of your machine.
Ok, watching this: https://sec.ch9.ms/ch9/fa49/1fbd93dd-250d-47d1-a420-02cf2234fa49/BRK3124_mid.mp4 at ~45min it goes in to how HGS adds more to the measured boot/secure boot process. the HGS takes each PCR (hash in the chain) and asks the TPM to sign each part and confirms those signatures with it's known TPM public key, so we have hardware based confirmation from a crypto chip designed to be difficult to get in to that confirms that the boot happened with the expected hardware and firmware. This should prevent a replay attack where a mitm false TPM sends the "secure boot happened" message to the HGS.
b. To attest to remote things, the TPM signs something (a certificate) that the Host Guardian Service presents. If the sig can be proven by the public key the HGS has already from the client host TPM, then it can be proven that it both poassed Secure/Measured Boot and that the TPM has not been tampered with in non-large-number-of-dollars and team of skilled people ways (decapping/electron microscope, possibly other unknown-to-me ways)
B. If these two things are passed, then the HGS allows the chosen guarded host to be used. It also sends the decryption key to the guarded host to decrypt the shielded VM. These keys only work for a time period (let's say 8 ours). I didn't dig into how it time-boxes these.
This means that we now have trusted hardware running a vm that is encrypted on it's disk. Cpu level debuggers aren't running because we confirmed that booting is secure. The HDD/SDD and other peripherals are what was expected as well. Unapproved firmware is shut out too because of secure boot.
So the iron has a measured boot and secure boot with a physical TPM. This is used to get a VM image and the key to decrypt it for a time period. vTPM in the VM/hyper-v then can runs to attest to the booting of the VM itself. I didn't dig into these specifics on virtualized TPMs either.
What about memory? Is that encrypted?
Not that I can tell. No one encrypts memory because as far as I'm aware, the CPU loss is abysmal. The secure enclave/vms (below) does protect that from being read by other CPU processes in other VMs and/or other privilege levels.
What about CPU cores, if there's a malicious VM also running on the same hardware, can that jump cores if it there's a VM breakout?
Probably not/it's more difficult. I understand the gist of intel's SGX. This is a chip enforced segregation of CPU (and possibly memory) that would prevent even someone running debugging software from reading a process that's been granted SGX apportioned resources. It basically walls off part of the CPU and ram for processing. I don't know yet if this is what VSM is doing (and if that is used in shielded vms, but there's a captured syscall mentioning SGX that I only see mentioned on google in two places: https://github.com/ionescu007/SimpleVisor and https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm . In this video, they talk about VSM using Isolated User mode and Secure Kernel in "trustlets" ~11min mark: https://sec.ch9.ms/ch9/c198/dbc5b17b-7ba3-4701-93a0-57ebd9d5c198/introductiontoshieldedvirtualmachines_mid.mp4
"Secure enclave" is a term used in reference to what SGX creates as a walled area of the CPU for processing. The diagram (last referrence of SGX, at the top of pg5) (PDF) https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sogeti-security-white-paper.pdf
More details about SGX: https://blog.quarkslab.com/overview-of-intel-sgx-part-1-sgx-internals.html
What about administration?
They offer making a separate forest, but let's be real, most won't. This mostly opens you up to normal AD priv-esc issues, which I think you should worry about a lot more than about some Azure employee copying your disks.
This will help though: "HGS comes with Just Enough Administration
(JEA) roles built in to help you manage it more securely.
JEA helps by allowing you to delegate admin tasks to non-admin users,
meaning the people who manage HGS policies need not actually be admins
of the entire machine or domain." https://docs.microsoft.com/en-us/windows-server/security/guarded-fabric-shielded-vm/guarded-fabric-manage-hgs
I really like their advice on setting this up. Follow it and your AD risks are much lower.
Random selection of other good security stuff Shielded VMs do:
- Disable a bunch of remote protocols for admin stuff: Guest file copy IC, registry hive injection, vmconnect, powershell direct, remotefx, some WMI calls, some KVPs (dunno what that is).
- Remove some virtual devices: serial, debug, hid, input mgr, rdp encoder, synthetic keyboard/mouse/video/rdp.
Summary
This offers you the ability to stop people who don't already have access to the running VM (end users) and VM admins from accessing the contents of a running VM. You enforce trusting the hardware with physical TPM attestation. You store your VHDs encrypted and revoke keys that have been given to VM servers after a time period. This prevents server admins and fabric admins from being able to run/access your shielded VMs/their data. Someone gets your SSD/HDD? They don't get your data, even if it's running on the machine because it's encrypted. Someone changes hardware on the server to gain access? It won't be attested for in secure/measured boot. Most physical attacks are covered.
Another VM resident on the same metal breaks out of the VM? VMS and the equivalent (same?) to intelSGX stops the CPU from being told to read the secure enclave processing resources given to the Shielded VM. I'm not 100% sure on the 100% validity of this statement, but it's my intuition that this is correct. If anyone can confirm/deny this, please comment below, or in the twitter thread. I'll try to check it for the next week at least.
The only thing I see this missing is cold boot attacks, but I don't know how reliable those are now-days and haven't done one since 2013.
Shielded VMs first prove that the hardware/firmware expected to be running on the server you think you're talking to is before delivering the shielded vm (VHD) and additionally cut out a number of potential attack points from your VM infrastructure/fabric.