Kris Meier October 20, 2020
It is no secret that moving on-premises NFS-based workflows into the cloud is top of mind in most IT leadership. Leveraging on-demand compute and networking with simplified and near-unlimited NFS storage is a compelling proposition especially in organizations struggling to keep up with a storage footprint growing at 50% Y-Y or more.
As part of our journey to bring our Application Accelerator utilizing the InfiniteIO File Metadata Engine (IFME) into the cloud, I took a closer look at the metadata performance of various cloud-based NFS as a Service offerings to get a picture of what impact our metadata acceleration could have. I deployed various cloud-based NFS as a Service offerings, ran some ML picture classification model training runs and captured the NFS metadata operation performance using tcpdump all within the same cloud.
What became clear is one of the most exciting benefits of running our Application Accelerator utilizing IFME in the cloud is that it’ll allow customers’ on-premises workloads to run directly in the cloud without modification. As you’ll see, InfiniteIO provides the performance benefits of file metadata offload to keep applications running fast while enabling customers to leverage the simplicity and cost of NFS as a Service.
What I’ve discovered is that NFS as a Service solutions really don’t perform as well as on-premises hardware-based solutions, particularly when it comes to latency. Layers of infrastructure are required to make NFS as a Service simple, elastic and resilient. Unfortunately, this means latency suffers. Caches are added either explicitly or under the covers to try and alleviate performance issues, and to some extent this is successful. However, in my tests, I consistently measure 1-3 ms latencies across the most common metadata operations (ACCESS, GETATTR, LOOKUP).
An example from one of my tests is below. Here, I ran the ML training test and looked at the time it took to process each of the NFS operations for one of the NFS as a Service solutions:
My data set consisted of 70,000 pictures. Each picture was 100-350 bytes. I trained the model over 3 epochs. You can see the 1-3 ms I mentioned above and, of course, the heavyweight operations take longer. I’ve also included min, max, and standard deviation in the table. Interestingly, average NFS operations never really go much faster than 1 ms. So for this particular NFS as a Service solution, that’s likely as good as it gets.
Max shows us our outliers and in particular for ACCESS. If I remove the single outlier 1000 ms ACCESS call, the average drops to 1.18 ms and standard deviation drops to 0.54 ms. I highlight this because not only is average performance important, but consistent performance is also very important. With standard deviations on order of the average (or larger!), my application is seeing neither consistent nor high performant metadata responses.
For some applications this may be acceptable, but for today’s modern intensive applications used in AI/ML, EDA, and Life Sciences where millions (or billions) of files are created and consumed, millisecond operation latency means applications that could complete in minutes on-premises now complete in hours in the cloud.
Applications in this space are driven by file metadata. Often 90% or more of file operations are really metadata queries. As we’ve seen from my previous blog, when a client is looking for a file, it performs at least two metadata operations: a LOOKUP or GETATTR to find the file handle or grab the file’s attributes, followed by an ACCESS operation to see if it has permission to access the file. At 3 ms per operation, a client could take up to 100 minutes to discover one million files on an NFS as a Service platform. In a complex file system, even more time will be spent walking directories to get to the files.
InfiniteIO has shown our on-premises solution can respond to metadata requests in as little as 20 μs. Adding a virtual InfiniteIO Application Accelerator to the solution into the cloud could reduce this time from 100 minutes down to 2 minutes just by accelerating metadata operations as shown in the picture below.
Even if we look at the more performant (and more expensive) NFS as a Service solutions with file metadata latencies of around 1 ms per operation, InfiniteIO can have a major improvement. If I look at the operations generated during a ML classification model training run, fully 80% of these operations were metadata operations. Using InfiniteIO’s Application Accelerator could reduce the time CPU/GPUs are waiting for storage by 75%.
This shouldn’t be too surprising. On-premises filesystems are typically not optimized for metadata operations, so how could a massively distributed, elastic file system do any better? Moving InfiniteIO File Metadata Engine into the cloud will bring application runtimes back down from hours to minutes and make the lift and shift of these modern applications into the cloud a reality.