Taming Small Data Writes to Block Storage Devices for Higher I/O Efficiency

 

While big data poses a big challenge to storage systems on their capability and efficiency, small data also presents an equally serious threat on their access efficiency. Examples of small data include file system's metadata, lookup table entries in virtual storage devices, and small updates. Almost all storage devices use block interface. When small data (e.g., 128Bytes or less) is substantially smaller than a block (e.g., 4KB), the actual access size would be much larger than the data itself, potentially resulting in wasted device bandwidth and significantly reduced I/O efficiency. Because of demand on immediate data persistency, writes of the small-data continue to be the Achilles' heel of block devices.

 

There are multiple software layers in the I/O stack interacting with the block interface, where small-data writes can inflict performance penalty. The first layer is the virtual block device, which is hosted on block device(s) to provide a block interface to virtual machines. For every write of a data block to a virtual device, the device needs a small write recording where the data block is stored. The second layer is the I/O scheduler, where requests of block writes are received and then sent to the (virtual) storage device. Small writes that can be identified and removed include small update in a data block, a data block that can be substantially reduced by compression, and a data block that can be removed by on-line deduplication. The third layer is file system, where metadata often need to be saved immediately after data write. Servicing these small writes via block interface is not efficient due to significant write amplification and often required order between writes of corresponding data block and metadata.

 

To address these challenges raised by small writes, the PI proposes a highly effective and fundamental approach that is applicable at any layers in an I/O stack. Instead of relying on any special hardware support or demanding any interface changes, this approach leverages data compression technique. As it is observed that small data usually accompany data blocks, they can be embedded into the data blocks should the data in the blocks can be compressed. This use of compression technique represents a deviation from its

conventional employment for reducing data volume, and holds much larger performance advantages. First, because small data are often not contiguous with data blocks, eliminating them not only reduces disk wear (for flash-based SSD) but also removes extra disk seeks (for hard disks). Second, when small data are metadata of a data block and two (or more) writes of them are required to complete a write request of the data block, the writes must follow a certain order to ensure on-disk data consistency. Using flushes to enforce the order would seriously degrade I/O performance. By integrating the metadata and data into one block, which is written in one atomic write, the order issue ca be efficiently addressed.

 

 

 

 

 

 

Software

 

v Download Wormhole: A Fast Ordered Index for In-memory Data Management (Wormhole 1.0)

 

Wormhole is a new ordered index structure that takes O(log L) worst-case time for looking up a key with a length of L. The low cost is achieved by simultaneously leveraging strengths of three indexing structures, namely hash table, prefix tree, and B+ tree, to orchestrate a single fast ordered index. Wormhole`s range operations can be performed by a linear scan of a list after an initial lookup. This improvement of access efficiency does not come at a price of compromised space efficiency. Instead, Wormhole`s index space is comparable to those of B+ tree and skip list.

 

 

v Download WipDB: A Write-in-place Key-value Store that Mimics Bucket Sort (WipDB 1.0)

 

WipDB is a KV store design that leverages non-volatile memory and relatively stable key distributions in the store to bound the write amplification by a number as low

as 4.15 in practice. The key idea is, instead of incrementally sorting KV items in the LSM`s hierarchical structure, it writes KV items right in place in an approximately sorted list, much

like a bucket sort algorithm does. The design also makes it possible to keep most internal data reorganization operations off the critical paths of read request service.

 

 

v Download Selfie virtual storage code Selfie 3.3 (also see readme)

 

Selfie is a virtual disk format, that eliminates frequent metadata writes by embedding the metadata into data blocks and making write of a block and its associated metadata be completed in one atomic block write. This is made possible by opportunistically compressing the data in a block to make room for the metadata and effectively managing blocks mixed with metadata and data. In our evaluation Selfie gains as much as 8x performance improvement over existing mainstream virtual disks. It delivers near-raw performance with an impressive scalability for concurrent I/O workloads. Selfie can improve write and read throughput of LevelDB, a state-of-the-art KV system, by up to 20 times and up to 10 times, respectively

 

v Download code for WOJ (Write-Once Full-data Journaling) in SSDs WOJ 1.0 (also see readme)

 

Leveraging the fact that with data journal mode all writes to the home locations in a file system are preceded by corresponding writes to the journal, Write-Once data Journaling (WOJ) uses a weak-hashing-based deduplication dedicated for removing the second writes in data journaling within an SSD. WOJ can reduce regular deduplication`s computation and space overheads significantly without compromising the correctness. To further reduce metadata persistency cost, WOJ is integrated with SSD`s FTL within the device.

 

The code represents a prototype of a virtual SSD with WOJ functionality enabled. In the virtual SSD, the WOJ functionality is implemented in Dmdedup, an open-source deduplication framework, as a device mapper target at the generic operating system block device layer in Linux kernel 4.12.4. All the data written to the virtual disk are first processed in the target and those to be persisted are directed to a real SSD.

 

v  Download code for ThinDedup Deduplication system ThinDedup 1.0 (also see readme)

 

ThinDedup compresses the data and inserts metadata into data blocks to reconcile the incompatibility between rigid block interface and small metadata. Assuming that performance-critical data are usually compressible, ThinDedup can mostly remove separate writes of metadata out of the critical path of servicing users` requests, and make I/O deduplication much thinner (keeping only data block writes). In addition to metadata insertion, ThinDedup also uses persistency of data fingerprints to evade enforcement of write order between data and metadata. It is implemented in the Linux kernel as a device mapper target to provide block-level deduplication.

 

Publications Related to this Project

 

 

Participants

 

 

 

REU Program on the project (see the REU website and a Student Presentation)

 

 

 

Acknowledgements: