Level 2 Advanced Replacement Cache for ZFS

zdw · on Oct 21, 2013

To give more info on this: http://wiki.illumos.org/display/illumos/Persistent+L2ARC

Basically, L2ARC is "Level 2 Advanced Replacement Cache" for ZFS, where level 1 is in RAM, and level 2 is on an SSD (usually a cheap/large MLC SSD, as opposed to an expensive/small SLC SSD for ZIL). Basically, it's using a SSD to act as a huge cache for reads, so they don't have to be serviced by a slower spinning rust array.

Prior to this change, after every system reboot, the L2ARC would be cleared and not used/filled until reads from disk happened. On a system that is rebooted frequently (or even infrequently), this can result in slower performance until the cache has been primed.

My understanding is that that with this change reads can happen from the L2ARC devices after a reboot (the "persistence"), which removes the ramp up of usefulness of the L2ARC.

codys · on Oct 21, 2013

This (L2ARC being considered "valid" even after a reboot) sounds quite a bit like ZFS is growing features that already exist in Linux's bcache (and maybe dm-cache? I'm not sure how it treats data).

fiatmoney · on Oct 21, 2013

All the parts (tiered caching, compression, checksums, redundancy, deduplication, journaling, network availability, FS migration...) likely exist separately; having them in a single filesystem (especially that doesn't require kernel patches, just a module) is quite pleasant.

codys · on Oct 21, 2013

And some of those things certainly benefit from being integrated into the FS.

I wonder if any knowledge at the filesystem level (as opposed to the block level, where bcache and dm-cache operate) could help L2ARC make better caching choices.

laumars · on Oct 21, 2013

That maybe true, but not everybody runs ZFS on Linux and some of these features makes better sense to group in the filesystem driver anyway.

Plus it's slightly hypocritical to comment about how ZFS is growing features that already exist in Linux when Btrfs is the embodiment of reinventing features.

newman314 · on Oct 21, 2013

L2ARC has been around quite a while so it's much more likely that it's the other way around.

codys · on Oct 21, 2013

I was speaking about the preserved-across-reset portion (which this submission is related to) only.

cokernel_hacker · on Oct 21, 2013

Neat. My reading of these changes imply that they finally made the L2ARC's info survive a reboot.

For some background:

ZFS, a modern WAFL clone [1], has a replacement algorithm called ARC [2] which can concisely be described as a hybridized MRU/MFU (Most Recently Used/Most Frequently Used) replacement algorithm to decide which pages make the most sense for keeping in memory. There is considerable literature surrounding replacement algorithm design, I have little to say about ARC other than it is patented [3] and can be outperformed by newer algorithms.

Note that this is quite different from the traditional approach of FS/buffer-cache design. One usually expects the OS kernel to manage the buffer-cache for you (OS X has it's Unified Buffer Cache (UBC), NT has it's Cache Manager (Cc), etc.). However ZFS includes it's own incredibly complex caching subsystem around. I do not know why they didn't want to improve or modify the Solaris kernel's segmap subsystem but there are consequences to this design. Notably, ZFS's memory usage is quite a bit higher because of ARC.

The idea of performing read-caching in memory with ARC seemed like such a good idea to the ZFS designers that they allow for a second level of ARC to take place: L2ARC. L2ARC essentially runs the ARC algorithm between SSDs and HDDs to, hopefully, speed up the performance of random reads in a ZFS storage pool.

Now to steer back towards what this code dump seems to be about. If you recall from before, ZFS's ARC is a replacement algorithm based on usage and it needs to know which things to put where. This so-called persistent L2ARC remembers where things were on a L2ARC device so that the storage pool can take advantage of the fact that data is on the SSD on, say, a reboot.

Huh? Why did this require extra code? Remember, ARC was about caching: it didn't need to remember anything. When coming back online, complicated things happen: transactions get replayed, metadata integrity needs to be rechecked, etc. Implementing a persistent cache that is crash safe is incredibly difficult but not uncommon: auto-tiering [4] solutions like Fusion Drive [5] have to provide this kind of safety.

[1] http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout

[2] http://en.wikipedia.org/wiki/Adaptive_replacement_cache

[3] http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=6996...

[4] http://en.wikipedia.org/wiki/Automated_Tiered_Storage

[5] http://en.wikipedia.org/wiki/Fusion_Drive

gnoway · on Oct 21, 2013

A modern WAFL clone? I've never read that before. The article you link to doesn't assert that either. Can you provide mode information?

cokernel_hacker · on Oct 21, 2013

The original ZFS paper references WAFL wrt its similarity a number of times. It seems like the biggest distinction that the paper claimed was that ZFS had pooled storage and WAFL was network oriented.

WAFL's biggest idea of the day was "write-anywhere" (the WA in WAFL). Write-anywhere is another way of phrasing copy-on-write which is a fancy way of saying _never overwrite_.

The idea, while simple, can be built upon to yield features like cheap snapshots and reasonable data integrity.

Perhaps "clone" is a bit too much but the similarity is definitely there.

FWIW, the NetApp folks sued Oracle because they also thought it looked similar [2]

[1] http://users.soe.ucsc.edu/~scott/courses/Fall04/221/zfs_over...

[2] http://www.netapp.com/us/company/news/press-releases/news-re...

gnoway · on Oct 21, 2013

That's correct, and Oracle (actually Sun at that point) sued back. The cases were both dismissed w/o prejudice [1]. Of course, that doesn't mean neither were valid cases.

[1] http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_di...

dmpk2k · on Oct 21, 2013

can be outperformed by newer algorithms

I'm interested to hear more about this. There's CAR... what else?

As an aside, ZFS's ARC algorithm differs a fair bit from IBM's -- a case of theory which then met reality. I don't recall the details though, alas.

cokernel_hacker · on Oct 21, 2013

Off of the top of my head: CLOCK-Pro [1] - an approximation of LIRS [2]

It's quite popular, I remember that MySQL and one of the BSDs use it.

[1] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/pap...

[2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/pap...