Very exciting to see implementation progress on HAMMER2. Some basics about the design:
- This is DragonflyBSD's next-gen filesystem.
- copy-on-write, implying snapshots and such, like ZFS, but snapshots are writable.
- compression and de-duplication, like ZFS
- a clustering system
- extra care to reduce RAM needs, in contrast to ZFS
- extra care to allow pre-allocation of files by writing zeros, something that will make SQL databases easier to run performantly on HAMMER2 than on ZFS
And much more. The design doc is an interesting read, take a look:
Isn't that an oxymoron? I thought the entire point of a snapshot was that it's an immutable record of the filesystem at a moment in time.
extra care to reduce RAM needs, in contrast to ZFS
ZFS utilizes a lot of RAM for caching. That's not the same thing as needing a lot of RAM. I've seen this same complaint about modern operating systems. People will buy a lot of RAM and then get annoyed when they see the OS making use of it. As long as the memory is available to be allocated to applications, why should we care whether or not the operating system makes use of it?
Is it better if we call a snapshot a branch point?
I believe the complaint about ZFS and RAM is that the caching cannot be considered optional. Performance is substantially worse than other filesystems with more modestly sized caches, so the memory isn't really available to applications.
You can run ZFS with megabytes of ram, it's just slow, so it is pick 2. That being said, ZFS metadata is much bigger than it needs to be (most infamously dedupe has this issue).
Right, but to then compare the memory use of ZFS with dedupe turned on against another fs without that feature would be disingenuous since dedupe is optional in ZFS. That was the whole point of my original comment.
deduplication doesn't have necessarily to be expensive; HAMMER (the first version, not HAMMER2) has offline deduplication that is usually scheduled to run at night. This allows regular use to be quite performant and with a low memory footprint.
Of course there are tradeoffs, in particular heavy disk usage at certain hours (which can be an issue depending on the workload) and the fact that space is reclaimed only after some time that it has been wasted.
I did say that ZFS metadata in general, and dedupe especially is much bigger than it needs to be. Even live dedupe can be done much more cheaply than how ZFS does it.
'btrfs subvolume snapshot <src> <dest>' is a writable snapshot; with -r flag it's read only. A Btrfs snapshot is a subvolume with stuff already in it, basically. In fact, all the inode numbers are initially identical in the original subvolume and the snapshot, and initially the snapshot is just a pointer. That's why they're cheap to create.
How does the deduplication work, in terms of user interface and performance?
For comparison:
ZFS has an online-only dedup, so it can save space as data is written, but it can't combine identical pieces of already-written data. It also scatters the segments of files to the winds, needing lots of ram and fast disks or it slows to a crawl.
BTRFS has mostly-offline dedup, so it doesn't save space as data is written, but you can combine identical pieces of data later. You can also make CoW copies of files instantly. It has minimal performance impact.
> I believe the correct terms are inline vs. post-processing deduplication.
Thanks.
> There are pros and cons to each approach, and as usual with storage, a lot depends on implementation and workload.
I would probably argue that the biggest factors in different performance between ZFS and BTRFS deduplication are nearly independent of when the deduplication happens, and boil down to implementation decisions specific to those two filesystems.
If you happen to know, would the right comparison to elsewhere in computing be between something like reference count garbage collection (ala python ignoring its cycle detector) versus a sweeping garbage collector?
Offline is a completely different term when speaking about filesystems. Offline means that the filesystem needs to be unmounted before an operation can proceed.
For example some filesystems can be grown online, but can shrink only offline.
Read-only and read-write snapshots are possible with LVM, ZFS, Btrfs, and many commercial storage solutions.
One reason why you would want a writable snapshot is during backup validation. An application may need to do crash recovery before being able to read data, and having a writable filesystem makes things a lot easier.
I know that ZFS and LVM both send writes to a separate area, so the original snapshot is never modified (don't know about Btrfs implementation details).
Yes, what I said in ZFS you can instantly (and without using extra space) create a writable filesystem from snapshot using zfs clone (you can create multiple filesystems from one snapshot), but snapshots by itself are always read only.
> extra care to allow pre-allocation of files by writing zeros, something that will make SQL databases easier to run performantly on HAMMER2 than on ZFS
I always though it care about performance of SQL databases you bypass the filesystem / kernel and go directly to the LUN with something like ASM.
It's been years and it's one dude that's doing it iirc.
> Are there any preliminary benchmarks available?
He spent years just to write the doc and spec. And the post just said the preliminary code will be in September so that's probably a no.
Even with the september code being posted, preliminary isn't going to mean much because it most likely have very little features compare to ZFS and Btrfs. You probably have to wait longer for a fair comparison...
IIRC it's a one man team and the dude is a unicorn.
The original plan for HAMMER2 included multi-master synchronous replication. I don't know if that's still on the roadmap, or if something less ambitious is planned.
IIRC, years ago Dillon sold some proprietary synchronous multi-master replication product that provided ACID guarantees while also being relatively performant. (Because synchronous multi-master had historically been quite slow.) I always thought HAMMER2's replication model was going to be an evolution of that tech.
Dillon's database was called Backplane, and most of it was open source. I was always disappointed that he never (as far as I know) finished it, because it seemed very promising.
If I were the Linux camp, I would seriously look at the work required to port this to the kernel. Since they can't mainline zfs this may be their next best option. Not that they will look outside their echo chamber but they should.
The first release -- which will have very few of the planned features -- hasn't even happened yet. If I were "the Linux camp", I wouldn't even bother looking into H2 for at least another five years or so.
Dillon, I had always thought was a hack when he forked FreeBSD at 4.x but he’s proven to have some novel ideas when it comes to these things and I’m looking forward to trying out the production ready Hammer2 FS
I have great respect for his ability to get shit done ever since Amiga days, so my thought was not "a hack" but rather - "how can he have this much patience and stamina?!"
Other than the design doc (which, being BSD, is bound to be the primary source of truth), does anyone know of any tech talks or more visual presentations about the design of Hammer FS? I sure do love being available, but for just starting to wrap your head around an FS architecture, talking through some slides would sure be neat. I'm not immediately seeing much on YouTube...
It's a short email post. Yes, they're build in to the inode structure and would be used for de-duplication. (However I'd hope that for a production system there's still a /verify/ pass on the data being de-duped to confirm it actually __is__ a dupe.)
Interestingly, this is not what happens in ZFS [1] (unless the defaults have changed since a few years ago):
If you accept the mathematical claim that a secure hash like SHA256 has
only a 2\^-256 probability of producing the same output given two different
inputs, then it is reasonable to assume that when two blocks have the
same checksum, they are in fact the same block. You can trust the hash.
An enormous amount of the world's commerce operates on this assumption,
including your daily credit card transactions. However, if this makes
you uneasy, that's OK: ZFS provies a 'verify' option that performs
a full comparison of every incoming block with any alleged duplicate to
ensure that they really are the same, and ZFS resolves the conflict if not.
To enable this variant of dedup, just specify 'verify' instead of 'on':
E knows of some critical security update that needs to be installed in sensitive locations.
E also knows of some attack on the hashing algorithm that is in use by the filesystem to craft a small block containing mostly garbage but some key bits that they would like to control. (Yes this is the hypothetical, but prior algorithms /have/ fallen).
E thus arranges to have this 'duplicate' block stored before routine and predictable maintenance patterns.
A installs the updates and the 'duplicate' file is now E's datastream, but A's intended credentials.
E has caused system corruption, and potentially privilege escalation.
Bah, 256 bits is already so vanishingly small it's hard to comprehend. If you put a thousand disks into a pool you might reach 2^40 blocks, leaving you with 256-80=176 bits of margin. That is never going to collide. You could make such a filesystem for each atom on Earth and the odds of a single collision would be less than 0.1% You could put a billion disks in a pool and still have 136 bits of margin.
You hit a point where the longer runtime increases your chance of error by a very small amount that is nevertheless larger than the protection you gain.
Portability doesn't seem to have been a primary goal for HAMMER/2 - I remember reading on mailing list that the reason was to not let portability impede other goals.
Thus even with FreeBSD a port would not be straight forward because of VFS API differences, required scheduler support and buffer cache implementation differences[1]. Linux I would assume is even more work.
I started an improved port of HAMMER to Linux (I don't like the current FUSE one and started building an in-kernel version from scratch) but, stopped working on it before I got it working or published it. I am planning to start working on it again with someone else. I wouldn't be surprised if someone else beats us to it though.
I had a reasonable idea of what was needed, and a roadmap for what order to tackle things in, when I was working on this (briefly) before. I didn't work on it for long and stopped working on this because I changed jobs and had less time for non-work projects, and personal circumstances got in the way too so, I switched to focusing on smaller projects. This was several years ago, before HAMMER2 and, I haven't actually followed developments since that closely (I haven't been looking at the code really) so, I expect some of my conclusions are no longer valid and I need to revisit this. I have been interested in taking this up again for a few months but, hadn't actually decided to start again until a few weeks ago when someone else I know expressed interest too.
What utilities you need and, what you do in userspace, depends on what the goal of the port is somewhat I think. Also, what goes in user-space depends on what you are porting to (FUSE or what OS). My goal was (and is again now) to have first class support on Linux, i.e. you should eventually be able to do everything you can on Dragonfly with HAMMER on Linux, with as good performance as possible. Initially, performance takes a back seat to feature support (unless it were to be so bad that it made features unusable) but, good performance is an important end goal of mine. I think for my goals, an in-kernel driver and user space utilities based on the Dragonfly ones but, with some substantial differences in code, make sense.
There is a FUSE port of HAMMER that lets you read, without history I think. I don't think this is maintained anymore but, it had very different goals to mine and, a port with different goals to my own, might want to pickup this/start a new FUSE port. In particular, if you don't care about supporting every feature, you want to get things that you do plan to support working more quickly, you don't care about performance in the long-run and/or, you want to target other platforms than Linux as well/instead, FUSE might make more sense. I think there are reasons to work on a FUSE port as well as a specific port to Linux (and other OSs), they have slightly different goals and trade-offs.
To be honest, I'm not sure exactly what you want to know here so, if I haven't addressed it, please do ask (but, it is possibly the case that what I know is too out of date to be interesting).
If you're interested in working on this together (there are a few people I know already who have expressed interest in working on this, all based in London so far), hearing more about progress in the near future or, if you are planning your own Linux port or port to another OS, I'd be interested to hear. Let me know if you'd like me to provide contact details for non-public messages (there's no email on my profile right now). I am not planning to make anything I do public until at least we have read support, with no history, working and at least somewhat tested but, I'm not set in stone about that particularly. I don't see any reason to use our port over the existing FUSE one until after then either.
The better question is when are you ready to switch to a proper kernel, DragonflyBSD? When you think about HAMMER2 you should really think about all the other improvements also.
There are plenty of reasons to want to/have to use Linux. There is clearly space in the world for more than one kernel. That said, Dragonfly deserves more use.
This is certainly the easiest way to get a working implementation, yes. There is an existing (but I don't think maintained) FUSE port of HAMMER and it would not be a huge amount of work to have HAMMER2 FUSE support I don't think. I am interested in a kernel port of HAMMER/HAMMER2 because I think this is likely to give the best performance in the long run.
- This is DragonflyBSD's next-gen filesystem.
- copy-on-write, implying snapshots and such, like ZFS, but snapshots are writable.
- compression and de-duplication, like ZFS
- a clustering system
- extra care to reduce RAM needs, in contrast to ZFS
- extra care to allow pre-allocation of files by writing zeros, something that will make SQL databases easier to run performantly on HAMMER2 than on ZFS
And much more. The design doc is an interesting read, take a look:
https://gitweb.dragonflybsd.org/dragonfly.git/blob_plain/HEA...