Initial Hammer2 filesystem implementation

jitl · on Aug 21, 2017

Very exciting to see implementation progress on HAMMER2. Some basics about the design:

- This is DragonflyBSD's next-gen filesystem.

- copy-on-write, implying snapshots and such, like ZFS, but snapshots are writable.

- compression and de-duplication, like ZFS

- a clustering system

- extra care to reduce RAM needs, in contrast to ZFS

- extra care to allow pre-allocation of files by writing zeros, something that will make SQL databases easier to run performantly on HAMMER2 than on ZFS

And much more. The design doc is an interesting read, take a look:

https://gitweb.dragonflybsd.org/dragonfly.git/blob_plain/HEA...

chongli · on Aug 22, 2017

snapshots are writable

Isn't that an oxymoron? I thought the entire point of a snapshot was that it's an immutable record of the filesystem at a moment in time.

extra care to reduce RAM needs, in contrast to ZFS

ZFS utilizes a lot of RAM for caching. That's not the same thing as needing a lot of RAM. I've seen this same complaint about modern operating systems. People will buy a lot of RAM and then get annoyed when they see the OS making use of it. As long as the memory is available to be allocated to applications, why should we care whether or not the operating system makes use of it?

tedunangst · on Aug 22, 2017

Is it better if we call a snapshot a branch point?

I believe the complaint about ZFS and RAM is that the caching cannot be considered optional. Performance is substantially worse than other filesystems with more modestly sized caches, so the memory isn't really available to applications.

binarycrusader · on Aug 22, 2017

The caching can be considered optional and (in Oracle Solaris) there is a tunable user_reserve_hint_pct:

https://docs.oracle.com/cd/E53394_01/html/E54818/chapterzfs-...

chongli · on Aug 22, 2017

Performance is substantially worse than other filesystems with more modestly sized caches, so the memory isn't really available to applications.

Is that in comparison to other filesystems that provide the same level of data integrity as ZFS? Otherwise you're comparing apples and oranges.

tedunangst · on Aug 22, 2017

You didn't say the memory use was justified by providing better integrity. You said it was optional.

chongli · on Aug 22, 2017

Yes, the memory use is optional. Pick two:

1) data integrity

2) low memory use

3) performance

dozzie · on Aug 22, 2017

With ZFS I hear it's "pick one, and you cannot pick `low memory'".

aidenn0 · on Aug 22, 2017

You can run ZFS with megabytes of ram, it's just slow, so it is pick 2. That being said, ZFS metadata is much bigger than it needs to be (most infamously dedupe has this issue).

chongli · on Aug 22, 2017

Right, but to then compare the memory use of ZFS with dedupe turned on against another fs without that feature would be disingenuous since dedupe is optional in ZFS. That was the whole point of my original comment.

mrighele · on Aug 22, 2017

deduplication doesn't have necessarily to be expensive; HAMMER (the first version, not HAMMER2) has offline deduplication that is usually scheduled to run at night. This allows regular use to be quite performant and with a low memory footprint. Of course there are tradeoffs, in particular heavy disk usage at certain hours (which can be an issue depending on the workload) and the fact that space is reclaimed only after some time that it has been wasted.

aidenn0 · on Aug 22, 2017

I did say that ZFS metadata in general, and dedupe especially is much bigger than it needs to be. Even live dedupe can be done much more cheaply than how ZFS does it.

gigatexal · on Aug 22, 2017

Lol this is FUD. ZFS is easily tunable to not blow up and use all your ram.

benchaney · on Aug 22, 2017

Does "FUD" just mean "something I disagree with" now? How is this FUD?

gigatexal · on Aug 22, 2017

Spreading misinformation is what I was taking issue with not just disagreeing and making a stink about it like a troll might.

JdeBP · on Aug 22, 2017

You're not appreciating nor applying the fear, uncertainty, and doubt parts of "Fear, Uncertainty, and Doubt".

_pfxa · on Aug 22, 2017

Talking about downsides of a thing is not FUD nor making stink about it.

snovv_crash · on Aug 22, 2017

Depends if you want dedup.

cmurf · on Aug 22, 2017

'btrfs subvolume snapshot <src> <dest>' is a writable snapshot; with -r flag it's read only. A Btrfs snapshot is a subvolume with stuff already in it, basically. In fact, all the inode numbers are initially identical in the original subvolume and the snapshot, and initially the snapshot is just a pointer. That's why they're cheap to create.

Dylan16807 · on Aug 22, 2017

How does the deduplication work, in terms of user interface and performance?

For comparison:

ZFS has an online-only dedup, so it can save space as data is written, but it can't combine identical pieces of already-written data. It also scatters the segments of files to the winds, needing lots of ram and fast disks or it slows to a crawl.

BTRFS has mostly-offline dedup, so it doesn't save space as data is written, but you can combine identical pieces of data later. You can also make CoW copies of files instantly. It has minimal performance impact.

antongribok · on Aug 22, 2017

I believe the correct terms are inline vs. post-processing deduplication.

There are pros and cons to each approach, and as usual with storage, a lot depends on implementation and workload.

Dylan16807 · on Aug 22, 2017

> I believe the correct terms are inline vs. post-processing deduplication.

Thanks.

> There are pros and cons to each approach, and as usual with storage, a lot depends on implementation and workload.

I would probably argue that the biggest factors in different performance between ZFS and BTRFS deduplication are nearly independent of when the deduplication happens, and boil down to implementation decisions specific to those two filesystems.

CogitoCogito · on Aug 22, 2017

If you happen to know, would the right comparison to elsewhere in computing be between something like reference count garbage collection (ala python ignoring its cycle detector) versus a sweeping garbage collector?

snovv_crash · on Aug 22, 2017

The term online and offline are also correct, although they tend to be used in the machine learning field more.

antongribok · on Aug 22, 2017

Offline is a completely different term when speaking about filesystems. Offline means that the filesystem needs to be unmounted before an operation can proceed.

For example some filesystems can be grown online, but can shrink only offline.

Dylan16807 · on Aug 22, 2017

The specific term "offline deduplication" does not mean you unmount the filesystem. Perhaps it's a bad term, but it exists.

takeda · on Aug 22, 2017

I'm glad Hammer FS exist, it is always great to have options, and new filesystems have more to build on.

But this one seems weird to me:

> copy-on-write, implying snapshots and such, like ZFS, but snapshots are writable.

snapshot by definition should be read-only, also ZFS allows to create a new filesystem from snapshot (through zfs clone) which is also O(1) operation.

antongribok · on Aug 22, 2017

Read-only and read-write snapshots are possible with LVM, ZFS, Btrfs, and many commercial storage solutions.

One reason why you would want a writable snapshot is during backup validation. An application may need to do crash recovery before being able to read data, and having a writable filesystem makes things a lot easier.

I know that ZFS and LVM both send writes to a separate area, so the original snapshot is never modified (don't know about Btrfs implementation details).

takeda · on Aug 22, 2017

Yes, what I said in ZFS you can instantly (and without using extra space) create a writable filesystem from snapshot using zfs clone (you can create multiple filesystems from one snapshot), but snapshots by itself are always read only.

Someone · on Aug 22, 2017

That looks a bit like git versus svn. So, it _might_ be an insane idea that ends up being a good idea.

Especially if the default is to have a read only snapshot, I do not rule that out.

needusername · on Aug 22, 2017

> extra care to allow pre-allocation of files by writing zeros, something that will make SQL databases easier to run performantly on HAMMER2 than on ZFS

I always though it care about performance of SQL databases you bypass the filesystem / kernel and go directly to the LUN with something like ASM.

Mayzie · on Aug 22, 2017

I'm curious, how does it perform against ZFS and Btrfs (and probably a non-COW filesystem, like EXT4)? Are there any preliminary benchmarks available?

Does it support RAID and subvolumes like Btrfs? How is the stability?

digitalzombie · on Aug 22, 2017

It's been years and it's one dude that's doing it iirc.

> Are there any preliminary benchmarks available?

He spent years just to write the doc and spec. And the post just said the preliminary code will be in September so that's probably a no.

Even with the september code being posted, preliminary isn't going to mean much because it most likely have very little features compare to ZFS and Btrfs. You probably have to wait longer for a fair comparison...

IIRC it's a one man team and the dude is a unicorn.

wahern · on Aug 22, 2017

The original plan for HAMMER2 included multi-master synchronous replication. I don't know if that's still on the roadmap, or if something less ambitious is planned.

IIRC, years ago Dillon sold some proprietary synchronous multi-master replication product that provided ACID guarantees while also being relatively performant. (Because synchronous multi-master had historically been quite slow.) I always thought HAMMER2's replication model was going to be an evolution of that tech.

atombender · on Aug 22, 2017

Dillon's database was called Backplane, and most of it was open source. I was always disappointed that he never (as far as I know) finished it, because it seemed very promising.

X86BSD · on Aug 22, 2017

If I were the Linux camp, I would seriously look at the work required to port this to the kernel. Since they can't mainline zfs this may be their next best option. Not that they will look outside their echo chamber but they should.

jlgaddis · on Aug 27, 2017

The first release -- which will have very few of the planned features -- hasn't even happened yet. If I were "the Linux camp", I wouldn't even bother looking into H2 for at least another five years or so.

gigatexal · on Aug 21, 2017

Dillon, I had always thought was a hack when he forked FreeBSD at 4.x but he’s proven to have some novel ideas when it comes to these things and I’m looking forward to trying out the production ready Hammer2 FS

jacobush · on Aug 22, 2017

I have great respect for his ability to get shit done ever since Amiga days, so my thought was not "a hack" but rather - "how can he have this much patience and stamina?!"

laumars · on Aug 22, 2017

I'm not aware of this work on the Amiga. What kind of projects was he involved in?

rodgerd · on Aug 22, 2017

The DICE C compiler/IDE was his best-known bit of work.

TallGuyShort · on Aug 22, 2017

Other than the design doc (which, being BSD, is bound to be the primary source of truth), does anyone know of any tech talks or more visual presentations about the design of Hammer FS? I sure do love being available, but for just starting to wrap your head around an FS architecture, talking through some slides would sure be neat. I'm not immediately seeing much on YouTube...

JdeBP · on Aug 22, 2017

Well, there's BSD Now #53, although that's primarily verbal.

alberth · on Aug 22, 2017

It's amazing how much work the Dfly has acompkksjed given how few developers there are.

I really hope Dfly gets more adoption and broader use.

blue1 · on Aug 22, 2017

Does H2 feature data integrity (checksums etc)? For me that is one of the best features of ZFS

mjevans · on Aug 22, 2017

It's a short email post. Yes, they're build in to the inode structure and would be used for de-duplication. (However I'd hope that for a production system there's still a /verify/ pass on the data being de-duped to confirm it actually __is__ a dupe.)

mrighele · on Aug 22, 2017

Interestingly, this is not what happens in ZFS [1] (unless the defaults have changed since a few years ago):

    If you accept the mathematical claim that a secure hash like SHA256 has
    only a 2\^-256 probability of producing the same output given two different
    inputs, then it is reasonable to assume that when two blocks have the
    same checksum, they are in fact the same block. You can trust the hash.
    An enormous amount of the world's commerce operates on this assumption,
    including your daily credit card transactions. However, if this makes
    you uneasy, that's OK: ZFS provies a 'verify' option that performs
    a full comparison of every incoming block with any alleged duplicate to
    ensure that they really are the same, and ZFS resolves the conflict if not.
    To enable this variant of dedup, just specify 'verify' instead of 'on':

[1] https://blogs.oracle.com/bonwick/zfs-deduplication-v2

mjevans · on Aug 23, 2017

Let me describe a hypothetical attack.

E knows of some critical security update that needs to be installed in sensitive locations.

E also knows of some attack on the hashing algorithm that is in use by the filesystem to craft a small block containing mostly garbage but some key bits that they would like to control. (Yes this is the hypothetical, but prior algorithms /have/ fallen).

E thus arranges to have this 'duplicate' block stored before routine and predictable maintenance patterns.

A installs the updates and the 'duplicate' file is now E's datastream, but A's intended credentials.

E has caused system corruption, and potentially privilege escalation.

loeg · on Aug 22, 2017

And HAMMER2 will use even larger 512 bit hashes for blind dedupe, making the possibility of a collision causing data loss vanishingly small.

Dylan16807 · on Aug 22, 2017

Bah, 256 bits is already so vanishingly small it's hard to comprehend. If you put a thousand disks into a pool you might reach 2^40 blocks, leaving you with 256-80=176 bits of margin. That is never going to collide. You could make such a filesystem for each atom on Earth and the odds of a single collision would be less than 0.1% You could put a billion disks in a pool and still have 136 bits of margin.

512 bit hashing is basically placebo.

loeg · on Aug 23, 2017

I agree that 256 bit hashing is totally sufficient. I wouldn't call 512 bits "placebo," though. Just overkill.

Dylan16807 · on Aug 23, 2017

You hit a point where the longer runtime increases your chance of error by a very small amount that is nevertheless larger than the protection you gain.

pmoriarty · on Aug 22, 2017

Is there any work to get HAMMER2 on Linux?

blinkingled · on Aug 22, 2017

Portability doesn't seem to have been a primary goal for HAMMER/2 - I remember reading on mailing list that the reason was to not let portability impede other goals.

Thus even with FreeBSD a port would not be straight forward because of VFS API differences, required scheduler support and buffer cache implementation differences[1]. Linux I would assume is even more work.

[1]https://wiki.freebsd.org/PortingHAMMERFS

craftkiller · on Aug 22, 2017

According to this forum post[1] HAMMER2 is supposed to be portable, but I don't see any mention of that on the HAMMER2 design document.

[1] https://forums.freebsd.org/threads/49789/#post-278338

blinkingled · on Aug 22, 2017

Right. But quickly skimming over h2 design doc gives the feeling that there's been efforts to reduce OS dependencies/special casing.

sipos · on Aug 22, 2017

It will definitely happen.

I started an improved port of HAMMER to Linux (I don't like the current FUSE one and started building an in-kernel version from scratch) but, stopped working on it before I got it working or published it. I am planning to start working on it again with someone else. I wouldn't be surprised if someone else beats us to it though.

loeg · on Aug 22, 2017

Do you have a good idea of what Dfly code is needed for HAMMER ports? Obviously sys/vfs/hammer2, but what about userspace libraries or binaries?

sipos · on Aug 23, 2017

I had a reasonable idea of what was needed, and a roadmap for what order to tackle things in, when I was working on this (briefly) before. I didn't work on it for long and stopped working on this because I changed jobs and had less time for non-work projects, and personal circumstances got in the way too so, I switched to focusing on smaller projects. This was several years ago, before HAMMER2 and, I haven't actually followed developments since that closely (I haven't been looking at the code really) so, I expect some of my conclusions are no longer valid and I need to revisit this. I have been interested in taking this up again for a few months but, hadn't actually decided to start again until a few weeks ago when someone else I know expressed interest too.

What utilities you need and, what you do in userspace, depends on what the goal of the port is somewhat I think. Also, what goes in user-space depends on what you are porting to (FUSE or what OS). My goal was (and is again now) to have first class support on Linux, i.e. you should eventually be able to do everything you can on Dragonfly with HAMMER on Linux, with as good performance as possible. Initially, performance takes a back seat to feature support (unless it were to be so bad that it made features unusable) but, good performance is an important end goal of mine. I think for my goals, an in-kernel driver and user space utilities based on the Dragonfly ones but, with some substantial differences in code, make sense.

There is a FUSE port of HAMMER that lets you read, without history I think. I don't think this is maintained anymore but, it had very different goals to mine and, a port with different goals to my own, might want to pickup this/start a new FUSE port. In particular, if you don't care about supporting every feature, you want to get things that you do plan to support working more quickly, you don't care about performance in the long-run and/or, you want to target other platforms than Linux as well/instead, FUSE might make more sense. I think there are reasons to work on a FUSE port as well as a specific port to Linux (and other OSs), they have slightly different goals and trade-offs.

To be honest, I'm not sure exactly what you want to know here so, if I haven't addressed it, please do ask (but, it is possibly the case that what I know is too out of date to be interesting).

If you're interested in working on this together (there are a few people I know already who have expressed interest in working on this, all based in London so far), hearing more about progress in the near future or, if you are planning your own Linux port or port to another OS, I'd be interested to hear. Let me know if you'd like me to provide contact details for non-public messages (there's no email on my profile right now). I am not planning to make anything I do public until at least we have read support, with no history, working and at least somewhat tested but, I'm not set in stone about that particularly. I don't see any reason to use our port over the existing FUSE one until after then either.

rurban · on Aug 22, 2017

The better question is when are you ready to switch to a proper kernel, DragonflyBSD? When you think about HAMMER2 you should really think about all the other improvements also.

zsd · on Aug 22, 2017

I use Linux just because it has a large community and good driver support. But I like hearing about innovations in OS development.

What would you say to me or show to me to sell me on the idea of using a BSD?

zsd · on Aug 27, 2017

I rest my case.

sipos · on Aug 23, 2017

There are plenty of reasons to want to/have to use Linux. There is clearly space in the world for more than one kernel. That said, Dragonfly deserves more use.

jacquesm · on Aug 22, 2017

A fuse port would seem to be the way to go.

sipos · on Aug 23, 2017

This is certainly the easiest way to get a working implementation, yes. There is an existing (but I don't think maintained) FUSE port of HAMMER and it would not be a huge amount of work to have HAMMER2 FUSE support I don't think. I am interested in a kernel port of HAMMER/HAMMER2 because I think this is likely to give the best performance in the long run.

beastman82 · on Aug 22, 2017

This sounds like a weapon system designed by Stark Industries.

qbrass · on Aug 22, 2017

Not Hammer Industries?

beastman82 · on Aug 22, 2017

bah! yes that's much better.