De-duplication in BTRFS

Latest response

Online de-dupe is fast becoming a requirement of large storage systems.  There have been several discussions about the pros and cons of online vs offline vs why bother at all.  I'm in the camp that says de-dupe is a *requirement* for some people, so please have it as an included option with the file system.  It makes little sense to avoid doing it online with BTRFS, so stick it in at that point.

 

The arguments as to whether it suits my environment are best answered by me, so please let me make the decision as to whether I enable it on my disks.

Responses

Deduplication should be great for us. In our fileserves & mailhubs would save a lot of space without modifying apps.

Depending on how well your mail stores are managed, you may achieve little in the way of meaningful space savings. Enterprise-grade mail systems tend to dedupe data, internally. Thus, what's written to disk may not have much in the way of externally-deduplicatable (i.e., at the filesystem, outside of the app's internal functionality) data. If you're running a mail system that functions like Exchange, your FS-based dedupe rates would be lower than those found on a classic sendmail/mbox-based system.

 

Where you'll tend to get your best dedupe rates (of the above two listed use cases) are on fileservers - particularly home directory stores - where users are each squirreling away their own copies (or worse, multiple copies per user) of a given document.

 

 

Overall, before you can project space savings, you have to have a good idea of how inefficiently your current storage systems are being used.

That's surely a calculation for the Admin to make?

 

As with all tools, it's mere availability shouldn't be a reason to imlpement it.  Horses for courses.

 

My personal use would probably be more geared to de-dupe of virtual images - where the OS disk across a hundred or so VM guests is likely to be largely the same.  Huge de-duplication potential right there - downside of course is that a cold boot of your entire system (after datacentre failure for example) would have a large number of systems hitting the same spindles.  Surely caching helps here though?

 

D

That is the way NetApp's PAM modules work. They are "dedup aware". The deduped blocks are cached in the PAM card and thus provide bootstorm protection in virtualized environments.

That is the way NetApp's PAM modules work. They are "dedup aware". The deduped blocks are cached in the PAM card and thus provide bootstorm protection in virtualized environments.