I don't know of any cross-library comparison benchmark. Also, if a library does ship with an internal benchmark, it usually uses a different number of data/parity shards and different shard size, compared to other library benchmarks, which makes comparison difficult.
I wrote an MIT-licensed ReedSolomon module for Node [1], with native C++ multi-core throughput. The internal benchmark (node benchmark.js) provides latency and throughput measurements for data=10, parity=4 for a variety of shard sizes, and the number of data/parity shards can be adjusted for comparison.
It's not yet using optimized assembler or SIMD like Klaus Post's implementation, but it is compatible and also based on Backblaze's implementation, as well as including a pure Javascript implementation for browser use.
In my library, the SSSE3 code also has AVX and AVX2 equivalents, providing higher throughput. There are also implementations with specific optimized instructions for ARM NEON, ARM64, PPC64/AltiVec and Power8. All of this in C (with intrinsics), not ASM, so making use of your C compiler's instruction scheduling, loop unrolling,... See https://github.com/NicolasT/reedsolomon/blob/master/cbits/re...
The one you linked to is excellent. It has hand crafted SSE3 assembler for enormous speed. It is a bit difficult to compare the benchmarks from the one in the original posting to Klaus's as they aren't done at the same (N,K) parameters though.
I'm not sure PAR2 really counts as erasure coding rather than a more general FEC. I've tried zfec in the past but it's not really set up for backup uses such as creating an archive to burn to a set of BD disks.
For backups, what you want to generate additional files, par2 style, which can be combined with the regular archive files (eg like in duplicity, GPG-encrypted tarballs) to repair any bitrot or lost fractions of those archives. But with zfec, as I understood it, it only works if you turn them entirely into erasure-coded shares and assume you'll lose entire shares and wind up using m-of-n shares in recovery/restoration (as opposed to PAR2 which will handle losing individual archive files but also arbitrary corruption in any of those archive files, as long as the total does not exceed the redundancy %). So you would have to do something like create a single giant archive file, and if you wanted the equivalent of 50% redundancy, you'd use zfec to turn it into '50-of-100' shares and backup the 100 shares (or do 500-of-1000 etc).
Then any shares which aren't bit-identical to get tossed out at restoration and you hope you have enough shares left to reconstruct the original giant archive file. But you might not - your shares might be mostly good but missing a few chunks, which is something a PAR2 setup could cope with (if the chunks are in the archive files, they get repaired normally, and if they're in the PAR2 redundancy, they don't matter). This is fine in the Taho-LAFS or datacenter setup where you assume your storage nodes will store perfectly or fail entirely (since given any live errors, you can just scrub the entire machine and rebuild an additional copy from the erasure-coded shares). Not so much in a backup setting.
Ultimately, while zfec is way faster than par2, I found this setup squirrely enough that I wasn't convinced to switch.
Slightly offtopic (although the title contains "portable"): it often saddens me that purely mathematical functions like these are so tightly coupled to implementation language (even though there are three different language bindings here). Shouldn't the programming community have advanced past that point by now?
I'm not sure it's a thing to be advanced past. Using a language to implement a given pure function is inherent to deploying it.
You could develop one universal way to create and deploy functions to different languages, but now you've introduced one more language to get people to use.
And in practical terms, migrating this sort of thing from one language to another when necessary is generally a good enough solution.
Imagine a high-level, straightforward imperative language X whose only purpose is to specify libraries and data structures. This X has strong typing, memory safety and maybe also memory alignment directives but abstracts away as much as possible from any actual environment/platform/CPU layout. It has a precise semantics and even some ways of specifying pragmatics (e.g. big O runtime behavior of functions). In X you write "pure" algorithms that do not access (m)any OS-dependent features. Libraries in this language are then transpiled to various target languages and platforms.
To use all libraries that have ever been written for X, someone only needs to write a transpiler-backend for the target language/platform once.
LLVM intermediate language is similar but too low-level. Think about a high-level counterpart.
Just guessing at the parent's intent, but the specification of a function can be decoupled from the production of an optimized implementation of that function, especially if the structure of the function is constrained in some way.
This approach is used to some degree in producing, for example, optimized BLAS implementations and other similar routines:
Of course, you do then need to juggle multiple languages. Different languages can be better at expressing different things, though (e.g., "what the function does" versus "how the function does it"), so it's not inherently crazy.
> Under the EPC, and in particular its Article 52,[1] "programs for computers" are not regarded as inventions for the purpose of granting European patents,[2] but this exclusion from patentability only applies to the extent to which a European patent application or European patent relates to a computer program as such.[3] As a result of this partial exclusion, and despite the fact that the EPO subjects patent applications in this field to a much stricter scrutiny[4] when compared to their American counterpart, that does not mean that all inventions including some software are de jure not patentable.
Going back to the original question, I certainly wouldn't want to start an open source project knowing that Qualcomm might sue me for patent infringement. Defending their intellectual property is part of their core business strategy.
Seems like the goal was to keep the product under copyright for some time, giving it a opportunity to recover costs or integrated exclusively inside a product for a while, giving some advantage over competitors, but making it a public-license after the period.
As a quick answer, the name comes from being able to recover data when some of it is "erased".
The only way to durably store data so that it survives a hardware failure (e.g. drive dying) is to store more than one copy. Full replicas are the simplest way to do this, but you've got a relatively high overhead (e.g. Store 1GB of data with 3x replicas, and you store 3GB of data). Erasure codes are a way to effectively store fractional replicas, so you only use 1.5x or 1.7x of the original data.
Erasure codes are great when you've got a lot of data and you need high durability but don't want to pay for the storage space required for full replicas.
Why don't we always use erasure codes for everything? EC isn't great when you've got small bits of data, and since there's a bit of math involved in reading and writing the EC data, EC has higher latency than simple replicas.
I tried Tahoe-LAFS, the setup seems a bit weird and from trying it on that public gateway a bit, there seem to be some issues in relation to what happens if you loose that link to your folder.
There seems to be no way of using it to simply attach and have a list of your data only.
I'm favoring Camlistore over Tahoe, though Camli has no good encryption yet and no erasure encoding. The basis of Camli seems to be better.
I believe it's compatible with the Java implementation used by Backblaze and others.