Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The last time I wanted an unpredictable, non-repeating sampling of 64-bit numbers, I used a symmetric cipher in CTR mode. Essentially, start with a random key and a random value, return the encrypted version of it, then the encrypted version of its increment, and so on. Returns every number in the range, with no predictability.


The thing the Vitter algorithm does which this doesn't is that it produces the samples in order. This is why it's useful with tape drives, or sampling market data through the day: you can start at the beginning, and take the samples without ever having to seek backwards.


It is also the only reason to use this algorithm :) It's a pretty important requirement that I don't think the author emphasized enough.

If you want them shuffled (or don't care if they're shuffled), better use a permutation function.

BTW there are much faster permutation functions than cryptographic ones (you don't need the security aspect for this application). Look to modern PRNGs.


This is a good approach. Thanks to bijective mapping between original and encrypted spaces the result is guaranteed to be unique. And since the mapping function is a secure encryption algorithm, the output has uniform distribution. Well done.


Indeed, but this assumes that modern crypto is secure! :)


When our crypto prof told us we're not sure if one way functions exist, everyone's jaw dropped.


I know this, as I work in crypto.

But I’d have never considered using it for this - it doesn’t register in the brain as anything other then crypto.

Awesome solution, thanks for hacking my brain.


This works, but only if you're sampling the exact same space as the block size. For example, if your cipher outputs 64-bit blocks, but you want to sample in 2^32, you're back to the drawing board, because there's no way to truncate the cipher output without risking duplicates.


You can use the "hasty pudding trick" to truncate the cipher output without risking duplicates. It is inefficient if the block space of the cipher is much larger than the sampling space, though.

However, you can also use a Feistel network to create a block cipher for an arbitrary bit length. So, you can always bound the block space to be no more than twice the sampling space, which is okay.


What's the hasty pudding trick? Is it related to the Hasty Pudding cipher?


It is. Perhaps it's not officially called that, but that's how it was described to me:

https://groups.google.com/d/msg/sci.crypt/gjYclOVJDrc/tiv4ic...


A variable-size block construction would help here. It doesn't even have to be cryptographically secure, only bijective and statistically good enough similar to PRNGs sans CS. It doesn't even have to be keyed.


Another good tool for this is the linear feedback shift register, and most programmers outside of crypto seem to be unaware of it too. A "complete" N-bit LFSR gives you 2^N non-repeating pseudorandom values. You start with some value (the seed), and feed it into your LFSR operation along with your chosen bitmask (the "taps"), and you get a difficult-to-predict value back out.

"Complete" LFSRs are created by carefully selecting your taps, and there's plenty of literature available for a broad range of taps for many different numbers of output bits.

Your cipher may have an LFSR buried somewhere in it, I guess they're a common component in crypto. (I know dick-all about crypto, mostly.)

Usual caveats for PRNGs apply: LFSRs are cool and magical and super fast, but the next number in the sequence can be computed by anyone else that can figure out your LFSR and seed, and since a complete LFSR will hit each value in 2^N exactly once per cycle, it doesn't even count as pseudorandom.

But, if you need a random-looking non-sequential non-repeating selection widget, an LFSR just might do the trick.


I’ve used this trick with a 64-bit block cipher (base-32 encoding the result) to generate short, unique public identifiers (to put in URLs) for entries in a database based on an integer primary key.


There is also Skip32 cipher for 32-bit values. Interestingly enough, it was initially designed to cover the scenarios like yours.


This is a good way to pick a few 64 bit numbers without duplicates, if 64 bit numbers is what you need.

I didn't see any replies to you mention this part, but Vitter's algorithm works on an unknown number of samples. You can pick a 64-bit number uniformly when you know in advance you have exactly 2^64 choices. But how would you pick samples from a stream with equal probability, without knowing how many items are in the stream? Maybe it's 3, maybe it's 2.493e85, you won't know until you run out, and when you do, you want to have picked some items from the stream with uniform probability.


Incorrect, you do know this number beforehand. The parameter `N` in the code is the upper bound on the number of values, and the parameter `n` is the number of samples.


I don't know which code you mean, but I was referring to Vitter's basic "Algorithm R", which does not depend on knowing the number of samples or an upper bound in advance. https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R


Wait, are symmetric ciphers guaranteed to cycle through the full space of block-sized numbers when applied to their output like that? I thought that was just "close to" true?


Yes, it's bijective. For every input value, it provides a unique output that will decrypt back to that input.

You might be misunderstanding the algorithm. We're only encrypting an incrementing number, not ever encrypting something that is already encrypted.


Ah, okay. Thanks! Maybe I was thinking of hash functions, for which it's not guaranteed that every output has an (n-sized) pre-image?

Edit: To clarify, I do now understand that you were proposing:

random number X -> output encrypt(X), encrypt(X+1), encrypt(X+2), etc

Which I had misinterpreted as:

random number X -> output encrypt(X), encrypt(encrypt(X)), encrypt(encrypt(encrypt(X))) etc.


Counter mode means encrypting successive (all different) numbers, not the previous output. There is no issue with cycles.


Yes, the original commenter replied to say that and I agreed to it.

https://news.ycombinator.com/item?id=20962758


This is my homespun solution whenever I need a deterministic, non-repaeting mapping of a sequence. Except I rely - shame on me - on a variable blocksize encryption scheme I've hacked together for just this purpose.

Output - of the mapping as well as of anything I run through the encryption algo - passes all pseudo-random validity testing I throw at it, but obviously I would never, ever consider using the thing where real security was a concern.


Choosing an initial random number, then keep modulo-incrementing it by a generator (a number larger than, and relatively prime to, the stack size) wouldn't work as well?


In a similar vein, I use a PRNG (based on Blake2b cryptographic hash) and mask off bits for the required domain.


Why not simply increment the CTR IV with a monotonically increasing counter?


you can also use a Lehmer random number generator of full period.


Like all LCGs, the output of the Lehmer RNG is very predictable, but it would indeed suffice for the purpose given in the article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: