Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why is hashmap not good enough for unique ID generation?

    const tweetIDs = {};

    while(TWEET_COUNT){
        const rnd = Math.random() * MAX_TWEET_ID;
        if(!tweetIDs[rnd]){
            tweetIDs[rnd] = true;
            TWEET_COUNT--;
        }
    }


The algorithm described in the article has other desirable properties:

* IDs are returned in sorted order

* IDs are generated one after another, ie could be implemented as a generator or coroutine

* No additional space required during the computation

Resulting in a single pass.


Nicely said.

I was thinking of ways to solve the problem by eliminating the overhead of checking whether a random number was already drawn, when Vitter's solution was to make the random number generator produce random numbers in a sorted manner, thus eliminating that overhead entirely.

EDIT: Then again... if the algorithm is to be used to shuffle a list, wouldn't returning the selection in a sorted manner defeat the entire purpose? If k = n, output would be input...


It would. The original problem is to draw a subset of tweets.

If there's a chance to randomize the entire tweet archive, then it's a shuffling algorithm now.


> * IDs are returned in sorted order

This is key. The author does not explain this well, or at all. If you require the IDs in sorted order, you need this algorithm.

If you don't mind or want them shuffled, use a permutation function.

(the idea of shuffling afterwards is nonsense, if you could do that, you can also check for duplicates when drawing)


>The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).

This will take longer time at each iteration and when both MAX_TWEET_ID and and TWEET_COUNT is large and close to (or equal to) each other it may take very long to complete.


That gets pretty slow and memory intensive once you sample a large amount of the total doesn’t it?


It was not a requirement :) However it won't get slow, it's O(1) to lookup if a key exists.


> (Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)

As k approaches n the expected number of draws needed to get k unique values increases faster than O(k).


It will get slow towards the end once each loop iteration has a very low chance of being productive (only happens when you're choosing all or almost all elements). You can fix that, though there's potential tradeoffs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: