Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm also wondering why this kind of intervention is necessary at all. The NoSQL solution we use at work has load based automatic splitting, and I'd have thought (though I haven't confirmed) that this would be an obvious feature to include.


I would speculate that it's a poorly-chosen shard key. MongoDB's built-in sharding uses range-based indexing. If you choose user_id as your shard key, and those are autoincrementing integers, then you're screwed if newer users tend to be more active on average than older ones.


Wait, people shard on a key other than something approximately random, like an sha1 hash!?


Where I work (Etsy) we keep an index server that maps each user to a shard on an individual basis. There are a number of advantages to it. For example, if one user generated a ton of activity they could in theory be moved to their own server. Approximately random works for the initial assignment.

Flickr works the same way (not by coincidence, since we have several former Flickr engineers on staff).


Sounds like a good system. I've noticed that people tend to do things like shard based on even/odd, and then they realize that they need three databases.

I've never had either problem though... but if I ever need to shard I plan on doing it based on object ID. Then one request can be handled by multiple databases, "for free", increasing both throughput and response time.


Even/odd isn't the end of the world, but you would then be best jumping to mod 4.


Actually for anonymous sharding (without a central index) a consistent hash is about the closest you can get to ideal distribution and flexibility. I haven't looked but I presume that's what mongo uses under the hood for their auto-sharding, too.


Load based splitting is on the MongoDB roadmap, but doesn't exist yet.


Just because the feature exists, doesn't mean it works.


Which NoSQL store do you use?



That looks potentially useful in some cases. Can you send me the code/docs for that?



Does it work that well for production for you? Really, that's extremely interesting.


Also look at dynamo based systems like cassandra and riak (riak seems to have a better load balancing at the moment, cassandra is a bit more "bumpy")




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: