We've considered having part of the servers in a different location. But it is a...

Loic · on Dec 11, 2016

You will have to go out of Ceph and build a middleware to route the git operations on the right shard. Take a look at libgit2[0] for that.

It is pretty easy to build such middleware both for the git over ssh (a simple script performing a lookup of where the shard is and then you connect to shard to operate there) and just a little bit more for the http part. At the webapp level, you will have a kind of RPC to run the git related operations which will connect to the right shard to run the operations.

When you use Ceph you are basically running a huge FS at the full scale of your GitLab installation, but practically, you have many independent datasets within your GitLab installation and you do not need to pay the cost for the global consistency of Ceph. You have many islands of data.

[0]: https://github.com/libgit2/

Edit: Typos/missing words.

dsr_ · on Dec 11, 2016

Do you have a disaster recovery plan that starts with "A meteor has destroyed our primary data center."?

I do; that's my default scenario. If you can survive that, you can survive all sorts of smaller issues like network congestion, data center power problems, grid power problems, and zombie plagues (or flu, which is more likely.)

Dylan16807 · on Dec 12, 2016

Depends on what you mean by 'survive'. I'd call a backup in google nearline sufficient for the meteor scenario, but that's going to be very slow and unpleasant to depend on for milder problems.

illumin8 · on Dec 12, 2016

It's not sufficient. How quickly could you procure new hardware, install that in a datacenter, make it fully functional, and restore your backups? The answer is likely weeks/months. Could your business survive being offline that long? It sounds unlikely.

Dylan16807 · on Dec 12, 2016

In an emergency you don't need new hardware. You can get cloud servers in minutes. If people have been practicing restores then it should not take particularly long to get the containers working again. A couple days to get things working-ish. That should be survivable while everyone focuses on the news coverage of the meteor.

But that's a true emergency situation. Don't go offline for multiple days for something that's reasonably likely to happen.