We've considered having part of the servers in a different location. But it is a big risk to see if Ceph can handle the latency. We're also considering using GitLab Geo to have a secondary installation. It seems that data centers can have intermittent network or power issues but they are less likely to go down for days (for example Backblaze is in one DC). At some point we'll likely have multiple datacenters but our first focus is to make GitLab.com fast as soon as possible.
You will have to go out of Ceph and build a middleware to route the git operations on the right shard. Take a look at libgit2[0] for that.
It is pretty easy to build such middleware both for the git over ssh (a simple script performing a lookup of where the shard is and then you connect to shard to operate there) and just a little bit more for the http part. At the webapp level, you will have a kind of RPC to run the git related operations which will connect to the right shard to run the operations.
When you use Ceph you are basically running a huge FS at the full scale of your GitLab installation, but practically, you have many independent datasets within your GitLab installation and you do not need to pay the cost for the global consistency of Ceph. You have many islands of data.
Do you have a disaster recovery plan that starts with "A meteor has destroyed our primary data center."?
I do; that's my default scenario. If you can survive that, you can survive all sorts of smaller issues like network congestion, data center power problems, grid power problems, and zombie plagues (or flu, which is more likely.)
Depends on what you mean by 'survive'. I'd call a backup in google nearline sufficient for the meteor scenario, but that's going to be very slow and unpleasant to depend on for milder problems.
It's not sufficient. How quickly could you procure new hardware, install that in a datacenter, make it fully functional, and restore your backups? The answer is likely weeks/months. Could your business survive being offline that long? It sounds unlikely.
In an emergency you don't need new hardware. You can get cloud servers in minutes. If people have been practicing restores then it should not take particularly long to get the containers working again. A couple days to get things working-ish. That should be survivable while everyone focuses on the news coverage of the meteor.
But that's a true emergency situation. Don't go offline for multiple days for something that's reasonably likely to happen.