Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Speed Up Git Pull (interrobeng.com)
146 points by nahname on Sept 5, 2013 | hide | past | favorite | 48 comments


A 50x speedup is pretty cool in its own right. Kudos.

However, I wonder if this isn't treating a symptom versus a root cause.

Is saving that 5s round-trip so common in your workflow that you needed to optimize it, and would it be more productive to refactor the app so you and collaborators are working on different files?

Also, this has the entirely valuable guidance that pushing to a local server is much faster than a remote server. There's a Github Enterprise product, for you to run close to you. It'd be an interesting calculation to see the performance hit from waiting 5s to push to a remote server, versus the performance hit from keeping your nearby server up and patched.

But nice to read, kudos all the same!


http://xkcd.com/1205/

If you pull 10 times a day (conservative for me) then you’ll save ~5 hours over a year.


But couldn't it very well take more than 5 hours to set this up?


Doesn't matter too much. A short wait can throw you out of focus, and it's the time you spend getting back into focus that's important.


More like 5 minutes to set it up.


It is when you need to dig into the documentation to figure all of this out, but if you now google 'speed up git pull', you'll find this article, repeat the commands, ?????, profit! five minutes.


It took me less than five seconds to make the SSH changes and less than five minutes to set up the intermediate server. Now I also have a cache for when GitHub goes down (all the time.)


> Now I also have a cache for when GitHub goes down (all the time.)

Isn't the `.git` a cache for when github goes down? Git keeps all of your history inside your repository; you don't need network access to do anything.


Sort of. I've automated synchronizing it with the GitHub repo, so my local repo may be behind the cached repo. I'm also protected if a repository is deleted/moved/DMCAed.


What benefits does this provide that a cron job running `git fetch` doesn't?


Redundancy, public access and not having hundreds of unused repositories on my computer. I'm not trying to sell anyone on the idea, it just works for me and I like it so whatever.


How would refactoring help? You still need to fetch before you can push to git.


If you're a heavy SSH user, using multiplexing in this manner can have negative consequences [1]. Downsides include having all your multiplexed connections exiting if the master exits!

[1] http://www.anchor.com.au/blog/2010/02/ssh-controlmaster-the-...


More recent SSH clients can use "ControlPersist" to establish the master connection in the background, so the first session doesn't control the lifetime of the connection. This makes using ControlMaster workable.

I usually set ControlPersist to 30 seconds, which may not be long enough for people hoping to get performance improvements from GitHub, . Setting it to too large a value increases the risk that you'll have stale server sockets after a network outage.


> This makes using ControlMaster workable.

Best part of reading this article. I had turned off connection sharing because of this.

So what, in more details, are the downsides to ControlPersist?


One that I frequently run into is that if you use SSH tunneling (like -L), you have to specify it the first time you ssh to that machine (i.e. when the ControlMaster is connected) and can't change it later. Using -L on later ssh's to the same machine silently fail, which can be infuriating if you don't realise it's happening. The best you can do at that point is to kill the ControlMaster ssh (disconnecting you across all your sessions), and then reconnecting with the right -L.


You can skip the master and spawn a fresh connection for your tunnel using `-o ControlPath=none`.


In fact, even better: you can add forwarding to your existing connection. <newline>~C opens a command line, which accepts the following commands:

    ssh> help
    Commands:
      -L[bind_address:]port:host:hostport    Request local forward
      -R[bind_address:]port:host:hostport    Request remote forward
      -D[bind_address:]port                  Request dynamic forward
      -KR[bind_address:]port                 Cancel remote forward
(If you're not familiar with them, some of the other escape sequences are useful too. ~? lists them all.)

[EDIT] Apparently, if you have a recent enough version, you can add a forward to the master with `ssh -O forward ...` [1]

[1] http://serverfault.com/questions/237688/adding-port-forwardi...


You can limit the sharing to just GithHub with a host line:

  Host github.com
  ControlMaster auto
  ControlPath /tmp/%r@%h:%p
  ControlPersist yes


> Downsides include having all your multiplexed connections exiting if the master exits!

I believe this is what ControlPersist is meant to solve - it may not have existed when that blog post was written.


Meta question: assuming that lots of GH users do this (nice trick), would GH have loads of dormant SSH connections? At scale, this could be a huge number. Would this be an issue?


No. I have this enabled and Github closes my connections after a very short period. I use it primarily for SSHing into my cluster of EC2 instances (which does massively speed things up).


Same here. I do a

    (ssh -fqN -o "StrictHostKeyChecking no" git@bitbucket.org >&/dev/null &)
with bitbucket and github in my zshrc and bitbucket stays open but github is getting closed at some point. It used to stay open however.


They could offer it as a premium feature.


Hrm, so this isn't about making git 50x faster but fast network communication.

> Establishing an SSH connection every time you perform a Git operation costs many round-trips

I don't really understand why the author is saying this. The whole point of git is to be distributed and not to push/pull at each commit.

That being said, he found something that speeds up his workflow tremendously, so congratulations.


Even with a short timeout (say, 10 seconds), it could be useful for a maintainer who pulls from several other users' GitHub repositories. Instead of establishing a new SSH connection for each remote, a complete "git remote update" of multiple repositories could be done over a single connection.


Oh, you might still want to share your commits with your coworkers, and for that you need to push and/or pull. (You could do that asynchronously, though.)


Note that if you are on centos 6, the openssh version isn't new enough to support this feature.


this is not true


Yes it is. ControlPersist was introduced in openssh 5.6.

Centos 6 ships with a patched version of 5.3.


Both ssh and sshd I guess? Do both ends need to support the feature?


ah ok. ControlMaster and ControlPath are both supported though.


...and keep your development copy of the project on ramdisk. Have a script that you launch as you start work that makes the ramsdisk and puts files from persistent location there using rsync, then periodically launches rsync to copy changes you make on our ramdisk back to persistent location.

I used this setup for almost a year. It saved me lot of time and sanity.



Cool. I still prefer my 5 line bash script manually started in separate console.


Does this buy you that much over a SSD?


Not sure. I was using it on a laptop with spinning rust for fairly large rails project.


swap on ssd is (pseudo and slower) increased ram on big needs.

Example: $5 digital ocean droplet (comes without swap, but it's over SSD, so you can create a swap file, and is less painful than mechanical swap).


Apologies, I think you know the answer, but I couldn't understand what it was from this post. Could you reword that?


Too bad it doesn't work on Cygwin. (I share my ssh config between Linux and Windows.) Too bad ssh doesn't have conditional configuration. (Yes, I know I could script this, but it's a little more pain than I want for this gain.)


How about VirtualBox in seamless mode instead of Cygwin?


Are there any "shorter" options for ControlPath that are still unique? I've had a few instances of silly hostnames that have caused an error about the name being too long for the socket.


Doesn't this only help if github lets you leave ssh connections open and not doing anything for long periods of time?

Surely if they do, they won't for too long if lots of people start doing this.


What can be done to prevent the "Connection to github.com closed by remote host" error that comes a few minutes after a push/pull with the Control settings enabled? Since I'm normally in vim by then, it ruins the layout.


Not sure how you can stop that error, but you can run "Ctrl-L" or ":redraw" in vim to fix your layout.


I just tried the first part of this (the ssh multiplexing) and instead of getting faster, 'git fetch' got slower (1.9 to 2.4 seconds). Any ideas why, and how I can debug/improve it?


Dear GitHub,

Please setup local ssh termination to your network.

Thanks,

Alex


git fetch

git merge origin/master

A much preferred workflow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: