Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scaling Mercurial at Facebook (facebook.com)
164 points by chairmanwow on May 27, 2016 | hide | past | favorite | 72 comments


Our code base has grown organically and its internal dependencies are very complex.

So instead of cleaning up the internal dependencies, they decided to rewrite Mercurial. That is the kind of thing Facebook likes to do: for example, when PHP got too slow, they wrote a PHP compiler....


I think you may not grasp how many thousands of engineers Facebook employs at this point. It is literally easier and less risky for them to assemble a 5-10 person team to fix Hg than to fix the code of thousands of other engineers.


... Or hire the creator of Hg. As they did.


While that is true, it is also true that it would have made sense for them to pay for Perforce upfront. I would say that hindsight is 20/20 but I'll wager there was an engineer there who did say this at the time...


Perforce won't scale to that level either. As Google has shown. Google's "Piper" is based on Perforce but isn't Perforce, because Perforce couldn't hack it.


What do you think is a better use of Facebook's engineering resources? Having one team work on building different developer tools, or having every team put their normal work on pause to refactor their entire codebase to clean up dependencies?


The general implication in this type of story is not that the tools were too limited for the use case, but that the processes which led to the use case are probably badly broken and the apparent tooling limitation is simply a symptom of that underlying disease.


Having code bloat doesn't mean that a process is broken, it comes down to what you are optimizing for. Facebook optimizes for getting new code into production fast. I don't know if @ktRolster stopped reading right after the line he quoted, but his point is addressed in the article:

Our code base has grown organically and its internal dependencies are very complex. We could have spent a lot of time making it more modular in a way that would be friendly to a source control tool, but there are a number of benefits to using a single repository. Even at our current scale, we often make large changes throughout our code base, and having a single repository is useful for continuous modernization. Splitting it up would make large, atomic refactorings more difficult. On top of that, the idea that the scaling constraints of our source control system should dictate our code structure just doesn't sit well with us.


Even at our current scale, we often make large changes throughout our code base, and having a single repository is useful for continuous modernization. Splitting it up would make large, atomic refactorings more difficult.

I read it, but none of those things make me think better of their code.


Ah, but the perfectionist engineer is rarely satisfied with production code quality. Hence we have another important part of a software company--the managers, who optimize for things like getting features out more quickly, things that are effective at doing what companies seek: make money.


> the managers, who optimize for things like getting features out more quickly, things that are effective at doing what companies seek: make money

Which works great until the day you wake up to realize your code base has become an incomprehensible mess and your engineering systems are outdated and your competitors are shipping better software at a faster pace.

I'm not really disagreeing, I'm just saying a balance must be struck. Really hate companies that view anything but feature work as a loss.


As someone who works at Facebook on a team dedicated to improving developer efficiency (React), I assure you that FB sees value in long-term maintenance and tooling improvements. All the teams building features do too.


Stop growing gigantic monolithic repositories. This is a very stupid way to do things. Evidence: it requires you to rewrite Mercurial. Essentially nobody except Facebook has this problem, and that isn't because Facebook is so far ahead of everyone else in its engineering practices.


I've worked at both FB (monorepo) and Amazon (massively federated repo where you only check out tiny slices at a time; a huge package dependency system pulls things together), and I can testify that monorepo is great, and federated repo with dependencies-via-packaging is awful. It's slower to work with, harder to coordinate wide scale changes, slower to work with, the cognitive load is much higher, and it's slower to work with. Did I mention it's higher friction and slower to work with?


What do people mean by monorepo exactly (e.g. at Facebook)? For instance, FB's open source projects aren't all hosted in the same "open source monorepo", why not? And what goes into the monorepos exactly? All the closed source code?

As a "one-man-team" who uses at least 15 different repos, it's hard for me to imagine how a massive company would manage things within a single repo and no package management.

Also, aren't any of those monorepo companies concerned that a single rogue employee or stolen laptop could leak their entire source code?

I can only guess I'm misunderstanding what is meant by monorepo and how they're used...


The majority of Google's code is in one gigantic repository. Sometimes there are some other repositories for very sensitive code, for example if you have a hardware vendor's closed-source driver code and they don't want many people to look at it. A few projects are in different repositories for their own reasons, like major open source projects. I don't think there's a single clear criteria you could come up with for what doesn't go in the main repository.

There are dependency management tools that help enforce public/private code on a wider scale and that help the build tools make sense of it all. There are also ownership tools that say what people and teams are qualified to review code in certain directories. The config files for all these tools are checked into the repository.

There's no versioning though. If you want to change an internal API you just update it and all the callers at once: patches in source control are already atomic. For truly massive changes (more code than most companies have) this gets too unwieldy and there are special tools and strategies people use.

This means you can't have a project depending on an out of date (internal) library. Without that requirement, you don't have the situation where different libraries need to be synced to different versions. And without needing to sync different things differently, you can get away with just one repository.

I worked at Amazon previously, which has world-class tools for dealing with versioned libraries in bulk, and Google's approach is vastly better. You spend less time worrying about breaking other people's dependencies, and you don't have someone spending a day fixing libraries every couple of weeks.


We export each of the open source projects out of the monorepo and into separate GitHub projects using a tool called ShipIt (https://code.facebook.com/posts/1715560542066337/automatical...)


    FBShipIt has been primarily designed for branches with linear histories; in particular, it does not understand merge commits.
That's sad.


Sounds wonderful to me.


Friends tell me that very sensitive code (ranking algos, etc) is not in the repo.

Personally, I also favor a single repo. You manage it the way you manage separate packages: with organization and some discipline. The magic is that command line tools like grep, sed & awk--and static analysis tools for refactoring--work really well. You can change a method signature and it just works. I've been part of monolith-breaking before (most recently at Trulia) and it definitely adds friction to the development process to work across such a rich graph of package dependencies.


Sensitive code _is_ in the repo, you just can't read it. This allows you to change library code, run the tests of the sensitive application code, and know that it worked, without being able to read the sensitive code.


What prevents you from reading it, if it's in the repo?


Access control if my sources are correct.


So you can read the interface and compile against it, but not the source of the actual algorithm? Do you know how that works?


No idea how it actually works but here are a couple of ideas about how it could work:

* Compilation is done centrally, you code against a mock or only the interface and the submit the code for test and final build.

* Or only libraries are supplied, possibly ofuscated.


PHP dont have proper module system. It not like Facebook had any choice.

Even with stolen FB code without dedicated infrastructure you still cannot do anything.


Wrong. Google has the exact same problem, only they chose to rewrite Perforce.

The scale of these codebases is way outside what most people have experienced and it might behoove people to realize that a lot of conventional wisdom might not apply. E.g. last time I checked, Google's codebase was >2Billion lines of code.

Google has 30,000 engineers working on a "monolithic" codebase and remain relatively productive despite (or perhaps because of it). There are whole sets of different problems at this scale.


See https://www.youtube.com/watch?v=W71BTkUbdqE for a lot more details on what Google does.


And Google has been contributing to Mercurial. Like subtree: https://bitbucket.org/Google/narrowhg


It's also very beneficial to the code health as it allows to make global refactoring, e.g. adopting new language features, improving shared libraries, etc.


2 billion lines of code @ 80 characters per line (with 2 bytes per character) = 320 GB of code, fits on my mac SSD. Wonder if 30,0000 engineers , sending 30,000 requests per second to server(s) with 320GB of code (can be put in memory) is a huge problem (edit) in terms of scale, especially when these same engineers build systems for handling billions of requests ?


You're not considering that it's not just the current 2 billion lines of code - it's the history of the repo, branches, forks, possibly binaries thrown in.


I never built any source code system, so not an expert, but isnt branching snapshotting the current version of file(s) (strings), and newer versions are about calculating deltas. Arent, most of the Operations in a source code mgmt system, are diffs and merges of strings. Basically, my point is, scale problems associated with these systems are not at the same level as other scalability problems handled by google's core competency areas such as search which deal with billions of requests or peta bytes of data. Not that they are easy, but that orders of magnitude smaller.


I've worked in places where we depended on external repos (Maven, Gradle, other Ivy-based things) and places where all code has to live in a mono-repo.

Thus far I've preferred the mono-repo mostly for dependency management reasons. Whether you have a lot of internal dependencies or external dependencies, you get similar benefits: - All the changes to internal dependencies are in your revision history, across the company. Shared internal library upgrades are picked up quickly and propagate throughout applications with low delay.

- It's easier to have shared versions of external dependencies "automatically" instead of establishing policies that need a human to enforce. This makes it easier to roll out new versions and bug fixes.

- It's easier to do system-wide improvements in code quality. Replacing common bad code with better implementations is something I've seen across code bases at both Google and Twitter and they have been beneficial.

I think nobody should be trying to implement a system like this if you've got fewer than ~600 SW engineers, though. Small groups don't have as much drift or system-wide refactorings that give you benefit.


At Expedia, we were a gigantic perforce repository, but the trend has moved to hundreds of smaller repositories and applications hosted in enterprise github.

Package management is simple, we push common packages into sonatype nexus with semantic versioning.

Personally I prefer strong versioning and allowing teams to upgrade common libraries at their own pace, but we've had a few times where we needed to quickly upgrade every repository (i.e., a security patch).

You have a few options:

1. You can unpublish the old dependency, breaking all builds until they upgrade. (Frowned upon)

2. You can write a script that identifies all repository owners that depend on you and send out a upgrade by X date email.

3. You can script pull request creation and submit hundreds of pulls to upgrade the dependency.

So far, things have worked out well, and I never want to go back to the monorepo.


Google, Facebook, and Twitter all use monolithic repos (I'm taking monolithic to mean 50+% is in one repo, not necessarily everything in one repo), so no, not just Facebook. I'm going to assume there are very good reasons for doing so (and there are - you can look up past discussions on HN).


I love it when random commenters point out the obvious problems of the world's wealthiest, most successful companies.


I know, do you reckon Facebook engineers are sitting awkwardly in a meeting with this comment page up:

"Why didn't we think of this!!!?"


Implying you have to have lots of money to be correct in your software management approach. Try again.

Maybe it's because these unknowns can't fathom working with such tightly wound systems that after all the code reviews are done your changes are irrelevant and need to be updated again and go through another review. In before straw man, but other than pointing out day to day issues, maybe you should try to understand the argument against trying to fit everything all in one place. Do you really want to be the maintainer of all those third party libraries you imported? Do you not allow third party and adopt "not invented here"?

Try again


The point is: whatever your opinions are of Google's or Facebook's source code management process, and whatever you think the defects are of that process, and no matter what improvements you think could be made, these defects and shortcomings have demonstrably not stood in the way of these companies completing the largest software projects in history.


"Herp derp! Successful software companies that move fast and write lots of code quickly despite their size are obviously fools that don't know how to set up their codebases."

http://danluu.com/monorepo/ is a good explanation of why that's a pretty foolish take.


Either that, or they're being pressed to death under the Design Stamina Hypothesis.


Yea, essentially this, they have bigger problems: http://www.darkcoding.net/software/facebooks-code-quality-pr...


Funny enough the same logic didn't apply to their Android app when they were running up against Dalvik method limits..although to an extent it did in another way.



I generally see two reactions to the "one codebase to rule them all" approach (used by Facebook, Google, et. al):

1. Holy god why would you let your code grow to such a massive, interdependent scale? Just release everything separately and versioned so that breaking changes don't affect everyone all at once. The idea of git being a bottleneck is absurd and you are using it wrong.

2. This is a very reasonable, practical approach to sharing code across a company. It reduces siloing and ensures that major refactors can happen in one pass without a ton of coordination. Better to fix the version control system than waste endless resources refactoring millions of lines of code.

Both reaction is valid. Having worked in both styles of codebase, I recognize that there are trade-offs in either case. The optimal solution depends on the project and the team.

Sometimes the path of least resistance--that is to say, the path to getting things shipped and, in turn, making money--is to let the codebase grow organically and worry about cleaning up any messy interdependencies later, once you have a better idea of what code you even need to keep around. In this scenario, it's important to recognize that developer efficiency is going to be an uphill battle in the long run, but if you are proactive about maintenance and tooling improvements then this approach can still be relatively painless.

Other times, especially when you're working on a tried-and-tested product with a clear API and a dedicated team, it can be productive to split it out and let the team manage their own versioning and releases. This becomes especially useful if the product is open source. (For instance, I wonder how Facebook manages its open source releases relative to its shared Mercurial codebase.) In this scenario, developer efficiency is usually less of a problem, as proper use of versioning can ensure faster, more agile updates to each product. But the downside is that your company as a whole can end up in a kind of versioning hell, where every project depends on a different version of every other project, and keeping everything up to date can require a huge amount of coordination.

So, in the end, pick your poison. My reaction, years ago, was more along the lines of #1, but I used to be much more of an idealist earlier in my career.


Approach #1 lands you in a dependency hell where you have to maintain multiple incompatible versions of each internal library or framework, or each external library that's in use, multiplies the work involved in upgrading dependencies, and in other various ways leads to its own problems. There's no panacea but I can see the appeal of having a single shared codebase.


Still, much easier to manage with branches than before git and mercurial were around.


It's not a question of managing source branches, it's a question of whether you have to update your OpenSSL dependency once for the whole company or a thousand times over for each individual software package.


I was super skeptical of it when I started at Google. But it just works, and along with the rigorous code review process and commitment to code health, it makes for clean code with lots of consistency and re-use.


I see the "sweeping change" argument put forward often, but I don't really understand how this can be the case.

There is no atomic deployment of a large distributed system - so even if you can check in related changes in different areas of a codebase, how do you release them?


Solving deployment is orthogonal to organizing your code. You can deploy multiple products which share the same codebase, and you can also deploy one product which draws from multiple codebases (say, from other gems or npm packages that you also maintain). Regardless of where your code goes, handling distributed releases generally requires agreed upon contracts (e.g. schemas) and proper dependency management during rollouts.


> For instance, I wonder how Facebook manages its open source releases relative to its shared Mercurial codebase.

It sanitizes and syncs them to GitHub

https://code.facebook.com/posts/1715560542066337/automatical...


The same kind of work is being done on Git these days:

http://thread.gmane.org/gmane.comp.version-control.git/29510...


I believe this is Twitter once again attempting to integrate their use of watchman (open sourced by FB) into the git core.

Their first attempt was in May 2014: http://www.spinics.net/lists/git/msg230487.html


Yeah, David Turner has been working for Twitter. But his work is based on previous work on an "index-helper" daemon by Nguyễn Thái Ngọc Duy who is not working for any company as far as I know.


He's also involved in the thread from May 2014.


> January 7, 2014

This is years old. Am I missing why this is newly relevant?


This seems to be an old article (January 7, 2014) anyone know if Facebook still use Mercurial?

About securing different parts of the repo, the mercurial server actually doesn't have any user authentication! You are meant to do that yourself with SSH or a web server, where you should be able to have more restrict access to some special folders.

About using a single repo, it does make sense to have all code that interact with each other at the same place. Imagine changing a variable name in some API and at the same time update all usage of that name in the whole codebase. And imagine the bureaucracy and people management for just making a variable change if there where separate repos witch you might not even have access to.


I kind of like to hate on Facebook because they waste my time and (try to) track me everywhere. But I'm also a big Hg fan, and Watchman looks awesome. So it's hard to completely hate them. Sigh. They're kinda like Google.


(2014)


It's funny how they did not use Perforce which works really well for large repositories.


They did mention Perforce in passing:

> For a repository as large as ours, a major bottleneck is simply finding out what files have changed. Git examines every file and naturally becomes slower and slower as the number of files increases, while Perforce "cheats" by forcing users to tell it which files they are going to edit. The Git approach doesn't scale, and the Perforce approach isn't friendly.

I don't know if Perforce's "unfriendliness" is enough reason not to choose it nor do I know if that was the only reason they rejected it. However, speaking for myself, I vastly prefer Subversion, Git, and Mercurial over Perforce precisely because Perforce requires you to ask for permission before you can edit a file.


For any of the major IDEs with Perforce plugins, this happens automatically when you start editing a file. It's essentially a non-issue for normal use cases. Things work a little differently when you're offline, but for most people, offline work is uncommon.


Git could use inotify (Linux) or kevent (BSD) to monitor for file changes in the background. It's not a limitation of the format or workflow, just the current tools.


On the flip side of editing it makes locking binary files trivial which is something Git and Mercurial fail at spectacularly(since the are intrinsically DVCS).


I wrote a simple Emacs function to automagically call p4 edit when I began editing a Perforce managed file. It worked OK. Obviously not as convenient as straight up editing, but good enough.


You can just edit "..." over the sub-directories you are working with. It tends to solve the problem.


Having used Perforce for nearly five years, I can say it's an absolute nightmare for many reasons.


Having also used Perforce for many years, I can say it is quite literally the best thing ever.


Augh, I've been trying to like Perforce for a year and have hated it at every turn. I invariably wind up having ~3 workspaces open so as to map to different parts (or different views of the same) of the depot, constantly swapping back and forth between them whenever there's a problem that requires attention on some stream or another. Code reviews are awful; I need to balance tonnes of CLs (and their shelves) so as to save _my_ current work, grab the reviewed code, reset everything etc (or I could just make a new workspace for each review! hahahah).

There's also this ever-present feeling of dread whenever I invoke any p4 command -- if I absentmindedly submit something with unshelving the change that I had shelved after making a previous change but reverted when trying out a new change, I can wipe my old data and lose work. Or some issue comes up in a deployed environment, so I have to grab a new workspace (or dance through shelving/unshelving all my open CLs), have a 10 minute coffee break while it syncs, sync up to the release line and work on that one. Also my IDE doesn't like being switched rapidly between repositories all the time. Also it's difficult to just _commit_ some code after making a change and wanting to save progress. Every operation is just so stress inducing; how do you handle it :/


Perforce also runs into problems for sufficiently large repos, so it would have been a very temporary solution at best.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: