Changes to Ragel in Response to the CloudFlare Incident

wyldfire · on March 6, 2017

For those of us who knew about the incident but not the nitty-gritty details -- "What is Ragel?"

> Ragel compiles executable finite state machines from regular languages. Ragel targets C, C++ and ASM. Ragel state machines can not only recognize byte sequences as regular expression machines do, but can also execute code at arbitrary points in the recognition of a regular language.

> What kind of task is Ragel good for?

> Writing robust protocol implementations.

tyingq · on March 6, 2017

Cloudflare's postmortem also goes into some detail on what they did wrong with Ragel: http://blog.cloudflare.com/incident-report-on-memory-leak-ca...

runevault · on March 6, 2017

Believe they originally just stated Ragel had problems and only fixed it to state that they miswrote their stuff after the creator reached out to them.

tedivm · on March 6, 2017

Cloudflare is really failing to inspire confidence here. Between them downplaying the issue, refusing to work with the Google engineer who found it, blaming the proxy owners (such as google) for not cleaning up their mess quickly enough, and attempting to dump the blame on groups like Ragel- I just have a hard time believing that Cloudflare will ever handle a security incident properly after this mess.

ethbro · on March 6, 2017

You mean by being transparent as opposed to some of the other serious security incidents out there?

Could you please clarify exactly how they "downplayed the issue"? The impact is an exercise in statistics. They seem to have compiled what logging / statistics they could, then accurately advised everyone of what they found.

They turned around an initial fix on the issue in hours, no?

As for web caches, what are they supposed to do, other than plead for their owners to delete them ASAP?

As for Ragel, not sure how the initial version of the response read, but the one I read clearly pointed out it was a Cloudflare issue. And given the semantics of the issue, I'd cut them some slack if they misidentified the cause the day after they'd been scrambling to fix this.

I'd be interested who you do trust to properly handle a security incident if you're looking for better than this. IBM? Microsoft? Apple? Just try and get a 24-hour fix out of them.

tedivm · on March 6, 2017

They downplayed the issue repeatedly. If you read the original Google report Cloudflare did not work with the team all that well. The Google team complained that Cloudflare was not working with them on the message and was downplaying the risk.

> https://bugs.chromium.org/p/project-zero/issues/detail?id=11...

According to the Google team the draft Cloudflare sent "contains an excellent postmortem, but severely downplays the risk to customers." There was also a ton of conversation here on this site about the fact that they were downplaying the severity.

As for web caches- look, it's great that they were proactively working to get things cleaned up. However, despite the fact that Google called in engineers who cancelled their weekend plans to deal with this, and despite the fact that Google is the one who did Cloudflare's job by finding the exploit, this is what Matt Prince had to say-

> I agree it's troubling that Google is taking so long. We were working with them to coordinate disclosure after their caches were cleared. While I am thankful to the Project Zero team for their informing us of the issue quickly, I'm troubled that they went ahead with disclosure before Google crawl team could complete the refresh of their own cache. We have continued to escalate this within Google to get the crawl team to prioritize the clearing of their caches as that is the highest priority remaining remediation step.

He essentially through the Google team under the bus. It was completely unprofessional.

Finally, the Ragel team themselves had to reach out to Cloudflare to get them to change the language on their announcement because it appeared to many people that the Cloudflare discussion was shifting the blame to Ragel. If you really need more information about this just read the OP this thread is part of.

tyingq · on March 6, 2017

>not sure how the initial version of the response read, but the one I read clearly pointed out it was a Cloudflare issue

Here's a diff of the original and what's there now.

https://gist.github.com/anonymous/e8f78f537551465e54a9781d62...

To me, it doesn't look like any attempt to change history. They did try to make it more clear that it was their issue, and not Ragel's. But, any initial appearance of that seems accidental.

jefftk · on March 6, 2017

At least with "blaming the proxy owners (such as google) for not cleaning up their mess quickly enough" I think tedivim is probably referring to: https://news.ycombinator.com/item?id=13721644

ethbro · on March 7, 2017

I saw that comment when it hit the original thread. I think commenters are conflating Project Zero with Google search cache in the usual version of this sentiment.

If I was Cloudflare, and someone had made me aware of this bug, I'd be griping as loudly as possible to get remaining caches cleared ASAP. It's all they really can do.

The fact that Bing had unlocated cached documents doesn't impact on Google search being quick or not to clear their own caches.

Did Cloudflare cause all this? Yes. Are they working in the interest of their customers to gripe at remaining cachers to refresh their caches? Also yes. Is there something else they could do now that we're here? I can't think of anything more effective.

jgrahamc · on March 7, 2017

Incorrect.

I originally wrote:

    To modify the page, we need to read and parse the HTML to find 
    elements that need changing. Since the very early days of Cloudflare, 
    we’ve used a parser written using Ragel. A single .rl file contains 
    an HTML parser used for all the on-the-fly HTML modifications that 
    Cloudflare performs.

    About a year ago we decided that the Ragel parser had become too 
    complex to maintain and we started to write a new parser, named cf-html, 
    to replace it. This streaming parser works correctly with HTML5 and 
    is much, much faster and easier to maintain.

    We first used this new parser for the Automatic HTTP Rewrites feature 
    and have been slowly migrating functionality that uses the old Ragel 
    parser to cf-html.

    Both cf-html and the old Ragel parser are implemented as NGINX modules 
    compiled into our NGINX builds. These NGINX filter modules parse buffers 
    (blocks of memory) containing HTML responses, make modifications as 
    necessary, and pass the buffers onto the next filter.

    It turned out that the underlying bug that caused the memory leak had 
    been present in our Ragel-based parser for many years but no memory was 
    leaked because of the way the internal NGINX buffers were used. 
    Introducing cf-html subtly changed the buffering which enabled the 
    leakage even though there were no problems in cf-html itself.

The Ragel author contact me on Twitter and I changed the text to (stars around changes):

    To modify the page, we need to read and parse the HTML to find 
    elements that need changing. Since the very early days of Cloudflare, 
    we’ve used a parser written using Ragel. A single .rl file contains 
    an HTML parser used for all the on-the-fly HTML modifications that 
    Cloudflare performs.

    About a year ago we decided that the *Ragel-based* parser had become too 
    complex to maintain and we started to write a new parser, named cf-html, 
    to replace it. This streaming parser works correctly with HTML5 and 
    is much, much faster and easier to maintain.

    We first used this new parser for the Automatic HTTP Rewrites feature 
    and have been slowly migrating functionality that uses the old Ragel 
    parser to cf-html.

    Both cf-html and the old Ragel parser are implemented as NGINX modules 
    compiled into our NGINX builds. These NGINX filter modules parse buffers    
    (blocks of memory) containing HTML responses, make modifications as 
    necessary, and pass the buffers onto the next filter.

    *For the avoidance of doubt: the bug is not in Ragel itself. It is in   
    Cloudflare's use of Ragel. This is our bug and not the fault of Ragel.*

    It turned out that the underlying bug that caused the memory leak had 
    been present in our Ragel-based parser for many years but no memory was 
    leaked because of the way the internal NGINX buffers were used. 
    Introducing cf-html subtly changed the buffering which enabled the 
    leakage even though there were no problems in cf-html itself.

https://twitter.com/jgrahamc/status/835056414062161920

devy · on March 6, 2017

Honest question: would Ragel be more robust if it were written in Rust and to generate Rust code?

JoachimSchipper · on March 6, 2017

Not really.

Safe code generating unsafe-code-with-bugs gets you a vulnerable program, so rewriting Ragel in Rust doesn't help.

As to generating Rust - Rust isn't magic, and Rust-that-is-as-fast-as-C can't do a lot more work than C. In particular, you don't want to do memory allocation in the fast path of a network server written for performance (note that Rust uses jemalloc, which is a C memory allocation library; Rust memory allocation isn't going to beat C memory allocation.)

As such, a realistic Rust program probably needs to re-use network buffers, at which point language-level security doesn't stop you from sending out "old" data that was still in the buffer. [EDIT: but see the reply https://news.ycombinator.com/item?id=13804831 below.] A debug build in Rust could allocate-per-request and detect the problem; but allocate-per-request in C is also likely to detect such overflows (indeed, the CloudFlare server did occasionally crash on this bug), and C programmers can also use e.g. VALGRIND_MAKE_MEM_UNDEFINED() for effort comparable to maintaining an alternate allocate-per-request implementation in Rust.

The difference for the fast path - where Rust can't do a lot more work, and where you can focus your effort - just isn't very large. (Using a safe language for e.g. low-performance control-plane interfaces can help, however - but at that point, you may well be arguing e.g. Rust vs. Lua.)

Manishearth · on March 6, 2017

Rust's version of Ragel actually generates 100% safe Rust code, AFAICT without any additional cost.

http://github.com/erickt/stateful

Scaevolus · on March 6, 2017

Bounds checking slices makes code around 1% slower. It's worth it, but not free.

Manishearth · on March 6, 2017

I mean, that was the fix in this situation too, anyway. "free" depends on what you're comparing it to.

JoachimSchipper · on March 6, 2017

Are you sure? CloudFlare's problem seems to have been running off the end of a logical buffer embedded in a linked list of "physical" buffers.

(Sorry, have to run - will try to take a better look later tonight.)

Manishearth · on March 6, 2017

In Rust you'd usually use a slice for this anyway, which has bounds checks. Yes, you could use indices instead, but that's not very Rusty (and more annoying to use!), whereas using pointer arithmetic is par for C.

Edit: Yes, you could ultimately get it wrong in Rust, too, I just think it's much less likely.

JoachimSchipper · on March 6, 2017

So you're thinking of an iterator of slices of char (arrays), right?

It still somewhat tricky to get right to me; the writer wants to have access to the underlying physical buffers, so you need a conversion step from list-of-physical-buffers without allocating, etc. But you're probably not wrong that a Rust programmer is more likely to get this right; thanks!

(And it's all likelihoods anyway; there's plenty of C code that doesn't get this wrong, and e.g. the VALGRIND_MAKE_MEM_UNDEFINED() I suggested above does reliably - albeit dynamically - find this problem in C.)

(Added a reference from my top comment to your reply.)

Manishearth · on March 6, 2017

You don't need to allocate, Rust slices are safe zero-copy views into an existing buffer.

Yes, you can run under valgrind, but dynamic checks ... :)

You need to have testcases that hits those code paths, and it's really hard to make "unexpected" testcases that test code paths like these. That's why fuzzing is awesome!

pjmlp · on March 6, 2017

Logical errors cannot be avoided regardless of the language, even theorem provers might have incorrect axioms.

However memory corruption issues, like in this case (out of bounds corruption), are easily avoidable in any memory safe language.

aidenn0 · on March 6, 2017

The language Ragel is written in is fairly inconsequential. Not because memory errors don't happen in compilers, but because compilers typically don't get malicious inputs.

Generating Rust would be useful, but Ragel already has an ML backend IIRC.

steveklabnik · on March 6, 2017

There used to be a Rust backend for ragel, but it was well before 1.0 and it bitrotted and then died, as far as I know.

erickt · on March 6, 2017

I wrote it [0]. I meant to get it merged upstream, but the author took the repository offline to help build his business, so my fork bitrotted. However, it looks like Rust support may have independently landed upstream [2]?

[0]: https://github.com/erickt/ [1]: http://www.colm.net/news/2014/10/24/ragel-now-maintained-by-... [2]: http://www.colm.net/news/2015/09/26/first-dev-version-of-rag...

pornel · on March 6, 2017

If it were exact 1:1 equivalent, generating Rust code operating on a raw buffer with no bounds checks, then no.

However, existing parsers in Rust take advantage of its macros and type system to provide more guarantees at compile time, so perhaps a different approach would not allow the logic error that lead to the bug.

astrodust · on March 6, 2017

Ragel itself is not the attack vector here, it's the code that Ragel generates. That's a by-product of the grammar and code you write with it.

It's like blaming JavaScript for a bug in a JavaScript interpreter. It's not the fault of the language specification, but the implementation.

Ragel's emitted code is partially the product of Ragel, which is working code, and the user-supplied code as part of the Ragel program, which is an unknown quantity.

Apparently the side-effect of one of Ragel's directives wasn't fully understood and this wasn't exposed in testing. It sounds like Cloudflare needs to fuzz a lot more aggressively to shake out issues like this before they hit production.

kibwen · on March 6, 2017

Ragel itself may not be at fault, but it is the case that if it were generating code in a memory-safe language rather than C (including Rust code sans the `unsafe` keyword), then improper use of Ragel wouldn't be capable of dumping random memory contents, which is what cloudbleed exhibited.

astrodust · on March 6, 2017

Ragel can emit lots of different kinds of code, Ruby included, so to blame the tool here is mistaken. It's blaming how the tool was configured and used.

Emitting Rust might not be a bad idea, but I bet this code-base pre-dates Rust being a viable target.

sitkack · on March 7, 2017

When the option arises, it is nice to run the test vectors against the diversity of generated output, C, Ruby, Etc. One of those backends could have very well alerted on the error.

astrodust · on March 7, 2017

Although Ragel can emit a variety of code, it's non-trivial to rework your own code that's a part of that from C to Ruby or Rust or whatever. You'd basically have to implement the same thing N times.

sitkack · on March 8, 2017

ANTLR has the same problem. Where any non-trivial grammar will require embedded code, forking it away from ever being supported by more than one backend.

It might be nice to have Lua or a Scheme embedded in the generator platform so that, one can truly target multiple backends.