Just anecdotally, I've had the shittiest experience in recent memory with CoreOS...

iyn · on May 4, 2015

Can you elaborate on "whenever logs start writing stack traces for errors."? What caused the problem? And what exactly was the problem with Deis & CephFS? I was considering using those some time ago.

t4nkd · on May 5, 2015

Sure. Whenever the log messages start to get pretty decent in size, like, when the log messages start including tracebacks due to errors, and you have lots of errors happening back to back, like lets say high traffic is causing some deadlock exception at the DB layer or something(this was our actual problem), this high write causes BTRFS to lock up. Usually this happens because of some kind of kernel level error, our ops guy is familiar with the exact details of the exception BTRFS throws. Suffice it to say, when that happens, it's not immediately obvious the instance is unavailable. If you have plenty of errors happening in a short time span, this behavior will start to roll across your cluster and depending if all of the hosts are running the offending container, your entire fleet of hosts will lock up. This sucks bad. The only real solution is to bounce the host machines -- though they seem to come back just fine.

CephFS had a similar-ish problem, where if you happened to ever own less than 3 nodes (almost always because of the above), it would have a real problem self-healing and start to get confused and re-elect bad nodes to quorum leadership, possibly clobbering all your registry data. We contributed a patch back to Deis to use S3 as a registry persistence layer because, well, having a volatile registry sucked. We were about to develop a control layer for quorum services to live on separate of the application services we were developing, and using a proxy in the quorum layer to communicate stuff like etcd. I would highly recommend this approach if you end up with any services that require a quorum.

iyn · on May 5, 2015

Thanks for the reply.

I would prefer to avoid many disk operations in the first place. Not sure, if it's applies to your problem, but have you thought about using something like Sentry (https://getsentry.com/ || https://github.com/getsentry/sentry)? Maybe there is some other tool that could help you here, I'm often impressed by the Open Source community and the vast number of different packages for (almost ;)) every technological problem.

tedreed · on May 5, 2015

FYI they switched from btrfs a while back. I think you need to reinstall with a newer version to get it though, it won't change on upgrade.