We're trialling BigQuery at the moment, pushed about 130TB into it last week (about ~300 billion rows) and have been blown away by the performance of it.
It's a bit of a shame they never released a paper on how it works, well, nothing since Dremel anyway. And Collosus isn't public either.
I can already see this in built ML stuff being useful for trialling models, especially as it's built right into the SQL.
Did you all work together with the Firebase team on the recent schema change? We're scrambling to mitigate the ~2x cost increase that caused for us, since it blew our analytics budget out of the water.
It was a change that Firebase implemented that went into effect about a month ago. Check this out and the look at the "Old Export Schema" at the bottom. The export schema went from multiple events per record to one event per record, with the User Parameters repeated on each record. That bloated the exports from Firebase and increased costs. https://support.google.com/firebase/answer/7029846?hl=en
Has anything changed on BigQuery since then that would warrant rerunning those benchmarks? If Amazon’s largest cluster outperforms BigQuery then is your decision about which service to use just a cost calculation of dedicated per month for Redshift vs. projected units scanned per month on BigQuery?
I've worked with both, and BigQuery is so much easier to work with. Redshift does give you more control (to a limited extent) but it takes a lot more work and specialized knowledge to perform simple tasks well.
In the article you linked, they gloss over the part that takes by far the longest when working with redshift: setting up "compression, distribution keys on large tables, and sort keys on commonly filtered columns". With BigQuery, you don't really worry about that crap, or about reloading your terabytes of data if you made a mistake somewhere in the schema. You also don't worry about vacuuming, or running out of space, or taking down the database from excessive CPU usage. Do you want to have a team of people at your company whose job is just to keep Redshift running smoothly? Or do you want another team of analysts or engineers?
I could see some companies being forced to use Redshift, especially if they're using S3 a lot, and I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way.
> I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way
That’s exactly what the TPC benchmarks are designed to show: how different appliances perform under the same diverse set of generic workloads. As of 2 years ago, they show that Redshift is faster.
> With BigQuery, you don't really worry about that crap
The benefit this depends on your organization’s level of expertise. If you grok sorting and distribution then you can leverage those to increase performance—but it’s not a prerequisite.
For example, in Redshift, when you bulk load data into a table, if the user didn’t specify a compression scheme in the table definition, Redshift will analyze the data and find a scheme that works best, and automatically apply it to the table for you. BigQuery almost certainly does something similar. The difference is, with BigQuery, you’re not invited to take part in that discussion. And you’re charged as if the data is uncompressed. Psychologically, this is a huge relief if you don’t (want to) know how compression works but rest assured you’re paying for it somehow.
To draw a tired analogy, vehicles with automatic transmission still have to shift gears. If you’re driving to the grocery store, not having to worry about that is a win. But if you’re racing stock cars, you’re definitely going to want a stick.
Compute resources tend to be significantly more expensive than storage. Our approach is two-fold:
- We finance compute resources required for you to ingest data into BigQuery. With Redshift, you pay for ingest directly via compute cluster consumption (again, more expensive than storage). This also increases your complexity due to on-cluster contention of resources between ingest and query.
- Separation of storage and compute gives you lots of options. With BigQuery ,you don't need to attach relatively expensive compute just for the luxury of getting more storage. Spectrum helps somewhat, but ultimately with Redshift you don't even get to pay for storage - you pay for compute/storage combos.
- BigQuery's Long-Term Storage is not an archival storage tier - it's a discount on storage, with identical performance and durability characteristics. At only $0.01 per GB per month.
This is likely a result of origins of the two technologies. BigQuery is Dremel, written and operated by Google since 2006. Redshift purchased source code to an on-premise fork of Postgres.
Or listen to Nick Caldwell, Reddit VP Engineering, moving away from AWS to BigQuery:
- "2017, which effectively brought us to the present system, we began forking all of our event data into BigQuery, after considering a lot of different alternatives" https://youtu.be/tKISLQ87GO8?t=426
Great question. At the time we had a hard time replicating results of that blog post.
Here's some food for thought.
- We recently announced that we've improved our TPC-DS 10T geomean by 5x over past 18 months [0]. The nice thing is that users never had to set maintenance windows or "upgrade their clusters" to get these performance improvements.
- Here's a session by our eng lead and CTO of Looker discussing various performance improvements, including querying a PB of data in ~5 seconds and paying a fraction of a penny for it.[1]
- Benchmarks are important, but they miss three key scenarios.
------ Effort - how much complexity is there to achieve maximum performance. As someone else stated in this thread, we feel that we differentiate here.
------ Maintenance - when data changes, new data comes in, you run DML, what happens to this pristine benchmark of yours? What is required to maintain this performance? Again, we feel that our offering is compelling (we don't ever have messy storage state that requires a vacuum, our compute is entirely stateless so re-sorting/re-distributing data is not needed, and BigQuery ingest never affects query capacity, even at PB/day ingest scale).
------ Real-world usage - a serially-executed set of queries poorly represents what happens in reality - high volatility, high concurrency workloads. Again, BigQuery has stateless compute, separation of compute and storage, and separation of compute and state.
- customers - in addition to what's been stated already, PTAL at Kabam [2] & NYT/Yahoo/BlueApron [3] for their motivations for migrating from Redshift to BigQuery
That said, lots of folks really like running on Redshift, and they've extended their platform with Spectrum and other bits. I'd invite you to find out what works best for you. Competition is good for the end user!
If you haven't read it already, I strongly suggest reading the original Dremel paper [0]. It's no doubt somewhat out of date, but I believe BQ is based in Dremel.
tl;dr for underlying storage model: distributed column store which pushes computation down a tree to leaf nodes to parallelize disk I/O. Parent nodes aggregate computations before returning to the client.
We (Sojern) have been using BigQuery since late 2015, we moved from Hive. It has been an amazing experience in terms of cost savings (managed infra, operational cost of a hadoop cluster), and performance. We use it across the org from analytics, ingestion, reporting, ML, and many more uses.
There is information about Colossus: the original GFS paper, an interview with Sean Quinlan with McKusick on Colossus improvements, as well as a talk by Denis Serenyi on more of the gory details, e.g. rebalancing.
There was also a public talk on D, the GFS chunkserver replacement, but there don't seem to be recordings or slides from that.
For this trial we are transferring some data from AWS, which is already partitioned in S3 by quite a few dimensions, including DAY.
As BQ only supports DAY partitioning, I had to create multiple tables that represent the other dimensions, but it seems to work OK, if a little clunky when you have to use wildcards/_TABLE_SUFFIX
And with the just announced clustering now you get even better costs and performance.
Disclosure: I'm https://twitter.com/felipehoffa and I work for Google Cloud. And I'm really excited to reprocess all public tables into clustered ones.
Cost (or rather bytes scanned) is sent back from the API on any queries or dry run/validation calls. There's a common chrome extension to put it in the legacy ui, and I believe the beta ui has it baked in (as part of validation)
It's a bit of a shame they never released a paper on how it works, well, nothing since Dremel anyway. And Collosus isn't public either.
I can already see this in built ML stuff being useful for trialling models, especially as it's built right into the SQL.