I wanted to note that ClickHouse Cloud results are now also being reported in the public ClickBench results:
https://benchmark.clickhouse.com/
Good to see transparent comparisons available now for Cloud performance vs. self-hosted or bare metal results as well as results from our peers. The ClickHouse team will continue to optimize further - as scale and performance is a relentless pursuit here at ClickHouse, and something we expect to be performed transparently and in a reproducible manner. Public benchmarking benefits all of us in the tech industry as we learn from each other in sharing the best techniques for attaining high performance within a cloud architecture
Full disclosure: I do work for ClickHouse, although have also been a past member of SPEC in developing and advocating for public, standardized benchmarks
To help understand the results of the benchmark, I find it helpful to look at how the benchmark is constructed, and what it tests for. From the README:
"The dataset is represented by one flat table. This is not representative of classical data warehouses, which use a normalized star or snowflake data model. The systems for classical data warehouses may get an unfair disadvantage on this benchmark."
Taking a look at the queries [0], it looks like it mostly consists of full table scans with filters, aggregations, and sorts. Since it's a single table, there are no joins.
Can you clarify what a "write unit" is? Naively it sounds like it might be blocks x partitions x replicas that actually hit disk. (Which is also probably not very clear to people not already using CH, but I have at least middling knowledge of CH's i/o patterns and I have no clue what a "write unit" is from the page's description.)
More helpful would be answers to my questions at https://news.ycombinator.com/item?id=33081502 - async_insert is a relatively new feature, we're still using buffer tables for example - but also most of our "client" inserts are actually onto multi-MV-attached null engines. Those MVs are also often doing some pre-aggregation before hitting our MTs as well. So we might insert a million rows, but the MV aggregates that down into 50k, but then that gets inserted into five persistent tables, each of which has its own sharding/partitioning so that blows up to 200k or something "rows" again. (And at some point those inserts are also going to get compacted into stuff inserted previously / concurrently by the MT itself.)
As I've said several times in this thread, I understand why you don't count inserts or rows. What I don't understand is what unit a WU does actually correspond to. In particular I don't understand its relation to e.g. parts or blocks, which are the units one would focus on optimizing self-hosted offerings.
I think optimizations that you focus on for self-hosted ClickHouse are the same as for Cloud. In self-hosted it helps to improve your throughput/capacity with fixed allocated resources. In cloud it directly affects cost.
For those complex pipelines you may find more useful to run tests during trial. Data distribution, partitioning and so on can change actual cost significantly so estimates can be too pessimistic or optimistic
> For those complex pipelines you may find more useful to run tests during trial.
Right, that's exactly what I don't want to deal with. Unless I have even just a ballpark estimate of complex pipelines both before I commit to any sales crap and afterwards when we're designing new pipelines, it's just not an option for us at all. I have no clue if it's going to cost us $10, $100, or $10000.
It doesn't mention anything about what a write unit is, except to say you can reduce write units by batching inserts (that part I guessed already.)
There's no way to think about what an actual write unit means. You could measure the costs on a sample workload, but that's far from ideal. Some transparency here would be nice.
I understand the answer is complicated, based on hairy implementation details, and subject to change. Give me the complexity and let me interpret it according to my needs.
Right, that link covers read units which is also what I expected - essentially the number of files I have to touch - but I still have no clue about write units.
Is one block on one non-partitioned non-distributed table one write unit? What about one insert that's two blocks on such a table? What about one block on a null engine with two MVs listening to insert into two non-partitioned non-distributed tables? What if the table is a replacing mergetree, do I incur WUs for compactions? etc.
My worry is that it is essentially 1 WU = 1 new part file, which I understand makes sense to bill on but is tremendously intransparent for users - at least I have no clue how often we roll new part files, instead I'm focused on total network and disk i/o performance on one side and client query latency on the other.
With analytical column store DBs the standard is to do massive batches writes of thousands to millions of records at a time, vs. inserting individual records. Inserting individual records is basically always crazy inefficient with column stores. So a single write is generally for thousands to millions of records.
Buddy, if you look just a couple posts up you'll see me comment on how ClickHouse's actual disk format works. You don't need to explain batching to me.
Nonetheless you can't insert a-whole-file-and-just-that-file in less than one write.
Where does it say that? The pricing page says on "Writes" in the info tooltip: "Each write operation (INSERT, DELETE, etc) consumes write units depending on the number of rows, columns, and partitions it writes to."
This doesn't imply to me that each individual INSERT costs 1 WU, but that it could be fractional. I guess it depends on how you read it?
Is lower time the right metric here? Seems normalizing per price would make a more useful metric for big data as long as the response time is reasonable
Yes, ClickBench results are presented as Relative Time, where lower is better. You can read more on the specifics of ClickBench methodology in the GitHub repository here: https://github.com/ClickHouse/ClickBench/
There are other responses from ClickHouse in the comments on the pricing, so I'll defer to their expertise on that topic there. Thank you for your feedback and ideas, as normalizing a price-based benchmark is an interesting concept (and where ClickHouse would expect to lead also given the architecture and efficiency)
Had a pricing question. Say we connected a log forwarder like Vector to send data to Clickhouse Cloud, once per second. If each write unit is $0.0125, and we execute 86,400 writes over the course of the day, would we end up spending $1080? Do you only really support large batch, low frequency writes?
Hi Cliff - A write unit does not correspond to a single INSERT. A single INSERT with less than 16MB of data generates ~0.01 “write unit”, so a single “write unit” typically corresponds to ~100 INSERTs. In your example, that would come closer to $11 a day. Depending on how small of batches you plan to write in that examples, there may be ways to reduce that spend further, by batching even more or turning on "async_insert" inside ClickHouse.
> and advocating for public, standardized benchmarks
For full transparency, I think you should do the same in ClickHouse. Or is there a strong reason not to run benchmarks on standard analytical workloads like TPC-H, TPC-DS or SSB?
You can't post results of TPC benchmarks without official audit. So it complicates posting results. You can't find common names that are usually compared with ClickHouse there [1]. So open standardized ClickBench tries to encourage benchmarking for everyone.
There are numerous benchmarks that use similar to TPC queries, but those are not standardized and can be misleading. For example a lot of work was done by Fivetran to get this report [0], but they show only overall geomean for those systems and you can't understand how they actually differ. Anyway their queries are not original TPC - variables are fixed in queries, they run first query when official query is a multiquery.
Contributors from Altinity run SSB with flattened and original schemas [2]. SSB is not well standardized and we see a lot of pairwise comparisons with controversial results - generally you can't just reproduce them and get all the results in single place for the same hardware.
License states the following. All other modifications are not standardized and you can't just compare systems. Otherwise there would be another standardized benchmark in the list you propose to run and publish.
>c. Public Disclosure: You may not publicly disclose any performance
results produced while using the Software except in the following
circumstances:
(1) as part of a TPC Benchmark Result. For purposes of this Agreement, a
"TPC Benchmark Result" is a performance test submitted to the TPC,
documented by a Full Disclosure Report and Executive Summary, claiming
to meet the requirements of an official TPC Benchmark Standard. You
agree that TPC Benchmark Results may only be published in accordance
with the TPC Policies. viewable at http: //www.tpc.org
(2) as part of an academic or research effort that does not imply or
state a marketing position
(3) any other use of the Software, provided that any performance results
must be clearly identified as not being comparable to TPC Benchmark
Results unless specifically authorized by TPC.
I see, thanks for the context, it seems like a PITA.
But given that each database system has its own flavor of SQL, vanilla TPC benchmarks may not work out of the box so one needs to tweak them a bit and this might be what actually disqualifies the published results from all of the clauses from above being applicable.
I can also anticipate that combination of clause (2) and (3) is what some that publish the results are also taking advantage of.
It is related to the "max_threads" setting of ClickHouse, and by default, it is the number of physical CPU cores, which is twice lower as the number of vCPUs.
For example, the c6a.4xlarge instance type in AWS has 16 vCPUs, 8 cores and "max_threads" in ClickHouse will be 8.
Interesting set of results. Ignoring ClickHouse, StarRocks seems to be better in almost all metrics.
I was curious to compare MonetDB, DuckDB, ClickHouse-Local, Elasticsearch, DataFusion, QuestDB, Timescale, and Athena. Amazingly, MonetDB shows up better than DuckDB in all metrics (except storage size), and Athena holds its own and fares admirably well, esp given that it is stateless. While, Timescale and Quest did not come up as good as I hoped they would.
The particular way in which the data is loaded into DuckDB and the particular machine configuration on which it is run triggers a problem in DuckDB related to memory management. Essentially the standard Linux memory allocator does not like our allocation pattern when doing this load, which causes the system to run out-of-memory despite freeing more memory than we allocate. More info is provided here [1].
As it is right now the benchmark is not particularly representative of DuckDB's performance. Check back in a few months :)
Thanks. Btw, we use DuckDB (via Node/Deno) for analytics (on Parquet/JSON), and so I must point out that despite the dizzying variation among various language bindings (cpp and python seem more complete), the pace of progress, given the team size, is god-like. It has been super rewarding to follow the project. Also, thanks for permissively licensing it (unlike most other source-available databases).
Goes without saying, if there are cost advantages to be had due to DuckDB's unique strengths, then serverless DuckDB Cloud couldn't come here soon enough.
We are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate, memory usage might still increase because of internal fragmentation in the allocator. Essentially, fragmentation might create "unused" space that does take up space. This phenomenon is called heap fragmentation [1].
Technically, I hope you understand that this isn't possible but maybe I am misinterpreting what you're trying to say.
auto buff = malloc(N);
free(buff);
free(buff);
is one way to free "more" buffers than allocated but this will lead to an UB and depending on the underlying system allocator implementation it may or may not crash.
However, given how silly this would be I believe this is not what you're trying to convey?
Here's what mytherin wrote, ...we are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate...
So, I assume, the context is, DuckDB allocates x buffers, frees x - m buffers at some point later, then allocates n buffers where n <<<< m, and yet malloc fails.
In the GitHub thread mytherin linked to above, Alexey Milovidov, ClickHouse CTO, points out that ClickHouse uses jemalloc and makes for a better choice than glibc malloc given the issue with fragmentation. It is likely that DuckDB switches to jemalloc, too.
The scenario I am describing is roughly the following:
Suppose we allocate 100K buffers that all have an equal size, and our memory usage is now 10GB. After that point we free 20K buffers, but allocate 10K more. In other words, from that point on we are freeing more buffers than we are allocating.
Now, since we are freeing more than we are allocating, you would expect our memory usage to go down. However, when using the standard glibc malloc on Linux, our memory usage unexpectedly goes up. After this happens several times in a row the system runs out of memory and new calls to malloc fails.
Happy ClickHouse user here. This is one amazing piece of software to be honest, for anyone ever wanted to parse, analyse and query billions of time series entry points ClickHouse is the way to go.
The cloud offering seems like an amazing product for companies that they could afford
I am not sure if the billing is right, but for 5M inserts per month the total bill would be $62K.
Checking the pricing, $0.0125 per write unit. It says each write "generates one or more write units". So ... $1.25 per 100 writes? That can't be right. I wondered if it meant writes per second (like how AWS DynamoDB or Azure CosmosDB work, with their unit-based billing).
With an analytical database like ClickHouse, you can write many rows with a single INSERT statement (thousands of rows, millions of rows, and more). In fact, this kind of batching is recommended. Larger inserts will consume more write units than smaller inserts. Check out our billing FAQ for some examples, and we will be enhancing it with more detail as questions from our users come in (we'll work on clarifying this specific point): https://clickhouse.com/docs/en/manage/billing/ We also provide a $300 credit free trial to try out your workload and see how it translates to some of the usage dimensions. Finally, this is a Beta product, so keep the feedback coming!
Thanks, I agree batching inserts would indeed be a good idea and it makes sense that's recommended. However that link you mention (as of now) does not specify what a write unit is. So if that could be clarified, that would be great. Since from your reply it sounds like one INSERT would indeed (at a minimum) incur one write unit. And thus 100 writes could indeed cost $1.25. Which could get expensive, fast.
An INSERT can consume less than one write unit, depending on how many rows and bytes it writes to how many partitions, columns, materialized views, data types, etc. So, a "write unit", which corresponds to hundreds of low-level write operations, typically translates to many batched INSERTS. We are working to improve our examples in the FAQ to clarify - thank you so much for asking the question!
You know what would be really helpful? Some tooling around measuring "write units", like a fake local ClickHouseCloud API you could submit a write to and see how many "write units" that particular write would take. Of course, that would make pricing transparent and easy and SaaS companies don't seem like that. I challenge you folks to actually figure out a way to prevent surprise bills, until then all this "write unit" stuff is bullshit.
How about if you are streaming in from Kafka and inserting each event as if arrives? Clickhouse is ideal for rapid analytics over event data so to introduce batching would be disappointing.
Batch upload is of course more cost effective, but I would expect that to be more typical in a traditional data warehouse where we are uploading hourly or daily extracts. Clickhouse would likely show up in more real time and streaming architectures where events arrive sporadically.
I am a huge fan and advocate of Clickhouse, but the concept of a write unit is strange and the likely charges if you have to insert small batches would make this unviable to use. A crypto analytics platform I built on top of Clickhouse would cost in the $millions per month vs $X00 or low $X000 with competitors or self hosting.
1) I had no idea what Clickhouse was for the first 30 seconds looking at the homepage. I now understand it to be a database of some sort. I shouldn't have seen the words "performance" and "cloud" and "serverless" before seeing the word database, right? I'm starting off confused. There shouldn't be an assumption that I know what you all do.
2) I have no idea what a column oriented database is. I've been a developer for 29 years (mostly frontend but I do a lot of full stack too). If I need an explainer, a lot of devs will.
Aside from that, it looks like a nice offering and I wish you all the best!
We all have our specialties, and that is fine. It is a common pattern that a developer gets comfortable with a particular tech stack, and then uses it for many years without seeing the need for much else. Some developers use Ruby on Rails plus Postgres for everything, others use C# and .NET and SQL Server. It's fine, if that's all you need.
Still, this is the year 2022. Cassandra, to take one example, was released in 2008. For everyone who has needed these fast-read databases, they've been much discussed for 14 years, including here on Hacker News, and on every other tech forum. At this point I think a company can simply assume that most developers will have some idea what a column database is.
Cassandra is not a columnar database, columnar in this sense is about the storage layout. Values for a column are laid next to each other in storage, thus allowing for optimizations in compression and computation, at the expense of reconstructing entire rows. Postgres is a row store, meaning all the columns for a row are stored next to each other in storage, which makes sense if you need all of the values for a row (or the vast majority).
I know what a column oriented DB is, but ClickHouse was not on my radar before.
The pitch on the landing page is that ClickHouse Cloud is great if you love ClickHouse. If you don't know what ClickHouse is, you have to do some work to find out.
No. Column oriented storage db’s make it fast to read 10s to billions of “rows”, usually when you have data that you want to read that can be independent of other columns. Ex: (stock close) - row storage isn’t going to buy you much here, you’d rather read the whole “stock close” column as fast as possible.
Whereas traditional db’s, data like [first name, last name], the columns may have way less meaning on their own and you need both columns to have the data make sense.
A traditional db with B(-)tree storage is much slower when dealing with that type of usage. Storing (stock close) in a single column format makes that type of query much faster.
In places like Russia, those words are very dangerous, and could get you jailed or sent to the front. Even calling it a "war" was/is a punishable offense.
So, yes, words, but the potential consequences of these words have more significance than empty air.
They support Ukraine, however, given the company spun out of Yandex, the latter is most certainly financially benefitting from their success, and is paying taxes that are funding the war. AFAIR, Yandex also has 2 director sits on Clickhouse board, although that could have changed.
> ClickHouse, Inc. is a Delaware company with headquarters in the San Francisco Bay Area. We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors. We do, however, have an incredibly talented team of Russian software engineers located in Amsterdam, and we could not be more proud to call them colleagues.
That's interesting, thanks for clarifying. Yandex do show up as an investor in Crunchbase, including in their most recent Series B. The cited blog post says:
> The formation of our company was made possible by an agreement under which Yandex NV, a Dutch company with operations in Russia
While Yandex NV is registered in the Netherlands, it's pretty clear that Yandex NV is directly related to Yandex (specifically, according to Wikipedia, it's a holding company of Yandex). For those who don't know, Yandex is basically the Google of Russia and holds 42% of market share among search engines in Russia.
The blog post does not seem to make any claims that Russia is not benefitting financially from the commercial success of Clickhouse. And given the above, such claim would unlikely be true. As such, I still think it's pretty much safe to assume that a portion of any $$$ paid to Clickhouse ultimately goes to fund the war and kill people.
That said, I sincerely hope they could find a way to stop that flow of money from happening somehow, as otherwise it's a nice technology and a great technical team behind it...
I have (an extremely boring, but quite hands-on) video about various ways of ClickHouse optimizations on top of external storage: https://www.youtube.com/watch?v=rK2BsaaaOCA (starting from around 40:00).
I am from selectdb company. selectdb is a cloud native data warehouse developed by the founding team of apache doris. Recently we submitted ClickBench's test results and we achieved good rankings. Later, we will incorporate all performance optimizations into Apache Doris and submit the test results of Apache Doris.
Regardless of selectdb or doris, we not only have outstanding performance on large wide tables, but also perform well in scenarios such as tpc-h where there are many joins.
https://en.selectdb.com/blog/SelectDB%20Topped%20ClickBench%...
Great stuff! I know the ClickHouse team and they are world class. If you're more familiar with a relational database some things will feel weird. For example, you should not insert 1 row at a time, there's over 1,000 built-in functions, and the default connection is often over HTTPS, but not 443.
I’m a little sad to see them embrace magic “insert unit” pricing, instead of taking the approach Altinity uses where you are renting an instance size that you can compare apples-to-apples with running your own cluster on ec2.
Honestly that's something they can probably offer separately, but I really prefer this pricing for most use cases where I want a database that is always available but has bursty request/response patterns. This means I can have an analytical database available for all my small services, websites, etc without having to think too much about availability, support, and a constant price overhead. But ClickHouse is so fast you can get pretty far with a $10 VPS, I admit.
Probably the best comparison is CockroachDB Cloud. They have a "serverless" offering based on unit pricing and a dedicated offering based on provisioned servers + support/maintenance overhead. I think that would be the ideal place to go long-term, but I'm super excited for this current one. I love ClickHouse and want to support them.
ClickHouse is also an interesting case because there's lots of options to migrate clusters, use S3 as long-term storage, etc to where I don't particularly feel locked into this offering if I ever wanted to shift into my own.
> I’m a little sad to see them embrace magic “insert unit” pricing, instead of taking the approach Altinity uses where you are renting an instance size that you can compare apples-to-apples with running your own cluster on ec2.
Thanks for this comment. We'll be publishing a blog at Altinity to compare both models. My view is that they both have their place. The BigQuery pricing model is great for testing or systems that don't have constant load. The Altinity model is good for SaaS systems that run constantly and need the ability to bound costs to ensure margins.
Having a selection of vendors that offer different economics seems better for users than everyone competing for margin on the same operational model.
Serverless "pay for usage" is different than the fixed-size dedicated cluster pricing model, and can be quite a bit cheaper, especially with spiky query traffic. It should also be more reliable, since you don't have to predict and provision capacity for your peak usage ahead of time.
Outside of being open source, how does ClickHouse differ from Snowflake/BigQuery? In what scenarios would I choose ClickHouse over those existing solutions?
ClickHouse supports ad-hoc analytics, real-time reporting, and time series workloads at the same time. It is perfectly suited for user-facing analytics services. It supports low-latency (<100ms) queries for real-time analytics as well as high query throughput (500 QPS and more) - all of this with real-time data ingestion of logs, events, and time series.
Take some notable examples from the list: https://clickhouse.com/docs/en/about-us/adopters/, something around web analytics, APM, ad networks, telecom data... ClickHouse is perfectly suited for these use cases. But if you try to align these scenarios with, say, BigQuery, they will become almost impossible or prohibitively expensive or just slow.
There are specialized systems for real-time analytics like Druid and Pinot, but ClickHouse does it better:
https://benchmark.clickhouse.com/
And ClickHouse looks like a normal relational database - there is no need for multiple components for different tiers (like in Druid), no need for manual partitioning into "daily", "hourly" tables (like you do in Spark and Bigquery), no need for lambda architecture... It's refreshing how something can be both simple and fast.
We are using Clickhouse Cloud at our company and the speed at which it serves up data is mind-blowing to our team, which had been using older systems for almost 10 years. Congrats on the public beta, and we can't wait to see what comes next!
I'm pretty interested in running Clickhouse, but to poll the room: is it easy to run without an infra person maintaining it? We don't need any kind of persistence guarantee or backups, happy to tear it down every day and refresh the data from sources, but does it have complicated config? How well does it work on say, a single core?
I've been running clickhouse for production as a solo dev for a year now and it's been great so far. I've had to tune it a bit for the machines I have, but everything works and it's incredibly fast. Not even comparable to postgres. For analytics, it goes way beyond what you'd expect, even under heavy writing.
If you can run MySQL you can run ClickHouse. They are both about the same level of effort operationally. Your case sounds pretty simple and can be addressed with one or more single node systems.
As far as whether it works on a single core, you probably need to load data and test. That's the only way to answer that question for sure.
IMO the real question for any database is whether you can invest the time to learn how to fix problems and deal with boring but necessary operations like upgrade. If not, it's probably better to use a managed service. We operate a SaaS for ClickHouse but use managed MySQL and Kubernetes. That choice freed up resources to focus on making ClickHouse run well.
I've been running it in a container on docker swarm (on a celeron 3205U—2x1.5GHz cores). My CH container stores read only data, and I periodically update the snapshot, which involves deploying a newer image. This is very straightforward but, because it's read only, I don't even need a persistent volume.
I've found the documentation to be pretty good and there's a really active telegram group where devs offer help (if you get noticed, but that's not too hard).
Easy for one person, you just set it and forget about it, I've one 4-core dedicated server for clickhouse that is inserting around 100 million new rows a day, db size is tens of terabytes, works without problems.
If you don't really care about the data then I'd say it's probably pretty easy to run without an infra person right? If you hit a problem just nuke it and try again?
For context, there are multiple Clickhouse companies. This is the one that spun out of Yandex and took the Clickhouse name as their namesake for the company!
Altinity has been in the space for a while offering Clickhouse services as well: https://altinity.com/
Hi! Thank you for mentioning my company Altinity. We love ClickHouse and have been supporting customers on it for years.
That said I would like to defend ClickHouse and their use of the logo. They acquired the IP and it's their right to use it as they please. It's hard to imagine them not using the ClickHouse name. There's a crowd of companies using/supporting ClickHouse so it's the obvious way to market themselves.
The interesting question is whether it's a good thing for users to have a large company controlling ClickHouse rather than having it in a foundation.
I think that’s only an interesting question to you because you directly benefit from that. For now on, your company will be regarded as an off brand Clickhouse since you have no influence on the roadmap, etc. MongoLab and MongoHQ come to mind.
My advice would be to pivot away from the Clickhouse brand and towards a great OLAP service that happens to run on Clickhouse.
I wish you guys luck but you’ll probably be used by customers mainly to negotiate with Clickhouse on better rates. And they’ll undercut competition when needed since they have so much cash and investor mindshare.
Yes. Our offering is quite differentiated--different pricing model, ability to run in user VPCs and private cloud, baked-in enterprise DBA support, and Altinity Stable builds with long-tail support are just a few examples. This is a large market with very diverse needs.
At the same time we contribute to open source and work to make ClickHouse better, just as many other businesses built on ClickHouse do. We've done quite a bit of work including many ClickHouse PRs, maintenance of the Grafana ClickHouse plugin since 2019, Altinity Kubernetes operator for ClickHouse, etc. We plan to step up our contributions as the market grows.
I was keeping a tally of how many companies were offering "ClickHouse as a service" at one point last year. I think I got up to 7 or 8.
It will be interesting to watch this unfold from a code licensing perspective. Will Clickhouse Inc. move to a more restrictive license to block all these other ClickHouse services?
> It will be interesting to watch this unfold from a code licensing perspective. Will Clickhouse Inc. move to a more restrictive license to block all these other ClickHouse services?
Is this fully hosted or runs in my cloud? Is this running https://github.com/ClickHouse/ClickHouse or a closed-source Clickhouse Cloud variant with added features?
Billing page says AWS for storage, so I'd assume the same for compute.
If they're really doing elastic reads as suggested in another comment I don't think this can be the standard CH server (at least for reads) - or I've missed something very exciting in recent versions.
It feels like there's gotta be something proprietary beyond just using a s3 disk setup but maybe not? is simply having data_cache_enabled and cache_path pointing to local nvme ssd sufficient to achieve similar speeds?
For non-elastic writes this might be sufficient (I have no clue as to performance of AWS S3 for this kind of use case) - but supposedly[0] there's some elastic read capacity. As far as I know that's not possible with standard CH at all, and may be where some secret sauce is.
I can vaguely imagine how you'd build such a thing, you'd spin up new instances that knows your write-primary's sharding/partitioning scheme - heck you could even approximate it with slightly-augmented clickhouse-clients doing the S3 pulls directly and no real additional "server" - but as far as I know there's no way out of the box.
There are a few options for migrating or synchronizing data from MySQL - I'd recommend starting with this page in the ClickHouse Docs - there is a nice video there that explains some of those options:
Depending on what you are trying to achieve, you could use clickhouse-local with the MySQL engine to move data, or could use an ETL/ETL tool to migrate/sync
Good point. If you just want to copy data far and away the easiest way to transfer data is using MySQLDatabaseEngine. You can even copy table definitions. Watch for issues with Decimal datatypes if you do this.
Be careful with this engine, it's easy to accidentally expose the password as you only need table read permissions if it wasn't set up using an external credential file.
I also had some pretty bad join performance (CH table joined to MySQL table), the quick solution to both of these is that we instead use the table function (https://clickhouse.com/docs/en/sql-reference/table-functions...) to copy the data periodically.
Check out the Altinity Sink Connector for ClickHouse [0]. This is advancing quite quickly and already has prod deployments. Please feel free to try it out.
Not affiliated with Clickhouse, but fwiw they've been working on this for a while (I recall coming across it like +2 years ago in an old Altinity blog post):
We're currently using InfluxDB for some timeseries metrics, but the 2.0 migration path have been terrible, even for simple examples, so we're looking for something else.
At a previous company, I wrote a simple TCP server to receive LineProtocol, parse it and write to ClickHouse. I was absolutely blown away by how fast I could chart data in Grafana [1]. The compression was stellar, as well...I was able to store and chart years of history data. We basically just stopped sending data to Influx and migrated everything over to the ClickHouse backend.
Clickhouse is so stupidly, mindbogglingly fast at what it does that you can often replace a much more specialized databases or specialized schemas / queries with brute force Clickhouse and come out far ahead. It really depends on the size of what you are doing. I've used Clickhouse for timeseries data in the tens of millions events per day and it worked well.
We're a company that has been using the private beta of ClickHouse Cloud.
IMHO, the unique value-prop of this offering is that it elastically scales reads (compute), like Redshift/Snowflake/BigQuery/other cloud data warehouses do, while also "being Clickhouse", and so giving you very traditional SQL querying capabilities (where those others all ask you to bend your SQL to fit the DB, sometimes in pretty ridiculous ways.)
I would suggest not thinking of this offering as "ClickHouse, but in the cloud"; but rather, thinking of this as "a cloud data warehouse, but rather than using its own proprietary query engine, it's just ClickHouse."
If you haven't evaluated other cloud data warehouses as alternatives to solving your problem (and found them wanting for one reason or another), then you're likely not in the niche that would see a positive profit margin on using ClickHouse Cloud.
I’m sure there are tons of people very willing to pay for this if it means not having to run their own k8s clusters. Managed offerings like this are expensive, but worth every penny for some.
Unsollicited feedback: I do not receive any validation email when creating an account, so I cannot log in. It is also impossible to reset the password.
Clickhouse is great but the ops and scaling make it notoriously difficult to self host.
If you have a lot of log data and want something open source and serverless you can self host, check out Matano (https://github.com/matanolabs/matano).
Exactly, getting started is easy but dealing with scaling and operations when it's deployed in a production environment is not simple, primarily because it doesn't separate storage and compute.
I think ClickHouse Cloud Beta addressed exactly this concern. It separates storage and compute and deals with scaling. There is no sharding so you don't need to deal with scaling.
It's interesting how diligently they wiped our any and all mentions of Yandex from their website, to the extent that Google struggles to find meaningful mentions apart from a few exceptions like [1]. Quite peculiar considering that just two years ago, according to that presentation, they were still considering themselves a Yandex project, even the link in that tweet is https://clickhouse.yandex .
For those unaware, Yandex is a Russian internet megacorp, think Google but in cahoots with the authoritarian government: cherrypicked news coverage friendly to the government, effectively a monopoly across multiple verticals (eg they bought out Uber in Russia), etc. In 2020, the year that deck seems to be from, they were already censoring their news feed [2] and tweaking search ranking to promote pro-government results [3] for years.
Yes, it seems the split was triggered by the war, but the censorship/promotion/collaboration with the government [0] were not worth breaking the ties.
A cynical view would be that it's only now that the association became bad for business.
[0]: By the way, Yandex was showing Crimea as unqualifiedly Russian territory since 2014, until the war, when they just removed borders between countries completely
Good to see transparent comparisons available now for Cloud performance vs. self-hosted or bare metal results as well as results from our peers. The ClickHouse team will continue to optimize further - as scale and performance is a relentless pursuit here at ClickHouse, and something we expect to be performed transparently and in a reproducible manner. Public benchmarking benefits all of us in the tech industry as we learn from each other in sharing the best techniques for attaining high performance within a cloud architecture
Full disclosure: I do work for ClickHouse, although have also been a past member of SPEC in developing and advocating for public, standardized benchmarks