Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ClickHouse Cloud is now in Public Beta (clickhouse.com)
278 points by taubek on Oct 4, 2022 | hide | past | favorite | 162 comments


I wanted to note that ClickHouse Cloud results are now also being reported in the public ClickBench results: https://benchmark.clickhouse.com/

Good to see transparent comparisons available now for Cloud performance vs. self-hosted or bare metal results as well as results from our peers. The ClickHouse team will continue to optimize further - as scale and performance is a relentless pursuit here at ClickHouse, and something we expect to be performed transparently and in a reproducible manner. Public benchmarking benefits all of us in the tech industry as we learn from each other in sharing the best techniques for attaining high performance within a cloud architecture

Full disclosure: I do work for ClickHouse, although have also been a past member of SPEC in developing and advocating for public, standardized benchmarks


To help understand the results of the benchmark, I find it helpful to look at how the benchmark is constructed, and what it tests for. From the README:

"The dataset is represented by one flat table. This is not representative of classical data warehouses, which use a normalized star or snowflake data model. The systems for classical data warehouses may get an unfair disadvantage on this benchmark."

Taking a look at the queries [0], it looks like it mostly consists of full table scans with filters, aggregations, and sorts. Since it's a single table, there are no joins.

[0]: https://github.com/ClickHouse/ClickBench/blob/main/snowflake...


Can you clarify what a "write unit" is? Naively it sounds like it might be blocks x partitions x replicas that actually hit disk. (Which is also probably not very clear to people not already using CH, but I have at least middling knowledge of CH's i/o patterns and I have no clue what a "write unit" is from the page's description.)


One write unit is around 100..200 INSERT queries.

If you are doing INSERT in batches with one million rows, it will give

    SELECT formatReadableQuantity(1000000 * 100 / 0.0125)
    
    8.00 billion
inserted rows per dollar. Pretty good, IMO.

If you are doing millions of INSERT queries with one record, without "async_insert" setting, it will cost much more.

That's why we have "write units" instead of just counting inserts.


More helpful would be answers to my questions at https://news.ycombinator.com/item?id=33081502 - async_insert is a relatively new feature, we're still using buffer tables for example - but also most of our "client" inserts are actually onto multi-MV-attached null engines. Those MVs are also often doing some pre-aggregation before hitting our MTs as well. So we might insert a million rows, but the MV aggregates that down into 50k, but then that gets inserted into five persistent tables, each of which has its own sharding/partitioning so that blows up to 200k or something "rows" again. (And at some point those inserts are also going to get compacted into stuff inserted previously / concurrently by the MT itself.)

As I've said several times in this thread, I understand why you don't count inserts or rows. What I don't understand is what unit a WU does actually correspond to. In particular I don't understand its relation to e.g. parts or blocks, which are the units one would focus on optimizing self-hosted offerings.


I think optimizations that you focus on for self-hosted ClickHouse are the same as for Cloud. In self-hosted it helps to improve your throughput/capacity with fixed allocated resources. In cloud it directly affects cost.

For those complex pipelines you may find more useful to run tests during trial. Data distribution, partitioning and so on can change actual cost significantly so estimates can be too pessimistic or optimistic


> For those complex pipelines you may find more useful to run tests during trial.

Right, that's exactly what I don't want to deal with. Unless I have even just a ballpark estimate of complex pipelines both before I commit to any sales crap and afterwards when we're designing new pipelines, it's just not an option for us at all. I have no clue if it's going to cost us $10, $100, or $10000.


It's Tyler from ClickHouse.

Check out the response below that has a reference to some of our billing FAQs.


It doesn't mention anything about what a write unit is, except to say you can reduce write units by batching inserts (that part I guessed already.)

There's no way to think about what an actual write unit means. You could measure the costs on a sample workload, but that's far from ideal. Some transparency here would be nice.

I understand the answer is complicated, based on hairy implementation details, and subject to change. Give me the complexity and let me interpret it according to my needs.


Absolutely.

Working on updating the FAQ and tooltips now and sharing your feedback. <3


Right, that link covers read units which is also what I expected - essentially the number of files I have to touch - but I still have no clue about write units.

Is one block on one non-partitioned non-distributed table one write unit? What about one insert that's two blocks on such a table? What about one block on a null engine with two MVs listening to insert into two non-partitioned non-distributed tables? What if the table is a replacing mergetree, do I incur WUs for compactions? etc.

My worry is that it is essentially 1 WU = 1 new part file, which I understand makes sense to bill on but is tremendously intransparent for users - at least I have no clue how often we roll new part files, instead I'm focused on total network and disk i/o performance on one side and client query latency on the other.


I may assure you that 1WU is not 1 part. Not even close. You can check it using trial credits with your data.

For example, I just checked that uploading 1.1GB example table(cell_towers with 14 columns) cost me 0.38 write units.


Then I'm even more confused, because the pricing page clearly says write operations consume at least one WU.


With analytical column store DBs the standard is to do massive batches writes of thousands to millions of records at a time, vs. inserting individual records. Inserting individual records is basically always crazy inefficient with column stores. So a single write is generally for thousands to millions of records.


Buddy, if you look just a couple posts up you'll see me comment on how ClickHouse's actual disk format works. You don't need to explain batching to me.

Nonetheless you can't insert a-whole-file-and-just-that-file in less than one write.


Where does it say that? The pricing page says on "Writes" in the info tooltip: "Each write operation (INSERT, DELETE, etc) consumes write units depending on the number of rows, columns, and partitions it writes to."

This doesn't imply to me that each individual INSERT costs 1 WU, but that it could be fractional. I guess it depends on how you read it?


The tooltip has been changed since my comment was posted; it's now not incorrect, but it still doesn't really tell me more useful information.

(See https://news.ycombinator.com/item?id=33081099 for the original wording.)


Is lower time the right metric here? Seems normalizing per price would make a more useful metric for big data as long as the response time is reasonable


Yes, ClickBench results are presented as Relative Time, where lower is better. You can read more on the specifics of ClickBench methodology in the GitHub repository here: https://github.com/ClickHouse/ClickBench/

There are other responses from ClickHouse in the comments on the pricing, so I'll defer to their expertise on that topic there. Thank you for your feedback and ideas, as normalizing a price-based benchmark is an interesting concept (and where ClickHouse would expect to lead also given the architecture and efficiency)


This benchmark focuses on analytical query latency for representative analytical queries, so yes - lower number is better.


Wow, I hadn't heard of StarRocks before... seems like an interesting competitor.

https://starrocks.io/blog/clickhouse_or_starrocks


See the SelectDB built from Apache Doris and by the creators of Apache Doris. the performance is amazing. https://en.selectdb.com/blog/SelectDB%20Topped%20ClickBench%...


Looks really cool! Great work!

Had a pricing question. Say we connected a log forwarder like Vector to send data to Clickhouse Cloud, once per second. If each write unit is $0.0125, and we execute 86,400 writes over the course of the day, would we end up spending $1080? Do you only really support large batch, low frequency writes?


Hi Cliff - A write unit does not correspond to a single INSERT. A single INSERT with less than 16MB of data generates ~0.01 “write unit”, so a single “write unit” typically corresponds to ~100 INSERTs. In your example, that would come closer to $11 a day. Depending on how small of batches you plan to write in that examples, there may be ways to reduce that spend further, by batching even more or turning on "async_insert" inside ClickHouse.


> and advocating for public, standardized benchmarks

For full transparency, I think you should do the same in ClickHouse. Or is there a strong reason not to run benchmarks on standard analytical workloads like TPC-H, TPC-DS or SSB?


You can't post results of TPC benchmarks without official audit. So it complicates posting results. You can't find common names that are usually compared with ClickHouse there [1]. So open standardized ClickBench tries to encourage benchmarking for everyone.

There are numerous benchmarks that use similar to TPC queries, but those are not standardized and can be misleading. For example a lot of work was done by Fivetran to get this report [0], but they show only overall geomean for those systems and you can't understand how they actually differ. Anyway their queries are not original TPC - variables are fixed in queries, they run first query when official query is a multiquery.

Contributors from Altinity run SSB with flattened and original schemas [2]. SSB is not well standardized and we see a lot of pairwise comparisons with controversial results - generally you can't just reproduce them and get all the results in single place for the same hardware.

[0] https://www.fivetran.com/blog/warehouse-benchmark [1] https://www.tpc.org/tpcds/results/tpcds_results5.asp?orderby... [2] https://altinity.com/blog/clickhouse-nails-cost-efficiency-c...


There is a good reference to the available benchmarks for analytical databases: https://github.com/ClickHouse/ClickBench#similar-projects


On couple of occasions I've seen TPC-H benchmarks with the remark that the results are not audited. Is that not possible?


License states the following. All other modifications are not standardized and you can't just compare systems. Otherwise there would be another standardized benchmark in the list you propose to run and publish.

>c. Public Disclosure: You may not publicly disclose any performance results produced while using the Software except in the following circumstances: (1) as part of a TPC Benchmark Result. For purposes of this Agreement, a "TPC Benchmark Result" is a performance test submitted to the TPC, documented by a Full Disclosure Report and Executive Summary, claiming to meet the requirements of an official TPC Benchmark Standard. You agree that TPC Benchmark Results may only be published in accordance with the TPC Policies. viewable at http: //www.tpc.org (2) as part of an academic or research effort that does not imply or state a marketing position (3) any other use of the Software, provided that any performance results must be clearly identified as not being comparable to TPC Benchmark Results unless specifically authorized by TPC.


I see, thanks for the context, it seems like a PITA.

But given that each database system has its own flavor of SQL, vanilla TPC benchmarks may not work out of the box so one needs to tweak them a bit and this might be what actually disqualifies the published results from all of the clauses from above being applicable.

I can also anticipate that combination of clause (2) and (3) is what some that publish the results are also taking advantage of.

[1] https://www.oracle.com/mysql/heatwave/performance/ [2] https://www.singlestore.com/blog/tpc-benchmarking-results/ [3] https://docs.pingcap.com/tidb/v6.2/v5.4-performance-benchmar... [4] https://www.monetdb.org/blogs/learning-from-benchmarking/


Why are you using 'threads' instead of vcpus or aws instances like it was for other benchmarks? Thats really hard to compare and add suspicions here.


It is related to the "max_threads" setting of ClickHouse, and by default, it is the number of physical CPU cores, which is twice lower as the number of vCPUs.

For example, the c6a.4xlarge instance type in AWS has 16 vCPUs, 8 cores and "max_threads" in ClickHouse will be 8.


Interesting set of results. Ignoring ClickHouse, StarRocks seems to be better in almost all metrics.

I was curious to compare MonetDB, DuckDB, ClickHouse-Local, Elasticsearch, DataFusion, QuestDB, Timescale, and Athena. Amazingly, MonetDB shows up better than DuckDB in all metrics (except storage size), and Athena holds its own and fares admirably well, esp given that it is stateless. While, Timescale and Quest did not come up as good as I hoped they would.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

It'd be interesting to see how rockset, starburst (presto/trino), and tiledb fare, if and when they get added to the benchmark.


The particular way in which the data is loaded into DuckDB and the particular machine configuration on which it is run triggers a problem in DuckDB related to memory management. Essentially the standard Linux memory allocator does not like our allocation pattern when doing this load, which causes the system to run out-of-memory despite freeing more memory than we allocate. More info is provided here [1].

As it is right now the benchmark is not particularly representative of DuckDB's performance. Check back in a few months :)

[1] https://github.com/duckdb/duckdb/issues/3969#issuecomment-11...


Thanks. Btw, we use DuckDB (via Node/Deno) for analytics (on Parquet/JSON), and so I must point out that despite the dizzying variation among various language bindings (cpp and python seem more complete), the pace of progress, given the team size, is god-like. It has been super rewarding to follow the project. Also, thanks for permissively licensing it (unlike most other source-available databases).

Goes without saying, if there are cost advantages to be had due to DuckDB's unique strengths, then serverless DuckDB Cloud couldn't come here soon enough.


> despite freeing more memory than we allocate

> despite DuckDB freeing more buffers than it is allocating

Can you please clarify how is that even possible?


We are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate, memory usage might still increase because of internal fragmentation in the allocator. Essentially, fragmentation might create "unused" space that does take up space. This phenomenon is called heap fragmentation [1].

[1] https://cpp4arduino.com/2018/11/06/what-is-heap-fragmentatio...


> Despite freeing more buffers than we allocate

Technically, I hope you understand that this isn't possible but maybe I am misinterpreting what you're trying to say.

  auto buff = malloc(N);
  free(buff);
  free(buff);
is one way to free "more" buffers than allocated but this will lead to an UB and depending on the underlying system allocator implementation it may or may not crash.

However, given how silly this would be I believe this is not what you're trying to convey?


Here's what mytherin wrote, ...we are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate...

So, I assume, the context is, DuckDB allocates x buffers, frees x - m buffers at some point later, then allocates n buffers where n <<<< m, and yet malloc fails.

In the GitHub thread mytherin linked to above, Alexey Milovidov, ClickHouse CTO, points out that ClickHouse uses jemalloc and makes for a better choice than glibc malloc given the issue with fragmentation. It is likely that DuckDB switches to jemalloc, too.


You are misinterpreting it indeed.

The scenario I am describing is roughly the following:

Suppose we allocate 100K buffers that all have an equal size, and our memory usage is now 10GB. After that point we free 20K buffers, but allocate 10K more. In other words, from that point on we are freeing more buffers than we are allocating.

Now, since we are freeing more than we are allocating, you would expect our memory usage to go down. However, when using the standard glibc malloc on Linux, our memory usage unexpectedly goes up. After this happens several times in a row the system runs out of memory and new calls to malloc fails.


surprised Spark isn't there


Happy ClickHouse user here. This is one amazing piece of software to be honest, for anyone ever wanted to parse, analyse and query billions of time series entry points ClickHouse is the way to go.

The cloud offering seems like an amazing product for companies that they could afford I am not sure if the billing is right, but for 5M inserts per month the total bill would be $62K.


That can’t be right, that’s insane. I would find it easy to do 5m inserts in 30 minutes.


A write unit is not the same thing as a single insert, if for that cost you've multiplied it up.


I hate the part of my brain that has allowed the name to interfere with my interest in even looking at it.


No, write unit is not a single INSERT. A single INSERT will take around 0.01 write units.


That's quite high price tag per insert, do you have to write large amount of data per insert?


Checking the pricing, $0.0125 per write unit. It says each write "generates one or more write units". So ... $1.25 per 100 writes? That can't be right. I wondered if it meant writes per second (like how AWS DynamoDB or Azure CosmosDB work, with their unit-based billing).


With an analytical database like ClickHouse, you can write many rows with a single INSERT statement (thousands of rows, millions of rows, and more). In fact, this kind of batching is recommended. Larger inserts will consume more write units than smaller inserts. Check out our billing FAQ for some examples, and we will be enhancing it with more detail as questions from our users come in (we'll work on clarifying this specific point): https://clickhouse.com/docs/en/manage/billing/ We also provide a $300 credit free trial to try out your workload and see how it translates to some of the usage dimensions. Finally, this is a Beta product, so keep the feedback coming!


Thanks, I agree batching inserts would indeed be a good idea and it makes sense that's recommended. However that link you mention (as of now) does not specify what a write unit is. So if that could be clarified, that would be great. Since from your reply it sounds like one INSERT would indeed (at a minimum) incur one write unit. And thus 100 writes could indeed cost $1.25. Which could get expensive, fast.


An INSERT can consume less than one write unit, depending on how many rows and bytes it writes to how many partitions, columns, materialized views, data types, etc. So, a "write unit", which corresponds to hundreds of low-level write operations, typically translates to many batched INSERTS. We are working to improve our examples in the FAQ to clarify - thank you so much for asking the question!


You know what would be really helpful? Some tooling around measuring "write units", like a fake local ClickHouseCloud API you could submit a write to and see how many "write units" that particular write would take. Of course, that would make pricing transparent and easy and SaaS companies don't seem like that. I challenge you folks to actually figure out a way to prevent surprise bills, until then all this "write unit" stuff is bullshit.


> We are working to improve our examples in the FAQ to clarify

Based on the comments here, and my own confusion, I think figuring out a different way of billing read/write operations is in order.


How about if you are streaming in from Kafka and inserting each event as if arrives? Clickhouse is ideal for rapid analytics over event data so to introduce batching would be disappointing.

Batch upload is of course more cost effective, but I would expect that to be more typical in a traditional data warehouse where we are uploading hourly or daily extracts. Clickhouse would likely show up in more real time and streaming architectures where events arrive sporadically.

I am a huge fan and advocate of Clickhouse, but the concept of a write unit is strange and the likely charges if you have to insert small batches would make this unviable to use. A crypto analytics platform I built on top of Clickhouse would cost in the $millions per month vs $X00 or low $X000 with competitors or self hosting.


"Each write operation (INSERT, DELETE, etc) consumes write units depending on the number of rows, columns, and partitions it writes to."


A single INSERT takes around 0.01 write units.


Some unsolicited feedback:

1) I had no idea what Clickhouse was for the first 30 seconds looking at the homepage. I now understand it to be a database of some sort. I shouldn't have seen the words "performance" and "cloud" and "serverless" before seeing the word database, right? I'm starting off confused. There shouldn't be an assumption that I know what you all do.

2) I have no idea what a column oriented database is. I've been a developer for 29 years (mostly frontend but I do a lot of full stack too). If I need an explainer, a lot of devs will.

Aside from that, it looks like a nice offering and I wish you all the best!


"column oriented database"

We all have our specialties, and that is fine. It is a common pattern that a developer gets comfortable with a particular tech stack, and then uses it for many years without seeing the need for much else. Some developers use Ruby on Rails plus Postgres for everything, others use C# and .NET and SQL Server. It's fine, if that's all you need.

Still, this is the year 2022. Cassandra, to take one example, was released in 2008. For everyone who has needed these fast-read databases, they've been much discussed for 14 years, including here on Hacker News, and on every other tech forum. At this point I think a company can simply assume that most developers will have some idea what a column database is.


Cassandra is not a columnar database, columnar in this sense is about the storage layout. Values for a column are laid next to each other in storage, thus allowing for optimizations in compression and computation, at the expense of reconstructing entire rows. Postgres is a row store, meaning all the columns for a row are stored next to each other in storage, which makes sense if you need all of the values for a row (or the vast majority).


You never explained what a column database is. One row and unlimited columns?



If you don't know what a column oriented DB is, that page is probably not for you.


I know what a column oriented DB is, but ClickHouse was not on my radar before.

The pitch on the landing page is that ClickHouse Cloud is great if you love ClickHouse. If you don't know what ClickHouse is, you have to do some work to find out.


Care to elaborate? I’m a fullstack Rails dev for 12 years and I had no idea either, just like OP. Why alienate potential users from the get go?


Isn't it just a database in Second normal form (2NF)?


No. Column oriented storage db’s make it fast to read 10s to billions of “rows”, usually when you have data that you want to read that can be independent of other columns. Ex: (stock close) - row storage isn’t going to buy you much here, you’d rather read the whole “stock close” column as fast as possible.

Whereas traditional db’s, data like [first name, last name], the columns may have way less meaning on their own and you need both columns to have the data make sense.

A traditional db with B(-)tree storage is much slower when dealing with that type of usage. Storing (stock close) in a single column format makes that type of query much faster.


Column oriented database really isn’t esoteric knowledgeable. You should check out DDIA, especially if you’re doing full stack


Thanks for the well wishes!

And thanks for the honest feedback.

It's always an interesting balance of promoting a new thing (Cloud) and explaining an existing thing. This might be helpful.

https://clickhouse.com/docs/en/home

(note: I work at ClickHouse)


And https://clickhouse.com/docs/en/intro/ is a bit lower level.


That's fine. You probably never played a data warehouse. No big deal, we all have our own areas mastered and gaps elsewhere.


Columnar DBs have been around since 1969, predate relational DBs. Seriously, if you're not into large scale data, it really isn't for you.


Clickhouse was spun out of Yandex, which is a Russian corporation. Given existing geopolitical tensions, is there anything to worry about there?

Does anyone know how much of the Clickhouse team (or ownership) is still located in Russia?


They have been extremely clear in their support of Ukraine https://clickhouse.com/blog/we-stand-with-ukraine


It's just a bunch of text, doesn't show any support whatsoever.

They can show support by donating to UA defence and showing proof (important - not some neutral org). Otherwise it is not support but a bunch of bs


In places like Russia, those words are very dangerous, and could get you jailed or sent to the front. Even calling it a "war" was/is a punishable offense.

So, yes, words, but the potential consequences of these words have more significance than empty air.


Why do you get to define the acceptable forms of support...?


They support Ukraine, however, given the company spun out of Yandex, the latter is most certainly financially benefitting from their success, and is paying taxes that are funding the war. AFAIR, Yandex also has 2 director sits on Clickhouse board, although that could have changed.


> ClickHouse, Inc. is a Delaware company with headquarters in the San Francisco Bay Area. We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors. We do, however, have an incredibly talented team of Russian software engineers located in Amsterdam, and we could not be more proud to call them colleagues.

From their "We Stand With Ukraine" page. [1]

[1] https://clickhouse.com/blog/we-stand-with-ukraine


That's interesting, thanks for clarifying. Yandex do show up as an investor in Crunchbase, including in their most recent Series B. The cited blog post says:

> The formation of our company was made possible by an agreement under which Yandex NV, a Dutch company with operations in Russia

While Yandex NV is registered in the Netherlands, it's pretty clear that Yandex NV is directly related to Yandex (specifically, according to Wikipedia, it's a holding company of Yandex). For those who don't know, Yandex is basically the Google of Russia and holds 42% of market share among search engines in Russia.

The blog post does not seem to make any claims that Russia is not benefitting financially from the commercial success of Clickhouse. And given the above, such claim would unlikely be true. As such, I still think it's pretty much safe to assume that a portion of any $$$ paid to Clickhouse ultimately goes to fund the war and kill people.

That said, I sincerely hope they could find a way to stop that flow of money from happening somehow, as otherwise it's a nice technology and a great technical team behind it...


You can see on their jobs page and 'our story' page, they are mostly in the US and The Netherlands.


Yandex is technically incorporated in The Netherlands and has an engineering presence there, but they are as Russian controlled as can be.



Just a bunch of words. Everyone says they support Ukraine if it benefits their business.

Show proof


What kind of proof are you looking for here?


> Does anyone know how much of the Clickhouse team (or ownership) is still located in Russia?

Zero.

No employees in Russia, no ownership from Russia, no influence from Russia.


I have (an extremely boring, but quite hands-on) video about various ways of ClickHouse optimizations on top of external storage: https://www.youtube.com/watch?v=rK2BsaaaOCA (starting from around 40:00).


I am from selectdb company. selectdb is a cloud native data warehouse developed by the founding team of apache doris. Recently we submitted ClickBench's test results and we achieved good rankings. Later, we will incorporate all performance optimizations into Apache Doris and submit the test results of Apache Doris. Regardless of selectdb or doris, we not only have outstanding performance on large wide tables, but also perform well in scenarios such as tpc-h where there are many joins. https://en.selectdb.com/blog/SelectDB%20Topped%20ClickBench%...


Great stuff! I know the ClickHouse team and they are world class. If you're more familiar with a relational database some things will feel weird. For example, you should not insert 1 row at a time, there's over 1,000 built-in functions, and the default connection is often over HTTPS, but not 443.


ClickHouse Cloud provides port 443 as well as 8443. You can insert one row at a time, it's perfectly ok with "async_insert=1".


I’m a little sad to see them embrace magic “insert unit” pricing, instead of taking the approach Altinity uses where you are renting an instance size that you can compare apples-to-apples with running your own cluster on ec2.


Honestly that's something they can probably offer separately, but I really prefer this pricing for most use cases where I want a database that is always available but has bursty request/response patterns. This means I can have an analytical database available for all my small services, websites, etc without having to think too much about availability, support, and a constant price overhead. But ClickHouse is so fast you can get pretty far with a $10 VPS, I admit.

Probably the best comparison is CockroachDB Cloud. They have a "serverless" offering based on unit pricing and a dedicated offering based on provisioned servers + support/maintenance overhead. I think that would be the ideal place to go long-term, but I'm super excited for this current one. I love ClickHouse and want to support them.

ClickHouse is also an interesting case because there's lots of options to migrate clusters, use S3 as long-term storage, etc to where I don't particularly feel locked into this offering if I ever wanted to shift into my own.


> I’m a little sad to see them embrace magic “insert unit” pricing, instead of taking the approach Altinity uses where you are renting an instance size that you can compare apples-to-apples with running your own cluster on ec2.

Thanks for this comment. We'll be publishing a blog at Altinity to compare both models. My view is that they both have their place. The BigQuery pricing model is great for testing or systems that don't have constant load. The Altinity model is good for SaaS systems that run constantly and need the ability to bound costs to ensure margins.

Having a selection of vendors that offer different economics seems better for users than everyone competing for margin on the same operational model.

Disclaimer: I'm CEO of Altinity.


Serverless "pay for usage" is different than the fixed-size dedicated cluster pricing model, and can be quite a bit cheaper, especially with spiky query traffic. It should also be more reliable, since you don't have to predict and provision capacity for your peak usage ahead of time.

Disclaimer: I work for ClickHouse.


Ok but it can also be quite a bit more expensive and your bill is less predictable.

Disclaimer: I’m a customer of altinity.


Outside of being open source, how does ClickHouse differ from Snowflake/BigQuery? In what scenarios would I choose ClickHouse over those existing solutions?


Druid and Pinot are more likely to be the peer group (e.g. see https://leventov.medium.com/comparison-of-the-open-source-ol...)


ClickHouse supports ad-hoc analytics, real-time reporting, and time series workloads at the same time. It is perfectly suited for user-facing analytics services. It supports low-latency (<100ms) queries for real-time analytics as well as high query throughput (500 QPS and more) - all of this with real-time data ingestion of logs, events, and time series.

Take some notable examples from the list: https://clickhouse.com/docs/en/about-us/adopters/, something around web analytics, APM, ad networks, telecom data... ClickHouse is perfectly suited for these use cases. But if you try to align these scenarios with, say, BigQuery, they will become almost impossible or prohibitively expensive or just slow.

There are specialized systems for real-time analytics like Druid and Pinot, but ClickHouse does it better: https://benchmark.clickhouse.com/

There are specialized systems for time-series workloads like InfluxDB and TimescaleDB, but ClickHouse does it better: https://gitlab.com/gitlab-org/incubation-engineering/apm/apm... https://arxiv.org/pdf/2204.09795.pdf http://cds.cern.ch/record/2667383/

There are specialized systems for logs and APM, but ClickHouse does it better: https://blog.cloudflare.com/log-analytics-using-clickhouse/

There are specialized systems for ad-hoc analytics, but ClickHouse does it better as well: https://github.com/crottyan/mgbench

Well, even if you want to process a text file, ClickHouse will do it better than any other tool: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...

And ClickHouse looks like a normal relational database - there is no need for multiple components for different tiers (like in Druid), no need for manual partitioning into "daily", "hourly" tables (like you do in Spark and Bigquery), no need for lambda architecture... It's refreshing how something can be both simple and fast.


We are using Clickhouse Cloud at our company and the speed at which it serves up data is mind-blowing to our team, which had been using older systems for almost 10 years. Congrats on the public beta, and we can't wait to see what comes next!


Just wanted to chime in and say that I'm using a self-hosted instance of CH for a small hobby project and the performance is awesome.


I'm pretty interested in running Clickhouse, but to poll the room: is it easy to run without an infra person maintaining it? We don't need any kind of persistence guarantee or backups, happy to tear it down every day and refresh the data from sources, but does it have complicated config? How well does it work on say, a single core?


I've been running clickhouse for production as a solo dev for a year now and it's been great so far. I've had to tune it a bit for the machines I have, but everything works and it's incredibly fast. Not even comparable to postgres. For analytics, it goes way beyond what you'd expect, even under heavy writing.

Even join-heavy queries work great.


If you can run MySQL you can run ClickHouse. They are both about the same level of effort operationally. Your case sounds pretty simple and can be addressed with one or more single node systems.

As far as whether it works on a single core, you probably need to load data and test. That's the only way to answer that question for sure.

IMO the real question for any database is whether you can invest the time to learn how to fix problems and deal with boring but necessary operations like upgrade. If not, it's probably better to use a managed service. We operate a SaaS for ClickHouse but use managed MySQL and Kubernetes. That choice freed up resources to focus on making ClickHouse run well.

Disclaimer: I work for Altinity.


I've been running it in a container on docker swarm (on a celeron 3205U—2x1.5GHz cores). My CH container stores read only data, and I periodically update the snapshot, which involves deploying a newer image. This is very straightforward but, because it's read only, I don't even need a persistent volume.

I've found the documentation to be pretty good and there's a really active telegram group where devs offer help (if you get noticed, but that's not too hard).


Much, much simpler than a lot of dbs I've worked at. It's mostly fire and forget, easily doable without an infra person, imo.


Easy for one person, you just set it and forget about it, I've one 4-core dedicated server for clickhouse that is inserting around 100 million new rows a day, db size is tens of terabytes, works without problems.


Another voice to say it's really easy on a single node. I ran it for a lot of work purely because of this.


If you don't really care about the data then I'd say it's probably pretty easy to run without an infra person right? If you hit a problem just nuke it and try again?


For context, there are multiple Clickhouse companies. This is the one that spun out of Yandex and took the Clickhouse name as their namesake for the company!

Altinity has been in the space for a while offering Clickhouse services as well: https://altinity.com/

EDIT: brazenly was the wrong word here :)


Hi! Thank you for mentioning my company Altinity. We love ClickHouse and have been supporting customers on it for years.

That said I would like to defend ClickHouse and their use of the logo. They acquired the IP and it's their right to use it as they please. It's hard to imagine them not using the ClickHouse name. There's a crowd of companies using/supporting ClickHouse so it's the obvious way to market themselves.

The interesting question is whether it's a good thing for users to have a large company controlling ClickHouse rather than having it in a foundation.


I think that’s only an interesting question to you because you directly benefit from that. For now on, your company will be regarded as an off brand Clickhouse since you have no influence on the roadmap, etc. MongoLab and MongoHQ come to mind.

My advice would be to pivot away from the Clickhouse brand and towards a great OLAP service that happens to run on Clickhouse.


By that definition Amazon RDS is an off-brand MySQL (and PostgreSQL too). We're perfectly happy to be in their company.


I think people are buying Amazon there.

I wish you guys luck but you’ll probably be used by customers mainly to negotiate with Clickhouse on better rates. And they’ll undercut competition when needed since they have so much cash and investor mindshare.


Will Altinity Cloud compete with ClickHouse Cloud now? :)


Yes. Our offering is quite differentiated--different pricing model, ability to run in user VPCs and private cloud, baked-in enterprise DBA support, and Altinity Stable builds with long-tail support are just a few examples. This is a large market with very diverse needs.

At the same time we contribute to open source and work to make ClickHouse better, just as many other businesses built on ClickHouse do. We've done quite a bit of work including many ClickHouse PRs, maintenance of the Grafana ClickHouse plugin since 2019, Altinity Kubernetes operator for ClickHouse, etc. We plan to step up our contributions as the market grows.


"brazenly" :)? Are you for real? Top 3 commiters work there including Alexey who created CH in the first place

alexey-milovidov 17,522 commits 5,648,537 ++ 5,580,633 --

alesapin 4,618 commits 1,024,262 ++ 932,501 --

KochetovNicolai 3,867 commits 377,420 ++ 314,035 --


Wait, wasn't Clickhouse itself spun out of Yandex anyway? I.e. Yandex opensources Clickhouse, then later starts Clickhouse (the company)?


That is literally what happened. There is nothing brazen here.


It is an interesting perspective...

FWIW, here is Alexey's perspective on the topic directly for those of you following along at home.

https://clickhouse.com/blog/introducing-click-house-inc


Strange name and domain.

It is included in many adblocker lists.


And we request removal as and where we can.

https://github.com/StevenBlack/hosts/issues/1781

Unfortunately mvps.org no longer seems to reply to emails.


I was keeping a tally of how many companies were offering "ClickHouse as a service" at one point last year. I think I got up to 7 or 8.

It will be interesting to watch this unfold from a code licensing perspective. Will Clickhouse Inc. move to a more restrictive license to block all these other ClickHouse services?


> It will be interesting to watch this unfold from a code licensing perspective. Will Clickhouse Inc. move to a more restrictive license to block all these other ClickHouse services?

I don't see it as a reasonable move.


Is this fully hosted or runs in my cloud? Is this running https://github.com/ClickHouse/ClickHouse or a closed-source Clickhouse Cloud variant with added features?


It is a fully hosted offering and is based on ClickHouse core v22.10


Curious how CH are ultimately actually hosting this, on a public cloud or is this their own cloud.


Billing page says AWS for storage, so I'd assume the same for compute.

If they're really doing elastic reads as suggested in another comment I don't think this can be the standard CH server (at least for reads) - or I've missed something very exciting in recent versions.


It feels like there's gotta be something proprietary beyond just using a s3 disk setup but maybe not? is simply having data_cache_enabled and cache_path pointing to local nvme ssd sufficient to achieve similar speeds?

* https://clickhouse.com/docs/en/guides/sre/configuring-s3-for...

* https://clickhouse.com/docs/en/integrations/s3/s3-merge-tree...


For non-elastic writes this might be sufficient (I have no clue as to performance of AWS S3 for this kind of use case) - but supposedly[0] there's some elastic read capacity. As far as I know that's not possible with standard CH at all, and may be where some secret sauce is.

I can vaguely imagine how you'd build such a thing, you'd spin up new instances that knows your write-primary's sharding/partitioning scheme - heck you could even approximate it with slightly-augmented clickhouse-clients doing the S3 pulls directly and no real additional "server" - but as far as I know there's no way out of the box.

[0] https://news.ycombinator.com/item?id=33081658


Is there an easy way to have ClickHouse Cloud ingest data from MySQL hosted in Amazon RDS?


There are a few options for migrating or synchronizing data from MySQL - I'd recommend starting with this page in the ClickHouse Docs - there is a nice video there that explains some of those options:

https://clickhouse.com/docs/en/integrations/migration/

Depending on what you are trying to achieve, you could use clickhouse-local with the MySQL engine to move data, or could use an ETL/ETL tool to migrate/sync


Good point. If you just want to copy data far and away the easiest way to transfer data is using MySQLDatabaseEngine. You can even copy table definitions. Watch for issues with Decimal datatypes if you do this.


Be careful with this engine, it's easy to accidentally expose the password as you only need table read permissions if it wasn't set up using an external credential file.

https://github.com/ClickHouse/ClickHouse/issues/3311

I also had some pretty bad join performance (CH table joined to MySQL table), the quick solution to both of these is that we instead use the table function (https://clickhouse.com/docs/en/sql-reference/table-functions...) to copy the data periodically.


Use Named Collections to protect credentials. They are very handy. Here's an article that discusses use in the JDBC Bridge.

https://altinity.com/blog/connecting-clickhouse-to-external-...

Disclaime: I work for Altinity.


Check out the Altinity Sink Connector for ClickHouse [0]. This is advancing quite quickly and already has prod deployments. Please feel free to try it out.

[0] https://github.com/Altinity/clickhouse-sink-connector


The 2 things that makes me hesitant about ClickHouse is that:

1. Data rebalancing is not automatic.

2. It doesn't really have a concept of cold-tier pushed to S3 directly, so cluster management is not simple for a small team.

Other than that, ClickHouse looks super amazing.


ClickHouse can push cold data directly to S3, so S3 will be used as cold storage and the local filesystem as hot storage.

Another approach is to store all in S3 with local caching, it is much more easy and somewhat more efficient.

ClickHouse Cloud covers these concerns.


Ah this must be a new feature? If you could share the docs related to this, I'd really appreciate it.


Not affiliated with Clickhouse, but fwiw they've been working on this for a while (I recall coming across it like +2 years ago in an old Altinity blog post):

https://clickhouse.com/docs/en/integrations/s3/s3-merge-tree...


We're currently using InfluxDB for some timeseries metrics, but the 2.0 migration path have been terrible, even for simple examples, so we're looking for something else.

Have anyone migrated from InfluxDB to ClickHouse?


Disclaimer: I work at ClickHouse

At a previous company, I wrote a simple TCP server to receive LineProtocol, parse it and write to ClickHouse. I was absolutely blown away by how fast I could chart data in Grafana [1]. The compression was stellar, as well...I was able to store and chart years of history data. We basically just stopped sending data to Influx and migrated everything over to the ClickHouse backend.

[1] https://grafana.com/grafana/plugins/grafana-clickhouse-datas...


Clickhouse is so stupidly, mindbogglingly fast at what it does that you can often replace a much more specialized databases or specialized schemas / queries with brute force Clickhouse and come out far ahead. It really depends on the size of what you are doing. I've used Clickhouse for timeseries data in the tens of millions events per day and it worked well.


Clickhouse is an amazing product but this pricing looks excessive.

A single node instance with a fast disk is more than sufficient for most needs: https://hub.docker.com/r/clickhouse/clickhouse-server

If you need a cluster, https://github.com/Altinity/clickhouse-operator makes things easy


We're a company that has been using the private beta of ClickHouse Cloud.

IMHO, the unique value-prop of this offering is that it elastically scales reads (compute), like Redshift/Snowflake/BigQuery/other cloud data warehouses do, while also "being Clickhouse", and so giving you very traditional SQL querying capabilities (where those others all ask you to bend your SQL to fit the DB, sometimes in pretty ridiculous ways.)

I would suggest not thinking of this offering as "ClickHouse, but in the cloud"; but rather, thinking of this as "a cloud data warehouse, but rather than using its own proprietary query engine, it's just ClickHouse."

If you haven't evaluated other cloud data warehouses as alternatives to solving your problem (and found them wanting for one reason or another), then you're likely not in the niche that would see a positive profit margin on using ClickHouse Cloud.


I’m sure there are tons of people very willing to pay for this if it means not having to run their own k8s clusters. Managed offerings like this are expensive, but worth every penny for some.


Unsollicited feedback: I do not receive any validation email when creating an account, so I cannot log in. It is also impossible to reset the password.


So sorry about that! Could you send us an email at support@clickhouse.com and we'll get to the bottom of it?


Clickhouse is great but the ops and scaling make it notoriously difficult to self host.

If you have a lot of log data and want something open source and serverless you can self host, check out Matano (https://github.com/matanolabs/matano).


Clickhouse is known for being super easy to run and operate vs peers. A single binary and very simple configuration.

The only acknowledged downside is that it does not have automatic rebalancing of data if new nodes are added.


Exactly, getting started is easy but dealing with scaling and operations when it's deployed in a production environment is not simple, primarily because it doesn't separate storage and compute.


I think ClickHouse Cloud Beta addressed exactly this concern. It separates storage and compute and deals with scaling. There is no sharding so you don't need to deal with scaling.


Slightly aside…

OLAP/Column dbs generally work well for large bulk inserts with many analytical queries…

How do clickhouse (or other column dbs) work with larger inserts (e.g. log data type inserts)?


Uber is using it for several PB of log data at scale: https://www.uber.com/en-IN/blog/logging/


People who uses CH inserting lots of JSON objects... stay back with this kind of pricing :)


Note: the final beneficiaries are russians and there is huge probability that taxes from clickhouse cloud will go to russian budget to sponsor war.

Clickhouse inc is subsidiary of Clickhouse B.V in Netherlands and is controlled by a bunch of russians.


You have left multiple comments now om this post with disparaging remarks about the Clickhouse team without any evidence to backup your claims.

Not super classy if you ask me.


Russians are bombing my city right this second, so sorry if I mentioned an issue with them more than a single time.


Under that logic all of SV sponsors war in the middle east.

Can we leave the russiaphobia in the political threads at least?


At first I was a bit confused about why The Onion’s spin-off site had a cloud offering, but that one is called ClickHole.


It's interesting how diligently they wiped our any and all mentions of Yandex from their website, to the extent that Google struggles to find meaningful mentions apart from a few exceptions like [1]. Quite peculiar considering that just two years ago, according to that presentation, they were still considering themselves a Yandex project, even the link in that tweet is https://clickhouse.yandex .

For those unaware, Yandex is a Russian internet megacorp, think Google but in cahoots with the authoritarian government: cherrypicked news coverage friendly to the government, effectively a monopoly across multiple verticals (eg they bought out Uber in Russia), etc. In 2020, the year that deck seems to be from, they were already censoring their news feed [2] and tweaking search ranking to promote pro-government results [3] for years.

[1]: https://presentations.clickhouse.com/meetup40/introduction/

[2]: https://meduza.io/feature/2022/05/05/my-zamuchilis-borotsya in Russian, but google translate does a reasonable job

[3]: https://www.svoboda.org/a/30580605.html same


> think Google but in cahoots with the authoritarian government

Sure, but the split appears have been very successful in this regard.

https://clickhouse.com/blog/we-stand-with-ukraine


Yes, it seems the split was triggered by the war, but the censorship/promotion/collaboration with the government [0] were not worth breaking the ties.

A cynical view would be that it's only now that the association became bad for business.

[0]: By the way, Yandex was showing Crimea as unqualifiedly Russian territory since 2014, until the war, when they just removed borders between countries completely

Edit: qualified the remark


>By the way, Yandex was showing Crimea as unqualifiedly Russian territory since 2014

Google and Apple used to show it as part of Russia to Russian users as well.

https://www.google.com/amp/s/techcrunch.com/2019/11/27/apple...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: