remix logo

Hacker Remix

We built a new powerful JSON data type for ClickHouse

259 points by markhneedham 13 hours ago | 48 comments

fuziontech 9 hours ago

Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.

Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!

https://posthog.com/handbook/engineering/databases/materiali...

The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.

ramraj07 10 hours ago

Great to see it in ClickHouse.

Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.

statictype 41 minutes ago

Do you have a link to the Snowflake whitepaper?

breadwinner 8 hours ago

If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.

[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...

zX41ZdbW 7 hours ago

> ClickHouse was designed for single-machine installations

This is incorrect. ClickHouse is designed for distributed setups from the beginning, including cross-DC installations. It has been used on large production clusters even before it was open-sourced. When it became open-source in June 2016, the largest cluster was 394 machines across 6 data-centers with 25 ms RTT between the most distant data-centers.

cvalka 5 hours ago

Absolutely incorrect. ClickHouse was created by Yandex and it's cluster ready from day one.

haolez 8 hours ago

What's the use case? Analytics on humongous quantities of data? Something besides that?

breadwinner 8 hours ago

Use case is "user-facing analytics", for example consider ordering food from Uber Eats. You have thousands of concurrent users, latency should be in milliseconds, and things like delivery time estimate must updated in real-time.

Spark can do analysis on huge quantities of data, and so can Microsoft Fabric. What Pinot can do that those tools can't is extremely low latency (milliseconds vs. seconds), concurrency (1000s of queries per second), and ability to update data in real-time.

Excellent intro video on Pinot: https://www.youtube.com/watch?v=_lqdfq2c9cQ

listenallyall 8 hours ago

I don't think Uber's estimated time-to-arrival is a statistic on which a database vendor, or development team, should brag about. It's horribly imprecise.

akavi 6 hours ago

Also isn't something that a (geo)sharded postgres DB with the appropriate indexes couldn't handle with aplomb. Number of orders to a given restaurant can't be more than a dozen a minute or so.

SoftTalker 2 hours ago

Especially as restaurants have a limit on their capacity to prepare food. You can't just spin up another instance of a staffed kitchen. Do these mobile-food-ordering apps include any kind of backdown on order acceptance e.g. "Joe's Diner is too busy right now, do you want to wait or try someplace else?"

cyanydeez 6 hours ago

What about it's ability to choose pricing based on source-destination and projected incomes.

whalesalad 7 hours ago

I thought “humongous quantities of data” was a baseline assumption for a discussion involving clickhouse et all?

peteforde 1 hour ago

I admit that I didn't read the entire article in depth, but I did my best to meaningfully skim-parse it.

Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?

I have no problem with people creating supersets of JSON, but if my standard lib JSON parser can't read your "JSON" then wouldn't it be better to call it something like "CH-JSON"?

If I am wildly missing something, I'm happy to be schooled. The end result certainly sounds cool, even though I haven't needed ClickHouse yet.

chirau 21 minutes ago

The article is about the internal storage mechanics of ClickHouse and how it optimizes handling JSON data behind the scenes. The data types like Dynamic and Variant that are discussed are part of ClickHouse’s internal mechanisms to improve performance, specifically for columnar storage of JSON data. The optimizations just help ClickHouse process and store data more efficiently.

The data remains standard JSON and so standard JSON parsers wouldn’t be affected since the optimizations are part of the storage layer and not the JSON structure itself.

lemax 54 minutes ago

As far as I understand they're talking about the internal storage mechanics of ClickHouse, these aren't user exposed JSON data types, they just power the underlying optimizations they're introducing.