154 points by thebuilderjr 6 days ago | 47 comments
kristjansson 3 days ago
https://www.youtube.com/playlist?list=PLSE8ODhjZXjZc2AdXq_Lc...
GardenLetter27 3 days ago
PartiallyTyped 3 days ago
jamesblonde 3 days ago
It reminds of 15 years ago where there was JDBC/ODBC for data. Then when data volumes increased, specialized databases became viable - graph, document, json, key-value, etc.
I don't see SQL and Spark hammers keeping their ETL monopolies for much longer.
jitl 3 days ago
SQL though is going the distance. like Feldera is SQL based stream processing and uses DataFusion under the hood for some data wrangling. DuckDB is also very SQL.
I have my quibbles with SQL as a language but I would prefer SQL embedded in $myLanguage to needing to use Python or (shudder) Scala to screw around with data.
hipadev23 3 days ago
ignoreusernames 3 days ago
spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
spark.read.parquet("sample_data").groupBy($"col").count().count()
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.bdndndndbve 3 days ago
ignoreusernames 3 days ago
I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk
> [...] which used to be a point of comparison to MapReduce specifically.
So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop
> Not ground-breaking nowadays but when I was doing this stuff 10+ years
I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating
hipadev23 2 days ago
https://people.csail.mit.edu/matei/papers/2010/hotcloud_spar...
62951413 2 days ago
pjmlp 3 days ago
francocalvo 3 days ago
What Spark has going for it is its ecosystem. Things like Delta and Iceberg are being written for Spark first. Look at PyIceberg for example
krapht 6 days ago
When I have small data that fits on my laptop, Pandas is good enough.
Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...
jitl 3 days ago
But, I can visit most rows in that dataset in about 4 hours if I use an OLAP data warehouse thing, the kind of thing you build on top of DataFusion.
threeseed 3 days ago
It’s largely for companies who can’t put everything in a single database because (a) they don’t control the source schema e.g. it’s a daily export from a SaaS app, (b) the ROI is not high enough to do so and (c) it’s not in a relational format e.g. JSON, Logs, Telemetry etc.
And with the trend toward SaaS apps it’s a situation that is becoming more common.
GardenLetter27 3 days ago
thebuilderjr 6 days ago
Hugsun 3 days ago
threeseed 3 days ago
For example you can go through say 1% of your data and for each column see if you can coerce all of the values to a float, int, date, string etc. And then from there you can set the Parquet schema with proper types.
RobinL 3 days ago
That's not right. There are many queries that run far faster in duckdb/datafusion than (say) postgres, even with the overhead of pulling whole large tables prior to running the query. (Or use like pg_duckdb).
For certain types of queries these engines can be 100x faster.
More here: https://postgres.fm/episodes/pg_duckdb
netcraft 3 days ago
chatmasta 3 days ago
Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.
jitl 3 days ago
geysersam 3 days ago
Really? I don't see it near the top.
[CH benchmarks](https://benchmark.clickhouse.com/#eyjzexn0zw0ionsiqwxsb3leqi...)
alamb 2 days ago
Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)
kalendos 3 days ago
https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...
riku_iki 2 days ago
chatmasta 2 days ago
If DuckDb is the only query engine in your analytics stack, then it makes sense to use its specialized format. But that’s not the typical Lakehouse use case.
riku_iki 2 days ago
that benchmark is also not typical lakehouse use case, since all data is hosted locally, so they don't test significant component of the stack.
chatmasta 2 days ago
TPC-H is okay but not Lakehouse specific. I’m not aware of any benchmarks that specifically test performance of engines under common setups like external storage or scalable compute. It would be hard to design one that’s easily reproducible. (And in fairness to Clickbench, it’s intentionally simple for that exact reason - to generate a baseline score for any query engine that can query tabular data).
alamb 2 days ago
If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat
Disclaimer: I am the PMC chair of DataFusion
There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html