remix logo

Hacker Remix

Data Version Control

213 points by shcheklein 4 days ago | 50 comments

bramathon 4 days ago

I've used DVC for most of my projects for the past five years. The good things is that it works a lot like git. If your scientists understand branches, commits and diffs, they should be able to understand DVC. The bad thing is that it works like git. Scientists often do not, in fact, understand or use branches, commits and diffs. The best thing is that it essentially forces you to follow Ten Simple Rules for Reproducible Computational Research [1]. Reproducibility has been a huge challenge on teams I've worked on.

[1] https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

dmpetrov 4 days ago

hi there! Maintainer and author here. Excited to see DVC on the front page!

Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

ajoseps 4 days ago

if the data files are all just text files, what are the differences between DVC and using plain git?

miki123211 4 days ago

DVC does a lot more than git.

It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.

There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.

It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.

In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.

amelius 3 days ago

Sounds like it is more a framework than a tool.

Not everybody wants a framework.

JadeNB 3 days ago

> Sounds like it is more a framework than a tool.

> Not everybody wants a framework.

The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.

amelius 3 days ago

Just saying that what makes Git so appealing is that it does one thing well, and from this view DVC seems to be in an entirely different category.

stochastastic 3 days ago

It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.

woodglyst 3 days ago

This sounds a lot like the experimental project Jacquard [0] from Ink & Switch.

[0] https://www.inkandswitch.com/jacquard/notebook/

azinman2 4 days ago

So where do the adjusted 10M rows live instead? S3?

thangngoc89 4 days ago

DVC support multiple remotes. S3 is one of them, there are also WebDAV, local FS, Google Drive, and a bunch of others. You could see the full list here [0]. Disclaimer: not affiliated with DVC in anyway, just a user.

[0] https://dvc.org/doc/user-guide/data-management/remote-storag...

dmpetrov 4 days ago

In this cases, you need DVC if:

1. File are too large for Git and Git LFS.

2. You prefer using S3/GCS/Azure as a storage.

3. You need to track transformations/piplines on the file - clean up text file, train mode, etc.

Otherwise, vanilla Git may be sufficient.

agile-gift0262 4 days ago

It's not just to manage file versioning. Yo can define a pipeline with different stages, the dependencies and outputs of each stage and DVC will figure out which stages need running depending on what dependencies have changed. Stages can also output metrics and plots, and DVC has utilities to expose, explore and compare those.

johanneskanybal 3 days ago

Mostly consult as a data engineer not ML ops but I’m interested in some aspects of this. We have 10 years of parquet files from 300+ different kafka topic and we’re currently migrating to apache iceberg. We’ll back fill on a need only basis and it would be nice to track that with git. Would this be a good fit for that?

Another potential aspect would be tracking schema evolution in a nicer way than we currently do.

thx in advance, huge fan of anything-as-code and think it’s a great fit for data (20+ years in this area).

stochastastic 3 days ago

Thanks for making and sharing DVC! It’s been a big help.

Is there any support that would be helpful? I’ll look at the project page too.

dmpetrov 3 days ago

Thank you!

Just shoot an email to support and mention HN. I’ll read and reply.

dpleban 3 days ago

Great to see DVC being discussed here! As a tool, it’s done a lot to simplify version control for data and models, and it’s been a game-changer for many in the MLOps space.

Specifically, it's a genius way to store large files in git repos directly on any object storage without custom application servers like git-lfs or rewriting git from scratch...

At DagsHub [0], we've integrated directly with DVC for a looong time, so teams can use it with added features like visualizing and labeling datasets managing and models, running experiments collaboratively, and tracking everything (code, data, models, etc.) all in one place.

Just wanted to share that for those already using or considering DVC—there are some options to use it as a building block in a more end-to-end toolchain.

[0] https://dagshub.com

jiangplus 4 days ago

How does it compare to Oxen?

https://github.com/Oxen-AI/Oxen

gregschoeninger 3 days ago

Maintainer of Oxen here, we initially built Oxen because DVC was pretty painfully slow to work with, and had a lot of extra bells and whistles that we didn’t need. Under the hood we optimized the merkle tree structure, hashing algorithms, network protocols, etc to make it speedy when it came to large datasets. We have a pretty nice front end at https://oxen.ai for viewing and querying the data as well.

Happy to answer any thoughts or questions!

bagavi 3 days ago

Can this be used with GitHub? If yes, I would shift from dvc immediately

jFriedensreich 3 days ago

never heard of oxen but it looks like a super interesting alternative. would love to hear from someone who has experience with both.

my first impression: dvc is made to use with git where there are arbitrary folders handled by dvc INSIDE your git repo, where oxen is an alternative for a separate data repo. also oxen has lots of integration with dataframes and tabular, ai training and infernece data that dvc is missing. on the other hand dvc has a full DAG pipeline engine integrated as well as import/ export and pluggable backends.