263 points by tosh 2 years ago | 52 comments
wskish 2 years ago
After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.
Labo333 2 years ago
I'm really happy to see `hnswlib` as a Python dependency since I'm the one who implemented PyPI support: https://github.com/nmslib/hnswlib/pull/140
fzysingularity 2 years ago
wskish 2 years ago
hnswlib implementation of hnsw is faster than faiss's implementation. Faiss has other index methods that are faster in some cases, but more complex as well.
wskish 2 years ago
nl 2 years ago
Since lots of people don't seem to understand how useful these embedding libraries are here's an example. I built a thing that indexes bouldering and climbing competition videos, then builds an embedding of the climber's body position per frame. I then can automatically match different climbers on the same problem.
It works pretty well. Since the body positions are 3D it works reasonably well across camera angles.
The biggest problem is getting the embedding right. I simplified it a lot above because I actually need to embed the problem shape itself because otherwise it matches too well: you get frames of people in identical positions but on different problems!
antman 2 years ago
antman 2 years ago
For anyone else: you pass it directly in metadata see https://github.com/jiggy-ai/hnsqlite/blob/main/test/test_col...
https://github.com/jiggy-ai/hnsqlite/blob/main/test/test_col...
leobg 2 years ago
wskish 2 years ago
leobg 2 years ago
gk1 2 years ago
chandureddyvari 2 years ago
4ft4 2 years ago
fzliu 2 years ago
mojoe 2 years ago
fzliu 2 years ago
You're right in that it's a bit heavyweight, so we're working to see how we can make pub/sub and other cluster components lighter and more efficient overall.
politician 2 years ago
ar9av 2 years ago
mshachkov 2 years ago
[0] https://github.com/facebookresearch/faiss/pull/2521 [1] https://github.com/rapidsai/raft
txtai 2 years ago
For example: SELECT id, text, date FROM txtai WHERE similar('machine learning') AND date >= '2023-03-30'
GitHub: https://github.com/neuml/txtai
This article is a deep dive on how the index format works: https://neuml.hashnode.dev/anatomy-of-a-txtai-index
txtai 2 years ago
ntonozzi 2 years ago
bobvanluijt 2 years ago
ttt3ts 2 years ago
Or are you referring to the ability to store data in addition to the vectors? In which case, you can pair any time tested DB with the index avoiding the hype of DBs that might be gone in a year.
fzliu 2 years ago
Most vector databases focus on ANN because of scale. Once you get to around a million vectors or so, it becomes prohibitively expensive to perform brute-force querying and search.
ing33k 2 years ago
amrb 2 years ago
https://python.langchain.com/en/latest/modules/indexes/vecto...
jawerty 2 years ago
antman 2 years ago
wanderingmind 2 years ago
ttt3ts 2 years ago
jeadie 2 years ago
wanderingmind 2 years ago
fzysingularity 2 years ago
I especially like their index-factory models. Once you figure out how to build it properly, you can easily push beyond 100M vectors (512-dim) on a single reasonably beefy node. Memory-mapping, sub-20ms latencies on 10M+ vectors, bring-your-own training sampling strategies, configurable memory-usage, PQ, the list goes on. Once you have this, distributing it across nodes becomes trivial and allows you to keep scaling horizontally.
Not sure if others have used their GPU bindings, but being able to train about 10x faster on your data is a game-changer for rapid-experimentation and evaluation, especially when you need to aggressively quantize at this scale. Also, the fact that you have an extremely portable GPU-trained index that you can run on a lightweight-CPU (potentially even a lambda) is very compelling to me.
That said, I'd love to see Faiss ported to the browser (using WASM) - if any of this sounded useful or intriguing, DM me, would love to share notes and learn more about how folks are using Faiss today.
bobvanluijt 2 years ago
Fiahil 2 years ago
I see some French names in the author list. The joke was intended, well done ! :)
RobotToaster 2 years ago
yeldarb 2 years ago
jeadie 2 years ago
fzliu 2 years ago
leobg 2 years ago
…written in Rust.
losteric 2 years ago
make3 2 years ago
tim_sw 2 years ago
jeadie 2 years ago
There are some platforms and open source tools that handle it end to end. https://github.com/marqo-ai/marqo is one, for example that is both open source and has a cloud offering.
gk1 2 years ago
Usually folks use a vector database alongside a doc store like Postgres, Snowflake, Elastic…
ing33k 2 years ago
heipei 2 years ago
arecurrence 2 years ago
georgehill 2 years ago
jeadie 2 years ago
jeadie 2 years ago
jeadie 2 years ago
lee101 2 years ago