Hacker Remix

Faiss: A library for efficient similarity search

263 points by tosh 2 years ago | 52 comments

wskish 2 years ago

hnswlib (https://github.com/nmslib/hnswlib) is a strong alternative to faiss that I have enjoyed using for multiple projects. It is simple and has great performance on CPU.

After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.

https://github.com/jiggy-ai/hnsqlite

Labo333 2 years ago

I totally agree and hnswlib is actually much faster than FAISS on CPU.

I'm really happy to see `hnswlib` as a Python dependency since I'm the one who implemented PyPI support: https://github.com/nmslib/hnswlib/pull/140

fzysingularity 2 years ago

Interesting - is there a good reference to back this claim? Curious to hear what overheads Faiss would have if it's configured with similar parameters to build the HNSW graphs. Is that what you A/B-tested in practice?

wskish 2 years ago

http://ann-benchmarks.com

hnswlib implementation of hnsw is faster than faiss's implementation. Faiss has other index methods that are faster in some cases, but more complex as well.

wskish 2 years ago

Thank you for this! This project is really hnswlib-sqlite just shortened into hns(w)qlite.

nl 2 years ago

I like Faiss but I tried Spotify's annoy[1] for a recent project and was pretty impressed.

Since lots of people don't seem to understand how useful these embedding libraries are here's an example. I built a thing that indexes bouldering and climbing competition videos, then builds an embedding of the climber's body position per frame. I then can automatically match different climbers on the same problem.

It works pretty well. Since the body positions are 3D it works reasonably well across camera angles.

The biggest problem is getting the embedding right. I simplified it a lot above because I actually need to embed the problem shape itself because otherwise it matches too well: you get frames of people in identical positions but on different problems!

[1] https://github.com/spotify/annoy

antman 2 years ago

I looked a bit and the code, I think it would be kow hanging fruit to add additional sqlite fields except the vector ones. Even if any filtering happens kind of suboptimally in post processing.

antman 2 years ago

Thanks Leobg!

For anyone else: you pass it directly in metadata see https://github.com/jiggy-ai/hnsqlite/blob/main/test/test_col...

https://github.com/jiggy-ai/hnsqlite/blob/main/test/test_col...

leobg 2 years ago

Cool! Using it right now. Question: Why not store the hnswlib binary right within the SQLite? Then the whole index would be in one file.

wskish 2 years ago

yes, this is what I want to do

leobg 2 years ago

hnswlib supports pre-filtering

gk1 2 years ago

If anyone is interested in diving deeper into Faiss, we put together an unofficial manual after not finding much learning materials about it:

https://www.pinecone.io/learn/faiss/

chandureddyvari 2 years ago

I really like your learning series. What you’ve done for understanding conversational memory at https://www.pinecone.io/learn/langchain-conversational-memor... was truly helpful. Thanks!

4ft4 2 years ago

Awesome. I used faiss during my phd studies on ai based lidar map building and I spent countless hours in the faiss github wiki pages, example code, and issues. Would have loved something like this back then. Bookmarked for the next time I need faiss.

fzliu 2 years ago

Faiss is a wonderful vector search library - in particular, the ability to do hybrid indexes e.g. IVF-PQ, IVF-SQ is great. We (https://milvus.io) use it as one of the indexing options (along with Annoy, Nmslib, and DiskANN) to power our vector database.

mojoe 2 years ago

I've actually been looking at Milvus for storing embeddings, why do y'all have options for both Pulsar and Kafka inside Milvus? Seems like unnecessary choices for what we hoped would just be a plug and play vector database

fzliu 2 years ago

The design of Milvus 2.x follows this paper we published a while back: https://arxiv.org/abs/2206.13843. In short, we used Pulsar to implement the write-ahead log, which provides coordination and a single source of truth across all Milvus components.

You're right in that it's a bit heavyweight, so we're working to see how we can make pub/sub and other cluster components lighter and more efficient overall.

politician 2 years ago

Based on some superficial research yesterday, Weaviate looks like an easier option for messing around locally, but Milvus looks better for a production use case.

ar9av 2 years ago

[dead]

mshachkov 2 years ago

There is a wip [0] on RAFT [1] integration to faiss as an implementation of cuda gpu backed indices, although you can use RAFT directly.

[0] https://github.com/facebookresearch/faiss/pull/2521 [1] https://github.com/rapidsai/raft

txtai 2 years ago

txtai combines Faiss and SQLite to support similarity search with SQL.

For example: SELECT id, text, date FROM txtai WHERE similar('machine learning') AND date >= '2023-03-30'

GitHub: https://github.com/neuml/txtai

This article is a deep dive on how the index format works: https://neuml.hashnode.dev/anatomy-of-a-txtai-index

txtai 2 years ago

[flagged]

ntonozzi 2 years ago

Check out https://github.com/erikbern/ann-benchmarks for some benchmarks on some of the different ANN libraries out there. I'd be interested in hearing other's experiences using these libraries in production.

bobvanluijt 2 years ago

Faiss is awesome but note that it's an ANN library that might not be suitable for all use cases; hence vector databases exist. Two things that might be of interest: the main contributor of Faiss on the Weaviate podcast: https://youtu.be/5o1YTp1IL5o and Weaviate as vector DB itself: https://weaviate.io

ttt3ts 2 years ago

Faiss supports FLAT aka not approx nearest neighbor. Also, all the vector databases use ANN too because nearest neighbor is hard.

Or are you referring to the ability to store data in addition to the vectors? In which case, you can pair any time tested DB with the index avoiding the hype of DBs that might be gone in a year.

fzliu 2 years ago

Vector indexes only form a component within vector databases - you'd want the database bits and bobs (scalability, replication, caching, etc) surrounding it in addition to the index itself. Milvus, for example, supports flat indexes - as far as I know, Weaviate could support flat indexing as well if it doesn't do so already.

Most vector databases focus on ANN because of scale. Once you get to around a million vectors or so, it becomes prohibitively expensive to perform brute-force querying and search.

ing33k 2 years ago

We used Weavite and Elastic in prod. Felt that Weavite was not yet prod ready wrt deployment ( sharding , clustering is murky ). We went with ES's dense_vector field which is serving us well.

amrb 2 years ago

Fast way to use LLM with external data!

https://python.langchain.com/en/latest/modules/indexes/vecto...

jawerty 2 years ago

Recently used Faiss to optimize a kmeans implementation. Way faster than the sklearn implementation can’t recommend enough.

antman 2 years ago

Why?

wanderingmind 2 years ago

Sorry a noob question, how is Faiss different from say pgvector, other than the fact that pgvector is developed mainly as a postgres plugin? I think I saw use cases of using pgvector for similarity search on openai encoidings of text (which itself is generated from a NN)

ttt3ts 2 years ago

Scale, Faiss supports a bunch more algorithms. A much larger scale.

jeadie 2 years ago

Although there is some work going on right now to add support for the type of algorithms in pgvector to alot it to scale better (and also to have better recall/speed tradeoffs).

wanderingmind 2 years ago

Thank you, that makes sense

fzysingularity 2 years ago

Big fan of Faiss - I've tried using several others (milvus, weaviate, opensearch, etc) but none struck the usability and configurability chord as much as Faiss did.

I especially like their index-factory models. Once you figure out how to build it properly, you can easily push beyond 100M vectors (512-dim) on a single reasonably beefy node. Memory-mapping, sub-20ms latencies on 10M+ vectors, bring-your-own training sampling strategies, configurable memory-usage, PQ, the list goes on. Once you have this, distributing it across nodes becomes trivial and allows you to keep scaling horizontally.

Not sure if others have used their GPU bindings, but being able to train about 10x faster on your data is a game-changer for rapid-experimentation and evaluation, especially when you need to aggressively quantize at this scale. Also, the fact that you have an extremely portable GPU-trained index that you can run on a lightweight-CPU (potentially even a lambda) is very compelling to me.

That said, I'd love to see Faiss ported to the browser (using WASM) - if any of this sounded useful or intriguing, DM me, would love to share notes and learn more about how folks are using Faiss today.

bobvanluijt 2 years ago

Interesting - could it be the case that your use case is more suited for a library like FAISS instead of a vector DB? Would love to understand this. (I’m affiliated with Weaviate).

Fiahil 2 years ago

Note of high importance : "Faiss" reads and sounds exactly like "Fesse" (butt), in French.

I see some French names in the author list. The joke was intended, well done ! :)

RobotToaster 2 years ago

This may be a stupid question, but how would one generate the vectors that this uses? (I assume you can't just feed it images or 3d models?)

yeldarb 2 years ago

Good timing -- we actually just published a tutorial showing how to build semantic image search (including generating vectors with CLIP) with faiss this morning: https://blog.roboflow.com/clip-semantic-search/

jeadie 2 years ago

It will depend on the type of data you are using, and then what you are planning on doing with it. Various open source models will take various modalities (e.g. images of models), and will create an embedding/vector for them (even if just as an internal representation). You can take these, and store them in a vector DB. This may help https://www.marqo.ai/blog/how-to-implement-text-to-image-sea...

fzliu 2 years ago

This is typically done by taking the activations from a neural network. Activations from layer depths are generally preferred since they are more representative.

leobg 2 years ago

Show HN: Auto-comment with a content marketing piece on every trending HN story about your startup’s topic using GPT-4

…written in Rust.

losteric 2 years ago

I've wondered, what's the right way to pronounce this library? "face"?

make3 2 years ago

Huggingface Datasets also has an integrated interface to Faiss ! https://huggingface.co/docs/datasets/faiss_es

tim_sw 2 years ago

How are people using vector DBs in production? Do you typically use and manage Faiss indexes alone or use something like Milvus, Pinecone, Weaviate, or Chroma?

jeadie 2 years ago

A big difficulty in using vector DBs in production for things like embeddings or LLMs it that there is alot that goes into converting and processing raw input into a vector form (think chunking, formatting, encoding, inference, metadata, etc). DBs like pinecone just don't handle any of that and therefore you have to build out large systems to do it yourself.

There are some platforms and open source tools that handle it end to end. https://github.com/marqo-ai/marqo is one, for example that is both open source and has a cloud offering.

gk1 2 years ago

We put together some stories here: https://www.pinecone.io/learn/wild/

Usually folks use a vector database alongside a doc store like Postgres, Snowflake, Elastic…

ing33k 2 years ago

We are using dense_vector field that Elastic Search offers and it's scaling quite well.

heipei 2 years ago

We're using Elasticsearch with binary vectors and our own sub-token filtering approach, but if we were to start over again today we'd simply use the dense vectors and ANN built into Elasticsearch, just with float vectors.

arecurrence 2 years ago

I’ve increasingly been using pgvector. It’s an excellent workflow when you can just include similarity search as another where clause in your sql query.

georgehill 2 years ago

Any rust equivalent of faiss?

jeadie 2 years ago

I know rust has beings to FAISS (see https://github.com/Enet4/faiss-rs), I don't know if there's anything that would be considered comparable. Alot of work has gone into FAISS

jeadie 2 years ago

I forgot about https://github.com/qdrant/qdrant. It's a DB not a library so again may not be an exact answer for what you're looking for

jeadie 2 years ago

Maybe https://github.com/hora-search/hora but I've never used it

lee101 2 years ago

[dead]