remix logo

Hacker Remix

Don't use cosine similarity carelessly

429 points by stared 4 days ago | 95 comments

pamelafox 4 days ago

If you're using cosine similarity when retrieving for a RAG application, a good approach is to then use a "semantic re-ranker" or "L2 re-ranking model" to re-rank the results to better match the user query.

There's an example in the pgvector-python that uses a cross-encoder model for re-ranking: https://github.com/pgvector/pgvector-python/blob/master/exam...

You can even use a language model for re-ranking, though it may not be as good as a model trained specifically for re-ranking purposes.

In our Azure RAG approaches, we use the AI Search semantic ranker, which uses the same model that Bing uses for re-ranking search results.

pamelafox 4 days ago

Another tip: do NOT store vector embeddings of nothingness, mostly whitespace, a solid image, etc. We've had a few situations with RAG data stores which accidentally ingested mostly-empty content (either text or image), and those dang vectors matched EVERYTHING. WAs I like to think of it, there's a bit of nothing in everything.. so make sure that if you are storing a vector embedding, there is some amount of signal in that embedding.

variaga 4 days ago

Interesting. A project I worked on (audio recognition for a voice-command system) we ended up going the other way and explicitly adding an encoding of "nothingness" (actually 2, one for "silence" and another for "white noise") and special casing them ("if either 'silence' or 'noise' is in the top 3 matches, ignore the input entirely").

This was to avoid the problem where, when we only had vectors for "valid" sounds and there was an input that didn't match anything in the training set (a foreign language, garbage truck backing up, a dog barking, ...) the model would still return some word as the closest match (there's always a vector that has the highest similarity) and frequently do so with high confidence i.e. even though the actual input didn't actually match anything in the training set, it would be "enough" more like one known vector than any of the others that it would pass most threshold tests, leading to a lot of false positives.

pbhjpbhj 4 days ago

That sounds like a problem for the embedding, would you need to renormalise so that low signal inputs could be well represented. A white square and a red square shouldn't be different levels of details. Depending on the purpose of the vector embedding, there should be a difference between images of mostly white pixels and partial images.

Disclaimer, I don't know shit.

pamelafox 4 days ago

I should clarify that I experienced these issues with text-embedding-ada-002 and the Azure AI vision model (based on Florence). I have not tested many other embedding models to see if they'd have the same issue.

refulgentis 4 days ago

FWIW I think you're right, we have very different stacks, and I've observed the same thing, with a much clunkier description thank your elegant way of putting it.

I do embeddings on arbitrary websites at runtime, and had a persistent problem with the last chunk of a web page matching more things. In retrospect, its obvious that the smaller the chunk was, the more it was matching everything

Full details: MSMARCO MiniLM L6V3 inferenced using ONNX on iOS/web/android/macos/windows/linux

mattvr 4 days ago

You could also work around this by adding a scaling transformation that normalizes and centers (e.g. sklearn StandardScaler) in between the raw embeddings — based on some example data points from your data set. Might introduce some bias, but I’ve found this helpful in some cases with off the shelf embeddings.

OutOfHere 3 days ago

Use horrible quality embeddings and get horrible results. No surprise there. ada is obsolete - I would never want to use it.

jhy 4 days ago

We used to have this problem in AWS Rekognition; a poorly detected face -- e.g. a blurry face in the background -- would hit with high confidence with every other blurry face. We fixed that largely by adding specific tests against this [effectively] null vector. The same will work for text or other image vectors.

short_sells_poo 4 days ago

If you imagine a cartesian coordinate space where your samples are clustered around the origin, then a zero vector will tend to be close to everything because it is the center of the cluster. Which is a different way of saying that there's a bit of nothing in everything I guess :)

pilooch 4 days ago

Statistically you want the retriever to be trained for cosine similarity. Vision LLM retriever such as DSE do this correctly. No need for reranker once done.

OutOfHere 3 days ago

Precisely. Ranking is a "smell" in this regard. They are using ada embedding which I consider to be of poor quality.

antirez 4 days ago

I propose a different technique:

- Use a large context LLM.

- Segment documents to 25% of context or alike.

- With RAG, retrieve fragments from all the documents, they do a first pass semantic re-ranking like this, sending to the LLM:

I have a set of documents I can show you to reply the user question "$QUESTION". Please tell me from the title and best matching fragments what document IDs you want to see to better reply:

[Document ID 0]: "Some title / synopsis. From page 100 to 200"

... best matching fragment of document 0...

... second best fragment ...

[Document ID 1]: "Some title / synopsis. From page 200 to 300"

... fragmnets ...

LLM output: show me 3, 5, 13.

New query, with attached the full documents for 75% of context window.

"Based on the attached documents in this chat, reply to $QUESTION".

datadrivenangel 3 days ago

Slow/expensive. Good idea otherwise.

danielmarkbruce 3 days ago

but inference time compute is the new hotness.

bjourne 4 days ago

So word vectors solve the problem that two words may never appear in the same context, yet can be strongly correlated. "Python" may never be found close to "Ruby", yet "scripting" is likely to be found in both their contexts so the embedding algorithm will ensure that they are close in some vector space. Except it rarely works well because of the curse of dimensionality.

Perhaps one could represent word embeddings as vertices, rather than vectors? Suppose you find "Python" and "scripting" in the same context. You draw a weighted edge between them. If you find the same words again you reduce the weight of the edge. Then to compute the similarity between two words, just compute the weighted shortest path between their vertices. You could extend it to pair-wise sentence similarity using Steiner trees. Of course it would be much slower than cosine similarity, but probably also much more useful.

jsenn 4 days ago

You might be interested in HippoRAG [1] which takes a graph-based approach similar to what you’re suggesting here.

[1]: https://arxiv.org/abs/2405.14831

yobbo 4 days ago

Embeddings represent more than P("found in the same context").

It is true that cosine similarity is unhelpful if you expect it to be a distance measure.

[0,0,1] and [0,1,0] are orthogonal (cosine 0) but have euclidean distance √2, and 1/3 of vector elements are identical.

It is better if embeddings encode also angles, absolute and relative distances in some meaningful way. Testing only cosine ignores all distances.

OutOfHere 3 days ago

Modern embeddings lie on a hypersphere surface, making euclidean equal to cosine. And if they don't, I probably wouldn't want to use them.

yobbo 3 days ago

True, on a hypersphere cosine and euclidean are equivalent.

But if random embeddings are gaussian, they are distributed on a "cloud" around the hypersphere, so they are not equal.

tgv 4 days ago

This was called ontology or semantic network. See e.g. OpenCyc (although it's rather more elaborate). What you propose is rather different than word embeddings, since it can't compare word features (think: connotations) nor ambiguity, and the way to discover similarities symbolically is not a well-understood problem.

bambax 4 days ago

> In the US, word2vec might tell you espresso and cappuccino are practically identical. It is not a claim you would make in Italy.

True, and quite funny. This is an excellent, well-written and very informative article, but this part is wrongly worded:

> Let's have a task that looks simple, a simple quest from our everyday life: "What did I do with my keys?" [and compare it to other notes using cosine similarity]: "Where did I put my wallet" [=> 0.6], "I left them in my pocket" [=> 0.5]

> The best approach is to directly use LLM query to compare two entries, [along the lines of]: "Is {sentence_a} similar to {sentence_b}?"

(bits in brackets paraphrased for quoting convenience)

This will result in the same, or "worse" result, as any LLM will respond that "Where did I put my wallet" is very similar to "What did I do with my keys?", while "I left them in my pocket" is completely dissimilar.

I'm actually not sure what the author was trying to get at here? You could ask an LLM 'is that sentence a plausible answer to the question' and then it would work; but if you ask for pure 'likeness', it seems that in many cases, LLMs' responses will be close to cosine similarity.

stared 4 days ago

Well, "Is {sentence_a} similar to {sentence_b}?" is the correct query when we care about some vague similarity of statements. In this case, we should go with something in the line "Is {answer} a plausible answer to the question {question}".

In any way, I see how the example "Is {sentence_a} similar to {sentence_b}?" breaks the flow. The original example was:

    {question}
    
    # A
    
    {sentence_A}
    
    # B

    {sentence_B}
As I now see, I overzealously simplified that. Thank you for your remark! I edited the article. Let me know if it is clearer for you now.

echoangle 4 days ago

I also don’t see the problem, if I were asked to rank the sentences by similarity to the question, I wouldn’t rank a possible answer first. In what way is an answer to a question similar to the question?

Dewey5001 2 days ago

I believe the intention here was to highlight a use case where cosine similarity falls short, leading into the next section that introduces alternatives. That said, I would appreciate more detail in the 'Extracting the right features' section, if someone has an example I would love to see it.

deepsquirrelnet 4 days ago

> So, what can we use instead?

> The most powerful approach

> The best approach is to directly use LLM query to compare two entries.

Cross encoders are a solution I’m quite fond of, high performing and much faster. I recently put an STS cross encoder up on huggingface based on ModernBERT that performs very well.

sroussey 4 days ago

I had to look that up… for others:

An STS cross encoder is a model that uses the CrossEncoder class to predict the semantic similarity between two sentences. STS stands for Semantic Textual Similarity.

stared 4 days ago

Technically speaking, cross encoders are LLMs - they use the last layer to predict similarity (a single number) rather than the probability of the next token. They are faster than generative models only if they are simpler - otherwise, there is no performance gain (the last layer is negligible). In any case, even the simplest cross-encoders are more computationally intensive than those using a dot product from pre-computed vectors.

That said, for many applications, we may be perfectly fine with some version of a fine-tuned BERT-like model rather than using the newest AGI-like SoTA just to compare if two products are vaguely similar, and it is worth putting the other one in suggestions.

deepsquirrelnet 3 days ago

This is true, and I’ve done quite a bit with static embeddings. You can check out my wordllama project if that’s interesting to you.

https://github.com/dleemiller/WordLlama

There’s also model2vec doing some cool things as well in that area. So it’s cool to see recent progress in 2024/5 on simple static embedding models.

On the computational performance note, the performance of cross encoder I trained using ModernBERT base is on par with the roberta large model, while being about 7-8x faster. Still way more complex than static, but on benchmark datasets, much more capable too.

staticautomatic 4 days ago

Link please?

deepsquirrelnet 4 days ago

Here you go!

https://huggingface.co/dleemiller/ModernCE-base-sts

There’s also the large model, which performs a bit better.

janalsncm 4 days ago

Cross encoders still don’t solve the fundamental problem of defining similarity that the author is referring to.

Frankly, the LLM approach the author talks about in the end doesn’t either. What does “similar” mean here?

Given inputs A, B, and C, you have to decide whether A and B are more similar or A and C are more similar. The algorithm (or architecture, depending on how you look at it) can’t do that for you. Dual encoder, cross encoder, bag of words, it doesn’t matter.

deepsquirrelnet 3 days ago

I think what you’re getting at could be addressed a few way. One is explainability — and with an llm you can ask it to tell you why it chose one or the other.

That’s not practical for a lot of applications, but it can do it.

For the cross encoder I trained, I have a pretty good idea what similar means because I created a semi-synthetic dataset that has variants based on 4 types of similarity.

Perhaps not a perfect solution when you’re really trying to split hairs about what is more similar between texts that are all pretty similar, but not all applications need that level of specificity either.