Are these sort of similarity searches useful for classifying text?

CuriouslyC · 2026-02-14T23:44:35 1771112675

Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.

visarga · 2026-02-15T07:00:54 1771138854

"12+38" won't embed close to "50", as you said they capture only surface level words ("a lot of keyword overlap") not meaning, it's why for small scale I prefer a folder of files and a coding agent using grep/head/tail/Python one liners.

OutOfHere · 2026-02-14T23:32:17 1771111937

It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.

stephantul · 2026-02-15T07:54:32 1771142072

Yes. This is known as a knn classifier. Knn classifiers are usually worse than other simple classifiers, but trivial to update and use.

See e.g., https://scikit-learn.org/stable/auto_examples/neighbors/plot...

neilellis · 2026-02-15T00:15:57 1771114557

Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

esafak · 2026-02-14T23:54:19 1771113259

You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.