PGVector's Missing Features

refibrillator · on Sept 14, 2024

> Pgvector is very slow, seconds to 10’s of seconds, on filter and order by queries

This is an absurd claim to make with no qualification on index type, number of rows, etc. Really undermines the authority of the post, even though the other points are fair albeit obvious from the docs (pgvector has no keyword or hybrid search, negated words, or match highlighting).

In comparison pgvecto.rs does look more promising in some of these aspects, some docs here: https://docs.pgvecto.rs/faqs/comparison-pgvector.html

skeptrune · on Sept 14, 2024

Agreed. I added a disclaimer - https://github.com/devflowinc/trieve-website/commit/010aeab3...

nikcub · on Sept 14, 2024

pgvectorscale from the timescaledb team introduces StreamingDiskANN indexes and compression to pgvector:

https://github.com/timescale/pgvectorscale

the sales blurb:

> On a benchmark dataset of 50 million Cohere embeddings with 768 dimensions each, PostgreSQL with pgvector and pgvectorscale achieves 28x lower p95 latency and 16x higher query throughput compared to Pinecone's storage optimized (s1) index for approximate nearest neighbor queries at 99% recall, all at 75% less cost when self-hosted on AWS EC2.

been using it in dev (~2.5M embeddings to date) and so far, pretty great

edit: HN announcement + thread from a couple of months ago:

https://news.ycombinator.com/item?id=40646276

viraptor · on Sept 14, 2024

I'm confused by this one:

> Postgres pgvector does not come up with a way to highlight the keyword snippets being matched on.

Yeah... That's just not what it's for. But you can do that post processing yourself if you want to - there's lots of different ways to achieve it. I'm not sure that's really a missing feature, since it doesn't belong to a vector database.

skeptrune · on Sept 14, 2024

> missing some key features desireable in *search solutions*

Agree that sub-sentence highlights are somewhat out of scope for a pure vector db, but I was specifically talking about missing features in the context of a search solution which encompasses this feature imo.

If I was integrating search into my product, I would not want to build and maintain a highlight algo myself. Especially now that I have experience doing so :( . One of the more complicated things we had to build and I still don't think it's ideal.

hakanderyal · on Sept 14, 2024

Filtering issue also prevents shared database multi-tenant setups, which is a blocker by itself for many.

I would love to not having to run another service for vector search, but it's just not there yet.

avthar · on Sept 14, 2024

There's actually quite a few ways to do shared database multi-tenant setups with pgvector. You can have separate out tenants into separate tables, schema, and even logical databases within the same database service itself.

Here's an overview of the methods [1] (all except separate database of course)

[1]: https://x.com/avthars/status/1832915959406842241

hakanderyal · on Sept 15, 2024

Yes, I know about all of them but at that point just running qdrant becomes preferable from a cost/benefit standpoint for my use cases.

Edit: Just noticed we had this discussion on X with you already.

tucnak · on Sept 14, 2024

Is this a joke? You don't have to put everything in 1 table.

avthar · on Sept 14, 2024

While I value the depth in which the blog authors went into, some of these are not problems with pgvector but problems with similarity search itself and would be problems with other vector databases as well. There are also solutions to the problems describe that the authors do not mention that I think any reader should take into account.

> 1. Required and Negated Words

This is not a bug of pgvector but a bug in embedding models and simple similarity search itself. You'd run into this issue on when doing RAG with any vectordb or ANN search library. You could probably solve this with query expansion, running multiple similarity searches in parallel, and doing filtering of results containing the required theme vs trying to get this all in a single search.

> 2. Explainability With Highlights

This is a misunderstanding of the purpose of embeddings based semantic search. The point of semantic search is not to match on exact keywords but to match on the /meaning/. If you want exact keyword matching, use full text search. This is something that can also be solved by hybrid search, combining full text search and semantic search and using a re-ranker. PostgreSQL has built in FTS with tsvector.

> 3. Performant Filters and Order By’s

The authors do not disclose any details about what index they use here or what kind of filtering they are trying to do. As other commenters point out, the StreamingDiskANN in the pgvectorscale extension [0] (complement to pgvector, you can use them together) improves on performance and accuracy of filtering vs pgvector HNSW (see details in [1]).

>4. Support for sparse vectors, BM25, and other inverse document frequency search modes

This is probably the most fair point in the post. But there exists projects like pg_search from ParadeDB which bring bm_25 to Postgres and help solve this [2]

Lastly, I respect companies trying to provide real-world examples of the trade-offs of different systems, and so thank the authors for sharing their experience and spurring discussion.

Disclaimer: I work at Timescale, where we offer pgvector, and also made other extensions for AI/ vector workloads on PostgreSQL, namely pgvectorscale, and pgai. I've tried to be as even in my analysis as possible but as with everything on the internet, you can make up your own mind and decide for yourself.

[0]: https://github.com/timescale/pgvectorscale/ [1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas... [2] https://github.com/paradedb/paradedb

skeptrune · on Sept 14, 2024

I tend to think "semantic" is a better term than "similarity" for describing dense vector search. Question to answer pairs are present in the training data which should discourage "similarity" thinking by itself. Personally, it feels like there was a shift to using "similarity" recently and I don't know why.

IDF keyword matching oriented techniques are a lot closer to "similarity" search than dense vector ones.

Dense vector search excels uniquely well for queries containing a semantic concept like a comparison with a query such as "A vs. B" [1]. Results will shift towards comparisons in general over results containing tokens A or B. Bag-of-words model where all the tokens collapse into a single vector is ideal for anything where you're trying to get an idea more than a set of tokens.

> 1. solve this with query expansion, running multiple similarity searches..

It would be a lot easier to solve by just having the required/negated word feature. That's really my point in the blog. Having to experiment and do newfangled things with high latency penalties for this common query pattern is something to seriously consider when choosing pgvector.

> Explainability with highlights

It's rare that a semantic search does not match tokens to some extent. Example imaged in the blog is actually from a dense vector query. Dense vector search isn't that magical and retains some token-level precision. Also, you can use word-level embeddings for the highlights to make them more semantically accurate. However, in our system, we use jaro-winkler distance.

In a semantic search context, you do want something matching on meaning, but when it doesn't work well, the information from highlights helps you refactor and get it closer to ideal.

> Sparse vectors, BM25, and other IDF modes

I'm a fan of pg_search (and tantivy), but SPLADE is a significant improvement and I think it's a big loss to not have easy access to it.

Appreciate that you enjoyed the post and thank you for starting a discussion on it!

[1]: https://hn.trieve.ai/about