More

avthar · on Oct 30, 2024

Post-co author here. This is actually something that we are considering implementing in future versions of pgai Vectorizer. You point the vectorizer at database A but tell it to create and store embeddings in database B. You can always do joins across the two databases with postgres FDWs and it would solve issues of load management if those are concerns. Neat idea and one on our radar!

avthar · on Oct 30, 2024

Post co-author here. Really appreciate the feedback.

Your point about HNSW being resource intensive is one we've heard. Our team actually built another extension called pgvectorscale [1] which helps scale vector search on Postgres with a new index type (StreamingDiskANN). It has BQ out the box and can also store vectors on disk vs only in memory.

Another practice I've seen work well is for teams use to use a read replica to service application queries and reduce load on the primary database.

To answer your third question, if you combine Pgai Vectorizer with pgvectorscale, the limitations around filtered search in pgvector HNSW are actually no longer present. Pgvectorscale implements streaming filtering, ensuring more accurate filtered search with Postgres. See [2] for details.

[1]: https://github.com/timescale/pgvectorscale [2]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...

whakim · on Oct 30, 2024

Thanks for your answer. I hear you on using a read-replica to serve embedding-based queries, but I worry there are lots of cases where that breaks down in practice: presumably you still need to do a bunch of IO on the primary to support insertion, and presumably reconstituting an index (e.g. to test out new hyperparameters) isn't cheap; at least you can offload the memory requirements of reading big chunks of your graph into memory onto the follower though.

Cool to see the pgvectorscale stuff; it sounds like the approach for filtering is not dissimilar to the direction that the pgvector team are taking with 0.8.0, although the much-denser graph (relative to HNSW) may mean the approach works even better in practice?

sroussey · on Oct 30, 2024

So… maybe 15 or 20 years ago I had setup MySQL servers such that some replicas had different indexes. MySQL only had what we would now call logical replication.

So after setting up replication and getting it going, I would alter the tables to add indexes useful for special purposes including full text which I did not being built on the master or other replicas.

I imagine, but can not confirm, that you could do something similar with PostgreSQL today.

whakim · on Oct 30, 2024

Yeah, logical replication is supported on PostgreSQL today and would support adding indices to a replica. I am not sure if that works in this case, though, because what's described here isn't just an index.

avthar · on Oct 30, 2024

Post co-author here. The point is a little nuanced, so let me explain:

You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.

The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.

jeffchuber · on Oct 30, 2024

i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.

spmurrayzzz · on Oct 30, 2024

This is also the case with weaviate (as you assumed). If you update the value of any previously vectorized property, weaviate generates new vectors automatically for you.

cevian · on Oct 30, 2024

as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.

In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.

Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.

spmurrayzzz · on Oct 30, 2024

Chroma certainly doesn't have the most advanced API in this area, but you can for sure store chunks or documents, its up to you. If your document size is too large to generate embeddings in a single forward pass, then yes you do need to chunk in that scenario.

Oftentimes though, even if the document does fit, you choose to chunk anyways or further transform the data with abstractive/extractive summarization techniques to improve your search dynamics. This is why I'm not sure the complexity noted in the article is relevant in anything beyond a "naive RAG" stack. How its stored or linked is an issue to some degree, but the greater more complex smell is in what happens before you even get to that point of inserting the data.

For more production-grade RAG, just blindly inserting embeddings wholesale for full documents is rarely going to get you great results (this varies a lot between document sizes and domains). So as a result, you're almost always going to be doing ahead-of-time chunking (or summarization/NER/etc) not because you have to due to document size, but because your search performance demands it. Frequently this involves more than one embeddings model for capturing different semantics or supporting different tasks, not to mention reranking after the initial sweep.

That's the complexity that I think is worth tackling in a paid product offering, but the current state of the module described in the article isn't really competitive with the rest of the field in that respect IMHO.

cevian · on Oct 30, 2024

(Post co author) We absolutely agree that chunking is critical for good RAG. What I think you missed in our post is that the vectorizer allows you to configure a chunking strategy of your choice. So you store the full doc but then the system well chunk and embed it for you. We don’t blindly embed the full document.

spmurrayzzz · on Oct 30, 2024

I didn't miss that detail, I just don't think chunking alone is where the complexity lies and that the pgai feature set isn't really differentiated at all from other offerings in that context. My commentary about full documents was responding directly to your comment here in this thread more so than I was the article (you claimed chroma can only insert chunks, which isn't accurate, and I expanded from there).

samatman · on Oct 30, 2024

This is true only in the trivial case where the entire document fits in a single chunk, correct?

That seems like a meaningful distinction.

spmurrayzzz · on Oct 30, 2024

Yes that is correct, but my position (which perhaps has been poorly-articulated) is that in the non-trivial instances, it is a distinction without difference in the greater context of the RAG stack and related pipelines.

Just allowing for a chunking function to be defined which is called at insertion time doesn't really alleviate the major pain points inherent to the process. Its a minor convenience, but in fact, as pointed out elsewhere in this thread by others, its a convenience you can afford to yourself in a handful of lines of code that you only ever have to write once.

avthar · on Oct 29, 2024

Good q. For most standalone vector search use cases, FAISS or a library like it is good.

However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.

For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.

avthar · on Oct 29, 2024

Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.

Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.

Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]

[1]: https://github.com/timescale/pgai

edanm · on Oct 30, 2024

Hey, this is really cool! Thanks for the article and the tool itself.

One question - in the RAG projects we've done, most of the source data was scattered in various source systems, but wasn't necessarily imported into a single DB or Data Lake. For example, building an internal Q&A tool for a company that has knowledge stored in services like Zendesk, Google Drive, an internal company Wiki, etc.

In those cases, it made sense to not import the source documents, or only import metadata about them, and keep the embeddings in a dedicated Vector DB. This seems to me to be a fairly common use case - most enterprises have this kind of data scattered across various systems.

How do you envision this kind of use case working with this tool? I may have missed it, but you mention things like working with images, etc, is your assumption that everyone is storing all of that data in Postgres?

hhdhdbdb · on Oct 29, 2024

Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?

I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.

cevian · on Oct 29, 2024

(post co-author here)

The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.

Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.

hhdhdbdb · on Oct 30, 2024

That is arguable because while it is a calculated field, it is not a pure one (IO is required), and not necessarily idempotent, not atomic and not guaranteed to succeed.

It is certainly convenient for the end user, but it hides things. What if the API calls to open AI fail or get rate limited. How is that surfaced. Will I see that in my observability. Will queries just silently miss results.

If the DB does the embedding itself synchronously within the write it would make sense. That would be more like elastic search or a typical full text index.

cevian · on Oct 30, 2024

(co-author here) We automatically retry on failures in a while. We also log error messages in the worker (self-hosted) and have clear indicators in the cloud UI that something went wrong (with plans to add email alerts later).

The error handling is actually the hard part here. We don't believe that failing on inserts due to the endpoint being down is the right thing because that just moves the retry/error-handling logic upstream -- now you need to roll your own queuing system, backoffs etc.

hhdhdbdb · on Oct 30, 2024

Thanks for the reply. These are compelling points.

I agree not to fail on insert too by the way. The insert is sort of an enqueuing action.

I was debating if a microservice should process that queue.

Since you are a PaaS the distinction might be almost moot. An implementation detail. (It would affect the api though).

However if Postgres added this feature generally it would seem odd to me because it feels like the DB doing app stuff. The DB is fetching data for itself from an external source.

The advantage is it is one less thing for the app to do and maybe deals with errands many teams have to roll their own code for.

A downside is if I want to change how this is done I probably can't. Say I have data residency or securiry requirements that affect the data I want to encode.

I think there is much to consider. Probably the why not both meme applies though. Use the built in feature if you can, and roll your own where you can't.

metalwhale · on Oct 30, 2024

Thank you for sharing this! I have one question: Is there any plan to add support for local LLM / embeddings models?

motoxpro · on Oct 30, 2024

"Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon."

In the post you responded to

metalwhale · on Oct 30, 2024

Haha I feel so dumb now. Thank you!

aargh_aargh · on Oct 30, 2024

This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?

I mean why is that even a question? Is there some fundamental difference between the black box that is GPT-* and say, LLaMA, that I don't grok?

mind-blight · on Oct 30, 2024

This is super cool! One suggestion for the blog: I would put "re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed." as a tl/dr at the top.

It wasn't clear to me why this was significantly different than using pg_vector until I read that. That makes the rest of the post (e.g. why this you need the custom methods in a `SELECT`) make a lot more sense in context

TechDebtDevin · on Oct 30, 2024

I'm doing something similar with go + postgres

avthar · on Sept 14, 2024

While I value the depth in which the blog authors went into, some of these are not problems with pgvector but problems with similarity search itself and would be problems with other vector databases as well. There are also solutions to the problems describe that the authors do not mention that I think any reader should take into account.

> 1. Required and Negated Words

This is not a bug of pgvector but a bug in embedding models and simple similarity search itself. You'd run into this issue on when doing RAG with any vectordb or ANN search library. You could probably solve this with query expansion, running multiple similarity searches in parallel, and doing filtering of results containing the required theme vs trying to get this all in a single search.

> 2. Explainability With Highlights

This is a misunderstanding of the purpose of embeddings based semantic search. The point of semantic search is not to match on exact keywords but to match on the /meaning/. If you want exact keyword matching, use full text search. This is something that can also be solved by hybrid search, combining full text search and semantic search and using a re-ranker. PostgreSQL has built in FTS with tsvector.

> 3. Performant Filters and Order By’s

The authors do not disclose any details about what index they use here or what kind of filtering they are trying to do. As other commenters point out, the StreamingDiskANN in the pgvectorscale extension [0] (complement to pgvector, you can use them together) improves on performance and accuracy of filtering vs pgvector HNSW (see details in [1]).

>4. Support for sparse vectors, BM25, and other inverse document frequency search modes

This is probably the most fair point in the post. But there exists projects like pg_search from ParadeDB which bring bm_25 to Postgres and help solve this [2]

Lastly, I respect companies trying to provide real-world examples of the trade-offs of different systems, and so thank the authors for sharing their experience and spurring discussion.

Disclaimer: I work at Timescale, where we offer pgvector, and also made other extensions for AI/ vector workloads on PostgreSQL, namely pgvectorscale, and pgai. I've tried to be as even in my analysis as possible but as with everything on the internet, you can make up your own mind and decide for yourself.

[0]: https://github.com/timescale/pgvectorscale/ [1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas... [2] https://github.com/paradedb/paradedb

skeptrune · on Sept 14, 2024

I tend to think "semantic" is a better term than "similarity" for describing dense vector search. Question to answer pairs are present in the training data which should discourage "similarity" thinking by itself. Personally, it feels like there was a shift to using "similarity" recently and I don't know why.

IDF keyword matching oriented techniques are a lot closer to "similarity" search than dense vector ones.

Dense vector search excels uniquely well for queries containing a semantic concept like a comparison with a query such as "A vs. B" [1]. Results will shift towards comparisons in general over results containing tokens A or B. Bag-of-words model where all the tokens collapse into a single vector is ideal for anything where you're trying to get an idea more than a set of tokens.

> 1. solve this with query expansion, running multiple similarity searches..

It would be a lot easier to solve by just having the required/negated word feature. That's really my point in the blog. Having to experiment and do newfangled things with high latency penalties for this common query pattern is something to seriously consider when choosing pgvector.

> Explainability with highlights

It's rare that a semantic search does not match tokens to some extent. Example imaged in the blog is actually from a dense vector query. Dense vector search isn't that magical and retains some token-level precision. Also, you can use word-level embeddings for the highlights to make them more semantically accurate. However, in our system, we use jaro-winkler distance.

In a semantic search context, you do want something matching on meaning, but when it doesn't work well, the information from highlights helps you refactor and get it closer to ideal.

> Sparse vectors, BM25, and other IDF modes

I'm a fan of pg_search (and tantivy), but SPLADE is a significant improvement and I think it's a big loss to not have easy access to it.

Appreciate that you enjoyed the post and thank you for starting a discussion on it!

[1]: https://hn.trieve.ai/about

avthar · on Sept 14, 2024

There's actually quite a few ways to do shared database multi-tenant setups with pgvector. You can have separate out tenants into separate tables, schema, and even logical databases within the same database service itself.

Here's an overview of the methods [1] (all except separate database of course)

[1]: https://x.com/avthars/status/1832915959406842241

hakanderyal · on Sept 15, 2024

Yes, I know about all of them but at that point just running qdrant becomes preferable from a cost/benefit standpoint for my use cases.

Edit: Just noticed we had this discussion on X with you already.

avthar · on June 12, 2024

See discussion for pgvectorscale in this HN post: https://news.ycombinator.com/item?id=40646276

avthar · on June 11, 2024

Blog co-author here (PM at Timescale).

We're excited to release pgvectorscale. Our team built this extension to make PostgreSQL a better database for AI and to challenge the notion that PostgreSQL and pgvector are not performant for vector workloads.

pgvectorscale is open-source under the PostgreSQL license and free to use on any PostgreSQL database.

Here are two helpful companion reads to the post linked by OP: A benchmark of how PostgreSQL with pgvector and pgvectorscale performs against Pinecone [1], and a technical deep dive into pgvectorscale's StreamingDiskANN index and Statistical Binary Quantization implementations [2].

Questions and feedback welcome!

[1]: https://www.timescale.com/blog/pgvector-vs-pinecone/ [2]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...

avthar · on Oct 10, 2023

Timescale recently released Timescale Vector [0] a scalable search index (DiskANN) and efficient time-based vector search, in addition to all capabilities of pgvector and vanilla PostgreSQL. We plan to add the document processing and embedding creation capabilities you discuss into our Python client library [1] next, but Timescale Vector integrates with LangChain and LlamaIndex today [2], which both have document chunking and embedding creation capabilities. (I work on Timescale Vector)

[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be... [1]: https://github.com/timescale/python-vector [2]: https://www.timescale.com/ai/#resources