Their self-reported benchmarks have them out-performing pinecone by 7x in querie...

ashvardanian · 2026-02-15T00:44:55 1771116295

8K QPS is probably quite trivial on their setup and a 10M dataset. I rarely use comparably small instances & datasets in my benchmarks, but on 100M-1B datasets on a larger dual-socket server, 100K QPS was easily achievable in 2023: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... ;)

Typically, the recipe is to keep the hot parts of the data structure in SRAM in CPU caches and a lot of SIMD. At the time of those measurements, USearch used ~100 custom kernels for different data types, similarity metrics, and hardware platforms. The upcoming release of the underlying SimSIMD micro-kernels project will push this number beyond 1000. So we should be able to squeeze a lot more performance later this year.

luoxiaojian · 2026-02-15T14:13:09 1771164789

Author here. Appreciate the context—just wanted to add some perspective on the 8K QPS figure: in the VectorDBBench setting we used (10M, 768d, on comparable hardware to the previous leader), we're seeing double their throughput—so it's far from trivial on that playing field.

That said, self-reported numbers only go so far—it'd be great to see USearch in more third-party benchmarks like VectorDBBench or ANN-Benchmarks. Those would make for a much more interesting comparison!

On the technical side, USearch has some impressive work, and you're right that SIMD and cache optimization are well-established techniques (definitely part of our toolbox too). Curious about your setup though—vector search has a pretty uniform compute pattern, so while 100+ custom kernels are great for adapting to different hardware (something we're also pursuing), I suspect most of the gain usually comes from a core set of techniques, especially when you're optimizing for peak QPS on a given machine and index type. Looking forward to seeing what your upcoming release brings!

luoxiaojian · 2026-02-15T12:35:54 1771158954

Author here. Thanks for the interest! On the performance side: we've applied optimizations like prefetching, SIMD, and a novel batch distance computation (similar to a GEMV operation) that alone gives ~20% speedup. We're working on a detailed blog post after the Lunar New Year that dives into all the techniques—stay tuned!

And we always welcome independent verification—if you have any questions or want to discuss the results, feel free to reach out via GitHub Issues or our Discord.

antirez · 2026-02-15T10:00:58 1771149658

It is absolutely possible and even not so hard. If you use Redis Vector Sets you will easily see 20k - 50k (depending on hardware) queries per second, with tens of millions of entries, but the results don't get much worse if you scale more. Of course all that serving data from memory like Vector Sets do. Note: not talking about RedisSearch vector store, but the new "vector set" data type I introduced a few months ago. The HNSW implementation of vector sets (AGPL) is quite self contained and easy to read if you want to check how to achieve similar results.

luoxiaojian · 2026-02-15T14:26:51 1771165611

Author here. Thanks for sharing—always great to see different approaches in the space. A quick note on QPS: throughput numbers alone can be misleading without context on recall, dataset size, distribution, hardware, distance metric, and other relevant factors. For example, if we relaxed recall constraints, our QPS would also jump significantly. In the VectorDBBench results we shared, we made sure to maintain (or exceed) the recall of the previous leader while running on comparable hardware—which is why doubling their throughput at 8K QPS is meaningful in that specific setting.

You're absolutely right that a basic HNSW implementation is relatively straightforward. But achieving this level of performance required going beyond the usual techniques.

antirez · 2026-02-15T14:59:48 1771167588

Yep you are right, also: quantization is a big issue here. For instance int8 quantization has minimal effects on recall, but makes dot-product much faster among vectors, and speedups things a lot. Also the number of components in the vectors make a huge difference. Another thing I didn't mention is that for instance Redis implementation (vector sets) is threaded, so the numbers I reported is not about a single core. Btw I agree with your comment, thank you. What I wanted to say is simply that the results you get, and the results I get, are not "out of this world", and are very credible. Have a nice day :)

luoxiaojian · 2026-02-16T06:14:57 1771222497

Appreciate the thoughtful breakdown—you're absolutely right that quantization, dimensionality, and threading all play a big role in performance numbers. Thanks for the kind words and for engaging in the discussion. Wishing you a happy Year of the Horse—新春快乐，马年大吉！

itake · 2026-02-15T09:44:15 1771148655

Pinecone scales horizontally (which creates overhead, but accomodates more data).

A better comparison would be with Meta's FAISS

luoxiaojian · 2026-02-15T13:33:04 1771162384

Author here. Good point—OpenSearch (which is based on FAISS) is actually included in the VectorDBBench results, so you can see how it compares there. That said, horizontal scaling for vector search is relatively straightforward; the real challenge is optimizing performance within a given (single-node) hardware budget, which is where we've focused our efforts.

panzi · 2026-02-15T01:22:25 1771118545

PGVectorScale claims even more. Also want to see someone verify that.

rvz · 2026-02-15T09:55:57 1771149357

Exactly. We should always be asking these sort of questions and take these self-reported benchmarks with a grain of salt until independent sources can verify such claims rather than trusting results from the creators themselves. Otherwise it falls into biased territory.

This sort of behaviour is now absolutely rampant in the AI industry.