I like the concept, but I'm not having much luck. I entered "weird looking monke...

lsb · on Sept 1, 2023

Yeah it’s an off the shelf sentence-transformer model from over a year ago. The demo was more to show off the embedding database, but the embeddings themselves are slightly useful too.

I don’t keep any analytics on the page about what people find and don’t find, so I haven’t set myself up to improve the search results :/

pizza · on Sept 2, 2023

FWIW sentence-transformers truncates the input to at most 256 tokens by default; you might just be embedding the first paragraph or so.

lsb · on Sept 2, 2023

I average the embeddings of every 512 bytes of the page text

theolivenbaum · on Sept 4, 2023

That might actually be making things worse for longer articles. It probably would be better would be to index them separately and aggregate back to article level post query

pizza · on Sept 2, 2023

Ah ok, that makes sense

kikokikokiko · on Sept 1, 2023

Wikipedia editors/guidelines are generally not in favor of "opinionated" adjectives, and the use of "weird looking" in your query sounds a lot like something that would be frowned upon on a wikipedia article. This makes it hard for your search to retrieve good results on this corpus of knowledge.

znagengast · on Sept 1, 2023

It's also largely dependent on the embeddings model as others have mentioned. Even if wikipedia doesn't have any words specifically referring to that monkey as "weird", the model itself would know to correlate this monkey's embeddings with the "weird" concept. The main issue with this particular implementation is the model used (all-minilm-l6-v2) which is designed for speed and efficiency over accuracy.

kristopolous · on Sept 2, 2023

Couldn't someone pretty easily merge it with something more informal?

For instance, Reddit data dump (https://academictorrents.com/details/7c0645c94321311bb05bd87...), filter for Wikipedia links, include a context of the thread, combine that with the contents of the article