I like the concept, but I'm not having much luck. I entered "weird looking monkey", hoping to get Proboscis, golden snub nosed, etc, but I just end up with the articles "Pet monkey","List of individual monkeys", "Ethnoprimatology", "Monkey".
Whereas when I type the same query into google, I get exactly what I expected. Which is kind of disappointing, I was hoping to find out about some weird looking monkeys that I didn't know about.
Yeah it’s an off the shelf sentence-transformer model from over a year ago. The demo was more to show off the embedding database, but the embeddings themselves are slightly useful too.
I don’t keep any analytics on the page about what people find and don’t find, so I haven’t set myself up to improve the search results :/
That might actually be making things worse for longer articles. It probably would be better would be to index them separately and aggregate back to article level post query
Wikipedia editors/guidelines are generally not in favor of "opinionated" adjectives, and the use of "weird looking" in your query sounds a lot like something that would be frowned upon on a wikipedia article. This makes it hard for your search to retrieve good results on this corpus of knowledge.
It's also largely dependent on the embeddings model as others have mentioned. Even if wikipedia doesn't have any words specifically referring to that monkey as "weird", the model itself would know to correlate this monkey's embeddings with the "weird" concept. The main issue with this particular implementation is the model used (all-minilm-l6-v2) which is designed for speed and efficiency over accuracy.
Whereas when I type the same query into google, I get exactly what I expected. Which is kind of disappointing, I was hoping to find out about some weird looking monkeys that I didn't know about.