Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This would be really cool. Can you suggest a way, given two essays, to decide if, semantically, the first should point to the second? If an author could use a tool like that it would mean that links could be discovered rather than inserted by hand.

That would be valuable ... you've made me think ... I may have a way of doing something close enough.

Hmm.



Latent semantic analysis can be used to measure the similarity of two documents.

http://en.wikipedia.org/wiki/Latent_semantic_analysis

I found examples in ruby and python in this blog, which is unfortunately down just at the moment http://blog.josephwilk.net/ruby/latent-semantic-analysis-in-...


Gregor Heinrich has a good paper on Latent Dirichlet Allocation, which I believe is an extension of Latent Semantic Analysis. It is a model which can be used to group documents based on semantic content. He gives the mathematical details and the "punchlines" for implementation. The model takes as input a collection of documents and outputs a topic label for each word in each document. The documents can be plotted in K-dimensional space, where K is the number of possible topics, by using the proportion of each topic in a document as its coordinates. Documents which are closer to each other have topics more similar to each other. You could then use your favorite clustering algorithm or a simple distance threshold to decide which should documents should link to each other.

The paper is here [PDF]: http://www.arbylon.net/publications/text-est.pdf

C++ implementation: http://gibbslda.sourceforge.net/

(note, I haven't used the C++ implementation).


That looks even better. I'm going to try this out in my pet project.

Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: