Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Links between Paul Graham's essays - revisited (solipsys.co.uk)
48 points by RiderOfGiraffes on Dec 14, 2008 | hide | past | favorite | 33 comments


Is there a standard tool that does this? Especially for, shall we say, highly interconnected webpages? I need it for Overcoming Bias.


Yes, it's called python.

Write a crawler that examines all local links, and builds a graph (it needn't be anything fancy; a collection of nodes and edges is enough, but you'll have to detect cycles when building the graph), and then print it out in graphviz .dot language.

Then you can feed it into neato [either through a command-line pipe, or construct one using Python's subprocess module], et voilà, instant pagemap! :)

You can also use some of the rendering modes and scrape position data for the rendered nodes, if you want to generate a clickable imagemap.


Here's a first cut implementation in ruby. Requires the "hpricot" gem. Outputs a dot file to stdout. http://gist.github.com/35841

Output from the above is at: http://jey.kottalam.net/tmp/obgraph.out

I tried rendering this with "dot -Tcmapx -oob.map -Tgif -oob.gif" but dot segfaulted after 70 minutes. The code as posted at gist only outputs nodes for articles written by Eliezer in an attempt to make the dot file a manageable size. Tips/fixes appreciated.


Not sure if it will work, but try:

1) Rendering an undirected graph via graphviz's neato tool.

2) Rendering to svg, which might let graphviz not have to keep as much rendering information in memory at any one time.

3) Asking about it on the grapviz-interest mailing list. https://mailman.research.att.com/mailman/listinfo/graphviz-i...


Yeah, it's that thing with the segfault that I worried about. "Highly interconnected." Thanks for the graph, which I may be able to use for my own purposes even if I can't show it.

There's got to be standard graphing tools for things that are highly connected...


When Daniel J. Bernstein made his DAGs of standard crypto algorithms (http://cr.yp.to/cipherdag/cipherdag-20070630.pdf) he mentioned that they'd crashed every standard drawing tool he tried. He makes a reference to "...my own drawing tools, which are much more careful in their use of memory." Unfortunately, I'm not sure where to obtain DJB's drawing tools.



Graphviz only renders a graph which is represented as per dot format. There are other programs: dot, neato etc which can produce jpeg,png,gif... formats through command line tools.

I guess what he meant was tools which can do this for a particular site.

I can't think of any but as a previous poster said - a crawl of desired depth for all local pages, then feed the access log to "statviz" to generate a neat dot file along with the links.


I am home for winter break and would be willing to write this for you in the name of science. Plus, it seems that you know Marcello Herreshoff, with whom I took Computer Science in high school. Email me the requirements for your tool to <zackster>:at:<gmail.com> please.


A tool like this would be ideal in locating the moment when a person stops producing original thought, and starts repacking his old thoughts in new structures.


One or two links to an old post would not deteriorate the quality or freshness of content. In fact a lot of times one might be interested to get broader idea of the author stream of thoughts, for the ones not interested thye just have to blind to hyperlinks.

May be it would become restructured content if there are too many outward references like the... "Undergraduate" post. But even that when you look into this post the hyperlinks are for keywords like "essays", "hacker", "computer science", what you "love", hack a "blub" on windows ..


This is cool - led me to http://www.paulgraham.com/javacover.html in which I read "Historically, languages designed for other people to use have been bad" which makes me feel a little less insane for spending the morning working on my "pet language".


In your words what does this graph imply ? I would be great if PG himself can see something through this and respond.

My take: - The most referenced (high in-links) posts are more dear to PG or most popular/strong in his line of thought.

- The ones with high outgoing links at times might mean they are restructured thought as 'markessien' noted.

- For a personal blog like PG's does higher interlinking reduce readability or make the content richer ? I guess the reader chooses this.


It measures some combination of how general an essay is, how early it was written (I usually add these links when I first write something, so most links are only backward in time), how much I like it, and sheer randomness (at least half the time I don't bother to add crosslinks).

Larger numbers of outgoing links probably don't imply recycling. If anything it's a sign I think an essay is good and will get lots of readers, so it's worth trying to spread that traffic around. (These predictions are often wrong, though. I find it practically impossible to guess which essays will get lots of traffic.)


Is it not worth considering putting "forward" links into older essays, so that when you write an essay it gets referenced by other essays, even when they're older?

Perhaps you don't have time, but would you consider adding links that other people suggest?

It would be interesting to do to this what I do with my SiteMap, and colour nodes darker if they have more traffic.

http://www.solipsys.co.uk/new/SiteMapExpt.html?YC_News


I like the idea, but it would be more meaningful if you could link for semantic reasons - that would be a very nice tool, could even be used by authors who want to put a book together from various essays they've written, etc. Or help you focus on an area of writing you're interested in by the author. I mention this because looking at the graph you can see that things that are related are not necessarily linked.


This would be really cool. Can you suggest a way, given two essays, to decide if, semantically, the first should point to the second? If an author could use a tool like that it would mean that links could be discovered rather than inserted by hand.

That would be valuable ... you've made me think ... I may have a way of doing something close enough.

Hmm.


Latent semantic analysis can be used to measure the similarity of two documents.

http://en.wikipedia.org/wiki/Latent_semantic_analysis

I found examples in ruby and python in this blog, which is unfortunately down just at the moment http://blog.josephwilk.net/ruby/latent-semantic-analysis-in-...


Gregor Heinrich has a good paper on Latent Dirichlet Allocation, which I believe is an extension of Latent Semantic Analysis. It is a model which can be used to group documents based on semantic content. He gives the mathematical details and the "punchlines" for implementation. The model takes as input a collection of documents and outputs a topic label for each word in each document. The documents can be plotted in K-dimensional space, where K is the number of possible topics, by using the proportion of each topic in a document as its coordinates. Documents which are closer to each other have topics more similar to each other. You could then use your favorite clustering algorithm or a simple distance threshold to decide which should documents should link to each other.

The paper is here [PDF]: http://www.arbylon.net/publications/text-est.pdf

C++ implementation: http://gibbslda.sourceforge.net/

(note, I haven't used the C++ implementation).


That looks even better. I'm going to try this out in my pet project.

Thank you.


In a way, there is a gross semantic linking in a Q/A type of way with the titles, e.g., Why do Startups Condense in America --> The Power of the Marginal. What Can Business Learn from Open Source -->How To Start A Startup + Inequality and Risk.


I wonder if someone can use the PageRank algorithm or something similar to determine which essays are the most prominent. It's hard to tell with this graph which is the most connected essay - is it "Taste for Makers", "What You Can't Say", "Great Hackers", or something else? It would also be interesting to correlate this with the Google PageRank for these pages, and traffic data (if it's available).


Here are the ones that get the most traffic, with daily page views:

     1760 Stuff
      770 What You'll Wish You'd Known
      403 How to Start a Startup
      365 Why to Start a Startup in a Bad Economy
      349 Why Nerds are Unpopular
      294 How to Do what You Love
      274 Web 2.0
      247 Great Hackers
      207 How to Make Wealth
      187 Lies We Tell Kids


"Stuff" is my sending the given link to my female relatives to dissuade them from, or to reduce their practice of, being packrats. I'm pretty sure there are other users out there that do the same...


You're right. There are rarely referring urls for it, which implies that it spreads primarily by email.



URLs are malformed. Otherwise, nice. When you say it's "similar" to PageRank on your webpage, what do you mean?


Re malformed - fixed.

I've just pulled the description from WikiPedia, implemented pretty much that, and that's what I did.

In essence, each page gets set to 1.0. Then distribute 80% of that down each outgoing link, and 20% to everyone equally. Lather, rinse, repeat.

I think that's pretty much the algorithm. Pages pointed to by pages with lots of juice get lots of juice. Pages pointed to by no one, or only pages with little juice, get little juice. It's linear algebra thought of as network flow.

Well, that's what I did.


In PageRank order: http://www.google.com/search?q=site%3awww.paulgraham.com

Google has a good PageRank implementation, though it isn't open source.


> In PageRank order: > http://www.google.com/search?q=site%3awww.paulgraham.com > Google has a good PageRank implementation, though it > isn't open source.

Yes, but doing what you've done here gives the entire site, whereas implementing it from the definition allows you to compute the actual values, as I did.

Of course, Google have twoke their algorithm so it's no longer exactly as they originally published, and they're not telling anyone what it actually does. Using the "clean" version is at least transparent.


Updated again to include new essays. Also, the RSS feed has been removed, and a couple of other spurious links, just 'cos I was there and could do it.

I'm still interested in additional links based on semantics.


Awesome!

One small thing, it looks like you included the RSS link from the bottom of the page in the list of essays.


Re RSS: Hmm. Oh well, Wabi Sabi.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: