Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Slaves of the feed - This is not the realtime we've been looking for (000fff.org)
26 points by ThomPete on Dec 15, 2009 | hide | past | favorite | 31 comments


So how do can we solve this problem of drinking from a firehose?

I think a system like Google Reader system might have some traction here, wherein friends can recommend items to each other. With enough friends, you could make a metafeed out of those items. Make it a new service; you get paid for drinking from the firehose, and you pay to get human-filtered feeds. Throw some 'liked/hated this firehose drinker' and a pile of people at it, and you get interest spaces.


There are decades of research in information retrieval that can easily be applied to this problem. Before Google showed up with PageRank, many algorithms were based on properties of a given document and the historical properties of the set (example, TF-IDF).

You don't need to drink from a firehose. The firehose will be throttled and filtered to given you the useful information that you want. This is where the future of real-time search lies.

Disclosure: I'm CTO of Collecta, one of these real-time search companies.


Search clearly isn't the answer. Can you create a search term that encompasses all of your interests? I hope not.

Surely filtering the "fire hose" with the query "programming news" will not yield satisfactory results.


That's pretty much what we have now, except we haven't really made it so that the monkeys picking out the good information are being rewarded proportional to the value they create. Right now, it's the monkey who can put the most ads in front of the most monkeys who gets the banana, not the monkeys who find the most worthwhile information from the most obscure sources.


He correctly identifies a problem, but no amount of interdevice communication is going to tell me which 5 of boingboing's posts are most likely to be of personal interest.

Here's what'd help me with the bottleneck: a unified inbox with a wicked smart relevance algorithm. VR? Probably not.


We've had plain old naive-Bayesian filtering for ages - I'm surprised more feed-readers don't use it.


The feed items that truly contain new information is not one trains for traditionally with bayes or so, because it would imply something that has not been seen before, rather than just being similar to something already seen. Searching for meaningfully different stuff rather than similar stuff is a whole different problem.

I'd like to have a treemap view of google reader rather than a list, and where I could visually, in real time filter the feeds by adding negative and positive keywords/tags, and click around in various levels of a treemap.

Click on "technology" => make a new treemap of significant tags in that area, click on the next tag to dive into that space etc, and eventually see all the feed items somehow. Perhaps present them in all levels by some sidebar or hoverthing. Color code items for popularity / activity / freshness.

I'm thinking something like dabble.db but with a treemap interface and cappuchino js interface.


Here's another way to approach the user's end of it: have them log in to their Google, Facebook, Twitter, et al, and shove all of it into a single stream. Let them filter by source, topic, relevance, time, and anything else you can implement. Record their preferences by watching what items they click through or spend time on or interact with, and your relevance sort gets better. Add other users' data, and you have a fantastic recommendation engine without ever making anyone rate anything.

Another problem to solve with current news reading is duplication. Even if you do want to read about Tiger Woods, there's a lot of duplication in the echo chamber of the internet.


Yep, that's the position google has to analyze data, it's incredibly powerful to have almost the whole activitystream on the internet.

An interesting sidenote, apparently google could activate face recognition for google goggles but have chosen not to do that at this moment for privacy issues.


As much as I hate to say it, It seems like this is something that could be easily accomplished by piggybacking off of Twitter.

Just drop the link, some tags, and misc attributes along with a short description.. Is it much more complicated than that?


Remember the end goal here is to deliver content to the user that they want to spend time consuming. Why use Twitter for anything but another way to find content?


I'm not sure I understand your comparison. What is the difference between finding content I want to spend time consuming, and finding content another way (via some content filtering process), which has every intention of limiting the content to what you "want" to consume.

I agree. The problem is definitely a matter of related vs. unrelated content. My suggestion (or babble) was more so just pointing out how easily we could implement a "source content filtering" logic into an already existing interface.. If we felt that was the best way to go about it.


I just mean that I don't see a benefit of using Twitter as a delivery system.


Of course. No doubt.


If you re-frame the problem not as searching for "different gold" but "similar chaff" (and hiding it) then simple trainable filters look more useful.


I can see an interesting parallel to searching for prime numbers, which are basically our most pure "new, unknown primitive facts." In this model, a Bayesian filter trained to detect "sameness" would almost be a generalization of a sieve of Erastothenes.


That sounds like a great interface. But it doesn't stop you from having to do all of the filtering/categorizing. Maybe if after having done all of that work on your end, there would be a nice way to share that with followers.


Nah, the only categorization that really needs to be done is by topic. It's easy to tell if something is, say, a tech article, or tech/programming, or tech/linux, based on source and keywords. If your algorithm did that once per item that's in anyone's feed, everything else is refinement. Ideally, a human won't need to touch it at all. Even if the machine gets it wrong, it could be corrected with a "this is in the wrong place" button. In any case, an article shouldn't rise too high in any category that it's not relevant for.


I'd like to see the categories automatically created via SVD. I wonder how people would react to the knowledge that their favorite posts are written by 30-something males on Saturdays, and include the word "has" twice as often as the word "my"?


I wouldn't mind an interface with some knobs and dials, given that one can escape the linearity of the feed, and replace that with a recombinant stratified interface which can present the information space from several different vantage points. It'd be a line of flight interface:

“This is how it should be done: Lodge yourself on a stratum, experiment with the opportunities it offers, find an advantageous place on it, find potential movements of deterritorialization, possible lines of flight, experience them, produce flow conjunctions here and there, try out continuums of intensities segment by segment, have a small plot of new land at all times. It is through a meticulous relation with the strata that one succeeds in freeing lines of flight...” – Deleuze and Guattari

http://www.linesofflight.net/linesofflight.htm


An interface that weird had better be devastatingly well-implemented, or people will just be confused.


Yep it could learn from the usage.


I don't think computers can help anyone find truly interesting things. An algorithm would have to be really complicated to discover the fact that just because so many people are interested in New Moon, or Tiger Woods, that doesn't mean I am.

When "relevancy" algorithms are attempted they don't do nearly as good a job at serving niche groups as "community" moderated algorithms such as HN.

HN is an algorithm, in which each person's brain is an equation that estimates the "worth" of piece of information. As a result the final product is an extremely complicated result of "equations", as it were, that are good at finding relevant information.

I don't expect computers to be good at doing that for another ten years or so.


An algorithm would have to be really complicated to discover the fact that just because so many people are interested in New Moon, or Tiger Woods, that doesn't mean I am.

A Naive Bayesian text classifier could easily learn that. If you fed it news stories that you are interested in it would quickly discover that you never click on stories about Tiger Woods.

It would then classify incoming news for you.

In fact, my old email project POPFile has been adapted to support things like RSS and NNTP filtering with ease.


That would work to eliminate "noise" info but such algorithms still do a poor job of helping one find new content and subjects that you have never expressed interest in before. I still trust groups of real people to help me in this regard more than I would trust an aggregator.


Theoretically, you could have a (lazily-evaluated) subscription to the entire universe of discourse. It would start out completely hidden, but would gradually become visible in your reader as your filters trim down the stream to a tractable level. Thus, you wouldn't so much be saying "tell me something I don't know," but rather "tell me everything, and leave out the 99.9999% I don't care about."


I don't think computers can help anyone find truly interesting things. An algorithm would have to be really complicated to discover the fact that just because so many people are interested in New Moon, or Tiger Woods, that doesn't mean I am.

I don't see why, in principle, the full Bayesian inference couldn't be done. In other words, automatically compute how interesting the item is based on how the masses liked it in aggregate, how individuals with similar tastes to you liked it, how similar the item is to other items that you have liked, etc.


The problem with this approach is feasibility. Some companies like Netflix and Amazon have done some complex recommendation engines, and I still don't think they're processing as much data as an all-encompassing social network data aggregating recommendation engine would.

I believe Reddit was at one point going to try and create a news recommendation engine but found it impractical as the number of users grew. The number of permutations for what people in just one user's network like and dislike is staggering.


Given that it's presently possible to offload data processing to a number of cloud-based computing services, I'm not sure that we should be afraid of the computation cost. If you are willing to make some assumptions about independence (for example) you can do a lot of the computations in parallel. A more subtle issue would be how much the use of social net data would gain Amazon or Netflix over their current recommendation engines. I suspect it would be quite a lot, but that's just a guess.


The idea is that interdevice communication creates the knowledge of context by logging everything you do but don't have the ability to to do. I know it needs more thinking but there is something there imho


Lots of great comments, thanks so much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: