First, you could try Datasift or Gnip, who both sell twitter data, and thus have...

mittermayr · on Nov 5, 2012

Hey! Datasift+Gnip both seem to supply conversational data, which I currently don't track. My focus is on user-data, which both don't seem to provide. And yes, they're expensive, wow.

The streaming API idea is a good one to get the most out of all of Twitter's data sources... someone else suggested this as well. Seems like it's worth a try.

Your Alice/Bob/Fred assumption is correct. I had a cache of sorts, through a large mysql table which went way over a couple of gigabytes and kept crashing all the time and restoring took half a day.

The twitter investors haven't had a chance to see any of the service since they are still waiting for results :) tough one to ask for intros ;)

PanMan · on Nov 5, 2012

The first thing I would try is rebuilding the cache. Since you only need to do key-value lookups, there are many (easy) approaches, that can scale better than MySQL. You could do JSON files on a filesystem (with some nesting, as you don't want to put 20 million files in 1 dir). Or Redis: 20 million x 1 KB is 20 GB. You will need more with overhead, but a few machines would work. Or Riak (we went > 1 billion items with Riak). But even MySQL should be able to handle this: We had over 100 million records in MySQL (on SSD's) before switching to something else.

That seems to be the quickest win to reduce the number of requests, but it will depend on the overlap of your usersets how much this will help.