I'm surprised nobody has mentioned Redis yet. I'd build it with two stages. Firs...

mittermayr · on Nov 5, 2012

you do have a sensible approach that you suggest here. originally, without using redis, i wanted to use mysql to cache all user data. and insert/refresh details in it over time. the table quickly grew to 20M records (with meta-data information taking at least 1K of data per record if not more) , the database grew to multiple gigabytes. twitter has up 140M accounts or more now, so i'd need headroom here, although i'd likely not touch a large amount of twitter users.

also the system started making sense after a while when I had user ids that I had already cached (you are correct, the IDs I get through followers/ids which is much more well-though out function in terms of limits).

but then mysql constantly crashed and reparing/backing up a multiple-gigabyte-table exceeded my technical abilities and i gave up. so I split up everything into per-user-sqlite databases that I backup to S3. i lose the ability to access a cache of users though since I can't query other user's sqlite databases in a sane way to see if they have meta data for that user id.

major problem is that I believe twitter will eventually shut me down if I duplicate/replicate their user database (and I constantly need to refresh since user data will eventually be outdated).

adrinavarro · on Nov 5, 2012

They don't need to know you replicate! ;)

By the way, which fields (nickname, avatar, follower count, following count…?) are you storing for follower representation?

mittermayr · on Nov 5, 2012

well, they'd find out eventually, that's my worry. it's tough to speculate on something like this.

this is the culprit call: https://dev.twitter.com/docs/api/1.1/get/users/lookup

i store anything that might provide sensible statistics later on from that response (so no profile colors or photos)

adrinavarro · on Nov 5, 2012

Well, it's intended to work with followers/ids and other calls. Twitter might go against you, but that would be disregarding your calls… If they pull the plug it will be because of features, not because of how you use the API :(

ivanhoe · on Nov 5, 2012

you could easily apply sharding to that DB table and then scale horizontally as much as you like (just think big enough, so that you don't have to re-shard too soon..)

KirinDave · on Nov 5, 2012

"I'm surprised nobody has mentioned Redis yet."

Because it's not a very good fit for this job?

adrinavarro · on Nov 5, 2012

Why wouldn't it be?

bonzoesc · on Nov 5, 2012

The problem is Twitter API limits and processing bandwidth, and Redis isn't a +1 Wand of Scaling Magic.

adrinavarro · on Nov 5, 2012

Redis does help with the big database crashing, read the post.

KirinDave · on Nov 9, 2012

Table too big? Redis is not a solution.