Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm surprised nobody has mentioned Redis yet.

I'd build it with two stages. First, a cache that holds user information (sort of replicating profiles), that is, follower count and etc. This would be shared among all users, and can go into a dedicated Redis instance (and why not, also replicated to a MySQL-InnoDB for convenience).

Then, the "graph" DB (follower list), that I'd put into Redis. With some scripting and Redis magic, you can keep automatically sorted (server-side) users by their follower count. You'll just need a lot of RAM (get a dedicated server, look at ovh or others, cloud is usually more expensive and less reliable when it comes to RAM).

You can collect profile information before they go on to the 1.1 (which forces auth), to populate the global DB. Then, you'd only have to fetch users' follower IDs (using the 1.1: followers/ids), which I believe is way more reliable (and progressively, pooling queries, populate the profile database in batches of 500 or 250 users, using followers lists with user details).

This means that data can be queried dynamically without killing the server (or the servers, there should be more than one), therefore allowing for "partial results" (1M followers -> info about the first 10,000 just after signing up, for example).



you do have a sensible approach that you suggest here. originally, without using redis, i wanted to use mysql to cache all user data. and insert/refresh details in it over time. the table quickly grew to 20M records (with meta-data information taking at least 1K of data per record if not more) , the database grew to multiple gigabytes. twitter has up 140M accounts or more now, so i'd need headroom here, although i'd likely not touch a large amount of twitter users.

also the system started making sense after a while when I had user ids that I had already cached (you are correct, the IDs I get through followers/ids which is much more well-though out function in terms of limits).

but then mysql constantly crashed and reparing/backing up a multiple-gigabyte-table exceeded my technical abilities and i gave up. so I split up everything into per-user-sqlite databases that I backup to S3. i lose the ability to access a cache of users though since I can't query other user's sqlite databases in a sane way to see if they have meta data for that user id.

major problem is that I believe twitter will eventually shut me down if I duplicate/replicate their user database (and I constantly need to refresh since user data will eventually be outdated).


They don't need to know you replicate! ;)

By the way, which fields (nickname, avatar, follower count, following count…?) are you storing for follower representation?


well, they'd find out eventually, that's my worry. it's tough to speculate on something like this.

this is the culprit call: https://dev.twitter.com/docs/api/1.1/get/users/lookup

i store anything that might provide sensible statistics later on from that response (so no profile colors or photos)


Well, it's intended to work with followers/ids and other calls. Twitter might go against you, but that would be disregarding your calls… If they pull the plug it will be because of features, not because of how you use the API :(


you could easily apply sharding to that DB table and then scale horizontally as much as you like (just think big enough, so that you don't have to re-shard too soon..)


"I'm surprised nobody has mentioned Redis yet."

Because it's not a very good fit for this job?


Why wouldn't it be?


The problem is Twitter API limits and processing bandwidth, and Redis isn't a +1 Wand of Scaling Magic.


Redis does help with the big database crashing, read the post.


Table too big? Redis is not a solution.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: