Our new search index: Caffeine

iworkforthem · on June 9, 2010

old index: problem: parts of the index were refreshed at a faster rate than others; the main layer would update every couple of weeks and to refresh a layer of the old index, Google would analyze the entire web first. impact: there was a significant delay between when Google found a page and made it available to its users.

new index: solution: caffeine, Google's new index will analyze the web in small portions and update its search index on a continuous basis, globally. As Google find new pages, or new information on existing pages, Google can add these straight to the index. improvement: user can find fresher information than ever before—no matter when or where it was published.

recent changes in Google search result page suggested that the searched results appears to be categorized further to the different structured data. This could be inline with the rolling out Caffeine.

Corrado · on June 9, 2010

Hmmm... come to think about it I have been getting quite a lot of results from CodeWeblog.com in my search results. Unfortunately, that site seems to be a link farm and its so useless that I reported it to the Google spam page. Maybe its appearance in the top sites was a result of the Caffeine changes.

biafra · on June 12, 2010

Ah! Its not only me who is annoyed by these codeweblog.com results.

Is there a way to remove domains form google results as a preference? I know I can always add "-site:codeweblog.com" But can I save this preference for future search queries?

joewein · on July 2, 2010

I reported them as well, but like your report, it doesn't seem to have been acted on.

See my blog post at: http://www.joewein.net/blog/2010/07/01/codeweblog-com-a-pile...

tptacek · on June 9, 2010

Wow is that ever a terrible infographic.

jedsmith · on June 9, 2010

I'd have to disagree. It shows us that Google has adopted the experimental "Bohr" model of database indexing.

resdirector · on June 9, 2010

I'm not sure if I get this...is this just a pithy one-liner, or am I missing something deeper here?

arethuza · on June 9, 2010

http://en.wikipedia.org/wiki/Bohr_model

The right hand side of the diagram....

So you were right about the "pithy one liner".

resdirector · on June 9, 2010

Then this comment should be flagged. It (a) reduces the intelligence of the discussions on HN (b) pushes more worthy comments down below the fold, (c) encourages similar pithy comments, and (d) makes those who would otherwise make intelligent comments less likely to comment.

This is the kind of one-liner that perhaps Jon Stewart would make...and while there is nothing wrong with The Daily Show, I thought that we, the HN community, are collectively better than that.

Sorry for the rant, but I'm just sick of wading through pithy one-liners on HN.

jedsmith · on June 9, 2010

I felt a small amount of humor -- indeed, what a small attempt it was, I didn't consider it that amusing -- was welcome in the thread discussing the shortcomings of the accompanying infographic. Unless you'd like to have a serious discussion about the shortcomings of the accompanying infographic? I'm not sure there's much intelligence to be had in a discussion responding to "Wow is that ever a terrible infographic".

Can we discuss its poor choice of color? Its lack of visual appeal? The obvious selection of a Home Depot paint swatch to represent Google's old index, and what that means for their database technology? Not really. Not in response to "Wow is that ever a terrible infographic".

I'd be hesitant to flag everything that you don't initially understand. Cut HN some slack, we're intelligent people and we enjoy humor as much as the next man. I certainly appreciate when someone makes me chuckle in the comments, and I'd loathe HN if that went away by your hand.

Given something like http://news.ycombinator.com/item?id=1403672 is it HN that has to change, or the person in the chair? I realize your time may be valuable, but you don't have to read everything and get offended by its presence -- including my comment.

resdirector · on June 9, 2010

Ah, sorry. My bad, I didn't get the joke from the start. I thought you were simply saying the infographic was so vague that it looked similar to the Bohr model, but I guess you're also adding in the fact that the Caffeine user emits a quanta of energy after a successful search....? Or am I still not getting it?

Re my general complaint about one-liners, it could well be that I'm just not getting any of them....

Again, my apologies for jumping the gun. I shouldn't have been so forthright in saying your comment should be flagged.

reader5000 · on June 9, 2010

Wow. It's okay you didnt get it, guy.

resdirector · on June 9, 2010

Yeah, actually, I think my rant is more due to the fact that I don't think I get the joke...is it possible to get an explanation from somebody who gets the joke, rather than just being modded down?

Is the joke "hey this diagram looks like that diagram"?

reader5000 · on June 9, 2010

Yes, that would be my understanding of the joke. Not hilarious but nothing to get angry about.

resdirector · on June 10, 2010

Yes, nothing to get angry about (I probably came across too aggressive), but I believe that the prevalence of pithy-one-liners like these (which get massive upvotes), reduce the quality of HN for reasons listed above. And I don't think I'm the only one: I noticed my comment fluctuate wildly between +6 and -4 over the course of a couple of hours...there's many others who feel the same way, even though we are now in the minority.

Regardless of what pg says, there's many of us that feel that HN has slowly decreased in quality over the last two years (that's putting it diplomatically). Perhaps the time has come to fork HN.

defen · on June 9, 2010

I couldn't tell if it was a joke... "magic internet stuff here"

MikeCapone · on June 9, 2010

Considering they had that page with pigeons to explain how pagerank works, I would assume it's not supposed to be very serious.

inerte · on June 9, 2010

The pigeon ranking page was an April 1st joke. It's exactly funny in one day of the year. Saying how you'll rank 85% of web searches is a little bit more serious business.

pjscott · on June 9, 2010

This shouldn't affect the ranking; updated data will just get to the ranking algorithm faster, is all.

tptacek · on June 9, 2010

You think for a moment that at least maybe the different content types are allocated different colors, but, no, they're all dark blue. One of two dark blue orbits in Caffeine.

tjmc · on June 9, 2010

Perhaps they were the best blues of the 41 shades...

(cf. http://stopdesign.com/archive/2009/03/20/goodbye-google.html)

imagii · on June 9, 2010

Without reading the article (and ignoring the captions), you'd think the left side was better.

Ordered and color-coded, as opposed to flying around.

bshep · on June 9, 2010

I agree. The depiction of old index looks better, that to me means failure to communicate on their part.

orionlogic · on June 9, 2010

very much like itunes Genius bar logo with electrons changed with google sidebar icons.

Caffeine is not new and has been talked in SEO community from a quite sometime. I wonder why they make it public now.

To silence iPhone 4 news? No no,just coincidence...

ErrantX · on June 9, 2010

Perhaps you have to "build search engines for a living like [them]" to understand it.

dandelany · on June 9, 2010

Well, search is their primary service; their web indexing algorithms are sort of their Mother of All Secret Sauces. You wouldn't expect Coke to publish the exact recipe.

tjpick · on June 9, 2010

agreed, I immediately thought the old index is structured and layered, the new index is a big ball of mud...

rbranson · on June 9, 2010

One of the worst I've ever seen. It wasn't even worth it if it took 3 minutes to draw it up.

gojomo · on June 9, 2010

100 petabytes of information (100 million GiB) feeding their index is more than I would have expected.

moe · on June 9, 2010

On the other hand it doesn't seem that much when you consider todays storage density.

You can fit around 0.5 PB into one rack nowadays. 200 racks then sounds a bit less impressive than 100 Petabytes.

However, that ofcourse doesn't account for redundancy, nor for doing anything useful with such a pile of data. Both of which impose some interesting challenges at that scale.

blasdel · on June 9, 2010

You think Google's mainline storage of their index is on disk?

It is to laugh. I don't think that's been true since around the time they stopped building server shelves out of lego or cardboard.

They've been keeping the full text of the web in RAM (for the snippets), with indexes, several times over. With independent live siblings in multiple datacenters.

houseabsolute · on June 9, 2010

Well, the serving part of their index is in memory (or largely so) as has been publicly announced. But are you really suggesting they might do the same for indexing?

Let's suppose for a second that they can get 8GB sticks of RAM for 1/3 the retail price of around $600, and for the sake of the back of the envelope calculation, let's round it up to ten GB. So let's call it $20/GB of RAM. Four gigabyte sticks might seem cheaper until you consider they'd have to double the number of machines to hold them. Now they mentioned that the size of the index is 100,000,000 GB, which means that they would have spent $2 billion on this RAM alone, not to mention all the other components that are required to house it.

Especially considering how much their capital expenditure has fallen lately (http://www.datacenterknowledge.com/archives/2010/01/22/googl...), it doesn't seem very likely that they could afford to spend even $2 billion on this. And the assumptions I made about the price of RAM are pretty generous, considering they'd have had to have been acquiring this for some time, and therefore that the RAM would have been more expensive previously.

gojomo · on June 9, 2010

The post wording "Caffeine takes up nearly 100 million gigabytes of storage in one database" isn't clear that's all needed for the constantly-consulted index; just that's what the indexing system uses, including perhaps rarely-consulted information. They might also be reporting uncompressed data sizes, or counting all the raw 'storage' used to store info in triplicate (or more).

So the live data consulted for almost all web queries might be be much much less than 100 PiB of index data, and thus fit far more economically in RAM.

houseabsolute · on June 9, 2010

Right. That is why I distinguished between the serving data and the data they discussed in this post.

blasdel · on June 9, 2010

Your retail RAM prices are inflated about 3x, but I think your point is correct.

You know how Google has always had that little blurb about how long it took to process your query on the results page? Back when they were for sure running everything out of RAM, it was usually something comically small like 0.00025 seconds. I just checked and it's now more like 0.25 seconds.

Perhaps they've done testing and found that absurdly fast results don't matter (or no longer matter) as much as they thought?

houseabsolute · on June 9, 2010

Sorry, that's just what I saw on Newegg for 8GB sticks. I did not intend to mislead.

I am under the impression that most of Google's index is still served out of ram. Certainly I never saw any announcements to the contrary. The other poster also pointed out that sub-millisecond latencies are pretty unlikely. Considering all the diverse sets of data they need to pull from, it would be similarly difficult to believe that tens or hundreds of disk seeks could be carried out in a timely manner for each query to yield a 200 ms total calculation time. If I had to guess they have probably just started taking factors into account that they did not before. For example, perhaps they are measuring total internal latency rather than just the latency of a particular sub-system. Or maybe there are components that go to disk while the posting list lookups are done in RAM.

rxin · on June 9, 2010

Your memory might've been wrong. 0.00025 is just too small a number for a system like this (Multi tier, RPC every where distributed systems).

iamelgringo · on June 9, 2010

Why use RAM? You can get a 500GB SSD for $1000. 100 PB would cost you $200 Million, which is peanuts to Google.

Or you could use PCIe SSD drives. You can get a TB of PCIe for $3000, and a 5 PCIe x16 mother board at Fry's costs around $300. You could get 100PB of PCIe SSd storage for around $300 M.

You could use RAM for very fast cache, and SSD storage for faster than HDD indexing and retrieval.

houseabsolute · on June 9, 2010

$200M is still a quarter of their capital expenditure in 2009 not including the other parts of the machines. And before 2009 flash was much more expensive and not as fast as it is now. It would surprise me greatly to hear that they had gone this route.

robk · on June 9, 2010

Ssds aren't reliable enough for Google scale. The rater number of writes before failure for Flash would make replacement a constant effort for them and ruin any economy. You need Ram for the reliability as well as the random access speed.

drats · on June 9, 2010

That link has over 7 billion of capex for 2006-2009 there without accounting for their earlier pre-2006 hardware. It's eminently feasible.

houseabsolute · on June 9, 2010

Most of that was during a time when RAM was significantly more expensive than my calculations took into account. But in any case, you and I are talking about two different things. You're saying that there is some possible universe where they are using 100 PB of ram on this project alone. I'm claiming that it's very unlikely that this is the same universe as the one we are living in. While I'll gladly admit that you are correct in your claim, I don't think it's useful.

herdrick · on June 9, 2010

No need to laugh. It was a reasonable assumption.

moe · on June 9, 2010

Notice the last sentence?

I merely tried to put the figure into perspective.

jgavris · on June 9, 2010

I too am skeptical of the hundreds of millions of gigabytes per day.

jodrellblank · on June 9, 2010

Hundreds of thousands of gigabytes per day.

spuz · on June 9, 2010

What I dislike about Google's current algorithm is that when I search for a common term such as "javadoc simpledateformat", the API documentation for Java 1.4.2 (as opposed to 1.6) comes up as apparently, that is what most documents on the web link to. I hope this new index will allow more recent and more up-to-date documents to get to the top of the results list.

noss · on June 9, 2010

I tend to use "java6 simpledateformat" as my query for that.

Jermey128 · on June 9, 2010

In Firefox I have the following quick-search to fix that (and make searching the API a little faster):

http://www.google.com/search?q=java+6+%s&ie=utf-8&oe...

This should also work in Chrome.

spuz · on June 9, 2010

Yep, I've done exactly that for the Java 6 docs in Chrome and FF. The general point of how do you keep the latest version of a set of documents the most relevant still applies however. I hope Google can fix this :)

cookiecaper · on June 9, 2010

I had the same annoyance searching for PostgreSQL stuff -- the latest docs are almost never accessible from the Google page, just docs 2 years old or more.

suraj · on June 9, 2010

Similarly searching for Apache directives (e.g. http://www.google.com/search?q=rewriterule ) returns result for apache 1.3 docs. You have to add 2.2 while searching for apache info.

dskhatri · on June 9, 2010

Nice to see they reference an iPod instead of a Nexus One. Somehow I feel that if it was a Microsoft announcement, the company would reference Zunes.

batiudrami · on June 9, 2010

Nexus One total internal storage: 512MB 'the largest iPod': 160GB

The Nexus One is a phone, not a device with a large storage capacity. If Google made one of those, I'm sure it would be referenced.

edit: the Nexus One box also includes a 4GB micro SD card, but that'd be comparing flash memory with harddrives, and is hardly an intuitive explanation.

jodrellblank · on June 9, 2010

That is an enormous amount of data.

I wonder how much is junk?

mkramlich · on June 9, 2010

You're new to the Internetz, right? :)

jodrellblank · on June 9, 2010

I meant junk like search engine cheating forum and usenet posts and sites registered to repeat stock phrases and content of spammy aggregator sites, rather than making judgements against lolcats and breakfast time tweets.

chintan · on June 9, 2010

huh? 40 miles of iPods.

+5 for an intuitive analogy. Priceless!

Keyframe · on June 9, 2010

I wonder how much libraries of congress is that?

mkomo · on June 9, 2010

hopefully iPod-miles (iPm?) will become a new ISO standard for measuring storage capacity.

rimantas · on June 9, 2010

Would nicely complement a station wagon full of tapes for bandwidth.

jsiarto · on June 9, 2010

I'm waiting for Google to open up an API or dashboard on top of Caffeine to compete with some of the monitoring tools like Radian6 and Jive. They could own that space and it's growing like crazy right now. Plus, we don't need any more companies coming out with "social web" search engines.

samaparicio · on June 9, 2010

Why is every measure of storage always "demystified" with a paper stack analogy? I find it funny that when the paper stack would not provide the right scale, the used a stack of storage devices...

leej · on June 9, 2010

they could be using in-memory compression. they could be storing hot parts of index in memory. they could be using both.

keegangrayson · on June 9, 2010

code or it didn't happen...

c00p3r · on June 9, 2010

Seems like the target group of this blog post are teenagers, not even students. Some street slang would look well near the size measurement in ipods. =)

on June 9, 2010

[dead]

SkyMarshal · on June 9, 2010

She's more than a mere Software Engineer, she's a scientist. AB in Archeology/Anthropology from Harvard, then a PhD in Statistics from Stanford.

Very interesting background to bring to this field. Call it real-time statistical anthropology.

enneff · on June 9, 2010

"Software Engineer" is a generic title at Google; the actual role can vary wildly. Carrie is an impressively talented engineer and statistician.

dalore · on June 9, 2010

Aren't we all using Duck Duck Go anyway?

swah · on June 9, 2010

Java or C++

coderdude · on June 9, 2010

If there could be an answer to what you said, it would probably be "both, and Python."

shadowsun7 · on June 9, 2010

Google uses only four languages in their codebase: C++, Java, Python and Javascript. Yegge has covered this pretty extensively in the past. Go Google him.

jlouis · on June 9, 2010

And Go, presumably.

swah · on June 9, 2010

I asked what this project was probably written in...