old index:
problem: parts of the index were refreshed at a faster rate than others; the main layer would update every couple of weeks and to refresh a layer of the old index, Google would analyze the entire web first.
impact: there was a significant delay between when Google found a page and made it available to its users.
new index:
solution: caffeine, Google's new index will analyze the web in small portions and update its search index on a continuous basis, globally. As Google find new pages, or new information on existing pages, Google can add these straight to the index.
improvement: user can find fresher information than ever before—no matter when or where it was published.
recent changes in Google search result page suggested that the searched results appears to be categorized further to the different structured data. This could be inline with the rolling out Caffeine.
Hmmm... come to think about it I have been getting quite a lot of results from CodeWeblog.com in my search results. Unfortunately, that site seems to be a link farm and its so useless that I reported it to the Google spam page. Maybe its appearance in the top sites was a result of the Caffeine changes.
Ah! Its not only me who is annoyed by these codeweblog.com results.
Is there a way to remove domains form google results as a preference?
I know I can always add "-site:codeweblog.com"
But can I save this preference for future search queries?
Then this comment should be flagged. It (a) reduces the intelligence of the discussions on HN (b) pushes more worthy comments down below the fold, (c) encourages similar pithy comments, and (d) makes those who would otherwise make intelligent comments less likely to comment.
This is the kind of one-liner that perhaps Jon Stewart would make...and while there is nothing wrong with The Daily Show, I thought that we, the HN community, are collectively better than that.
Sorry for the rant, but I'm just sick of wading through pithy one-liners on HN.
I felt a small amount of humor -- indeed, what a small attempt it was, I didn't consider it that amusing -- was welcome in the thread discussing the shortcomings of the accompanying infographic. Unless you'd like to have a serious discussion about the shortcomings of the accompanying infographic? I'm not sure there's much intelligence to be had in a discussion responding to "Wow is that ever a terrible infographic".
Can we discuss its poor choice of color? Its lack of visual appeal? The obvious selection of a Home Depot paint swatch to represent Google's old index, and what that means for their database technology? Not really. Not in response to "Wow is that ever a terrible infographic".
I'd be hesitant to flag everything that you don't initially understand. Cut HN some slack, we're intelligent people and we enjoy humor as much as the next man. I certainly appreciate when someone makes me chuckle in the comments, and I'd loathe HN if that went away by your hand.
Given something like http://news.ycombinator.com/item?id=1403672 is it HN that has to change, or the person in the chair? I realize your time may be valuable, but you don't have to read everything and get offended by its presence -- including my comment.
Ah, sorry. My bad, I didn't get the joke from the start. I thought you were simply saying the infographic was so vague that it looked similar to the Bohr model, but I guess you're also adding in the fact that the Caffeine user emits a quanta of energy after a successful search....? Or am I still not getting it?
Re my general complaint about one-liners, it could well be that I'm just not getting any of them....
Again, my apologies for jumping the gun. I shouldn't have been so forthright in saying your comment should be flagged.
Yeah, actually, I think my rant is more due to the fact that I don't think I get the joke...is it possible to get an explanation from somebody who gets the joke, rather than just being modded down?
Is the joke "hey this diagram looks like that diagram"?
Yes, nothing to get angry about (I probably came across too aggressive), but I believe that the prevalence of pithy-one-liners like these (which get massive upvotes), reduce the quality of HN for reasons listed above. And I don't think I'm the only one: I noticed my comment fluctuate wildly between +6 and -4 over the course of a couple of hours...there's many others who feel the same way, even though we are now in the minority.
Regardless of what pg says, there's many of us that feel that HN has slowly decreased in quality over the last two years (that's putting it diplomatically). Perhaps the time has come to fork HN.
The pigeon ranking page was an April 1st joke. It's exactly funny in one day of the year. Saying how you'll rank 85% of web searches is a little bit more serious business.
You think for a moment that at least maybe the different content types are allocated different colors, but, no, they're all dark blue. One of two dark blue orbits in Caffeine.
Well, search is their primary service; their web indexing algorithms are sort of their Mother of All Secret Sauces. You wouldn't expect Coke to publish the exact recipe.
On the other hand it doesn't seem that much when you consider todays storage density.
You can fit around 0.5 PB into one rack nowadays.
200 racks then sounds a bit less impressive than 100 Petabytes.
However, that ofcourse doesn't account for redundancy, nor for doing anything useful with such a pile of data. Both of which impose some interesting challenges at that scale.
You think Google's mainline storage of their index is on disk?
It is to laugh. I don't think that's been true since around the time they stopped building server shelves out of lego or cardboard.
They've been keeping the full text of the web in RAM (for the snippets), with indexes, several times over. With independent live siblings in multiple datacenters.
Well, the serving part of their index is in memory (or largely so) as has been publicly announced. But are you really suggesting they might do the same for indexing?
Let's suppose for a second that they can get 8GB sticks of RAM for 1/3 the retail price of around $600, and for the sake of the back of the envelope calculation, let's round it up to ten GB. So let's call it $20/GB of RAM. Four gigabyte sticks might seem cheaper until you consider they'd have to double the number of machines to hold them. Now they mentioned that the size of the index is 100,000,000 GB, which means that they would have spent $2 billion on this RAM alone, not to mention all the other components that are required to house it.
Especially considering how much their capital expenditure has fallen lately (http://www.datacenterknowledge.com/archives/2010/01/22/googl...), it doesn't seem very likely that they could afford to spend even $2 billion on this. And the assumptions I made about the price of RAM are pretty generous, considering they'd have had to have been acquiring this for some time, and therefore that the RAM would have been more expensive previously.
The post wording "Caffeine takes up nearly 100 million gigabytes of storage in one database" isn't clear that's all needed for the constantly-consulted index; just that's what the indexing system uses, including perhaps rarely-consulted information. They might also be reporting uncompressed data sizes, or counting all the raw 'storage' used to store info in triplicate (or more).
So the live data consulted for almost all web queries might be be much much less than 100 PiB of index data, and thus fit far more economically in RAM.
Your retail RAM prices are inflated about 3x, but I think your point is correct.
You know how Google has always had that little blurb about how long it took to process your query on the results page? Back when they were for sure running everything out of RAM, it was usually something comically small like 0.00025 seconds. I just checked and it's now more like 0.25 seconds.
Perhaps they've done testing and found that absurdly fast results don't matter (or no longer matter) as much as they thought?
Sorry, that's just what I saw on Newegg for 8GB sticks. I did not intend to mislead.
I am under the impression that most of Google's index is still served out of ram. Certainly I never saw any announcements to the contrary. The other poster also pointed out that sub-millisecond latencies are pretty unlikely. Considering all the diverse sets of data they need to pull from, it would be similarly difficult to believe that tens or hundreds of disk seeks could be carried out in a timely manner for each query to yield a 200 ms total calculation time. If I had to guess they have probably just started taking factors into account that they did not before. For example, perhaps they are measuring total internal latency rather than just the latency of a particular sub-system. Or maybe there are components that go to disk while the posting list lookups are done in RAM.
Why use RAM? You can get a 500GB SSD for $1000. 100 PB would cost you $200 Million, which is peanuts to Google.
Or you could use PCIe SSD drives. You can get a TB of PCIe for $3000, and a 5 PCIe x16 mother board at Fry's costs around $300. You could get 100PB of PCIe SSd storage for around $300 M.
You could use RAM for very fast cache, and SSD storage for faster than HDD indexing and retrieval.
$200M is still a quarter of their capital expenditure in 2009 not including the other parts of the machines. And before 2009 flash was much more expensive and not as fast as it is now. It would surprise me greatly to hear that they had gone this route.
Ssds aren't reliable enough for Google scale. The rater number of writes before failure for Flash would make replacement a constant effort for them and ruin any economy. You need Ram for the reliability as well as the random access speed.
Most of that was during a time when RAM was significantly more expensive than my calculations took into account. But in any case, you and I are talking about two different things. You're saying that there is some possible universe where they are using 100 PB of ram on this project alone. I'm claiming that it's very unlikely that this is the same universe as the one we are living in. While I'll gladly admit that you are correct in your claim, I don't think it's useful.
What I dislike about Google's current algorithm is that when I search for a common term such as "javadoc simpledateformat", the API documentation for Java 1.4.2 (as opposed to 1.6) comes up as apparently, that is what most documents on the web link to. I hope this new index will allow more recent and more up-to-date documents to get to the top of the results list.
Yep, I've done exactly that for the Java 6 docs in Chrome and FF. The general point of how do you keep the latest version of a set of documents the most relevant still applies however. I hope Google can fix this :)
I had the same annoyance searching for PostgreSQL stuff -- the latest docs are almost never accessible from the Google page, just docs 2 years old or more.
Similarly searching for Apache directives (e.g. http://www.google.com/search?q=rewriterule ) returns result for apache 1.3 docs. You have to add 2.2 while searching for apache info.
Nexus One total internal storage: 512MB
'the largest iPod': 160GB
The Nexus One is a phone, not a device with a large storage capacity. If Google made one of those, I'm sure it would be referenced.
edit: the Nexus One box also includes a 4GB micro SD card, but that'd be comparing flash memory with harddrives, and is hardly an intuitive explanation.
I meant junk like search engine cheating forum and usenet posts and sites registered to repeat stock phrases and content of spammy aggregator sites, rather than making judgements against lolcats and breakfast time tweets.
I'm waiting for Google to open up an API or dashboard on top of Caffeine to compete with some of the monitoring tools like Radian6 and Jive. They could own that space and it's growing like crazy right now. Plus, we don't need any more companies coming out with "social web" search engines.
Why is every measure of storage always "demystified" with a paper stack analogy? I find it funny that when the paper stack would not provide the right scale, the used a stack of storage devices...
Seems like the target group of this blog post are teenagers, not even students. Some street slang would look well near the size measurement in ipods. =)
Google uses only four languages in their codebase: C++, Java, Python and Javascript. Yegge has covered this pretty extensively in the past. Go Google him.
new index: solution: caffeine, Google's new index will analyze the web in small portions and update its search index on a continuous basis, globally. As Google find new pages, or new information on existing pages, Google can add these straight to the index. improvement: user can find fresher information than ever before—no matter when or where it was published.
recent changes in Google search result page suggested that the searched results appears to be categorized further to the different structured data. This could be inline with the rolling out Caffeine.