Why search index size no longer matters
By Charlene Li
There’s already been a great deal written about the debate between Yahoo! and Google on their relative index sizes (see my previous post as well as Search Engine Watch’s two posts and John Battelle’s numerous posts on it.) Like them, I was subjected to numerous phone calls and meetings with both Google and Yahoo! over the past week. Rather than add to the debate, I’d like to talk about what this debate means and the implications for the future, primarily that reported size just doesn’t matter any more.
As way of background, when Google first called me last week to say that they couldn’t confirm Yahoo!’s index size, my first question to them was, “Are you saying that Yahoo! is lying?” Obviously, Google never said this outright, but from their discussions, they wanted analysts and press to come to that conclusion. Numerous charts and test examples provided by Google intimated that Yahoo’s claim of 20 billion documents was based on inaccurate counting at best, deliberate obfuscation at worst.
After a week of watching Google flex its PR muscle, Yahoo! responded that it never said that it’s index was bigger than Google’s – only that it was 20 billion documents deep. Yahoo! said that it could never verify another search engine’s index size and couldn’t make any such comparison. Yahoo! also strongly stated to me that for another search engine to purport that it could do the same was unfounded.
I can understand Google’s ego taking a HUGE hit as it has built its reputation and company’s culture on the fact that it is/was the “biggest” search engine out there. My advice to them was rather than challenge Yahoo! on the actual size of the index, to move beyond and concentrate on relevancy. Yet throughout this whole debate, both Google and Yahoo! continued to focus the debate on index size rather than provide data on how searches are more relevant.
But on to the implications. I think Google has shot itself in the foot as the importance of index size is being widely disputed. Eventually, Google will come out with an update announcement that its index is at XX billion documents (presumably north of 20 billion). Rather than gasp in wonderment at how big Google is, we’ll all just shake our heads and say, “There they go again!”.
Now some have called for standards and a way to audit index size, in the belief that understanding the size is important. Hypothetically, size does matter, but an audit plays directly into Google’s argument that there is “right” way to count documents in an index (and they will argue strenuously that their way is the best). The reality is, index construction and counting is highly customized and proprietary to every index and hence, can’t be standardized or audited. In the same way, relevance lies in the eye of the beholder – every search engine has relevancy metrics and I can guarantee you that they think they all show up at the top of their scales!
So we’ll continue to see “index envy” taking place between the search engines, but it’s clear to me that index size is no longer anything that outsiders can use to gauge how “good” a search engine is. Indeed, as personalized search, vertical search, and integration of content into search results becomes more important in determining how well we like a search engine, index size will quickly become irrelevant.