Yahoo! increases index size to 20 billion documents
By Charlene Li
Does size really matter? I think the biggest impact of the index size increase will be marketing for Yahoo! – at least until Google comes out with a similar announcement (which I expect in the next day or so at SES -- there’s too much ego at Google to let Yahoo! stand for long as having the “biggest” index.) Even if Google does come out with a large index size, Yahoo! is making a clear stand that it is willing and ready to take on Google at the core of search – the index size. Unfortunately, in a world where consumer attention is focused on simple-to-digest numbers, search leadership seemingly comes down to who has the biggest index. And nothing could be further from the truth.
Index size does if you’re looking for information on the fringes – yes, I’m talking about the “long tail”. Yahoo! has been making headway especially into “
dark deep web” content, such as their beta search through subscription databases (disclosure: Forrester Research is one of the participants in the Subscriptions service). One of the marketing implications of a larger index is that some companies may find their content now indexed by Yahoo! and no longer need to use Yahoo’s paid inclusion service, Search Submit, to get into the index (but there are other reasons to use paid inclusion, namely frequently spidering).
But there are other considerations, such as relevancy, which gets harder and harder to determine when accessing databases of information – there are typically few links into database that allows the search engine to determine what information is relevant or not.
So keep in mind that there are many dimensions to search beyond simply index size – the ability to determine relevance among different document objects and sources, as well a the ability to crawl that index regularly and quickly – are also factors. And as Yahoo! and Google have shown with their expansion into vertical search (think shopping, local, video, images), the user interface and specialization will also be major factors.
Aside: I found out about the announcement through Mike Liedtke at the Associated Press, who wrote up the announcement and I’ve been waiting for the Y! Search Blog to post. Interestingly, I was headed to the Yahoo! campus for a briefing on a future announcement when I talked with Mike and brought up the index size announcement – it was news to the Yahoo! people as well (they weren’t in the core search group, which is busy presumably with SES).
Update: Well, that didn’t take long. I got a call this afternoon from Google, providing some background on how they go about the accounting of the documents in their index. A couple of the issues they raised were interesting, ranging from how they de-dupe and canonicalize their index to ensure that the high quality of the index (de-duping is the process of removing URLs that point to the same page, while canonicalization is the process of taking all variations of a URL, for example those that are dynamically generated, and recognizing that they all generate the same page). Google raised other issues, such as fully versus partially crawled URLs, and also the use of synonyms and stemming – at which point I realized that my capacity to understand technical search algorithms had reached its limit.
Google’s briefing raised some good issues, and in one way was very refreshing – up to this point, they haven’t felt they needed to defend how they calculated the size of their index, and it took a competitor’s prodding to get them to open up about it. I did sense that Google’s search ego was definitely taking a beating and I have to hand it to them, they were pretty clever to use me to raise these issues rather than commenting directly on a competitor’s announcement.
Conveniently, I was headed over to the Yahoo! Search Night Out party so I tracked down Tim Mayer from Yahoo! to get details on how they de-dupe and canonicalize (try saying that fast) their index. Tim assured me that of course, they de-dupe and go through canonicalization. He did point out that there is an art to how engineers construct the de-duping processes, and that there is inevitably differences in how different search engines do this. This accounts in part for the great differences in search engine results is a function of not only how the index is created but also how the algorithms weight the results.
So I decided to conduct a few basic tests on Yahoo! and Google. I tested the searches [“angel island” “Christmas tree” light] on Yahoo! and Google and [“mt. trashmore riverview”] on Yahoo! and Google. One tip Google had was to go to the last page of the search results. For the Angel Island search, on Yahoo!, I got an estimation of 798 search results, but by the time I got to the end, there were only 117 shown out of an estimated 179 search results (hmmm, what happened to the other 619?). Google initially reported 391 search results, but the last page showed only 180 entries. When I clicked on the links to show “omitted results”, Yahoo! expanded and then dropped down to 166 out of an estimated 169 results by the last page, while Google expanded to 392 search results and showed 390.
So there are a few funky things happening with the Yahoo! estimated search results. But the proof was in the pudding, or more specifically, in the first few pages of the search results. Neither site could provide search results that told me who puts up the christmas tree every year on Angel Island in San Francisco Bay. And Yahoo! was only slightly better at delivering Web sites that provided personal recollections of the beloved ski hill in my home town, lovingly called Mt. Trashmore.
My conclusion: for these two obscure searches, it appears that Yahoo!’s larger search index doesn’t produce significantly better search results, and in fact, delivers fewer search results overall. But I have to add a HUGE caveat that having a larger index doesn’t necessarily mean only more search result depth, but also potentially breadth. For example, esoteric scientific research may now searchable thanks to Yahoo!’s larger index.
In the end, without clear guidance of what standards and processes are used to create the index, index size in the end really doesn’t matter – it’s only the results that will count. I feel for Google – from a PR and marketing perspective, their hands are tied because the layperson can’t distinguish what really lies behind a search index of 11 billion documents (Google) versus 20 billion documents (Yahoo!) and they can’t really explain all of this without sounding, well, “evil” and petty. In the end though, it’s the search experience that matters. Tim Mayer said it best – Yahoo! hopes that that size of the index will be enough to entice consumers to give Yahoo! a spin – and their hope is that they’ll like what they see and come back again.