Enter your email address:

Delivered by FeedBurner

About This Blog

Josh’s Tweet Stream

  • More tweets

« Heading to SES | Main | Google launches RSS for Google News »

August 08, 2005

Yahoo! increases index size to 20 billion documents

By Charlene Li

Yahoo! announced that its search index is now 20 billion documents strong, an index that it launched on July 20th. That surpasses the announced 8 11.3 billion documents indexed by Google.

Does size really matter? I think the biggest impact of the index size increase will be marketing for Yahoo! – at least until Google comes out with a similar announcement (which I expect in the next day or so at SES -- there’s too much ego at Google to let Yahoo! stand for long as having the “biggest” index.) Even if Google does come out with a large index size, Yahoo! is making a clear stand that it is willing and ready to take on Google at the core of search – the index size. Unfortunately, in a world where consumer attention is focused on simple-to-digest numbers, search leadership seemingly comes down to who has the biggest index. And nothing could be further from the truth.

Index size does if you’re looking for information on the fringes – yes, I’m talking about the “long tail”. Yahoo! has been making headway especially into “dark deep web” content, such as their beta search through subscription databases (disclosure: Forrester Research is one of the participants in the Subscriptions service). One of the marketing implications of a larger index is that some companies may find their content now indexed by Yahoo! and no longer need to use Yahoo’s paid inclusion service, Search Submit, to get into the index (but there are other reasons to use paid inclusion, namely frequently spidering).

But there are other considerations, such as relevancy, which gets harder and harder to determine when accessing databases of information – there are typically few links into database that allows the search engine to determine what information is relevant or not.

So keep in mind that there are many dimensions to search beyond simply index size – the ability to determine relevance among different document objects and sources, as well a the ability to crawl that index regularly and quickly – are also factors. And as Yahoo! and Google have shown with their expansion into vertical search (think shopping, local, video, images), the user interface and specialization will also be major factors.

Aside: I found out about the announcement through Mike Liedtke at the Associated Press, who wrote up the announcement and I’ve been waiting for the Y! Search Blog to post. Interestingly, I was headed to the Yahoo! campus for a briefing on a future announcement when I talked with Mike and brought up the index size announcement – it was news to the Yahoo! people as well (they weren’t in the core search group, which is busy presumably with SES).

Update: Well, that didn’t take long. I got a call this afternoon from Google, providing some background on how they go about the accounting of the documents in their index. A couple of the issues they raised were interesting, ranging from how they de-dupe and canonicalize their index to ensure that the high quality of the index (de-duping is the process of removing URLs that point to the same page, while canonicalization is the process of taking all variations of a URL, for example those that are dynamically generated, and recognizing that they all generate the same page). Google raised other issues, such as fully versus partially crawled URLs, and also the use of synonyms and stemming – at which point I realized that my capacity to understand technical search algorithms had reached its limit.

Google’s briefing raised some good issues, and in one way was very refreshing – up to this point, they haven’t felt they needed to defend how they calculated the size of their index, and it took a competitor’s prodding to get them to open up about it. I did sense that Google’s search ego was definitely taking a beating and I have to hand it to them, they were pretty clever to use me to raise these issues rather than commenting directly on a competitor’s announcement.

Conveniently, I was headed over to the Yahoo! Search Night Out party so I tracked down Tim Mayer from Yahoo! to get details on how they de-dupe and canonicalize (try saying that fast) their index. Tim assured me that of course, they de-dupe and go through canonicalization. He did point out that there is an art to how engineers construct the de-duping processes, and that there is inevitably differences in how different search engines do this. This accounts in part for the great differences in search engine results is a function of not only how the index is created but also how the algorithms weight the results.

So I decided to conduct a few basic tests on Yahoo! and Google. I tested the searches [“angel island” “Christmas tree” light] on Yahoo! and Google and [“mt. trashmore riverview”] on Yahoo! and Google. One tip Google had was to go to the last page of the search results. For the Angel Island search, on Yahoo!, I got an estimation of 798 search results, but by the time I got to the end, there were only 117 shown out of an estimated 179 search results (hmmm, what happened to the other 619?). Google initially reported 391 search results, but the last page showed only 180 entries. When I clicked on the links to show “omitted results”, Yahoo! expanded and then dropped down to 166 out of an estimated 169 results by the last page, while Google expanded to 392 search results and showed 390.

So there are a few funky things happening with the Yahoo! estimated search results. But the proof was in the pudding, or more specifically, in the first few pages of the search results. Neither site could provide search results that told me who puts up the christmas tree every year on Angel Island in San Francisco Bay. And Yahoo! was only slightly better at delivering Web sites that provided personal recollections of the beloved ski hill in my home town, lovingly called Mt. Trashmore.

My conclusion: for these two obscure searches, it appears that Yahoo!’s larger search index doesn’t produce significantly better search results, and in fact, delivers fewer search results overall. But I have to add a HUGE caveat that having a larger index doesn’t necessarily mean only more search result depth, but also potentially breadth. For example, esoteric scientific research may now searchable thanks to Yahoo!’s larger index.

In the end, without clear guidance of what standards and processes are used to create the index, index size in the end really doesn’t matter – it’s only the results that will count. I feel for Google – from a PR and marketing perspective, their hands are tied because the layperson can’t distinguish what really lies behind a search index of 11 billion documents (Google) versus 20 billion documents (Yahoo!) and they can’t really explain all of this without sounding, well, “evil” and petty. In the end though, it’s the search experience that matters. Tim Mayer said it best – Yahoo! hopes that that size of the index will be enough to entice consumers to give Yahoo! a spin – and their hope is that they’ll like what they see and come back again.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c50bf53ef00d8345a6db969e2

Listed below are links to weblogs that reference Yahoo! increases index size to 20 billion documents:

» 20 billion channels and nothing to see. from SEOToolSet Blog
Yahoo announced today that as of their last update, their index has been increased to 20 billion documents topping Google's 11.3 billion document index handily. Charlene Li has an excellent write up on why this is or isn't important. The most basic que... [Read More]

» Yahoo says to Google: "Mine's bigger than yours, nah, nah na nah nah" from The Marketing Microscope
20 BILLION pages in the Yahoo index - this is the announcement made by Yahoo on Tuesday (Forgive my reblogging ... [Read More]

» Pregnancy timeline from by week guide
taking in how the baby develops, changes to the mother and key scan dates. [Read More]

» Celtics Put Second-Rounder Powe on Payroll from agreeing to terms
didn't have to return to Cal to earn guaranteed money in the NBA, after all. Powe, who left the Bears after his sophomore [Read More]

» Physicists solve pebble mystery from the time of Aristotle
-- has now been solved by physicists in France and the US. Douglas Durian of the University of Pennsylvania and colleagues in Strasbourg [Read More]

Comments

Djibril

On this interesting subject, read this article from a French researcher: http://aixtal.blogspot.com/2005/03/web-yahoo-indexes-more-pages-than.html

Jack Krupansky

I sure would like to see a transparent explantion of the result count numbers (plural).

I just did a test on Yahoo and it does similar kinds of weird things with counts as Google...

1) The first page of results gives one count.
2) As I advance through the result pages the count incrementally *falls* on each page.
3) When I get to the "last" page of results the count dramatically falls to a new number that is well above the number on the last displayed result.
4) When I click on the link to show "ommitted" results, the first results page has yet a new number of total results. At this point I can't easily navigate to the last page of a much longer list of results.
5) Rerunning the test, I see some variations on the numbers, but the same overall net impact.

Maybe Yahoo and Google should have a hyperlink under the result count number that takes you to a page the gives you the details, real scoop, caveats, and assorted and sundry excuses for why there is no "true" result count.

Maybe you could use your "clout" to get to the bottom of this.

Incidentally, my test search was my first and last names in quotes plus the word "blog".

-- Jack Krupansky

hfghgfhg

三色鱼 13:42:07
请问厦门飞上海的航班时间--今天的。谢谢

-=

http://videos.dns2go.com/free-porn-videos.html
http://videos.dns2go.com/porn-videos.html
http://videos.dns2go.com/gratis-porn-videos.html
http://videos.dns2go.com/clip-free-porn-videos.html
http://videos.dns2go.com/clip-porn-videos.html
http://videos.dns2go.com/free-lesbian-porn-videos.html
http://videos.dns2go.com/free-porn-sample-videos.html
http://videos.dns2go.com/porn-sample-videos.html
http://videos.dns2go.com/free-girls-porn-videos.html
http://videos.dns2go.com/download-free-porn-videos.html
http://videos.dns2go.com/free-porn-teen-videos.html
http://videos.dns2go.com/girls-porn-videos.html
http://videos.dns2go.com/alicia-machado-porn-videos.html
http://videos.dns2go.com/lesbian-porn-videos.html
http://videos.dns2go.com/porn-teen-videos.html
http://videos.dns2go.com/hilton-paris-porn-videos.html
http://videos.dns2go.com/porn-star-videos.html
http://videos.dns2go.com/free-black-porn-videos.html
http://videos.dns2go.com/free-porn-star-videos.html
http://videos.dns2go.com/black-porn-videos.html
http://videos.dns2go.com/asian-porn-videos.html
http://videos.dns2go.com/game-porn-videos.html
http://videos.dns2go.com/free-online-porn-videos.html
http://videos.dns2go.com/amateur-porn-videos.html
http://videos.dns2go.com/asian-free-porn-videos.html
http://videos.dns2go.com/porn-sex-videos.html
http://videos.dns2go.com/free-hardcore-porn-videos.html
http://videos.dns2go.com/alicia-de-machado-porn-videos.html
http://videos.dns2go.com/clip-daily-porn-videos.html
http://videos.dns2go.com/free-porn-trailer-videos.html
http://videos.dns2go.com/anime-free-porn-videos.html
http://videos.dns2go.com/free-full-length-porn-videos.html
http://videos.dns2go.com/home-porn-videos.html
http://videos.dns2go.com/de-gratis-porn-videos.html
http://videos.dns2go.com/free-porn-videos-xxx.html
http://videos.dns2go.com/anime-porn-videos.html
http://videos.dns2go.com/free-porn-sex-videos.html
http://videos.dns2go.com/porn-videos-watch.html
http://videos.dns2go.com/porn-videos-xxx.html
http://videos.dns2go.com/porn-trailer-videos.html
http://videos.dns2go.com/adult-free-porn-videos.html
http://videos.dns2go.com/download-porn-videos.html
http://videos.dns2go.com/hardcore-porn-videos.html
http://videos.dns2go.com/amateur-free-porn-videos.html
http://videos.dns2go.com/cartoon-porn-videos.html
http://videos.dns2go.com/celebrity-porn-videos.html
http://videos.dns2go.com/asian-clip-porn-videos.html
http://videos.dns2go.com/asian-porn-sample-videos.html
http://videos.dns2go.com/homemade-porn-videos.html
http://videos.dns2go.com/free-porn-preview-videos.html
http://videos.dns2go.com/de-porn-videos.html
http://videos.dns2go.com/cartoon-free-porn-videos.html
http://videos.dns2go.com/clip-porn-star-videos.html
http://videos.dns2go.com/free-mature-porn-videos.html
http://videos.dns2go.com/adult-porn-videos.html
http://videos.dns2go.com/porn-post-videos.html
http://videos.dns2go.com/alejandra-guzman-porn-videos.html
http://videos.dns2go.com/music-porn-videos.html
http://videos.dns2go.com/free-gallery-porn-videos.html
http://videos.dns2go.com/online-porn-videos.html
http://videos.dns2go.com/live-naked-porn-porn-sex-videos.html
http://videos.dns2go.com/eve-porn-videos.html
http://videos.dns2go.com/mature-porn-videos.html
http://videos.dns2go.com/free-long-porn-videos.html
http://videos.dns2go.com/bedava-porn-videos.html
http://videos.dns2go.com/free-hilton-paris-porn-videos.html
http://videos.dns2go.com/free-porn-streaming-videos.html
http://videos.dns2go.com/gratuit-porn-videos.html
http://videos.dns2go.com/porn-preview-videos.html
http://videos.dns2go.com/celebrity-free-porn-videos.html
http://videos.dns2go.com/porn-streaming-videos.html
http://videos.dns2go.com/ebony-free-porn-videos.html
http://videos.dns2go.com/free-full-porn-videos.html
http://videos.dns2go.com/michelle-porn-videos-vieth.html
http://videos.dns2go.com/alejandra-de-guzman-porn-videos.html
http://videos.dns2go.com/gratuite-porn-videos.html
http://videos.dns2go.com/indian-porn-videos.html
http://videos.dns2go.com/ebony-porn-videos.html
http://videos.dns2go.com/home-made-porn-videos.html
http://videos.dns2go.com/free-porn-videos-watch.html
http://videos.dns2go.com/free-pic-porn-videos.html
http://videos.dns2go.com/de-hilton-paris-porn-videos.html
http://videos.dns2go.com/free-porn-site-videos.html
http://videos.dns2go.com/gallery-porn-videos.html
http://videos.dns2go.com/info-manga-porn-remember-videos.html
http://videos.dns2go.com/long-porn-videos.html
http://videos.dns2go.com/caseros-porn-videos.html
http://videos.dns2go.com/girls-gratis-porn-videos.html
http://videos.dns2go.com/clip-free-girls-porn-videos.html
http://videos.dns2go.com/free-movie-porn-videos.html
http://videos.dns2go.com/demand-porn-videos.html
http://videos.dns2go.com/porn-soft-videos.html
http://videos.dns2go.com/kelly-porn-r-videos.html
http://videos.dns2go.com/downloadable-free-porn-videos.html
http://videos.dns2go.com/hot-porn-videos.html
http://videos.dns2go.com/daily-free-porn-videos.html
http://videos.dns2go.com/free-latina-porn-videos.html
http://videos.dns2go.com/de-michelle-porn-videos-vieth.html
http://videos.dns2go.com/latina-porn-videos.html
http://videos.dns2go.com/anal-free-porn-videos.html
http://videos.dns2go.com/index.html

Andrew

Maybe Yahoo and Google should have a hyperlink under the result count number that takes you to a page the gives you the details, real scoop, caveats, and assorted and sundry excuses for why there is no "true" result count.

Maybe you could use your "clout" to get to the bottom of this.

Ann

Yahoo announced today that as of their last update, their index has been increased to 20 billion documents topping Google's 11.3 billion document index handily. Charlene Li has an excellent write up on why this is or isn't important. The most basic que...

The comments to this entry are closed.