When we give presentations to different business groups we are often asked how Google works, and of course how you can optimize for it. This video came out about a week ago and many websites have already shared it, as it is one of the first times we have heard such an explanation from Google itself (well actually Matt Cutts of Google, but you know what we mean).
Here is a summary of what Matt describes in the video above. We have made some edits for brevity and clarity:
So there are three things that you really want to do well if you want to be the world’s best search engine.
1) You want to crawl the web comprehensively and deeply.
2) You want to index those pages.
3) You want to rank or serve those pages and return the most relevant ones first.
The Google Dance
Crawling is actually more difficult than one might think. Whenever Google started, whenever I joined back in 2000, we didn’t manage to crawl the web for something like three or four months. The more page rank you have– that is, the more people who link to you, and the more reputable those people are– the more likely it is we’re going to discover your page relatively early in the crawl. In fact, you could imagine crawling in strict page rank order; you’d get the CNNs of the world and The New York Times of the world and really very high page rank sites. If you think about how things used to be, we used to crawl for 30 days.
So we’d crawl for several weeks. We would then index for about a week, and we would push that data out. (That would take about a week). Sometimes you’d hit one data center that had old data; sometimes you’d hit a data center that had new data. That’s what the Google dance was back then.
The Fritz Update
So eventually, in 2003, I believe, we switched as part of an update called Update Fritz to crawling a fairly interesting significant chunk of the web every day. So, if you imagine breaking the web into a certain number of segments, you could imagine crawling that part of the web and refreshing it every night. At any given point, your main base index would only be so out of date, because then you’d loop back around and you’d refresh that. That seems to work very, very well. Instead of waiting for everything to finish, you’re incrementally updating your index. We’ve gotten even better over time; at this point, we can get very, very fresh. Any time we see updates, we can usually find them very quickly.
In the old days, you would have not just a main or a base index, but you could have what were called supplemental results, or the supplemental index. That was something that we wouldn’t crawl and refresh quite as often, but it was a lot more documents. So, you could almost imagine having really fresh content, a layer of our main index, and then more documents that are not refreshed quite as often, but there’s a lot more of them.
OK, I have crawled a large fraction of the web. Within that web you have, for example; one document. Indexing is basically taking things in word order. Well, let’s just work through an example. Suppose you say Katy Perry; in a document, Katy Perry appears right next to each other. What you want in an index is which documents does the word Katy appear in, and which documents does the word Perry appear in? So you might say Katy appears in documents 1, and 2, and 89, and 555, and 789. Perry might appear in documents number 2, and 8, and 73, and 555, and 1,000. The whole process of doing the index is reversing, so that instead of having the documents in word order, you have the words, and they have it in document order.
These are all the documents that a word appears in. Now when someone comes to Google and they type in Katy Perry, you want to say, “OK, what documents might match Katy Perry?” Well, document one has Katy, but it doesn’t have Perry; so it’s out. Document number two has both Katy and Perry; so that’s a possibility. Document eight has Perry but not Katy. 89 and 73 are also out because they don’t have the right combination of words. 555 actually contains both Katy and Perry. Then, these two are also out. So, when someone comes to Google and they type in Chicken Little, Britney Spears, Matt Cutts, Katy Perry, whatever it is; we find the documents that we believe have those words, either on the page or maybe in back links, in anchor text pointing to that document. Once you’ve done what’s called document selection, you try to figure out, “how should you rank those?” And that can be really tricky.
The Secret Sauce
We use page rank as well as over 200 other factors in our rankings to try to say, “OK, maybe this document is really authoritative. It has a lot of reputation because it has a lot of page rank. But it only has the word Perry once.” It just happens to have the word Katy somewhere else on the page. Whereas here is a document that has the word Katy and Perry right next to each other, so there’s proximity. It also has got a lot of reputation; it’s got a lot of links pointing to it. We try to balance that off. You want to find reputable documents that are also about what the user typed in. And that’s kind of the secret sauce, trying to figure out a way to combine those 200 different ranking signals in order to find the most relevant document.
So at any given time, hundreds of millions of times a day, someone comes to Google. We try to find the closest data center to them. They type in something like Katy Perry; we send that query out to hundreds of different machines all at once, which look through their little tiny fraction of the web that we’ve indexed. We conclude that, “OK, these are the documents that we think best match; all those machines return their matches.” And we say, “OK, what’s the crème de la crème? What’s the needle in the haystack? What’s the best page that matches this query across our entire index?” Then we take that page and we try to show it with a useful snippet. So, you show the key words in the context of the document and you get it all back in just under half a second.
Even More Resources about How Search Works:
Read publications by Googlers: http://research.google.com/pubs/papers.html
“The Anatomy of a Large-Scale Hypertextual Web Search Engine”: http://research.google.com/pubs/archive/334.pdf
More videos from Google: http://www.youtube.com/GoogleWebmasterHelp
If you just cannot get enough of it, here is a paper called The Anatomy of a Large-Scale Hypertextual Web Search Engine:
Written by Sergey Brin and Lawrence Page while they were at Stanford: http://infolab.stanford.edu/~backrub/google.html
And for a deep dive by a third party check out: http://www.googleguide.com/google_works.html
This was a really nice insight on the history of Google’s infamous algorithm. What ‘Google Milestone’ has meant the most to you over the years?