On November 16th 2003, Google commenced an update (the Florida update) which had a catastrophic effect for a very large number of websites and, in the process, turned search engine optimization on its head. It is usual to give alphabetical names to Google's updates in the same way that names are given to hurricanes, and this one became known as "Florida".
In a nutshell, a vast number of pages, many of which had ranked at or near the top of the results for a very long time, simply disappeared from the results altogether. Also, the quality (relevancy) of the results for a great many searches was reduced. In the place of Google's usual relevant results, we are now finding pages listed that are off-topic, or their on-topic connections are very tenuous to say the least.
The theories about the Florida update
The various search engine related communities on the web went into overdrive to try and figure what changes Google had made to cause such disastrous effects.
SEO filter (search engine optimization filter)
One of the main theories that was put forward and that, at the time of writing, is still believed by many or most people, is that Google had implemented an 'seo filter'. The idea is that, when a search query is made, Google gets a set of pages that match the query and then applies the seo filter to each of them. Any pages that are found to exceed the threshold of 'allowable' seo, are dropped from the results. That's a brief summary of the theory.
At first I liked this idea because it makes perfect sense for a search engine to do it. But I saw pages that were still ranked in the top 10, and that were very well optimized for the searchterms that they were ranked for. If an seo filter was being applied, they wouldn't have been listed at all. Also, many pages that are not SEOed in any way, were dropped from the rankings.
Searchterm list
People realized that this seo filter was being applied to some searchterms but not to others, so they decided that Google is maintaining a list of searchterms to apply the filter to. I never liked that idea because it doesn't make a great deal of sense to me. If an seo filter can be applied to some searches on-the-fly, it can applied to all searches on-the-fly.
LocalRank
Another idea that has taken hold is that Google have implemented LocalRank. LocalRank is a method of modifying the rankings based on the interconnectivity between the pages that have been selected to be ranked. I.e. pages in the selected set, that are linked to from other pages in the selected set, are ranked more highly. (Google took out a patent on LocalRank earlier this year). But this idea cannot be right. A brief study of LocalRank shows that the technique does not drop pages from the results, as the Florida algorithm does. It merely rearranges them.
Commercial list
It was noticed that many search results were biased towards information pages, and commercial pages were either dropped or moved down the rankings. From this sprang the theory that Google is maintaining a list of "money-words", and modifying the rankings of searches that are done for those words, so that informative pages are displayed at and near the top, rather than commercial ones.
Google sells advertising, and the ads are placed on the search results pages. Every time a person clicks on one of the ads, Google gets paid by the advertiser. In some markets, the cost per click is very expensive, and the idea of dropping commercial pages from the results, or lowering their rankings, when a money-word is searched on is to force commercial sites into advertising, thereby putting up the cost of each click and allowing Google to make a lot more money.
Comment on the above theories
All of the above theories are based on the idea that, when a search query is received, Google compiles a set of results and then modifies them in one way or another before presenting them as the search results. All of the above theories are based on the premise that Google modifies the result set. I am convinced that all the above theories are wrong, as we will see.
Stemming
Finally, there is a theory that has nothing to do with how the results set is compiled. Google has implemented stemming, which means that, in a search query, Google matches words of the same word-stem; e.g. drink is the stem of drink, drinks, drinking, drinker and drinkers. So far, this is not a theory - it's a fact, because Google say it on their website. The theory is that, stemming accounts for all the Florida effects. Like the other theories, I will show why this one cannot be right.
Evidence
There are a number of evidences (Florida effects) that are seen in the results, but I won't go into detail about them all. One piece of evidence that everyone jumped to conclusions about is the fact that you can modify the searches to produce different results. For instance, a search for "uk holidays" (without quotes) shows one set of results, but if you tell Google not to include pages that contain a nonsense word, e.g. "uk holidays -asdqwezxc" (without quotes), you will get a different set of results for some searches, but not for others. Also, the results with the -nonsense word looked the same as they were before the update began, therefore they appeared to be the results before a filter was applied.
This is what led people to come up with the idea of a list of searchterms or a list of money-words; i.e. a filter is applied to some searches but not to others. It was believed that the set of results without the -nonsense word (the normal results) were derived from the set produced with the -nonsense word. But that was a mistake.
What really happened
Although Google's results state how many matches were found for a searchterm (e.g. "1 - 10 of about 2,354,000"), they will only show a maximum of 1000 results. I decided to compare the entire sets of results produced, with and without the -nonsense word, and compare them to see if I could discover why a page would be filtered out and why other pages made it to the top. I did it with a number of searchterms and I was very surprised to find that, in some cases, over 80% of the results had been filtered out and replaced by other results. Then I realised that the two sets of results were completely different - the filtered sets were not derived from the unfiltered sets.
The partners in each pair of result sets were completely different. The 'filtered' set didn't contain what was left from the 'unfiltered' set, and very low ranking pages in the 'unfiltered' set got very high rankings in the 'filtered' set. I saw one page, for instance, that was ranked at #800+ unfiltered and #1 filtered. That can't happen with a simple filter. It can't jump over the other pages that weren't filtered out. All the theories about various kinds of filters and lists were wrong, because they all assumed that the result set is always compiled in the same way, regardless of the searchterm, and then modified by filters. That clearly isn't the case.
In it inescapable that the Google engine now compiles the search results for different queries in different ways. For some queries it compiles them in one way, and for others it compiles them in a different way. The different result sets are not due to filters, they are simply compiled differently in the first place. I.e. the result set without the -nonsense word, and the result set with the -nonsense word are compiled in different ways and are not related to each other as the filter theories suggest. One set is not the result of filtering the other set.
The most fundamental change that Google made with the Florida update is that they now compile the results set for the new results in a different way than they did before.
That's what all the previous theories failed to spot. The question now is, how does Google compile the new results set?
Back in 1999, a system for determining the rankings of pages was conceived and tested by Krishna Bharat. His paper about it is here. He called his search engine "Hilltop". At the time he wrote the paper, his address was Google's address, and people have often wondered if Google might implement the Hilltop system.
Hilltop employs an 'expert' system to rank pages. It compiles an index of expert web pages - these are pages that contain multiple links to other pages on the web of the same subject matter. The pages that end up in the rankings are those that the expert pages link to. Of course, there's much more to it than that, but it gives the general idea. Hilltop was written in 1999 and, if Google have implemented it, they have undoubtedly developed it since then. Even so, every effect that the Florida update has caused can be attributed to a Hilltop-type, expert-based system. An important thing to note is that the 'expert' system cannot create a set of results for all search queries. It can only create a set for queries of a more general nature.
We see many search results, that once contained useful commercial sites, now containing much more in the way of information or authority pages. That's because expert pages would have a significant tendancy to point to information pages. We see that the results with and without the -nonsense word are sometimes different and sometimes the same. That's because an expert system cannot handle all search queries, as the Krishna Bharat paper states. When it can't produce a set of results, Google's normal mechanisms do it instead. We see that a great many home pages have vanished from the results (that was the first thing that everyone noticed). It's because expert pages are much more likely to point to the inner pages that contain the information that to home pages. Every effect we see in the search results can be attributed to an expert system like Hilltop.
I can see flaws in every theory that has been put forward thus far. The flaw in the seo filter idea is that there are highly SEOed pages still ranking in the top 10 for searchterms that they should have been filtered out for. The flaw in the LocalRank theory is that LocalRank doesn't drop pages, but a great many pages have been dropped. The flaw in the list of searchterms is that if a filter can be applied to one searchterm, it can be applied to them all, so why bother maintaining a list. The flaw in the money-words list idea is that, if it ever came out that they were doing it, Google would run the risk of going into a quick decline. I just don't believe that the people at Google are that stupid. The flaw in the stemming theory is not that Google hasn't introduced stemming, it's that the theory doesn't take into account the fact that the Florida results set is compiled in a different way to the -nonsense set. Stemming is additional to the main change, but it isn't the main change itself.
The expert-system, or something like it, accounts for every Florida effect that we see. I am convinced that this is what Google rolled out in the Florida update. Having said that, I must also add that it is still a theory, and cannot be relied upon as fact. I cannot say that Google has implemented Hilltop, or a development of Hilltop, or even a Hilltop-like system. What I can say with confidence is that the results without a -nonsense word (the normal results) are not derived from the results with a -nonsense word, as most people currently think. They are a completely different results set and are compiled in a different way. And I can also say that every effect that the Florida update has caused would be expected with a Hilltop-like expert-based system.
Where do we go from here?
At the moment, Google's search results are in poor shape, in spite of what their representatives say. If they leave them as they, they will lose users, and risk becoming a small engine as other top engines have done in the past. We are seeing the return of some pages that were consigned to the void, so it is clear that the people at Google are continuing to tweak the changes.
If they get the results to their satisfaction, the changes will stay and we will have to learn how to seo Google all over again. But it can be done. There are reasons why certain pages are at the top of the search results and, if they can get there, albeit accidentally in many cases, other pages can get there too.
If it really is an expert system, then the first thing to realise is that the system cannot deal with all searchterms, so targeting non-generalised and lesser searchterms, using the usual search engine optimization basics, will still work.
For more generalised searchterms, the page needs to be linked to by multiple expert pages that are unaffiliated with the page. By "unaffiliated" I mean that they must reside on servers with different IP C block addresses than each other and than the target page, and their URLs must not use the same domain name as each other or as the target page. These expert pages can either be found and links requested or they can be created.
Latest
8th December 2003
Since soon after the Florida update began, some pages that disappeared from the results have been returning. In some cases they are back at, or close to, the rankings that they had before Florida. In other cases they are quite highly ranked but lower than before. Day after day, more of them are returning.
I put this down to Google recognizing that Florida caused a sharp decline in the quality (relevancy) of the search results. It appears that they are adjusting the algorithm's parameters, trying to find a balance between the new page selection process and good relevancy. In doing so, some of the pages that were dumped out of the results are getting back into the results set, and they are achieving high rankings because they already matched the ranking algorithm quite well, so once they are back in the results set, they do well in the rankings.
Reminder
Don't forget that all this is just theory, but what we see happening does appear to fit an expert sytem, although there can be other explanations. We can be sure that Google compiles the results sets in different ways depending on the searchterm, and that the Florida results are not derived, via one or more filters, from the -nonsense results, but we can't yet be certain that an expert system is used to compile the Florida results set.
22nd December 2003
Google has now dealt with the -nonsense search trick of seeing the non-Florida results, and it no longer works. It doesn't mean that they are not different to the Florida results; it's just that we can no longer see them.
5th January 2004
Dan Thies, of Seo Research Labs, came up with the interesting theory that the Florida changes are due to Google now using Topic Sensitive PageRank (TSPR). His PDF article can be found here. It's an interesting theory because, like the 'expert system' theory, it would cause Google to use 2 different algorithms depending on the searchterm used. To date, it's the only other theory that I believe has a chance of being right.
http://www.webworkshop.net/florida-update.html
Friday, May 25, 2007
Google and Inbound Links (IBLs)
The effect of inbound links
It's common knowledge that Google evaluates many factors when working out which page to rank where in response to a search query. They claim to incorporate around 100 different ranking factors. And it's common knowledge that the most powerful of these ranking factors is link text. Link text is the text that you click on when clicking a link. Here's an example of link text:- miserable failure. The words "miserable failure" are the link text. Link text is also known as anchor text.
I used that particular example because it shows the power of link text - the link text effect. If you click on it, it searches Google for "miserable failure", and you may be surprised to see which page is ranked at #1. If you click on the "Cached" link for that #1 ranked listing, you will see Google's cache for the page, and you will see each word of the phrase "miserable failure" highlighted in yellow in the page - or that's what you would see if the page actually contained either of those words, but it doesn't.
So how come the George Bush page is ranked at #1 for a phrase that isn't anywhere to be found in the page? The cache page itself tells us. In Google's head are the words, These terms only appear in links pointing to this page: miserable failure. The link texts of links that point to that page contain the words "miserable failure", and it's the power of those link texts that got the page to #1.
That demonstrates the power of link text in Google. Some people decided to get the George Bush page ranked #1 for "miserable failure", and they did it by linking to the page using the link text "miserable failure". It's known as "Googlebombing".
Why are inbound links so powerful?
It's because of the way that Google stores a page's data, and the way that they process a search query.
Google's Regular index consists of two indexes - the short index and the long index. They are also known as the short barrels and the long barrels. The short index is also known as the "fancy hits" index. Google also has a Supplemental index, but that's not part of the Regular index, and it's not relevant to this topic.
The short index is used to store the words in link texts that point to a page, the words in a page's title, and one or two other special things. But when they store the link text words in the short index, they are attributed to the target page, and not to the page that the link is on. In other words, if my page links to your page, using the link text "Miami hotels", then the words "Miami" and "hotels" are stored in the short index as though they appeared in your page - they belong to your page. If 100 pages link to your page, using those same words as link text, then your page will have a lot of entries in the short index for those particular words.
The long index is used to store all the other words on a page - its actual content.
And here's the point...
When Google processes a search query, they first try to get enough results from the short index. If they can't get enough results from there, they use the long index to add to what they have. It means that, if they can get enough results from the short index - that's the index that contains words in link texts and page titles - then they don't even look in the long index where the actual contents of pages are stored. Page content isn't even considered if they can get enough results from the link texts and titles index - the short index.
That is the reason why link texts are so powerful for Google rankings. They are much more powerful than page titles, because a page can have the words from only one title in the short index, but it can have the words from a great many link texts in there. That is the reason why the George Bush page ranks #1 for "miserable failure". All the link texts from all the pages that link to the George Bush page using the "miserable failure" link text, are in the short index - and they are all attributed to the George Bush page.
Page titles are the second most powerful ranking factor, because they are stored in the short index.
URL-only listings
We sometimes see a page listed in the rankings, but its URL is shown and linked instead of its title, and there is no description snippet for it. They are known an URL-only listings. Google says that they are "partially indexed pages". I'll explain what that means, since it's relevant to this topic.
When Google spiders a page and finds a link to another page on it, but they don't yet have the other page in the index, they find themselves with some link text that they want to attribute to the other page, so that it can be used in the normal search query processing. They treat it as normal, and place it in the short index, attributing it to the other page which they haven't got. Sometimes they will store the words from more than one link to the other page before they have spidered and indexed the page itself.
Sometimes that link text data in the short index will cause the other page to be ranked for a search query before the page has been spidered and indexed. But they don't have the page itself, so they don't have its title, or anything from the page that can be used for the description snippet. So they simply display and link its URL.
That's what is meant by "partially indexed", and it's why we sometimes see those URL-only listings. Google will later spider the other page, it's data will be stored as normal, and its listings in the search results will be displayed normally.
Note: When a page is indexed, not only is its content indexed, but also link texts that point to it are indexed as part of the page itself. So when links that point to a page are indexed, the page itself is partially indexed, even though it hasn't yet been spidered.
http://www.webworkshop.net/google-and-inbound-links.html
It's common knowledge that Google evaluates many factors when working out which page to rank where in response to a search query. They claim to incorporate around 100 different ranking factors. And it's common knowledge that the most powerful of these ranking factors is link text. Link text is the text that you click on when clicking a link. Here's an example of link text:- miserable failure. The words "miserable failure" are the link text. Link text is also known as anchor text.
I used that particular example because it shows the power of link text - the link text effect. If you click on it, it searches Google for "miserable failure", and you may be surprised to see which page is ranked at #1. If you click on the "Cached" link for that #1 ranked listing, you will see Google's cache for the page, and you will see each word of the phrase "miserable failure" highlighted in yellow in the page - or that's what you would see if the page actually contained either of those words, but it doesn't.
So how come the George Bush page is ranked at #1 for a phrase that isn't anywhere to be found in the page? The cache page itself tells us. In Google's head are the words, These terms only appear in links pointing to this page: miserable failure. The link texts of links that point to that page contain the words "miserable failure", and it's the power of those link texts that got the page to #1.
That demonstrates the power of link text in Google. Some people decided to get the George Bush page ranked #1 for "miserable failure", and they did it by linking to the page using the link text "miserable failure". It's known as "Googlebombing".
Why are inbound links so powerful?
It's because of the way that Google stores a page's data, and the way that they process a search query.
Google's Regular index consists of two indexes - the short index and the long index. They are also known as the short barrels and the long barrels. The short index is also known as the "fancy hits" index. Google also has a Supplemental index, but that's not part of the Regular index, and it's not relevant to this topic.
The short index is used to store the words in link texts that point to a page, the words in a page's title, and one or two other special things. But when they store the link text words in the short index, they are attributed to the target page, and not to the page that the link is on. In other words, if my page links to your page, using the link text "Miami hotels", then the words "Miami" and "hotels" are stored in the short index as though they appeared in your page - they belong to your page. If 100 pages link to your page, using those same words as link text, then your page will have a lot of entries in the short index for those particular words.
The long index is used to store all the other words on a page - its actual content.
And here's the point...
When Google processes a search query, they first try to get enough results from the short index. If they can't get enough results from there, they use the long index to add to what they have. It means that, if they can get enough results from the short index - that's the index that contains words in link texts and page titles - then they don't even look in the long index where the actual contents of pages are stored. Page content isn't even considered if they can get enough results from the link texts and titles index - the short index.
That is the reason why link texts are so powerful for Google rankings. They are much more powerful than page titles, because a page can have the words from only one title in the short index, but it can have the words from a great many link texts in there. That is the reason why the George Bush page ranks #1 for "miserable failure". All the link texts from all the pages that link to the George Bush page using the "miserable failure" link text, are in the short index - and they are all attributed to the George Bush page.
Page titles are the second most powerful ranking factor, because they are stored in the short index.
URL-only listings
We sometimes see a page listed in the rankings, but its URL is shown and linked instead of its title, and there is no description snippet for it. They are known an URL-only listings. Google says that they are "partially indexed pages". I'll explain what that means, since it's relevant to this topic.
When Google spiders a page and finds a link to another page on it, but they don't yet have the other page in the index, they find themselves with some link text that they want to attribute to the other page, so that it can be used in the normal search query processing. They treat it as normal, and place it in the short index, attributing it to the other page which they haven't got. Sometimes they will store the words from more than one link to the other page before they have spidered and indexed the page itself.
Sometimes that link text data in the short index will cause the other page to be ranked for a search query before the page has been spidered and indexed. But they don't have the page itself, so they don't have its title, or anything from the page that can be used for the description snippet. So they simply display and link its URL.
That's what is meant by "partially indexed", and it's why we sometimes see those URL-only listings. Google will later spider the other page, it's data will be stored as normal, and its listings in the search results will be displayed normally.
Note: When a page is indexed, not only is its content indexed, but also link texts that point to it are indexed as part of the page itself. So when links that point to a page are indexed, the page itself is partially indexed, even though it hasn't yet been spidered.
http://www.webworkshop.net/google-and-inbound-links.html
The Madness of King Google
When Google arrived on the scene in the late 1990s, they came in with a new idea of how to rank pages. Until then, search engines had ranked each page according to what was in the page - it's content - but it was easy for people to manipulate a page's content and move it up the rankings. Google's new idea was to rank pages largely by what was in the links that pointed to them - the clickable link text - which made it a little more difficult for page owners to manipulate the page's rankings.
Changing the focus from what is in a page to what other websites and pages say about a page (the link text), produced much more relevant search results than the other engines were able to produce at the time.
The idea worked very well, but it could only work well as long as it was never actually used in the real world. As soon as people realised that Google were largely basing their rankings on link text, webmasters and search engine optimizers started to find ways of manipulating the links and link text, and therefore the rankings. From that point on, Google's results deteriorated, and their fight against link manipulations has continued. We've had link exchange schemes for a long time now, and they are all about improving the rankings in Google - and in the other engines that copied Google's idea.
In the first few months of this year (2006), Google rolled out a new infrastructure for their servers. The infrastructure update was called "Big Daddy". As the update was completed, people started to notice that Google was dropping their sites' pages from the index - their pages were being dumped. Many sites that had been fully indexed for a long time were having their pages removed from Google's index, which caused traffic to deteriorate, and business to be lost. It caused a great deal of frustration, because Google kept quiet about what was happening. Speculation about what was causing it was rife, but nobody outside Google knew exactly why the pages were being dropped.
Then on the 16th May 2006, Matt Cutts, a senior Google software engineer, finally explained something about what was going on. He said that the dropping of pages is caused by the improved crawling and indexing functions in the new Big Daddy infrastructure, and he gave some examples of sites that had had their pages dropped.
Here is what Matt said about one of the sites:
Some one sent in a health care directory domain. It seems like a fine site, and it’s not linking to anything junky. But it only has six links to the entire domain. With that few links, I can believe that out toward the edge of the crawl, we would index fewer pages.
And about the same site, he went on to say:
A few more relevant links would help us know to crawl more pages from your site.
Because the site hasn't attracted enough relevant links to it, it won't have all of its pages included in Google's index, in spite of the fact that, in Matt's words, "it seems like a fine site". He also said the same about another of the examples that he gave.
Let me repeat one of the things that he said about that site. "A few more relevant links would help us know to crawl more pages from your site." What??? They know that the site is there! They know that the site has more pages that they haven't crawled and indexed! They don't need any additional help to know to crawl more pages from the site! If the site has "fine" pages then index them, dammit. That's what a search engine is supposed to do. That's what Google's users expect them to do.
Google never did crawl all sites equally. The amount of PageRank in a site has always affected how often a site is crawled. But they've now added links to the criteria, and for the first time they are dumping a site's pages OUT of the index if it doesn't have a good enough score. What sense is there in dumping perfectly good and useful pages out of the index? If they are in, leave them in. Why remove them? What difference does it make if a site has only one link pointing to it or a thousand links pointing to it? Does having only one link make it a bad site that people would rather not see? If it does, why index ANY of it's pages? Nothing makes any sort of sense.
So we now have the situation where Google intentionally leaves "fine" and useful pages out of their index, simply because the sites haven't attracted enough links to them. It is grossly unfair to website owners, especially to the owners of small websites, most of whom won't even know that they are being treated so unfairly, and it short-changes Google's users, since they are being deprived of the opportunity to find many useful pages and resources.
So what now? Google has always talked against doing things to websites and pages, solely because search engines exist. But what can website owners do? Those who aren't aware of what's happening to their sites simply lose - end of story. Those who are aware of it are forced into doing something solely because search engines exist. They are forced to contrive unnatural links to their sites - something that Google is actually fighting against - just so that Google will treat them fairly.
Incidentally, link exchanges are no good, because Matt also said that too many reciprocal links causes the same negative effect. The effect being that the site isn't crawled as often, and fewer pages from the site are indexed.
It's a penalty. There is no other way to see it. If a site is put on the Web, and the owner doesn't go in for search engine manipulation by doing unnatural link-building, the site gets penalised by not having all of its pages indexed. It can't be seen as anything other than a penalty.
Is that the way to run a decent search engine? Not in my opinion it isn't. Do Google's users want them to leave useful pages and resources out of the index, just because they haven't got enough links pointing to them? I don't think so. As a Google user, I certainly don't want to be short-changed like that. It is sheer madness to do it. The only winners are those who manipulate Google by contriving unnatural links to their sites. The filthy linking rich get richer, and the link-poor get poorer - and pushed by Google towards spam methods.
Google's new crawling/indexing system is lunacy. It is grossly unfair to many websites that have never even tried to manipulate the engine by building unnatural links to their sites, and it is very bad for Google's users, who are intentionally deprived of the opportunity to find many useful pages and resources. Google people always talk about improving the user's experience, but now they are intentionally depriving their users. It is sheer madness!
What's wrong with Google indexing decent pages, just because they are there? Doesn't Google want to index all the good pages for their users any more? It's what a search engine is supposed to do, it's what Google's users expect it to do, and it's what Google's users trust it to do, but it's not what Google is doing.
At the time of writing, the dropping of pages is continuing with a vengeance, and more and more perfectly good sites are being affected.
A word about Matt Cutts
Matt is a senior software engineer at Google, who currently works on the spam side of things. He is Google's main spam man. He communicates with the outside world through his blog, in which he is often very helpful and informative. Personally, I believe that he is an honest person. I have a great deal of respect for him, and I don't doubt anything that he says, but I accept that he frequently has to be economical with the truth. He may agree or disagree with some or all of the overwhelming outside opinion concerning Google's new crawl/index function, but if he agrees with any of it, he cannot voice it publically. This article isn't about Matt Cutts, or his views and opinions; it is about what Google is doing.
The thread in Matt's blog where all of this came to light is here.
Update:
Since writing this article, it has occured to me that I may have jumped to the wrong conclusion as to what Google is actually doing with the Big Daddy update. What I haven't been able to understand is the reason for attacking certain types of links at the point of indexing pages, instead of attacking them in the index itself, where they boost rankings. But attacking certain types of links may not be Big Daddy's primary purpose.
The growth of the Web continues at a great pace, and no search engine can possibly keep up with it. Index space has to be an issue for the engines sooner or later, and it may be that Big Daddy is Google's way of addressing the issue now. Search engines have normally tried to index as much of the Web as possible, but, since they can't keep pace with it, it may be that Google has made a fundamental change to the way they intend to index the Web. Instead of trying to index all pages from as many websites as possible, they may have decided to allow all sites to be represented in the index, but not necessarily to be fully indexed. In that way, they can index pages from more sites, and their index could be said to be more comprehensive.
Matt Cutts has stated that, with Big Daddy, they are now indexing more sites than before, and also that the index is now more comprehensive than before.
If that's what Big Daddy is about, then I would have to say that it is fair, because it may be that Google had to leave many sites out of the index due to space restrictions, and the new way would allow pages from more sites to be included in the index.
http://www.webworkshop.net/google-madness.html
Changing the focus from what is in a page to what other websites and pages say about a page (the link text), produced much more relevant search results than the other engines were able to produce at the time.
The idea worked very well, but it could only work well as long as it was never actually used in the real world. As soon as people realised that Google were largely basing their rankings on link text, webmasters and search engine optimizers started to find ways of manipulating the links and link text, and therefore the rankings. From that point on, Google's results deteriorated, and their fight against link manipulations has continued. We've had link exchange schemes for a long time now, and they are all about improving the rankings in Google - and in the other engines that copied Google's idea.
In the first few months of this year (2006), Google rolled out a new infrastructure for their servers. The infrastructure update was called "Big Daddy". As the update was completed, people started to notice that Google was dropping their sites' pages from the index - their pages were being dumped. Many sites that had been fully indexed for a long time were having their pages removed from Google's index, which caused traffic to deteriorate, and business to be lost. It caused a great deal of frustration, because Google kept quiet about what was happening. Speculation about what was causing it was rife, but nobody outside Google knew exactly why the pages were being dropped.
Then on the 16th May 2006, Matt Cutts, a senior Google software engineer, finally explained something about what was going on. He said that the dropping of pages is caused by the improved crawling and indexing functions in the new Big Daddy infrastructure, and he gave some examples of sites that had had their pages dropped.
Here is what Matt said about one of the sites:
Some one sent in a health care directory domain. It seems like a fine site, and it’s not linking to anything junky. But it only has six links to the entire domain. With that few links, I can believe that out toward the edge of the crawl, we would index fewer pages.
And about the same site, he went on to say:
A few more relevant links would help us know to crawl more pages from your site.
Because the site hasn't attracted enough relevant links to it, it won't have all of its pages included in Google's index, in spite of the fact that, in Matt's words, "it seems like a fine site". He also said the same about another of the examples that he gave.
Let me repeat one of the things that he said about that site. "A few more relevant links would help us know to crawl more pages from your site." What??? They know that the site is there! They know that the site has more pages that they haven't crawled and indexed! They don't need any additional help to know to crawl more pages from the site! If the site has "fine" pages then index them, dammit. That's what a search engine is supposed to do. That's what Google's users expect them to do.
Google never did crawl all sites equally. The amount of PageRank in a site has always affected how often a site is crawled. But they've now added links to the criteria, and for the first time they are dumping a site's pages OUT of the index if it doesn't have a good enough score. What sense is there in dumping perfectly good and useful pages out of the index? If they are in, leave them in. Why remove them? What difference does it make if a site has only one link pointing to it or a thousand links pointing to it? Does having only one link make it a bad site that people would rather not see? If it does, why index ANY of it's pages? Nothing makes any sort of sense.
So we now have the situation where Google intentionally leaves "fine" and useful pages out of their index, simply because the sites haven't attracted enough links to them. It is grossly unfair to website owners, especially to the owners of small websites, most of whom won't even know that they are being treated so unfairly, and it short-changes Google's users, since they are being deprived of the opportunity to find many useful pages and resources.
So what now? Google has always talked against doing things to websites and pages, solely because search engines exist. But what can website owners do? Those who aren't aware of what's happening to their sites simply lose - end of story. Those who are aware of it are forced into doing something solely because search engines exist. They are forced to contrive unnatural links to their sites - something that Google is actually fighting against - just so that Google will treat them fairly.
Incidentally, link exchanges are no good, because Matt also said that too many reciprocal links causes the same negative effect. The effect being that the site isn't crawled as often, and fewer pages from the site are indexed.
It's a penalty. There is no other way to see it. If a site is put on the Web, and the owner doesn't go in for search engine manipulation by doing unnatural link-building, the site gets penalised by not having all of its pages indexed. It can't be seen as anything other than a penalty.
Is that the way to run a decent search engine? Not in my opinion it isn't. Do Google's users want them to leave useful pages and resources out of the index, just because they haven't got enough links pointing to them? I don't think so. As a Google user, I certainly don't want to be short-changed like that. It is sheer madness to do it. The only winners are those who manipulate Google by contriving unnatural links to their sites. The filthy linking rich get richer, and the link-poor get poorer - and pushed by Google towards spam methods.
Google's new crawling/indexing system is lunacy. It is grossly unfair to many websites that have never even tried to manipulate the engine by building unnatural links to their sites, and it is very bad for Google's users, who are intentionally deprived of the opportunity to find many useful pages and resources. Google people always talk about improving the user's experience, but now they are intentionally depriving their users. It is sheer madness!
What's wrong with Google indexing decent pages, just because they are there? Doesn't Google want to index all the good pages for their users any more? It's what a search engine is supposed to do, it's what Google's users expect it to do, and it's what Google's users trust it to do, but it's not what Google is doing.
At the time of writing, the dropping of pages is continuing with a vengeance, and more and more perfectly good sites are being affected.
A word about Matt Cutts
Matt is a senior software engineer at Google, who currently works on the spam side of things. He is Google's main spam man. He communicates with the outside world through his blog, in which he is often very helpful and informative. Personally, I believe that he is an honest person. I have a great deal of respect for him, and I don't doubt anything that he says, but I accept that he frequently has to be economical with the truth. He may agree or disagree with some or all of the overwhelming outside opinion concerning Google's new crawl/index function, but if he agrees with any of it, he cannot voice it publically. This article isn't about Matt Cutts, or his views and opinions; it is about what Google is doing.
The thread in Matt's blog where all of this came to light is here.
Update:
Since writing this article, it has occured to me that I may have jumped to the wrong conclusion as to what Google is actually doing with the Big Daddy update. What I haven't been able to understand is the reason for attacking certain types of links at the point of indexing pages, instead of attacking them in the index itself, where they boost rankings. But attacking certain types of links may not be Big Daddy's primary purpose.
The growth of the Web continues at a great pace, and no search engine can possibly keep up with it. Index space has to be an issue for the engines sooner or later, and it may be that Big Daddy is Google's way of addressing the issue now. Search engines have normally tried to index as much of the Web as possible, but, since they can't keep pace with it, it may be that Google has made a fundamental change to the way they intend to index the Web. Instead of trying to index all pages from as many websites as possible, they may have decided to allow all sites to be represented in the index, but not necessarily to be fully indexed. In that way, they can index pages from more sites, and their index could be said to be more comprehensive.
Matt Cutts has stated that, with Big Daddy, they are now indexing more sites than before, and also that the index is now more comprehensive than before.
If that's what Big Daddy is about, then I would have to say that it is fair, because it may be that Google had to leave many sites out of the index due to space restrictions, and the new way would allow pages from more sites to be included in the index.
http://www.webworkshop.net/google-madness.html
Google's "Big Daddy" Update
In December 2005, Google began to roll out what they called the "Big Daddy" update, and by the end of March 2006 it had been fully deployed in all of their datacenters. It wasn't a normal update, which are often algorithm changes. Big Daddy was a software/infrastructure change, largely to the way that they crawl and index websites.
As the update spread across the datacenters, people started to notice that many pages from their sites had disappeared from the regular index. Matt Cutts, a senior software engineer at Google, put it down to "sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling."
That statement pretty much sums up the way that the Big Daddy update affects websites. Links into and out of a site are being used to determine how many of the site's pages to have in the index. Matt then went on to give a few examples of sites that had been hit, and what he thought might be their problems...
About a real estate site, he said, "Linking to a free ringtones site, an SEO contest, and an Omega 3 fish oil site? I think I’ve found your problem. I’d think about the quality of your links if you’d prefer to have more pages crawled. As these indexing changes have rolled out, we’ve improving how we handle reciprocal link exchanges and link buying/selling."
About another real estate site, he said, "This time, I’m seeing links to mortgages sites, credit card sites, and exercise equipment. I think this is covered by the same guidance as above; if you were getting crawled more before and you’re trading a bunch of reciprocal links, don’t be surprised if the new crawler has different crawl priorities and doesn’t crawl as much."
And about a health care directory site, he said, "your site also has very few links pointing to you. A few more relevant links would help us know to crawl more pages from your site."
The Big Daddy update is mainly a new crawl/index function that evaluates the trustability of links into and out of a site, to determine how many of the site's pages to have in the index. Not only does it evaluate the trustability of the links, but it takes account of the quantity of trustable links. As the health care site shows, if a site doesn't score well enough, it doesn't get all of its pages indexed, and if the site already had all of its pages indexed, many or most of them them are removed.
I've written about the gross unfairness of evaluating links for that purpose here, so I won't go over it again, but I want to suggest a reason why Google has done it.
Since Google came on the scene with their links-based rankings system, people have increasingly arranged links solely for ranking purposes. For instance, link exchange schemes are all over the Web, and link exchange requests plague our email inboxes. Over the years, such links have increased and, because of them, the quality of Google's index/rankings has deteriorated. Google's system relies on the natural linking of the Web, but in implementing the system, they ruined natural linking, which in turn has eroded the quality of Google's index and rankings. It's my belief that Big Daddy is Google's way of addressing the problem. They are evaluating the trustability of both inbound and outbound links to try and prevent unnatural links from benefiting websites.
They had to address the problem but, in my opinion, they've done it in the wrong way. Nevertheless, it's done, and we have to live with it. We still have a lot to learn about the Big Daddy update, but the way I see it is that reciprocal and off-topic links are not dead, but they won't help a site as they did before. Perhaps those links won't count against a site, but they won't count for it, and links are now needed that count for a site, if it is to be fully indexed.
The best links to have are one-way on-topic links into the site, but because of what Google did to natural linking, they aren't easily found. Google caused people to not link naturally, and most sites don't naturally attract links, but the links must be found. The most obvious places to get them are directories. DMOZ can take a very long time to review a site, and even then the site may not be included, but it's a very good directory to be listed in, so it's always worth submitting to it (read this before submitting to DMOZ).
Other directories are well worth submitting to, and a good sized list of decent ones can be found at VileSilencer. Google may not credit all the links from all of them, but that doesn't matter as long as some of them are credited - and all of them may send some traffic.
Google isn't against link-building, and their own people suggest doing it. But it is ludicrous that we have now the situation where Google first destroyed the natural linking of the Web, and then turned around to suggest ways of unnaturally getting natural links, just so that a website can be treated fairly by them. It's a ludicrous situation, but that's the way it is. Some of the unnatural ways that Google suggests are, writing articles that people will link to, writing a blog that people will link to, and creating a buzz. But most people don't want to write articles and blogs, and would have nothing to write or blog about, and very few sites can create a buzz, so for most people, a buzz is a complete non-starter.
Update:
Since writing this article, it occured to me that I may have jumped to the wrong conclusion as to what Google is actually doing with the Big Daddy update. What I haven't been able to understand is the reason for attacking certain types of links at the point of indexing pages, instead of attacking them in the index itself, where they boost rankings. But attacking certain types of links may not be Big Daddy's primary purpose.
The growth of the Web continues at a great pace, and no search engine can possibly keep up with it. Index space has to be an issue for them sooner or later, and it may be that Big Daddy is Google's way of addressing the issue now. Search engines have normally tried to index as much of the Web as possible, but, since they can't keep pace with it, it may be that Google has made a fundamental change to the way they intend to index the Web. Instead of trying to index all pages from as many websites as possible, they may have decided to allow all sites to be represented in the index, but not necessarily to be fully indexed. In that way, they can index pages from more sites, and their index could be said to be more comprehensive.
Matt Cutts has stated that, with Big Daddy, they are now indexing more sites than before, and also that the index is now more comprehensive than before.
If that's what Big Daddy is about, then I can't find fault with it. But it doesn't make any difference to webmasters. We still need to find more of those one-way on-topic inbound links to get more of our pages in the index.
http://www.webworkshop.net/googles-big-daddy-update.html
As the update spread across the datacenters, people started to notice that many pages from their sites had disappeared from the regular index. Matt Cutts, a senior software engineer at Google, put it down to "sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling."
That statement pretty much sums up the way that the Big Daddy update affects websites. Links into and out of a site are being used to determine how many of the site's pages to have in the index. Matt then went on to give a few examples of sites that had been hit, and what he thought might be their problems...
About a real estate site, he said, "Linking to a free ringtones site, an SEO contest, and an Omega 3 fish oil site? I think I’ve found your problem. I’d think about the quality of your links if you’d prefer to have more pages crawled. As these indexing changes have rolled out, we’ve improving how we handle reciprocal link exchanges and link buying/selling."
About another real estate site, he said, "This time, I’m seeing links to mortgages sites, credit card sites, and exercise equipment. I think this is covered by the same guidance as above; if you were getting crawled more before and you’re trading a bunch of reciprocal links, don’t be surprised if the new crawler has different crawl priorities and doesn’t crawl as much."
And about a health care directory site, he said, "your site also has very few links pointing to you. A few more relevant links would help us know to crawl more pages from your site."
The Big Daddy update is mainly a new crawl/index function that evaluates the trustability of links into and out of a site, to determine how many of the site's pages to have in the index. Not only does it evaluate the trustability of the links, but it takes account of the quantity of trustable links. As the health care site shows, if a site doesn't score well enough, it doesn't get all of its pages indexed, and if the site already had all of its pages indexed, many or most of them them are removed.
I've written about the gross unfairness of evaluating links for that purpose here, so I won't go over it again, but I want to suggest a reason why Google has done it.
Since Google came on the scene with their links-based rankings system, people have increasingly arranged links solely for ranking purposes. For instance, link exchange schemes are all over the Web, and link exchange requests plague our email inboxes. Over the years, such links have increased and, because of them, the quality of Google's index/rankings has deteriorated. Google's system relies on the natural linking of the Web, but in implementing the system, they ruined natural linking, which in turn has eroded the quality of Google's index and rankings. It's my belief that Big Daddy is Google's way of addressing the problem. They are evaluating the trustability of both inbound and outbound links to try and prevent unnatural links from benefiting websites.
They had to address the problem but, in my opinion, they've done it in the wrong way. Nevertheless, it's done, and we have to live with it. We still have a lot to learn about the Big Daddy update, but the way I see it is that reciprocal and off-topic links are not dead, but they won't help a site as they did before. Perhaps those links won't count against a site, but they won't count for it, and links are now needed that count for a site, if it is to be fully indexed.
The best links to have are one-way on-topic links into the site, but because of what Google did to natural linking, they aren't easily found. Google caused people to not link naturally, and most sites don't naturally attract links, but the links must be found. The most obvious places to get them are directories. DMOZ can take a very long time to review a site, and even then the site may not be included, but it's a very good directory to be listed in, so it's always worth submitting to it (read this before submitting to DMOZ).
Other directories are well worth submitting to, and a good sized list of decent ones can be found at VileSilencer. Google may not credit all the links from all of them, but that doesn't matter as long as some of them are credited - and all of them may send some traffic.
Google isn't against link-building, and their own people suggest doing it. But it is ludicrous that we have now the situation where Google first destroyed the natural linking of the Web, and then turned around to suggest ways of unnaturally getting natural links, just so that a website can be treated fairly by them. It's a ludicrous situation, but that's the way it is. Some of the unnatural ways that Google suggests are, writing articles that people will link to, writing a blog that people will link to, and creating a buzz. But most people don't want to write articles and blogs, and would have nothing to write or blog about, and very few sites can create a buzz, so for most people, a buzz is a complete non-starter.
Update:
Since writing this article, it occured to me that I may have jumped to the wrong conclusion as to what Google is actually doing with the Big Daddy update. What I haven't been able to understand is the reason for attacking certain types of links at the point of indexing pages, instead of attacking them in the index itself, where they boost rankings. But attacking certain types of links may not be Big Daddy's primary purpose.
The growth of the Web continues at a great pace, and no search engine can possibly keep up with it. Index space has to be an issue for them sooner or later, and it may be that Big Daddy is Google's way of addressing the issue now. Search engines have normally tried to index as much of the Web as possible, but, since they can't keep pace with it, it may be that Google has made a fundamental change to the way they intend to index the Web. Instead of trying to index all pages from as many websites as possible, they may have decided to allow all sites to be represented in the index, but not necessarily to be fully indexed. In that way, they can index pages from more sites, and their index could be said to be more comprehensive.
Matt Cutts has stated that, with Big Daddy, they are now indexing more sites than before, and also that the index is now more comprehensive than before.
If that's what Big Daddy is about, then I can't find fault with it. But it doesn't make any difference to webmasters. We still need to find more of those one-way on-topic inbound links to get more of our pages in the index.
http://www.webworkshop.net/googles-big-daddy-update.html
Subscribe to:
Posts (Atom)