Friday, May 25, 2007

Google's "Florida" Update

On November 16th 2003, Google commenced an update (the Florida update) which had a catastrophic effect for a very large number of websites and, in the process, turned search engine optimization on its head. It is usual to give alphabetical names to Google's updates in the same way that names are given to hurricanes, and this one became known as "Florida".
In a nutshell, a vast number of pages, many of which had ranked at or near the top of the results for a very long time, simply disappeared from the results altogether. Also, the quality (relevancy) of the results for a great many searches was reduced. In the place of Google's usual relevant results, we are now finding pages listed that are off-topic, or their on-topic connections are very tenuous to say the least.

The theories about the Florida update
The various search engine related communities on the web went into overdrive to try and figure what changes Google had made to cause such disastrous effects.

SEO filter (search engine optimization filter)
One of the main theories that was put forward and that, at the time of writing, is still believed by many or most people, is that Google had implemented an 'seo filter'. The idea is that, when a search query is made, Google gets a set of pages that match the query and then applies the seo filter to each of them. Any pages that are found to exceed the threshold of 'allowable' seo, are dropped from the results. That's a brief summary of the theory.

At first I liked this idea because it makes perfect sense for a search engine to do it. But I saw pages that were still ranked in the top 10, and that were very well optimized for the searchterms that they were ranked for. If an seo filter was being applied, they wouldn't have been listed at all. Also, many pages that are not SEOed in any way, were dropped from the rankings.

Searchterm list
People realized that this seo filter was being applied to some searchterms but not to others, so they decided that Google is maintaining a list of searchterms to apply the filter to. I never liked that idea because it doesn't make a great deal of sense to me. If an seo filter can be applied to some searches on-the-fly, it can applied to all searches on-the-fly.

LocalRank
Another idea that has taken hold is that Google have implemented LocalRank. LocalRank is a method of modifying the rankings based on the interconnectivity between the pages that have been selected to be ranked. I.e. pages in the selected set, that are linked to from other pages in the selected set, are ranked more highly. (Google took out a patent on LocalRank earlier this year). But this idea cannot be right. A brief study of LocalRank shows that the technique does not drop pages from the results, as the Florida algorithm does. It merely rearranges them.

Commercial list
It was noticed that many search results were biased towards information pages, and commercial pages were either dropped or moved down the rankings. From this sprang the theory that Google is maintaining a list of "money-words", and modifying the rankings of searches that are done for those words, so that informative pages are displayed at and near the top, rather than commercial ones.

Google sells advertising, and the ads are placed on the search results pages. Every time a person clicks on one of the ads, Google gets paid by the advertiser. In some markets, the cost per click is very expensive, and the idea of dropping commercial pages from the results, or lowering their rankings, when a money-word is searched on is to force commercial sites into advertising, thereby putting up the cost of each click and allowing Google to make a lot more money.

Comment on the above theories
All of the above theories are based on the idea that, when a search query is received, Google compiles a set of results and then modifies them in one way or another before presenting them as the search results. All of the above theories are based on the premise that Google modifies the result set. I am convinced that all the above theories are wrong, as we will see.

Stemming
Finally, there is a theory that has nothing to do with how the results set is compiled. Google has implemented stemming, which means that, in a search query, Google matches words of the same word-stem; e.g. drink is the stem of drink, drinks, drinking, drinker and drinkers. So far, this is not a theory - it's a fact, because Google say it on their website. The theory is that, stemming accounts for all the Florida effects. Like the other theories, I will show why this one cannot be right.


Evidence
There are a number of evidences (Florida effects) that are seen in the results, but I won't go into detail about them all. One piece of evidence that everyone jumped to conclusions about is the fact that you can modify the searches to produce different results. For instance, a search for "uk holidays" (without quotes) shows one set of results, but if you tell Google not to include pages that contain a nonsense word, e.g. "uk holidays -asdqwezxc" (without quotes), you will get a different set of results for some searches, but not for others. Also, the results with the -nonsense word looked the same as they were before the update began, therefore they appeared to be the results before a filter was applied.

This is what led people to come up with the idea of a list of searchterms or a list of money-words; i.e. a filter is applied to some searches but not to others. It was believed that the set of results without the -nonsense word (the normal results) were derived from the set produced with the -nonsense word. But that was a mistake.

What really happened
Although Google's results state how many matches were found for a searchterm (e.g. "1 - 10 of about 2,354,000"), they will only show a maximum of 1000 results. I decided to compare the entire sets of results produced, with and without the -nonsense word, and compare them to see if I could discover why a page would be filtered out and why other pages made it to the top. I did it with a number of searchterms and I was very surprised to find that, in some cases, over 80% of the results had been filtered out and replaced by other results. Then I realised that the two sets of results were completely different - the filtered sets were not derived from the unfiltered sets.

The partners in each pair of result sets were completely different. The 'filtered' set didn't contain what was left from the 'unfiltered' set, and very low ranking pages in the 'unfiltered' set got very high rankings in the 'filtered' set. I saw one page, for instance, that was ranked at #800+ unfiltered and #1 filtered. That can't happen with a simple filter. It can't jump over the other pages that weren't filtered out. All the theories about various kinds of filters and lists were wrong, because they all assumed that the result set is always compiled in the same way, regardless of the searchterm, and then modified by filters. That clearly isn't the case.

In it inescapable that the Google engine now compiles the search results for different queries in different ways. For some queries it compiles them in one way, and for others it compiles them in a different way. The different result sets are not due to filters, they are simply compiled differently in the first place. I.e. the result set without the -nonsense word, and the result set with the -nonsense word are compiled in different ways and are not related to each other as the filter theories suggest. One set is not the result of filtering the other set.

The most fundamental change that Google made with the Florida update is that they now compile the results set for the new results in a different way than they did before.

That's what all the previous theories failed to spot. The question now is, how does Google compile the new results set?

Back in 1999, a system for determining the rankings of pages was conceived and tested by Krishna Bharat. His paper about it is here. He called his search engine "Hilltop". At the time he wrote the paper, his address was Google's address, and people have often wondered if Google might implement the Hilltop system.

Hilltop employs an 'expert' system to rank pages. It compiles an index of expert web pages - these are pages that contain multiple links to other pages on the web of the same subject matter. The pages that end up in the rankings are those that the expert pages link to. Of course, there's much more to it than that, but it gives the general idea. Hilltop was written in 1999 and, if Google have implemented it, they have undoubtedly developed it since then. Even so, every effect that the Florida update has caused can be attributed to a Hilltop-type, expert-based system. An important thing to note is that the 'expert' system cannot create a set of results for all search queries. It can only create a set for queries of a more general nature.

We see many search results, that once contained useful commercial sites, now containing much more in the way of information or authority pages. That's because expert pages would have a significant tendancy to point to information pages. We see that the results with and without the -nonsense word are sometimes different and sometimes the same. That's because an expert system cannot handle all search queries, as the Krishna Bharat paper states. When it can't produce a set of results, Google's normal mechanisms do it instead. We see that a great many home pages have vanished from the results (that was the first thing that everyone noticed). It's because expert pages are much more likely to point to the inner pages that contain the information that to home pages. Every effect we see in the search results can be attributed to an expert system like Hilltop.

I can see flaws in every theory that has been put forward thus far. The flaw in the seo filter idea is that there are highly SEOed pages still ranking in the top 10 for searchterms that they should have been filtered out for. The flaw in the LocalRank theory is that LocalRank doesn't drop pages, but a great many pages have been dropped. The flaw in the list of searchterms is that if a filter can be applied to one searchterm, it can be applied to them all, so why bother maintaining a list. The flaw in the money-words list idea is that, if it ever came out that they were doing it, Google would run the risk of going into a quick decline. I just don't believe that the people at Google are that stupid. The flaw in the stemming theory is not that Google hasn't introduced stemming, it's that the theory doesn't take into account the fact that the Florida results set is compiled in a different way to the -nonsense set. Stemming is additional to the main change, but it isn't the main change itself.

The expert-system, or something like it, accounts for every Florida effect that we see. I am convinced that this is what Google rolled out in the Florida update. Having said that, I must also add that it is still a theory, and cannot be relied upon as fact. I cannot say that Google has implemented Hilltop, or a development of Hilltop, or even a Hilltop-like system. What I can say with confidence is that the results without a -nonsense word (the normal results) are not derived from the results with a -nonsense word, as most people currently think. They are a completely different results set and are compiled in a different way. And I can also say that every effect that the Florida update has caused would be expected with a Hilltop-like expert-based system.


Where do we go from here?
At the moment, Google's search results are in poor shape, in spite of what their representatives say. If they leave them as they, they will lose users, and risk becoming a small engine as other top engines have done in the past. We are seeing the return of some pages that were consigned to the void, so it is clear that the people at Google are continuing to tweak the changes.

If they get the results to their satisfaction, the changes will stay and we will have to learn how to seo Google all over again. But it can be done. There are reasons why certain pages are at the top of the search results and, if they can get there, albeit accidentally in many cases, other pages can get there too.

If it really is an expert system, then the first thing to realise is that the system cannot deal with all searchterms, so targeting non-generalised and lesser searchterms, using the usual search engine optimization basics, will still work.

For more generalised searchterms, the page needs to be linked to by multiple expert pages that are unaffiliated with the page. By "unaffiliated" I mean that they must reside on servers with different IP C block addresses than each other and than the target page, and their URLs must not use the same domain name as each other or as the target page. These expert pages can either be found and links requested or they can be created.

Latest

8th December 2003

Since soon after the Florida update began, some pages that disappeared from the results have been returning. In some cases they are back at, or close to, the rankings that they had before Florida. In other cases they are quite highly ranked but lower than before. Day after day, more of them are returning.

I put this down to Google recognizing that Florida caused a sharp decline in the quality (relevancy) of the search results. It appears that they are adjusting the algorithm's parameters, trying to find a balance between the new page selection process and good relevancy. In doing so, some of the pages that were dumped out of the results are getting back into the results set, and they are achieving high rankings because they already matched the ranking algorithm quite well, so once they are back in the results set, they do well in the rankings.

Reminder
Don't forget that all this is just theory, but what we see happening does appear to fit an expert sytem, although there can be other explanations. We can be sure that Google compiles the results sets in different ways depending on the searchterm, and that the Florida results are not derived, via one or more filters, from the -nonsense results, but we can't yet be certain that an expert system is used to compile the Florida results set.


22nd December 2003
Google has now dealt with the -nonsense search trick of seeing the non-Florida results, and it no longer works. It doesn't mean that they are not different to the Florida results; it's just that we can no longer see them.


5th January 2004
Dan Thies, of Seo Research Labs, came up with the interesting theory that the Florida changes are due to Google now using Topic Sensitive PageRank (TSPR). His PDF article can be found here. It's an interesting theory because, like the 'expert system' theory, it would cause Google to use 2 different algorithms depending on the searchterm used. To date, it's the only other theory that I believe has a chance of being right.

http://www.webworkshop.net/florida-update.html