How Do Search Engines Work?

We’re all familiar with search engines on a superficial level, but what happens in the background – both before and after you click “Search” – to make Bing, Google, and Yahoo actually work? The answer is a lot more than most people realise. But by taking the time to understand the processes search engines go through, you’ll gain a better appreciation of the how’s and why’s of SEO, be able to identify problems and issues with your own site, and also work better with your search engine optimization agency. So lets go through the process step by step.

Crawling

Before a search engine can do anything it must first discover web pages. This is the task of the search engine “spiders”, also known as ‘bots, robots or crawlers; the spiders for the three major search engines are MSNbot 2 (Bing), Googlebot 2.1 (Google) and Slurp (Yahoo!), but there are many, many more and they all perform much the same task.

These spiders are pieces of software that follow links around the internet. Each page they access is sent back to a data centre, the data centre is a vast warehouse containing thousands of computers. Once a page is stored in a data centre, the search engine can begin to analyse it and that’s where the magic starts to happen.

Conceptually, each spider will have started from a single page on the internet (historically the DMOZ directory was the starting point for many), and will have been crawling pages by following links from that day to the present. This is a massive, constant task, involving accessing and storing billions of pages every day, and the scale of the problem is one of the reasons there are so few major search engines around today.

It’s important to note that at this stage in the search engine process there is no itelligence or clever algorithm at work. The spiders are relatively simple bits of software, they follow links, harvest whatever data they can, and send it back to the data centre, then follow the next set of links, and so on. It’s all very robotic, which is why search engines can so easily be stymied by non standard content or navigation, such as Flash movies or forms and the like.

Key points to remember about crawling:

  • The job of the crawlers is to discover new content. They do this by following links.
  • Crawling is a massive, constant process, and the search engines crawl billions of pages every day, finding new content and recrawling old content to check if it’s changed.
  • Search engine crawlers aren’t smart; they are simple bits of software programmed to singlemindedly collect data and send it back to the search engine data centres.

Caching

Once a page has been crawled search engines will typically take a cache of the page. This means the entire page, including it’s content, images, styles, scripts and source code, is stored by the search engine verbatim. This cache usually becomes available, via the cached link in the search results or via the cache: search operator, a few days after the crawl date, allowing users to access the page as it existed at the time it was crawled and as the search engine spider saw it.

In practice the cache functionality is rarely used by most users, but for those interested in SEO it can be invaluable because it serves to highlight accessibility issues with your pages. For example, if your pages don’t have a cache, or the cache is significantly different to what you see in your own browser – perhaps it has some key areas of content missing, or there is no clickable navigation to other pages on your site – then you know there is a good chance that there is some sort of issue preventing the spiders from properly accessing your pages or their content.

Key points to remember about caching:

  • The cache is a literal copy of what the spiders crawled; it’s not searchable, but is sometimes useful for assessing how spiders interact with your site.
  • If the search engines don’t have a cache of your pages, or if your cache differs in important ways from what you see in your browser, you could have an accessibility problem.

Indexing

Although the cache is a useful tool, to a search engine it has limited applications. In this state, a page can’t be searched by the search engine. To be searchable, a page must be indexed. This is the next stage and involves deconstructing the page to its constituent parts, and databasing it so it can be easily located and retrieved by the search engine later on, and compared to other pages.

It helps to think of a reference book in this respect. When looking for a specific piece of information, you don’t leaf through every page looking for mentions of a key word or phrase, instead you look at the index. Search engine indexes function in a similar way, only across billions of documents rather than the few hundred pages that a typical book index might cover.

You can check if a page is indexed by a search engine by performing a site: query on it. If this search returns no results your page isn’t indexed and, assuming enough time has passed and you’d ordinarily expect the page to be indexed, you may have an accessibility problem preventing the page from being crawled and/or indexed.

Key points to remember about indexing:

  • Search engine indexes are analagous to the index you’ll find at the back of most reference books, which allows you to quickly flip to the right page when you are looking for information on a specific topic.
  • They are the key to search engines being able to search hundreds of thousands, if not millions, of documents so quickly.
  • Your pages must be in the index before they can rank in the search engine results. If they aren’t, you may have an accessibility problem.

Retrieval

Having crawled, cached and indexed a page, it’s then ready to be returned by the search engine in response to a users search query.

Let’s assume that somebody searches for “search engine optimisation”. The first thing the search engine will do is access a data centre (typically the nearest one) and retrieve every document that has been indexed that the engine considers to be relevant for the term “search engine optimisation”. This often amounts to hundreds of thousands or even millions of documents. This is the pool of results that will then be sorted by the search engine in the final stage.

This then is the second hurdle for SEO; once you’ve ensured your content is accessible in order that it can be crawled, cached and indexed, you must also make sure it’s relevant for the terms that your target market is typing into search engines. The easiest way to be considered relevant for a search term is to include that term on one of your pages. Some other signals, such as the text of links pointing to your pages, may be considered, but the vast majority of retrieved pages simply use the term in question in their content.

Key points to remember about retrieval:

  • After you click “Search”, every document that is relevant to the term you searched for is retrieved from the nearest data centre.
  • If your pages aren’t relevant to the terms people search for, they can’t be retrieved by the search engine or considered during the next and final stage, ranking.

Sorting

In the final stage, search engines take all of the documents they retrieved in the previous step and pass them through their algorithms in order to sort the documents into the order that they think best serves the users intent. The sorted documents are returned in a SERP, or search engine results page.

For all the scale and complexity of the previous stages, the algorithms that do the sorting are the real workhorses. They analyse dozens or even hundreds of factors about each page, and they do this in mere fractions of a second. Needless to say search engines don’t reveal any specific details about their algorithms, although we do know the general concepts behind them, which are similar for all of the major engines.

For a large number of searches most of the sorting will again be done based on the content of the pages retrieved. This is classic information retrieval based on the frequency of words appearing on the page, how they are emphasised and so on. So, again, relevance is of paramount importance.

For more competitive search terms, search engines will find that too many documents are all equally as relevant as each other. At this point they will have to look to other signals to distinguish between the documents, and the most important of these is links from other sites. These confer credibility to your pages. This then is the final piece of the SEO puzzle.

Key points to remember about sorting:

  • Every document retrieved during the previous stage is fed through a complex formula, the algorithm, in order to sort them into the order that the search engine thinks is best for your needs.
  • All of the major search engines use similar principles in their algorithms, differing only in the details and the emphasis they apply to individual elements.
  • For many search terms, being relevant isn’t always enough to ensure high rankings; you also need credibility in the form of links from other websites.
  • Once sorted according to the algorithm, the results are returned to the user via the search engine results pages, or SERPs for short.

ARC

ARC is the cornerstone of Greenlight’s SEO methodology and, as we’ve explained, its three parts – accessibility, relevancy and credibility – correspond with how search engines work so that every potential SEO issue is considered logically and methodically. Remember:

  • Your pages and content must be accessible to search engines before they can be crawled, cached and indexed.
  • Search engines consider the relevancy of your pages during the retrieval and sorting processes.
  • Credibility is important to distinguish your site from the thousands of others sites that are equally as accessible and relevant.

If you think your site falls down in one or more of these areas, we can help. Please get in touch.

Share