Introduction to Web Search Engines

Introduction to Web Search Engines
How search engines like Google work.

Abstract

This article examines at a high level some of the major technical aspects involved in how web search engines discover web pages, index the content, and provide search results.

Introduction

In 2008, Google reported that they had discovered over 1 trillion unique URLs on the Web [1]. And as previous research has shown [2], it is unlikely that Google, or any other search engine, is even close to discovering all the available content on the Web. The Web is certainly a large place, and finding information can be a daunting task without a web search engine. In this article, we will examine how search engines work by examining the collection process, indexing of the content, and the factors that play in ranking the content.

Web Directories

In 1994, the Web was a much smaller place. Stanford graduate students Jerry Yang and David Filo started manually organizing their favorite websites into a directory which soon became the foundation of Yahoo! [3]. A web directory allows you to find web pages based on navigating a structured hierarchy. For example, Computers and Internet > Internet > World Wide Web > History would lead you to a set of websites about the Web’s history.

Although hierarchical classifications of web pages can be helpful, this is not always the ideal way to find information on the Web. For example, if you wanted to know who founded Yahoo!, it would be time consuming to search through the hierarchy of web pages and then scan each page for the information you were seeking. Creating human-powered web directories is also not a scalable strategy as the Web continues to grow and change at a fantastic rate.

To provide search functionality for the Web, search engines must first discover what the web pages actually say and then create an index on the page contents. The index can then be quickly searched to reveal which pages contain the given query text.

Web Crawling

Matthew Gray’s World Wide Web Wanderer (1993) was one of the first efforts to automate the discovery of web pages [4]. Gray’s web crawler would download a web page, examine it for links to other pages, and continue downloading links it discovered until there were no more links left to be discovered. This is how web crawlers, also called spiders, generally operate today.

Because the Web is so large, search engines normally employ thousands of web crawlers that meticulously scour the web day and night, downloading pages, looking for links to new pages, and revisiting old pages that might have changed since they were visited last. Search engines will often revisit pages based on their frequency of change in order to keep their index fresh. This is necessary so search engine users can always find the most up-to-date information on the Web.

Maintaining an accurate "snap shot" of the Web is challenging, not only because of the size of the Web and constantly changing content, but also because pages disappear at an alarming rate (a problem commonly called linkrot). Brewster Kahle, founder of the Internet Archive, estimates that web pages have an average life expectancy of only 100 days [5]. And some pages cannot be found by web crawling. These are pages that are not linked to others, pages that are password-protected or are generated dynamically when submitting a web form. These pages reside in the deep Web, also called the hidden or invisible Web [6].

Some website owners don’t want their pages indexed by search engines for any number of reasons, so they use the Robots Exclusion Protocol (robots.txt) to tell web crawlers which URLs are off-limits. Other website owners want to ensure certain web pages are indexed by search engines, so they use the Sitemap Protocol, a method supported by all major search engines, to provide a crawler with a list of URLs they want indexed [7]. Sitemaps are especially useful in providing the crawler URLs it would be unable to find with web crawling.

Figure 1 below shows how a web crawler pulls from the Web and places downloaded web resources into a local repository. The next section will examine how this repository of web resources is then indexed and retrieved when you enter a query into a search engine.

Figure 1 - The Web is crawled and placed into a local repository where it is indexed and retrieved when using a search engine.

Indexing and Ranking

When a web crawler has downloaded a web page, the search engine will index its content. Often the stop words, words that occur very frequently like a, and, the, and to, are ignored. Other words might be stemmed. Stemming is a technique that removes suffixes from a word to improve the content of the index. For example, eating, eats, and eaten may all be stemmed to eat so that a search for eat will match all its variants.

An example index (usually called an inverted index) will look something like this where the number corresponds to a web page that contains the text:

cat > 2, 5
dog > 1, 5, 6
fish > 1, 2
bird > 4

So a query for dog would result in pages 1, 5, and 6. A query for cat dog would only result in page 5 since it is the only page that contains both search terms. Some search engines provide advanced search capabilities, so a search for cat OR dog and NOT fish would be entered which would result in pages 1, 5, and 6.

The search engine also maintains multiple weights for each term. The weight might correspond to any number of factors that determines how relevant the term is to its host web page. Term frequency is one such weight which measures how often a term appears in a web page. For example, if someone wanted to search the Web for pages about dogs, a web page containing the term dog five times would likely be more relevant than a page containing dog just once. However, term frequency is susceptible to spamming (or spamdexing), a technique which some individuals use to artificially manipulate a web page’s ranking, so it is only one of many factors which are used [8].

Another weight that is given to a web page is based on the context in which the term appears in the page. If the term appears in a large, bold font or in the title of the page, it may be given more weight than to a term that appears in a regular font. A page might also be given more weight if links pointing to the page use the term in its anchor text. In other words, a page that is pointed to with the link text “see the dogs” is more likely about dogs since the term dogs appears in the link. This functionality has left search engines susceptible to a practice known as Google-bombing, where many individuals collude to produce the same anchor text to the same web page for humorous effect. A popular Google bomb once promoted the White House website to the first result when searching Google for “miserable failure”. Google has recently implemented an algorithmic solution capable of diffusing most Google bombs [9].

A final weight which most search engines will use is based on the web graph, the graph which is created when viewing web pages as nodes and links as directed edges. Sergey Brin and Larry Page were graduate students at Stanford University when they noted just how important the web graph was in determining the relevancy of a web page. In 1998, they wrote a research paper [10] about how to measure the importance of a web page by examining a page’s position in the web graph, in particular the page’s in-links (incoming links) and out-links (outgoing links). Essentially, they viewed links like a citation. Good pages receive many citations, and bad pages receive few. So pages that have in-links from many other pages are probably more important and should rank higher than pages that few people link to. Weight should also be given to pages based on who is pointing to them; an in-link from a highly cited page is better than an in-link from a lowly cited page. Brin and Page named their ranking algorithm PageRank, and it was instrumental in popularizing their new search engine called Google. All search engines today take into account the web graph when ranking results.

Figure 2 shows an example of a web graph where web pages are nodes and links from one page to another are directed edges. The size and color of the nodes indicate how much PageRank the web pages have. Note that pages with high PageRank (red nodes) generally have significantly more in-links than do pages with low PageRank (green nodes).

Figure 2 – Example web graph. Pages with higher PageRank are represented with larger nodes.

Rank Optimization

Search engines guard their weighting formulas as a trade secret since it differentiates their service from other search engines, and they do not want content-producers (the public who produces web pages) to “unfairly” manipulate their rankings. However, many companies rely heavily on search engines for recommendations and customers, and their ranking on a search engine results page (SERP) is very important. Most search engine users only examine the first screen of results, and they view the first few results more often than the results at the bottom of the page. This naturally pits content-producers in an adversarial role against search engines since the producers have an economic incentive to rank highly in SERPs. Competition for certain terms (e.g., Hawaii vacation and flight to New York) is particularly fierce. Because of this, most search engines provide paid-inclusion or sponsored results along with regular (organic) results. This allows companies to purchase space on a SERP for certain terms.

An industry based on search engine optimization (SEO) thrives on improving their customer’s rankings by designing their pages to maximize the various weights discussed above and to increase the number and quality of incoming links. Black hat SEOs may use a number of questionable techniques like spamdexing and link farms, artificially created web pages designed to bolster the PageRank of a particular set of web pages, to increase their ranking. When detected, such behavior is often punished by search engines by removing the pages from their index and embargoing the website for a period of time [11].

Vertical Search

Search engines like Google, Yahoo!, and Bing normally provide specialized types of web search called vertical search [12]. A few examples include:

Regular web search is the most popular type of search which searches the index based on any type of web page. Other on-line textual resources like PDFs and Microsoft Office formats are also available through regular web search.
News search will search only news-related websites. Typically the search results are ordered based on age of the story.
Image search searches only images that were discovered when crawling the web. Images are normally indexed by using the image’s filename and text surrounding the image. Artificial intelligence techniques for trying to discover what is actually pictured in the image are slowly emerging. For example, Google can now separate images of faces and line drawing from other image types.
Video search searches the text accompanied by videos on the Web. Like image search, there is heavy reliance on people to supply text which accurately describes the video.

Other specialty searches include blog search, newsgroup search, scholarly literature search, etc. Search engines also occasionally mix various types of search results together onto the same SERP. Figure 3 below shows how Ask.com displays news and images along with regular web search results when searching for harding. The blending of results from different vertical search offerings is usually called universal search [13].

Figure 3 - Ask.com's universal search results.

Personalized Search

In order to provide the best possible set of search results for a searcher, many search engines today are experimenting with techniques that take into account personal search behavior. When searching for leopard, a user who often queries for technical information is more likely to want to see results dealing with Leopard the operating system than leopard the animal. Research has also shown that one third of all queries are repeat queries, and most of the time an individual will click on the same result they clicked on before [14]. Therefore a search engine should ideally present the previously-selected result close to the top of the SERP when recognizing the user has entered the same query before.

Figure 4 below shows a screen shot of personalized search results via Google's SearchWiki [15], an experiment in search personalization that Google rolled-out in late 2008. The user was able to promote results higher in the list, remove poor results from the list, and add comments to specific results. The comment and removal functions are no longer available today, but Google does allow users to star results that they like, and these starred results appear prominently when the user later searches for the same content.

Figure 4 – Example of Google's SearchWiki.

As smartphones have become increasingly popular, search engines have started providing search results based on the user's location [15]. A location-aware search engine recognizes that when a user searches for restaurants on their mobile device, they are likely wanting to find restaurants in their near vicinity.

Conclusions

Searching the entire Web in less than a second is a substantial engineering achievement. As we have seen, search engines are not searching the live Web but instead the copies of web pages that it has previously crawled. Pages are ranked based on their physical characteristics (title, term frequency, font size, etc.) and on their level of connectivity in the Web graph. The competition to present the most relevant set of search results is fierce, and search engines are always looking for better search mechanisms and altering how results are presented. The competition for ranking #1 is also fierce, guaranteeing the practice of SEO will not die anytime soon.

References

We knew the web was big...
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
Gulli, Antonio and Signorini,Alessio (2005). The indexable web is more than 11.5 billion pages. In Proceedings of WWW 2005.
http://www.cs.uiowa.edu/~asignori/web-size/
The History of Yahoo! - How It All Started...
http://docs.yahoo.com/info/misc/history.html
Measuring the Growth of the Web
http://www.mit.edu/people/mkgray/growth/
Weiss, Rick.: On the web, research work proves ephemeral: Electronic archivists are playing catch-up in trying to keep documents from landing in history's dustbin (2003). The Washington Post.
http://www.washingtonpost.com/ac2/wp-dyn/A8730-2003Nov23
Bergman, Michael K. (August 2001). "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing 7 (1).
http://dx.doi.org/10.3998/3336451.0007.104
Sitemaps.org
http://www.sitemaps.org/
Gyöngyi, Zoltán and Garcia-Molina, Hector (2005), "Web spam taxonomy", Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005
http://airweb.cse.lehigh.edu/2005/gyongyi.pdf
A quick word about Googlebombs
http://googlewebmastercentral.blogspot.com/2007/01/quick-word-about-googlebombs.html
Brin, Sergey and Page, Lawrence (1998), The Anatomy of a Large-Scale Hypertextual Web Search Engine
http://infolab.stanford.edu/~backrub/google.html
Cutts, Matt. Ramping up on international webspam (Feb 2006)
http://www.mattcutts.com/blog/ramping-up-on-international-webspam/
Watch Out Google, Vertical Search Is Ramping Up!
http://www.readwriteweb.com/archives/vertical_search.php
Universal search: The best answer is still the best answer
http://googleblog.blogspot.com/2007/05/universal-search-best-answer-is-still.html
Teevan, Jaime, Adar, Eytan, Jones, Rosie, and Potts, Michael (2006). History repeats itself: repeat queries in Yahoo's logs. In Proceedings of SIGIR '06.
http://doi.acm.org/10.1145/1148170.1148326
SearchWiki: make search your own
http://googleblog.blogspot.com/2008/11/searchwiki-make-search-your-own.html
Harnick Chris (2009). How location-based searches are changing mobile, Mobile Marketer.
http://www.mobilemarketer.com/cms/news/search/3994.html