An insight into the computational issues within the web search industry.
Thomas Clark, University Of Plymouth
School Of Computing
1. Introduction
This report will discuss some of the main computational problems unique to the web search industry, the main focus will be on the web search engine ‘Google’ because of its majority market share as reported by the survey conducted by Statista GmbH through 2015 until 2018, and finally published in 2018. It is visualised by Figure 1, extracted from their website. Many research papers available online offer information from a view inside Google, as they are published by Google employees so I will try to
Figure 1, Google market share, illustrated with a line graph focus the majority of my research on information supplied by 3rd party, unbiased sources. Computational issues are common within the web search industry, largely part to the sheer volume of traffic and information sorted by algorithms specific to their respective topic. Despite this making it difficult to solve computational issues, it can also provide an advantage, because they have potentially the largest source of feedback available. Glossing over the background issues specific to search engines based in the west (politically and financial influence), computational issues go to the core of the the mechanics of the search engine industry.
For my report I’d like to focus on how Google ranks results using their own algorithm, PageRank. According to a derivative of Google’s own definition (2011), “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is.”. PageRank was the first algorithm employed by the web giant before
Figure 2, a comparison between searches and their results using google.co.uk their rapid expansion into other markets. PageRank sorts sites by importance or relevance, to be brief, pages with many other pages linking to them will show higher in the search results. This is why homepages will appear before other pages in non-page-specific searches. See figure 2. Incredibly, due to this single algorithm, an entire industry has developed called ‘Search Engine Optimisation’, in which companies exploit this algorithm to ensure clients’ sites appear before competitors, it can be very lucrative, however this also opens up the world of fraud and spam, as the algorithm can be exploited with malicious intentions. It’s important to note that PageRank estimates the quality and relevance of a page, it does not guarantee its authenticity, quality or reliability. This is often the reason why you have to read a few articles on a subject to actually find a page with any sort of academic calibre.
2. Background
Patricia Briggs once quipped “Any idiot can put up a website”, which is true nowadays, however running a ‘successful’ site demands a little more effort, intelligent thinking and a much more measured approach. When google was conceptualised you need to understand that nothing of its scale existed, the internet was awash with corporations who felt it was necessary to have a website just to show they were established, there was no real way of indexing which meant that marketing online did not exist outside of emails. Google changed the way companies use the internet, all from a garage in Menlo Park, California. Brin and Page saw a hole in the market where others didn’t, and used their knowledge from their education to advance their project, but what was the academic world like in the 90s in terms of computers? To be blunt, it was limited, institutions were as new to this area of computing, it truly was an exciting time to be a part of something like Google. Innovation has always been at the core of institutions in the United States, and Stanford was no different.
The computational issues would not have become prevalent immediately, because of the smaller volume of indexed sites at that time. In 1998, link analysis was gaining a lot of attention, with scientists such as Jon Kleinberg, a student in his second year at Cornell working on a similar search engine project called ‘HITS’. This engine used an algorithm that utilised the hyperlink structure of web pages to improve the quality of results returned by a user search, at the time existing search engines such as Exite, Yahoo! And WebCrawler used entirely textual based search algorithms, which means the search results were passable but overall, poor. His algorithm uses hubs and authorities to identify a recursive link between web pages. Essentially it returns pages with lots of links to authorities as hubs, and returns pages that are linked by many hubs as authorities. See figure 3. He presented his work at the 1999 Ninth Annual ACM-SIAM Symposium on Discrete Algorithms held in San Fransisco, California. At the same time, a programming duo by the names of Lawrence (Larry) Page and Sergey Brin, PHD students at the University Of Stanford were cooking up a similar idea, named ‘PageRank’, the precursor to Google. Even though they had working together on a search engine since early 1995, it took three years for the project to gain substantial traction, despite its ingenuity and innovative nature. Eventually they decided to take absence from their studies and focus entirely on their company. They travelled to Australia and published the notable paper, “The PageRank Citation Ranking: Bringing Order to the Web” (Brin, 1998). There had been some controversy over wether Kleinberg’s HITS had influenced PageRank or vice versa, but PageRank emerged as the prominent link analysis model in the end, mostly due to the success of Google as a company, something with Kleinberg did not have.
3. Methods
The science behind PageRank makes for an interesting read of its own, it’s incredibly interesting. The PageRank of any given page according to Google’s own founders, Brin, S and Page, L (1998) is: PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn)). It’s works by distributing probability across a number of sites, as a result the sum of all PageRanks is one. Breaking this equation down: PR(Tn): Every page has a unique perception of its own self-relevance, For the first page (T1), all the way to the n’th page (Tn). C(Tn): Each page distributes its ballot evenly across all its external links, the value of outgoing links for the first page is (T1), and again follows the same pattern as the last all the way through to the n’th value (Tn). (PR(T1)/C(T1): If page 1 has a backlink from page ‘n’, the share of the ballot of page 1 will be PR(Tn)/C(Tn). d(…): the values of the ballot shares are then returned as sum, however to stop some pages having a greater influence over the others, the sum is multiplied by 0.85, a value specified by Google. (1 – d): This is the key to the probability figure of the equation, it appends the previous value by adding the value lost by the influence multiplier (d). This also means that if a page has no references (backlinks), it will still get a small share of the portability, assuming the multiplier is 0.85, this algorithm would return a page value of 0.15. (1 – 0.85 = 15). This is the normalised sum, or in simpler terms, the average.
The ranking system employed by Google is far more complex than just sharing probability, Google takes into account the term entered by the user as a whole, right down to where certain letters are capitalised, the spacing between words and abbreviations which vary from country to country. For example, search results for ‘MIA (Missing In Action)’ appear in a different order with a Mandarin search engine compared to an English search engine. See figures 3 and 4.
An example of how PageRank affects the notoriety or relevance of a web page is the prominence of Amazon’s homepage compared to Amazon’s recruitment portal.
4. Conclusions
Looking into the mechanics behind PageRank really opens up the realm of computer science, it’s easy to take it for granted, but there’s a lot more ‘under the bonnet’ than at first glance. Google’s own founders’ documents offer a real insight into the concept of a large scale web search engine, before anything of its kind was around. It’s probably the closest you’re going to get to time travel to see how Google was actually developed both conceptually and literally. Citations / References
Statista (May 2015 – July 2018): Search Engine Market Share
https://www.statista.com/statistics/279797/market-share-held-by-google-in-the-united-kingdom-uk/
Google, inc (2011, 2014), made available by archive.org
https://web.archive.org/web/20111104131332/https://www.google.com/competition/howgooglesearchworks.html
Brin, S & Page, L (1998): The Anatomy of a Large-Scale Hypertextual Web Search Engine
http://infolab.stanford.edu/~backrub/google.html