This blog is majorly inspired by the SMX West 2016 Keynote speaker Paul Haahr- @haahr, Rank Engineer at Google. I am writing this blog because I think if a digital marketer wants to make the best of any SEO strategy in this world, he/she ought to know how things actually work on the other side of the world more clearly – I am talking about the search engine’s side- Google’s side.
Google uses a self-developed software called ‘Googlebot’ which collects documents from all over the web or at least most of the web to create indices that render Google’s Search Engine Results Page (SERPs) to the search engine user. A Search Engine Results Page is the page displayed as a result of the searcher’s query.
When a query is submitted into the Google’s search engine, these Googlebots which are also called ‘spiders’ begin crawling into a random set of seed websites or webpages extracting outgoing links out of them. The spiders then follow these links landing onto a whole different set of websites extracting outgoing links from them. This how Googlebots gather documents and use these links to understand the structure of the web and find relevant pages.
After the spiders have crawled enough of the web and majority of its content, they build a specific index or what Google calls ‘Keyword Inverted Index’ collecting and storing all the pages with keywords/ phrases relevant to what the user types in on Google’s search engine as the query. Once the index is created, the bots calculate which pages out of all the documents are the most relevant to the user’s query. Along with this they also check which pages are the most popular, reliable, trustworthy and other factors (signals). All these metrics are a part of Google’s PageRank Algorithm. To understand PageRank Algorithm, please read our blog What Google Humms..these days..
Basically Google up until a few years back would only analyze the crawled pages by extracting links and few other things like rendering content but its core job would be to extract outgoing links from webpages. However, since a few years, Google has added/improved on how crawlers would analyze web pages and build indices based on its analysis. The added functionality is called ‘semantics search engine’.
What semantic search engine does is that it increases the accuracy of relevant search results by understanding the meaning of the query in context of the searcher’s intention or actual need.
Building a ‘web index’ according to google is just like building a book index. Web Index is , in short, for every word a list of pages that the word appears on. Keeping in mind that the scale of the web is too large, Google breaks the web into groups of millions of pages and each group is called an ‘index shard’. This way the web is divided into thousands of index shards which makes up a substantial part of Google’s Index Building Process’.
What does Google do after it gets a query?
It understands the query first- part of the semantic search if you will. It asks several questions such as “Does the query name know any entities? Are there any useful synonyms? In what context should the user input be treated?. After understanding what the user is actually looking for, Google send the query to all the ‘shards’. The index shard then finds matching pages, scores them in other words what is called ‘PageRanking’ conclusively sending the top pages sort by score. The returned pages are then tested for ‘signals’ like spam, duplication, clustering site links, demoting or promoting sites on its base.
There are two kinds of signals identified by google
- Query Independent that includes pageRank, language, Mobile-friendliness
- Query Dependent that includes feature of page and query, Keywords hints, Synonyms, Proximity
How does google position pages?
Google uses something called Reciprocal Ranked Weightage wherein each site from the SERP is given weightage beginning from the very first.
Site 1 – has a weightage of 1
Site 2 – has a weightage of ½
Site 3 – has a weightage of ⅓
Site 3 – has a weightage of ¼
So forth and so on.
Google is known to conduct ‘Live Experiments’ all the time. Paul Haahr says it is very rare that a site is not involved in a ‘Live Experiment’ at any given point of time.
Google also employs what it calls a ‘Needs Met Rating’. The rating scales from ‘Fully Meets’ to ‘Fails to Meet’ including in between them ‘Very Highly meets’ ‘Highly Meets’ Moderately Meets’ ‘Slightly Meets’
Fully Meets- The need is ‘fully met’ when a user types in, for example, ‘wwe’ (World Wrestling Entertainment) and google returns with the search result ‘www.wwe.com’. Basically it means the query that results in absolutely unambiguous page return or in other words there could be no other interpretation of the given query
Very Highly Meets- A ‘very highly met’ only fails to be a ‘Fully Met’ need because it could have more than one interpretation. A query that could possibly result into a few web pages or websites is a ‘Very Highly Met’ need.
Highly Meets- A ‘highly met’ need is a query that results into a highly informative, authoritative page such as Wikipedia.
More Highly Meets- Depending on the location of the user, google gives different result, for example, a person in Kashmir (India) types apples, the user is likely to be returned with pages about the ‘fruit’ that is apple. On the other hand if a person sitting in California types ‘apple’ into google search, the user is likely to be returned with pages about ‘technological gadget company’ that is Apple. Although the user in Kashmir could possibly be looking for the technological gadget and the user in California can be searching for the fruit.
Moderately Meets- There are several useful informative links
Slightly Meets- If a query results into a ‘less than the most useful’ informative page which could be acceptable but not great, ambiguous but relevant].
Fails to Meet- A search result that is completely ambiguous and not relevant or useful to the user at all.
Page Quality Concepts by Google
- Expertise
- Authoritativeness
- Trustworthiness
Webpages /sites that deal with subjects like medical care are expected to rank high on trustworthiness due to the possible impact of its information on its users.
This is how Google classifies ‘Low Quality Pages’
- The quality of the main content is low
- There is not enough/ unsatisfying amount of content on the main page
- The author does not have expertise or is not trustworthy or authoritative on the topic
- The website has negative reputation
- The secondary content is distracting or helpful