Any search engine spider will look for robots.txt file instruction to proceed with regular tasks. If the robots.txt file instruction stops the crawler from crawling, the search spider can not proceed further. Only a website is crawl-able or index-able, the complete search engine process can be initiated for a particular website.
Links plays a vital role for exploring the content inside a website. The search engine spider takes multiple link types to understand and validate the content quality and relevancy.
We shall discuss the details of robots.txt and link in the later section of this post. We will initiate from the basic concepts and explore them to the expert level.
Related Article: Digital Marketing Training in Velachery
What is a Search Engine?
A program that searches for and identifies items in a database(collection of websites) that correspond to keywords or characters specified by the user, used especially for finding particular sites on the World Wide Web (Internet).
Few examples of famous Search Engines Are
- Bing
- Yahoo
- Duckduckgo
- Baidu
- Yandex
How Search Engine Works?
An official video from google explains the detailed overview of the functionality of Google search engine. Matt Cutts, a quality engineer from google explains step by step procedure with clear animation explaining how search works,
Do not miss watching this video by Matt Cutts
I will be explaining the logic in the similar manner what Mr.Matt Cutts has explained in the above video. But, with personalized examples in my own style.
The main hero of the whole process is a software program named spider. In common it is termed as search engine spider.
What is a Search Engine Spider?
The spider checks, validates, updates and adds new website to the search engine database. Each and every search engine service provider has their own database of websites created by their own search engines.
I have shared the name of few famous search engine list and their spider names:
- Google -------> googlebot
- Bing ----------> bingbot
- Yahoo --------> yahooslurp
- Duckduckgo-> DuckDuckBot
- Baidu --------> Baiduspider
- Yandex ------> yandexbot
Search Engine spider undergoes two major process in its working process.
- Crawling
- Indexing
Understanding Search Engine Spider Crawling Process
In order for the search engine spider to initiate the crawling process, a website must be known to the search engine. We can inform the search engine about a new website by
Once a spider comes to know that there is a website, it will crawl and index for serving the users searching for the related product or service. Once a website content is crawled, the spider will schedule the upcoming crawling process periodically based on the update frequency of the website. If the website does not update as per update frequency mentioned, the frequency will be assigned by the spider automatically.
The best way to communicate with search spider is to generate a sitemap file and add the same to the robots.txt file. This will make sure that the sitemap file is crawled every time. It is because, the search engine spider will check the robots.txt file first then the remaining content later.
Let's try to understand the crawling process in detail.
For the search engine spider to crawl the website, links plays a major role. They assist the spider to discover more content and explore. A small summary of links types will add more value:
The crawling process starts with exploring all the links in the webpage. It opens each and every link inside the website and checks the status of the links. It follows the link which are Do Follow and skip the links marked as No Follow.
No follow is a method used to let the robot know that the link is not required to be considered for crawling purposes. If this content shared here is confusing you, read my article that gives Complete Guide for Understanding Do Follow and No Follow Links.
The link status code is used by the spider. A small summary of the link status code for your understanding.
Status Code 200 ---> Ok
Status Code 301 ---> Permanent Redirection
Status Code 302 ---> Temporary Redirection
Status Code 500 ---> Internal Server Error
Status Code 404 ---> Page Not Found
In the above status code, the status codes 200, 301, 302 will be considered and 500 and 404 can not be considered for crawling and indexing. Hence, a web page that is not working will changed to 200, 301, or 302 from 500 and 404 status.
If the information shared above is confusing, please read my complete guide to Understanding HTTP Status Status Code in SEO.
In the above process, all the working links are added to the search engine if it is not added already. All the existing links are updated with latest information.
The non working links are removed from the database. For all the working links, a vote or value called link juice is passed. This link juice is what accumulates to Page Rank. It also contributes to Domain Authority and Page Authority.
The more the Domain authority (DA) or the Page Authority (PA) is more the probability of a website or web page to rank better.
I would relate a crawling process to an interview process in a company for a job opportunity. A company will check the skill set, IQ level, and most of all is he the right person to work with the organization. The organization will hire the candidate only when he meets the requirement criteria. Similarly, a search engine spider will check the content quality during the crawling process.
- Generating a Sitemap file and submitting them to the respective search engines.
- Generating hyperlinks from another website (Back link)
- Search Engine submission process
Once a spider comes to know that there is a website, it will crawl and index for serving the users searching for the related product or service. Once a website content is crawled, the spider will schedule the upcoming crawling process periodically based on the update frequency of the website. If the website does not update as per update frequency mentioned, the frequency will be assigned by the spider automatically.
The best way to communicate with search spider is to generate a sitemap file and add the same to the robots.txt file. This will make sure that the sitemap file is crawled every time. It is because, the search engine spider will check the robots.txt file first then the remaining content later.
Let's try to understand the crawling process in detail.
For the search engine spider to crawl the website, links plays a major role. They assist the spider to discover more content and explore. A small summary of links types will add more value:
- Internal Links: A hyperlink generated with in the same domain or website.
- External Links (Backlinks): A hyperlink from another website.
- Outbound Links: A hyperlink to another website.
The crawling process starts with exploring all the links in the webpage. It opens each and every link inside the website and checks the status of the links. It follows the link which are Do Follow and skip the links marked as No Follow.
No follow is a method used to let the robot know that the link is not required to be considered for crawling purposes. If this content shared here is confusing you, read my article that gives Complete Guide for Understanding Do Follow and No Follow Links.
The link status code is used by the spider. A small summary of the link status code for your understanding.
Status Code 200 ---> Ok
Status Code 301 ---> Permanent Redirection
Status Code 302 ---> Temporary Redirection
Status Code 500 ---> Internal Server Error
Status Code 404 ---> Page Not Found
In the above status code, the status codes 200, 301, 302 will be considered and 500 and 404 can not be considered for crawling and indexing. Hence, a web page that is not working will changed to 200, 301, or 302 from 500 and 404 status.
If the information shared above is confusing, please read my complete guide to Understanding HTTP Status Status Code in SEO.
In the above process, all the working links are added to the search engine if it is not added already. All the existing links are updated with latest information.
The non working links are removed from the database. For all the working links, a vote or value called link juice is passed. This link juice is what accumulates to Page Rank. It also contributes to Domain Authority and Page Authority.
The more the Domain authority (DA) or the Page Authority (PA) is more the probability of a website or web page to rank better.
I would relate a crawling process to an interview process in a company for a job opportunity. A company will check the skill set, IQ level, and most of all is he the right person to work with the organization. The organization will hire the candidate only when he meets the requirement criteria. Similarly, a search engine spider will check the content quality during the crawling process.
Crawling is a quality check process executed by search engine spiders to check weather a website and its contents are meeting the quality guidelines that are laid down by the search engines.
In search engine point of view, a quality content must be
- Unique
- Detailed Information(Everything about something)
- Must add value to the exiting content
- Easy to read
- Actionable
Is my Content Crawled by Search Engine?
If the content of the website is not crawl-able, we can not expect any organic results. Only when a website allows crawling, a copy of the crawled website will be saved as a temporary memory in google server. This cache memory of the website will be updated every time the link is crawled.
Different Ways to check the Crawled Status of the articles are
Different Ways to check the Crawled Status of the articles are
- Cache: operator
- Cached option on Google SERP (Search Engine Results Page)
The below screen will give you how to use the above methods to check the crawled status of the website.
Method 1:
Open your favorite web browser and type cache:domainname and hit enter. The cached copy of the website will be displayed. In the above example, the results says the website content was cached on 28, Sep 2019.
Method 2:
Open google.com from any of your favorite browser and search your brand name, domain name or any keyword related to your business. The results related to your website may be displayed. Look for your domain name and find the small down faced arrow mark and click on it. This will show "Cached" option. Click this and it will take you to the same page which was used in the method 1.
Understanding Search Engine Spider Indexing Process
Indexing is a second phase of a search engine spider process. We can relate the indexing process to an induction process in a company. In general, the selected candidates will be enrolled on to the pay roll using the process name induction. The quality website article will be crawled and indexed inside a search engine database.
We can define Indexing as a process of including or adding a website into the search engine database. Once the website or an article is indexed, the content can be explored via search engine.
We can check weather a content is indexed inside google using the below methods:
We can define Indexing as a process of including or adding a website into the search engine database. Once the website or an article is indexed, the content can be explored via search engine.
We can check weather a content is indexed inside google using the below methods:
- Searching your brand name in the search engine.
- site: google operator
- Inspect URL option in Google Webmaster Tool
Searching Your Brand Name in The Search Engine
All you need is to enter your brand name in google search engine and all the websites related to your keyword will be displayed. This indicates that your website is indexed on the search engine list of websites.
This method is not false proof or this can not be used for checking the index status of a single article or content.
site: Operator
The next method we are about to discuss is the site: operator.
Syntax: site:domain name or specific URL
In order to check the indexed status of a website or specific URL, you need to follow the below method:
site:https://www.amudhakumar.com
site:https://www.amudhakumar.com/2019/09/search-engine-optimization-seo-starter-guide.html
This will show the results as given below:
If the link or the content is not indexed, the message will say the link is not found in the search engine list or database. This is how the message will pop up.
site:https://www.amudhakumar.com
site:https://www.amudhakumar.com/2019/09/search-engine-optimization-seo-starter-guide.html
This will show the results as given below:
If the link or the content is not indexed, the message will say the link is not found in the search engine list or database. This is how the message will pop up.
0 Comments
Post a Comment