How to fix high server load caused by bot indexing

If our Technical Support team has contacted you due to your website or server being under heavy load, it could be for a number of reasons, including:

Your website is experiencing a high number of website visitors
Your website or server is under some form of DDOS attack
Search Engine or other bots could be crawling your website excessively

In this article, we’ll cover the details on what you can do if bots are causing high server load.

What is a bot?

A bot or web spider is a software application which performs repetitive and automated tasks via the Internet. Search Engines, such as Google, use these bots to crawl websites, collecting information on the website. When a bot crawls a website, it uses the same resources that a normal visitor would do; this includes bandwidth and server resources.

Not all bots are benign in nature, though: some bots crawl websites for more nefarious purposes, such as harvesting email addresses for spamming or looking for vulnerabilities to exploit.

Identifying bot traffic

There are a number of ways to identify bot traffic on your website or server. The easiest and most effective way to do this is via the AWStats tool in konsoleH. AWStats provides a report on bot traffic, called Robots/Spiders visitors. This report lists each bot that has crawled your site, the number of hits, bandwidth consumed and date of the last crawl. Bot bandwidth usage should not exceed a few megabytes per month.

If the AWStats report does not give you the information you require, you can always look through your website or server log files. Every visit, whether it’s a human or bot, will be recorded as an entry in the log files. Note, though, that logs are not available for the current day. The key information recorded in the log entry is as follows:

Source IP address: The IP address from where the request came
Request Timestamp: The date and time the request was made
Page Requested: If a web server log, the page that was visited
User Agent: The name of the software from which the request originated. For normal visitors, this is usually the browser name and version, while for bots it’s the name of the bot.

Using this information, specifically the User Agent and Source IP address, you can easily differentiate normal visitors from bots. Other information such as Page Requested could help, as bots generally only access pages, and exclude things such as images, javascript etc.

Legitimate bots used by search engines have distinct and easily identifiable names such as Googlebot & Bingbot (for Google and Bing respectively), while others have more vague names such as Slurp for Yahoo.

Malicious bots are far harder to identify as they don’t stick to the rules that the legitimate bots follow. For example, malicious bots may mimic other bots or web browsers by using the same or similar User Agent. However, it’s still possible to spot bad bots as follows:

Doing a reverse DNS lookup on the source IP address should give you the hostname of the bot. All major search engine bot IP addresses should resolve to specific host names. For example, Googlebot resolves to googlebot.com or google.com
Malicious bots ignore the Robots Exclusion Standard (see below for more info on this), so if you get bots hitting pages that should be excluded, it would indicate that the bot is bad.

How can bots trigger high server load?

Malicious bots may deliberately cause high server resource usage as part of DDOS attacks. This includes hitting the website or server with thousands of concurrent requests or flooding the server with large data requests.

Legitimate bots usually consume a manageable amount of resources, however in some cases even legitimate bots could trigger high server resource usage. These include:

If lots of new content is added to your website, the search engine bots could more aggressively crawl your website to index the new content
There could be a problem with your website, and the bots could be triggering this fault causing a resource-intensive operation, such as an infinite loop.

How to mitigate the problem

There are a number of ways you can fix the problem of bots consuming high levels of server resources. The solution is dependant on the type of bot.

1. Malicious Bots

These bots, unfortunately, do not follow the standard protocols when it comes to web crawlers. This makes it harder to prevent these bots from crawling your website. Your best defense against these types of bots is to identify the source of the malicious traffic and block access to your site from these sources. There are a number of ways to do this

.htaccess File

The .htaccess file is a file that sits in the root of your website and contains instructions on how your website can be accessed.

Using this file you can block access to requests that come from certain sources, based on their IP address. Using the rule…

Order Deny,Allow

Deny from 127.0.0.1

Deny from 192.168.1.1

…will block access to your website from the IP addresses 127.0.0.1 and 192.168.1.1

You can also block access based on the User Agent. For example:

BrowserMatchNoCase BadSpamBot badspambot

Order Deny,Allow

Deny from env=badspambot

will block access to bots that have the user agent badspambot

Extreme care must be taken when working with the .htaccess file, though: if the file contains an error (such as a typo), it can stop your whole website from working. Also if you accidentally block legitimate traffic (such as a search engine bot), you run the risk of having your whole site removed from that particular search engine.

konsoleH IP blocker

If you don’t feel comfortable editing the .htaccess file directly, you can use our IP Blocking tool in konsoleH to stop malicious bots hitting your site, based on their IP address. For more information on using this tool, see this article.

Legitimate bots

Thankfully, with legitimate bots (such as those from search engines), there are a number of ways you can control access to your website or server.

Robots exclusion standard

The robot exclusion standard is used to communicate with bots in terms of where they can or cannot go on your website. These rules are contained in a file called robots.txt that needs to exist in the root of your website. Here are a few examples of what you can do with the robots.txt file

Stop all bots from crawling your website. This should only be done on sites that you don’t want to appear in search engines, as blocking all bots will prevent the site from being indexed.
Stop all bots from accessing certain parts of your website. This is useful when your site has a lot of pages that provide little value or extra content. By preventing the bots from crawling these extra pages reduces extra strain on your server.
Block only certain bots from your website. Some bots are used by services that may not be relevant or valuable to your website. For example, there may be no need to have Yandex, which is a Russian search engine, crawling your website. In this case you can specifically block Yandex from your website.

Fix any website errors

If a bot is triggering a website error, which is causing a spike in server resources, it could be easier to fix the error. High server usage through a website error may not exclusively be triggered by a bot – normal visitors could trigger this error too. You can identify where the error is occurring by examining your server error log files.

Website caching or CDN

Server-side website caching will reduce any unnecessary hits on your server by serving static versions of your website, rather than having the content generated from a database. Caching has other benefits, including faster load times for your real visitors.

Similarly, a Content Delivery Network or CDN will off-load content from your server to high-performance nodes, that are geographically closer to the source of the request. Like server-side caching this will reduce the number of hits on your server, reducing server load.

Limit crawl rate

Most major search engines provide a way to control the rate at which their bot’s hit your server. This does not control how often a bot will crawl your site, but how much resources will be consumed when they do crawl it. By reducing this rate you should be able to minimise the impact the bots have on your server resources.

Here are instructions for Google and Bing on how to limit crawl rates:

Google

Log into Search Console
On the Search Console Home page, click the site that you want.
Click the gear icon , then click Site Settings.
In the Crawl rate section, select the option you want and then limit the crawl rate as desired.

Bing

Log into Bing Webmaster Tools
Click on the site you want
Expand Configure My Site
Click Crawl Control
Select the prefered time you want the Bing Bot to reduce the server load

Source: https://xneelo.co.za/help-centre/website/bot-indexing/