how to detect search engine bots with php?

PhpWeb CrawlerBots

Php Problem Overview


How can one detect the search engine bots using php?

Php Solutions


Solution 1 - Php

I use the following code which seems to be working fine:

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en

added mediapartners

Solution 2 - Php

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}

Solution 3 - Php

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:

http://www.useragentstring.com/pages/useragentstring.php

Or more specifically for crawlers:

http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

Solution 4 - Php

You can checkout if it's a search engine with this function :

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

Then you can use it like :

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

Solution 5 - Php

I'm using this to detect bots:

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

In addition I use a whitelist to block unwanted bots:

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.

Note: My whitelist is based on Facebooks robots.txt.

Solution 6 - Php

Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.

The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.

I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier

Solution 7 - Php

I use this function ... part of the regex comes from prestashop but I added some more bot to it.

    public function isBot()
{
    $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
    $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
    $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

    return $isBot;
}

Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )

One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )

Solution 8 - Php

If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot

> To verify Googlebot as the caller: > > 1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command. > > 2.Verify that the domain name is in either googlebot.com or google.com > > 3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.

Here is my tested code :

<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 ) 
{
//add your code
}

?>

In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain. I hope you enjoy ;)

Solution 9 - Php

You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.

Solution 10 - Php

Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector

Solution 11 - Php

 <?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
	$lastupdated = date("Ymd", filemtime(FILE_BOTS));
	if ($lastupdated != date("Ymd")) {
		$lists = array(
		'http://labs.getyacg.com/spiders/google.txt',
		'http://labs.getyacg.com/spiders/inktomi.txt',
		'http://labs.getyacg.com/spiders/lycos.txt',
		'http://labs.getyacg.com/spiders/msn.txt',
		'http://labs.getyacg.com/spiders/altavista.txt',
		'http://labs.getyacg.com/spiders/askjeeves.txt',
		'http://labs.getyacg.com/spiders/wisenut.txt',
		);
		foreach($lists as $list) {
			$opt .= fetch($list);
		}
		$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
		$fp =  fopen(FILE_BOTS,"w");
		fwrite($fp,$opt);
		fclose($fp);
	}
	$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
	$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
	$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
	$host = strtolower(gethostbyaddr($ip));
	$file = implode(" ", file(FILE_BOTS));
	$exp = explode(".", $ip);
	$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
	$threshold = CLOAKING_LEVEL;
	$cloak = 0;
	if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
		$cloak++;
	}
	if (stristr($file, $class)) {
		$cloak++;
	}
	if (stristr($file, $agent)) {
		$cloak++;
	}
	if (strlen($ref) > 0) {
		$cloak = 0;
	}
	
	if ($cloak >= $threshold) {
		$cloakdirective = 1;
	} else {
		$cloakdirective = 0;
	}
}
?>

That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com

Needs a bit of work, but definitely the way to go.

Solution 12 - Php

I made one good and fast function for this

function is_bot(){
		
		if(isset($_SERVER['HTTP_USER_AGENT']))
		{
			return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
		}
		
		return false;
	}

This cover 99% of all possible bots, search engines etc.

Solution 13 - Php

100% Working Bot detector. It is working on my website successfully.

function isBotDetected() {

	if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
	) {
		return true; // 'Above given bots detected'
	}

	return false;

} // End :: isBotDetected()

Solution 14 - Php

I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.

$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
    // if not meet the conditions then
    // do what you need

	// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
	if($user_agent!=""){
		$myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
		fwrite($myfile, $user_agent);
		$user_agent = "\n";
		fwrite($myfile, $user_agent);
		fclose($myfile);
	}
}

This is the content of useragent.txt

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

Solution 15 - Php

For Google i'm using this method.

function is_google() {
	$ip   = $_SERVER['REMOTE_ADDR'];
	$host = gethostbyaddr( $ip );
	if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {

		$forward_lookup = gethostbyname( $host );

		if ( $forward_lookup == $ip ) {
			return true;
		}

		return false;
	} else {
		return false;
	}

}

var_dump( is_google() );

Credits: https://support.google.com/webmasters/answer/80553

Solution 16 - Php

might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.

<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>

function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}

for a bad bot you can use this:

<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>

for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:

<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>

OR

<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>

you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.

hope you find this useful

Solution 17 - Php

Verifying Googlebot

As useragent can be changed...

> the only official supported way to identify a google bot is to run a > reverse DNS lookup on the accessing IP address and run a forward DNS > lookup on the result to verify that it points to accessing IP address > and the resulting domain name is in either googlebot.com or google.com > domain.

Taken from here.

> so you must run a DNS lookup

Both, reverse and forward.

See this guide on Google Search Central.

Solution 18 - Php

function bot_detected() {

  if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
	return true;
  }
  else{
    return false;
  }
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionterrificView Question on Stackoverflow
Solution 1 - PhpminnurView Answer on Stackoverflow
Solution 2 - PhpÓlafur WaageView Answer on Stackoverflow
Solution 3 - PhpJukka DahlbomView Answer on Stackoverflow
Solution 4 - PhpmacherifView Answer on Stackoverflow
Solution 5 - PhpmguttView Answer on Stackoverflow
Solution 6 - PhpFabian KesslerView Answer on Stackoverflow
Solution 7 - PhpWonderLandView Answer on Stackoverflow
Solution 8 - Phpطراحی سایت تهرانView Answer on Stackoverflow
Solution 9 - PhpGumboView Answer on Stackoverflow
Solution 10 - PhpmattabView Answer on Stackoverflow
Solution 11 - PhpL. CosioView Answer on Stackoverflow
Solution 12 - PhpIvijan Stefan StipićView Answer on Stackoverflow
Solution 13 - PhpIrshad KhanView Answer on Stackoverflow
Solution 14 - PhpLighthouse NguyenView Answer on Stackoverflow
Solution 15 - PhpMike AronView Answer on Stackoverflow
Solution 16 - PhpAntony FalencikowskiView Answer on Stackoverflow
Solution 17 - PhpyanntinocoView Answer on Stackoverflow
Solution 18 - PhpElyorView Answer on Stackoverflow