Detect Search Crawlers via JavaScript

JavascriptWeb CrawlerBots

Javascript Problem Overview


I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot.

I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler:

/MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x

Example of search crawlers I want to block:

Google 
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 
Googlebot/2.1 (+http://www.googlebot.com/bot.html) 
Googlebot/2.1 (+http://www.google.com/bot.html) 

Baidu 
Baiduspider+(+http://www.baidu.com/search/spider_jp.html) 
Baiduspider+(+http://www.baidu.com/search/spider.htm) 
BaiDuSpider 

Javascript Solutions


Solution 1 - Javascript

This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:

/bot|crawler|spider|crawling/i

For example you have some object, util.browser, you can store what type of device a user is on:

util.browser = {
   bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
   mobile: ...,
   desktop: ...
}

Solution 2 - Javascript

Try this. It's based on the crawlers list on available on https://github.com/monperrus/crawler-user-agents

var botPattern = "(googlebot\/|bot|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = navigator.userAgent; 
if (re.test(userAgent)) {
    console.log('the user agent is a crawler!');
}

Solution 3 - Javascript

The following regex will match the biggest search engines according to this post.

/bot|google|baidu|bing|msn|teoma|slurp|yandex/i
    .test(navigator.userAgent)

The matches search engines are:

  • Baidu
  • Bingbot/MSN
  • DuckDuckGo (duckduckbot)
  • Google
  • Teoma
  • Yahoo!
  • Yandex

Additionally, I've added bot as a catchall for smaller crawlers/bots.

Solution 4 - Javascript

This might help to detect the robots user agents while also keeping things more organized:

Javascript

const detectRobot = (userAgent) => {
  const robots = new RegExp([
    /bot/,/spider/,/crawl/,                            // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,        // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,       // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                         // OTHER
  ].map((r) => r.source).join("|"),"i");               // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

Typescript

const detectRobot = (userAgent: string): boolean => {
  const robots = new RegExp(([
    /bot/,/spider/,/crawl/,                               // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                   // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,           // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,          // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                            // OTHER
  ] as RegExp[]).map((r) => r.source).join("|"),"i");     // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

Use on server:

const userAgent = req.get('user-agent');
const isRobot = detectRobot(userAgent);

Use on "client" / some phantom browser a bot might be using:

const userAgent = navigator.userAgent;
const isRobot = detectRobot(userAgent);

Overview of Google crawlers:

https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

Solution 5 - Javascript

isTrusted property could help you.

> The isTrusted read-only property of the Event interface is a Boolean > that is true when the event was generated by a user action, and false > when the event was created or modified by a script or dispatched via > EventTarget.dispatchEvent().

eg:

isCrawler() {
  return event.isTrusted;
}

⚠ Note that IE isn't compatible.

Read more from doc: https://developer.mozilla.org/en-US/docs/Web/API/Event/isTrusted

Solution 6 - Javascript

People might light to check out the new navigator.webdriver property, which allows bots to inform you that they are bots:

https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver

> The webdriver read-only property of the navigator interface indicates whether the user agent is controlled by automation.

> It defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for example, so that alternate code paths can be triggered during automation.

It is supported by all major browsers and respected by major browser automation software like Puppeteer. Users of automation software can of course disable it, and so it should only be used to detect "good" bots.

Solution 7 - Javascript

The "test for MSIE x.x" example is just code for testing the userAgent against a Regular Expression. In your example the Regexp is the

/MSIE (\d+\.\d+);/

part. Just replace it with your own Regexp you want to test the user agent against. It would be something like

/Google|Baidu|Baiduspider/.test(navigator.userAgent)

where the vertical bar is the "or" operator to match the user agent against all of your mentioned robots. For more information about Regular Expression you can refer to this site since javascript uses perl-style RegExp.

Solution 8 - Javascript

I combined some of the above and removed some redundancy. I use this in .htaccess on a semi-private site:

(google|bot|crawl|spider|slurp|baidu|bing|msn|teoma|yandex|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|biglotron|convera|gigablast|archive|webmon|httrack|grub|netresearchserver|speedy|fluffy|bibnum|findlink|panscient|IOI|ips-agent|yanga|Voyager|CyberPatrol|postrank|page2rss|linkdex|ezooms|heritrix|findthatfile|Aboundex|summify|ec2linkfinder|facebook|slack|instagram|pinterest|reddit|twitter|whatsapp|yeti|RetrevoPageAnalyzer|sogou|wotbox|ichiro|drupact|coccoc|integromedb|siteexplorer|proximic|changedetection|WeSEE|scrape|scaper|g00g1e|binlar|indexer|MegaIndex|ltx71|BUbiNG|Qwantify|lipperhey|y!j-asr|AddThis)

Solution 9 - Javascript

I found this isbot package that has the built-in isbot() function. It seams to me that the package is properly maintained and that they keep everything up-to-date.

USAGE:

const isBot = require('isbot');

...

isBot(req.get('user-agent'));

Package: https://www.npmjs.com/package/isbot

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJonView Question on Stackoverflow
Solution 1 - JavascriptmegawacView Answer on Stackoverflow
Solution 2 - JavascriptSergey P. aka azureView Answer on Stackoverflow
Solution 3 - JavascriptEdoView Answer on Stackoverflow
Solution 4 - JavascriptcbdeveloperView Answer on Stackoverflow
Solution 5 - JavascriptEmericView Answer on Stackoverflow
Solution 6 - JavascriptjoeView Answer on Stackoverflow
Solution 7 - Javascriptmorten.cView Answer on Stackoverflow
Solution 8 - JavascriptWes ReimerView Answer on Stackoverflow
Solution 9 - JavascriptNeNaDView Answer on Stackoverflow