Detect URLs in text with JavaScript

JavascriptRegexUrl

Javascript Problem Overview


Does anyone have suggestions for detecting URLs in a set of strings?

arrayOfStrings.forEach(function(string){
  // detect URLs in strings and do something swell,
  // like creating elements with links.
});

Update: I wound up using this regex for link detection… Apparently several years later.

kLINK_DETECTION_REGEX = /(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)(\s+|$)/gi

The full helper (with optional Handlebars support) is at gist #1654670.

Javascript Solutions


Solution 1 - Javascript

First you need a good regex that matches urls. This is hard to do. See here, here and here:

> ...almost anything is a valid URL. There > are some punctuation rules for > splitting it up. Absent any > punctuation, you still have a valid > URL. > > Check the RFC carefully and see if you > can construct an "invalid" URL. The > rules are very flexible.
> > For example ::::: is a valid URL. > The path is ":::::". A pretty > stupid filename, but a valid filename. > > Also, ///// is a valid URL. The > netloc ("hostname") is "". The path > is "///". Again, stupid. Also > valid. This URL normalizes to "///" > which is the equivalent. > > Something like "bad://///worse/////" > is perfectly valid. Dumb but valid.

Anyway, this answer is not meant to give you the best regex but rather a proof of how to do the string wrapping inside the text, with JavaScript.

OK so lets just use this one: /(https?:\/\/[^\s]+)/g

Again, this is a bad regex. It will have many false positives. However it's good enough for this example.

function urlify(text) {
  var urlRegex = /(https?:\/\/[^\s]+)/g;
  return text.replace(urlRegex, function(url) {
    return '<a href="' + url + '">' + url + '</a>';
  })
  // or alternatively
  // return text.replace(urlRegex, '<a href="$1">$1</a>')
}

var text = 'Find me at http://www.example.com and also at http://stackoverflow.com';
var html = urlify(text);

console.log(html)

// html now looks like:
// "Find me at <a href="http://www.example.com">http://www.example.com</a> and also at <a href="http://stackoverflow.com">http://stackoverflow.com</a>"

So in sum try:

$$('#pad dl dd').each(function(element) {
    element.innerHTML = urlify(element.innerHTML);
});

Solution 2 - Javascript

Here is what I ended up using as my regex:

var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;

This doesn't include trailing punctuation in the URL. Crescent's function works like a charm :) so:

function linkify(text) {
    var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;
    return text.replace(urlRegex, function(url) {
        return '<a href="' + url + '">' + url + '</a>';
    });
}

Solution 3 - Javascript

I googled this problem for quite a while, then it occurred to me that there is an Android method, android.text.util.Linkify, that utilizes some pretty robust regexes to accomplish this. Luckily, Android is open source.

They use a few different patterns for matching different types of urls. You can find them all here: http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/2.0_r1/android/text/util/Regex.java#Regex.0WEB_URL_PATTERN

If you're just concerned about url's that match the WEB_URL_PATTERN, that is, urls that conform to the RFC 1738 spec, you can use this:

/((?:(http|https|Http|Https|rtsp|Rtsp):\/\/(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?((?:(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}\.)+(?:(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop])|k[eghimnrwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])|(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r[eouw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?:\:\d{1,5})?)(\/(?:(?:[a-zA-Z0-9\;\/\?\:\@\&\=\#\~\-\.\+\!\*\'\(\)\,\_])|(?:\%[a-fA-F0-9]{2}))*)?(?:\b|$)/gi;

Here is the full text of the source:

"((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
+ "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
+ "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?"
+ "((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+"   // named host
+ "(?:"   // plus top level domain
+ "(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
+ "|(?:biz|b[abdefghijmnorstvwyz])"
+ "|(?:cat|com|coop|c[acdfghiklmnoruvxyz])"
+ "|d[ejkmoz]"
+ "|(?:edu|e[cegrstu])"
+ "|f[ijkmor]"
+ "|(?:gov|g[abdefghilmnpqrstuwy])"
+ "|h[kmnrtu]"
+ "|(?:info|int|i[delmnoqrst])"
+ "|(?:jobs|j[emop])"
+ "|k[eghimnrwyz]"
+ "|l[abcikrstuvy]"
+ "|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])"
+ "|(?:name|net|n[acefgilopruz])"
+ "|(?:org|om)"
+ "|(?:pro|p[aefghklmnrstwy])"
+ "|qa"
+ "|r[eouw]"
+ "|s[abcdeghijklmnortuvyz]"
+ "|(?:tel|travel|t[cdfghjklmnoprtvwz])"
+ "|u[agkmsyz]"
+ "|v[aceginu]"
+ "|w[fs]"
+ "|y[etu]"
+ "|z[amw]))"
+ "|(?:(?:25[0-5]|2[0-4]" // or ip address
+ "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]"
+ "|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]"
+ "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
+ "|[1-9][0-9]|[0-9])))"
+ "(?:\\:\\d{1,5})?)" // plus option port number
+ "(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~"  // plus option query params
+ "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"
+ "(?:\\b|$)";

If you want to be really fancy, you can test for email addresses as well. The regex for email addresses is:

/[a-zA-Z0-9\\+\\.\\_\\%\\-]{1,256}\\@[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}(\\.[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25})+/gi

PS: The top level domains supported by above regex are current as of June 2007. For an up to date list you'll need to check https://data.iana.org/TLD/tlds-alpha-by-domain.txt.

Solution 4 - Javascript

Based on Crescent Fresh answer

if you want to detect links with http:// OR without http:// and by www. you can use the following

function urlify(text) {
	var urlRegex = /(((https?:\/\/)|(www\.))[^\s]+)/g;
    //var urlRegex = /(https?:\/\/[^\s]+)/g;
    return text.replace(urlRegex, function(url,b,c) {
		var url2 = (c == 'www.') ?  'http://' +url : url;
        return '<a href="' +url2+ '" target="_blank">' + url + '</a>';
    }) 
}

Solution 5 - Javascript

This library on NPM looks like it is pretty comprehensive https://www.npmjs.com/package/linkifyjs

> Linkify is a small yet comprehensive JavaScript plugin for finding URLs in plain-text and converting them to HTML links. It works with all valid URLs and email addresses.

Solution 6 - Javascript

Function can be further improved to render images as well:

function renderHTML(text) { 
	var rawText = strip(text)
	var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
	
    return rawText.replace(urlRegex, function(url) {   
	
	if ( ( url.indexOf(".jpg") > 0 ) || ( url.indexOf(".png") > 0 ) || ( url.indexOf(".gif") > 0 ) ) {
			return '<img src="' + url + '">' + '<br/>'
		} else {
			return '<a href="' + url + '">' + url + '</a>' + '<br/>'
		}
	}) 
} 

or for a thumbnail image that links to fiull size image:

return '<a href="' + url + '"><img style="width: 100px; border: 0px; -moz-border-radius: 5px; border-radius: 5px;" src="' + url + '">' + '</a>' + '<br/>'

And here is the strip() function that pre-processes the text string for uniformity by removing any existing html.

function strip(html) 
    {  
        var tmp = document.createElement("DIV"); 
        tmp.innerHTML = html; 
        var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
        return tmp.innerText.replace(urlRegex, function(url) {     
	    return '\n' + url 
    })
} 

Solution 7 - Javascript

There is existing npm package: url-regex, just install it with yarn add url-regex or npm install url-regex and use as following:

const urlRegex = require('url-regex');

const replaced = 'Find me at http://www.example.com and also at http://stackoverflow.com or at google.com'
  .replace(urlRegex({strict: false}), function(url) {
     return '<a href="' + url + '">' + url + '</a>';
  });

Solution 8 - Javascript

let str = 'https://example.com is a great site'
str.replace(/(https?:\/\/[^\s]+)/g,"<a href='$1' target='_blank' >$1</a>")

Short Code Big Work!...

Result:-

 <a href="https://example.com" target="_blank" > https://example.com </a>

Solution 9 - Javascript

If you want to detect links with http:// OR without http:// OR ftp OR other possible cases like removing trailing punctuation at the end, take a look at this code.

https://jsfiddle.net/AndrewKang/xtfjn8g3/

A simple way to use that is to use NPM

npm install --save url-knife

Solution 10 - Javascript

try this:

function isUrl(s) {
	if (!isUrl.rx_url) {
		// taken from https://gist.github.com/dperini/729294
		isUrl.rx_url=/^(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$/i;
		// valid prefixes
		isUrl.prefixes=['http:\/\/', 'https:\/\/', 'ftp:\/\/', 'www.'];
		// taken from https://w3techs.com/technologies/overview/top_level_domain/all
		isUrl.domains=['com','ru','net','org','de','jp','uk','br','pl','in','it','fr','au','info','nl','ir','cn','es','cz','kr','ua','ca','eu','biz','za','gr','co','ro','se','tw','mx','vn','tr','ch','hu','at','be','dk','tv','me','ar','no','us','sk','xyz','fi','id','cl','by','nz','il','ie','pt','kz','io','my','lt','hk','cc','sg','edu','pk','su','bg','th','top','lv','hr','pe','club','rs','ae','az','si','ph','pro','ng','tk','ee','asia','mobi'];
	}

	if (!isUrl.rx_url.test(s)) return false;
	for (let i=0; i<isUrl.prefixes.length; i++) if (s.startsWith(isUrl.prefixes[i])) return true;
	for (let i=0; i<isUrl.domains.length; i++) if (s.endsWith('.'+isUrl.domains[i]) || s.includes('.'+isUrl.domains[i]+'\/') ||s.includes('.'+isUrl.domains[i]+'?')) return true;
	return false;
}

function isEmail(s) {
	if (!isEmail.rx_email) {
		// taken from http://stackoverflow.com/a/16016476/460084
		var sQtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]';
		var sDtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]';
		var sAtom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+';
		var sQuotedPair = '\\x5c[\\x00-\\x7f]';
		var sDomainLiteral = '\\x5b(' + sDtext + '|' + sQuotedPair + ')*\\x5d';
		var sQuotedString = '\\x22(' + sQtext + '|' + sQuotedPair + ')*\\x22';
		var sDomain_ref = sAtom;
		var sSubDomain = '(' + sDomain_ref + '|' + sDomainLiteral + ')';
		var sWord = '(' + sAtom + '|' + sQuotedString + ')';
		var sDomain = sSubDomain + '(\\x2e' + sSubDomain + ')*';
		var sLocalPart = sWord + '(\\x2e' + sWord + ')*';
		var sAddrSpec = sLocalPart + '\\x40' + sDomain; // complete RFC822 email address spec
		var sValidEmail = '^' + sAddrSpec + '$'; // as whole string

		isEmail.rx_email = new RegExp(sValidEmail);
	}

	return isEmail.rx_email.test(s);
}

will also recognize urls such as google.com , http://www.google.bla , http://google.bla , www.google.bla but not google.bla

Solution 11 - Javascript

Generic Object Oriented Solution

For people like me that use frameworks like angular that don't allow manipulating DOM directly, I created a function that takes a string and returns an array of url/plainText objects that can be used to create any UI representation that you want.

URL regex

For URL matching I used (slightly adapted) h0mayun regex: /(?:(?:https?:\/\/)|(?:www\.))[^\s]+/g

My function also drops punctuation characters from the end of a URL like . and , that I believe more often will be actual punctuation than a legit URL ending (but it could be! This is not rigorous science as other answers explain well) For that I apply the following regex onto matched URLs /^(.+?)([.,?!'"]*)$/.

Typescript code

	export function urlMatcherInText(inputString: string): UrlMatcherResult[] {
		if (! inputString) return [];

		const results: UrlMatcherResult[] = [];

		function addText(text: string) {
			if (! text) return;

			const result = new UrlMatcherResult();
			result.type = 'text';
			result.value = text;
			results.push(result);
		}

		function addUrl(url: string) {
			if (! url) return;

			const result = new UrlMatcherResult();
			result.type = 'url';
			result.value = url;
			results.push(result);
		}

		const findUrlRegex = /(?:(?:https?:\/\/)|(?:www\.))[^\s]+/g;
		const cleanUrlRegex = /^(.+?)([.,?!'"]*)$/;

		let match: RegExpExecArray;
		let indexOfStartOfString = 0;

		do {
			match = findUrlRegex.exec(inputString);

			if (match) {
				const text = inputString.substr(indexOfStartOfString, match.index - indexOfStartOfString);
				addText(text);
				
				var dirtyUrl = match[0];
				var urlDirtyMatch = cleanUrlRegex.exec(dirtyUrl);
				addUrl(urlDirtyMatch[1]);
				addText(urlDirtyMatch[2]);

				indexOfStartOfString = match.index + dirtyUrl.length;
			}
		}
		while (match);

		const remainingText = inputString.substr(indexOfStartOfString, inputString.length - indexOfStartOfString);
		addText(remainingText);
		
		return results;
	}

	export class UrlMatcherResult {
		public type: 'url' | 'text'
		public value: string
	}

Solution 12 - Javascript

tmp.innerText is undefined. You should use tmp.innerHTML

function strip(html) 
    {  
        var tmp = document.createElement("DIV"); 
        tmp.innerHTML = html; 
        var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
        return tmp.innerHTML .replace(urlRegex, function(url) {     
        return '\n' + url 
    })

Solution 13 - Javascript

You can use a regex like this to extract normal url patterns.

(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})

If you need more sophisticated patterns, use a library like this.

https://www.npmjs.com/package/pattern-dreamer

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionarbalesView Question on Stackoverflow
Solution 1 - JavascriptCrescent FreshView Answer on Stackoverflow
Solution 2 - JavascriptNiaz MohammedView Answer on Stackoverflow
Solution 3 - JavascriptAdamView Answer on Stackoverflow
Solution 4 - Javascripth0mayunView Answer on Stackoverflow
Solution 5 - JavascriptDan KantorView Answer on Stackoverflow
Solution 6 - JavascriptGautam SharmaView Answer on Stackoverflow
Solution 7 - JavascriptVedmantView Answer on Stackoverflow
Solution 8 - JavascriptKashan HaiderView Answer on Stackoverflow
Solution 9 - JavascriptKang AndrewView Answer on Stackoverflow
Solution 10 - JavascriptkofifusView Answer on Stackoverflow
Solution 11 - JavascripteddyP23View Answer on Stackoverflow
Solution 12 - JavascriptÁn Bình TrọngView Answer on Stackoverflow
Solution 13 - JavascriptKang AndrewView Answer on Stackoverflow