Detect and extract url from a string?

Java Problem Overview

This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.

I found this expression from stackoverflow,But the result is just http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

Is there any better idea?

Java Solutions

Solution 1 - Java

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
		"(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
				+ "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
				+ "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
		Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Here's an example of using it:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
	int matchStart = matcher.start(1);
	int matchEnd = matcher.end();
	// now you have the offsets of a URL match
}

Solution 2 - Java

/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
	List<String> containedUrls = new ArrayList<String>();
	String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
	Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
	Matcher urlMatcher = pattern.matcher(text);

	while (urlMatcher.find())
	{
		containedUrls.add(text.substring(urlMatcher.start(0),
				urlMatcher.end(0)));
	}

	return containedUrls;
}

Example:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
	System.out.println(url);
}

Prints:

https://stackoverflow.com/
http://www.google.com/

Solution 3 - Java

m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. Here it's (https?|ftp|file)

You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.

You need to repeat your find function to match the next one and use the new group array.

Solution 4 - Java

Detecting URLs is not an easy task. If its enough for you to get a string that starts with https?|ftp|file then it could be fine. Your problem here is, that you have a capturing group, the () and those are only around the first part http...

I would make this part a non capturing group using (?:) and put brackets around the whole thing.

"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

Solution 5 - Java

With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:

"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

I don't think that regex matches the whole url though.

Solution 6 - Java

I tried all examples here for extracting different urls like these and neither works perfect for all:

> http://example.com<br> > https://example.com.ua<br> > www.example.ua<br> > https://stackoverflow.com/question/5713558/detect-and-extract-url-from-a-string<br> > https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8

And I wrote my regEx and a method for making it which works with text with multiple links in it:

private static final String LINK_REGEX = "((http:\\/\\/|https:\\/\\/)?(www.)?(([a-zA-Z0-9-]){2,2083}\\.){1,4}([a-zA-Z]){2,6}(\\/(([a-zA-Z-_\\/\\.0-9#:?=&;,]){0,2083})?){0,2083}?[^ \\n]*)";
private static final String TEXT_WITH_LINKS_EXAMPLE = "link1:http://example.com link2: https://example.com.ua link3 www.example.ua\n" +
        "link4- https://stackoverflow.com/questions/5713558/detect-and-extract-url-from-a-string\n" +
        "link5 https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8";

And method which returns ArrayList with links:

 private ArrayList<String> getAllLinksFromTheText(String text) {
    ArrayList<String> links = new ArrayList<>();
    Pattern p = Pattern.compile(LINK_REGEX, Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        links.add(m.group());
    }
    return links;
}

That's all. Call this method with TEXT_WITH_LINKS_EXAMPLE parameter and will receive five links from the text.

Solution 7 - Java

https://github.com/linkedin/URL-Detector

        <groupId>io.github.url-detector/</groupId>
        <artifactId>url-detector</artifactId>
        <version>0.1.23</version>

Solution 8 - Java

Old question, but this library might be useful to someone. It passes lots of test cases

https://mvnrepository.com/artifact/com.linkedin.urls/url-detector/0.1.17

Additional documentation:
https://engineering.linkedin.com/blog/2016/06/open-sourcing-url-detector--a-java-library-to-detect-and-normali

https://github.com/linkedin/URL-Detector

Solution 9 - Java

This little code snippet / function will effectively extract URL strings from a string in Java. I found the basic regex for doing it here, and used it in a java function.

I expanded on the basic regex a bit with the part “|www[.]” in order to catch links not starting with “http://”

Enough talk (it is cheap), here’s the code:

//Pull all links from the body for easy retrieval
private ArrayList pullLinks(String text) {
ArrayList links = new ArrayList();
 
String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&amp;@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&amp;@#/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith("(") &amp;&amp; urlStr.endsWith(")"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}

Content Type	Original Author	Original Content on Stackoverflow
Question	Shisoft	View Question on Stackoverflow
Solution 1 - Java	WhiteFang34	View Answer on Stackoverflow
Solution 2 - Java	BullyWiiPlaza	View Answer on Stackoverflow
Solution 3 - Java	M'vy	View Answer on Stackoverflow
Solution 4 - Java	stema	View Answer on Stackoverflow
Solution 5 - Java	Billy Moon	View Answer on Stackoverflow
Solution 6 - Java	Alexander Yushko	View Answer on Stackoverflow
Solution 7 - Java	Yuriy Barannikov	View Answer on Stackoverflow
Solution 8 - Java	Chandan	View Answer on Stackoverflow
Solution 9 - Java	lemmy njaria	View Answer on Stackoverflow