Detect and extract url from a string?

JavaRegexUrl

Java Problem Overview


This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.

I found this expression from stackoverflow,But the result is just http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

Is there any better idea?

Java Solutions


Solution 1 - Java

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
		"(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
				+ "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
				+ "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
		Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Here's an example of using it:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
	int matchStart = matcher.start(1);
	int matchEnd = matcher.end();
	// now you have the offsets of a URL match
}

Solution 2 - Java

/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
	List<String> containedUrls = new ArrayList<String>();
	String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
	Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
	Matcher urlMatcher = pattern.matcher(text);

	while (urlMatcher.find())
	{
		containedUrls.add(text.substring(urlMatcher.start(0),
				urlMatcher.end(0)));
	}

	return containedUrls;
}

Example:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
	System.out.println(url);
}

Prints:

https://stackoverflow.com/
http://www.google.com/

Solution 3 - Java

m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. Here it's (https?|ftp|file)

You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.

You need to repeat your find function to match the next one and use the new group array.

Solution 4 - Java

Detecting URLs is not an easy task. If its enough for you to get a string that starts with https?|ftp|file then it could be fine. Your problem here is, that you have a capturing group, the () and those are only around the first part http...

I would make this part a non capturing group using (?:) and put brackets around the whole thing.

"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

Solution 5 - Java

With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:

"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

I don't think that regex matches the whole url though.

Solution 6 - Java

I tried all examples here for extracting different urls like these and neither works perfect for all:

> http://example.com<br> > https://example.com.ua<br> > www.example.ua<br> > https://stackoverflow.com/question/5713558/detect-and-extract-url-from-a-string<br> > https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8

And I wrote my regEx and a method for making it which works with text with multiple links in it:

private static final String LINK_REGEX = "((http:\\/\\/|https:\\/\\/)?(www.)?(([a-zA-Z0-9-]){2,2083}\\.){1,4}([a-zA-Z]){2,6}(\\/(([a-zA-Z-_\\/\\.0-9#:?=&;,]){0,2083})?){0,2083}?[^ \\n]*)";
private static final String TEXT_WITH_LINKS_EXAMPLE = "link1:http://example.com link2: https://example.com.ua link3 www.example.ua\n" +
        "link4- https://stackoverflow.com/questions/5713558/detect-and-extract-url-from-a-string\n" +
        "link5 https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8";

And method which returns ArrayList with links:

 private ArrayList<String> getAllLinksFromTheText(String text) {
    ArrayList<String> links = new ArrayList<>();
    Pattern p = Pattern.compile(LINK_REGEX, Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        links.add(m.group());
    }
    return links;
}

That's all. Call this method with TEXT_WITH_LINKS_EXAMPLE parameter and will receive five links from the text.

Solution 7 - Java

https://github.com/linkedin/URL-Detector

        <groupId>io.github.url-detector/</groupId>
        <artifactId>url-detector</artifactId>
        <version>0.1.23</version>

Solution 8 - Java

Solution 9 - Java

This little code snippet / function will effectively extract URL strings from a string in Java. I found the basic regex for doing it here, and used it in a java function.

I expanded on the basic regex a bit with the part “|www[.]” in order to catch links not starting with “http://”

Enough talk (it is cheap), here’s the code:

//Pull all links from the body for easy retrieval
private ArrayList pullLinks(String text) {
ArrayList links = new ArrayList();
 
String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&amp;@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&amp;@#/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith("(") &amp;&amp; urlStr.endsWith(")"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionShisoftView Question on Stackoverflow
Solution 1 - JavaWhiteFang34View Answer on Stackoverflow
Solution 2 - JavaBullyWiiPlazaView Answer on Stackoverflow
Solution 3 - JavaM'vyView Answer on Stackoverflow
Solution 4 - JavastemaView Answer on Stackoverflow
Solution 5 - JavaBilly MoonView Answer on Stackoverflow
Solution 6 - JavaAlexander YushkoView Answer on Stackoverflow
Solution 7 - JavaYuriy BarannikovView Answer on Stackoverflow
Solution 8 - JavaChandanView Answer on Stackoverflow
Solution 9 - Javalemmy njariaView Answer on Stackoverflow