Notepad++ regex group capture

RegexNotepad++

Regex Problem Overview


I have such txt file:

ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua

Trying to delete all subdomains with such regex:

Find:    .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1

Receive:

prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua

Why last line becomes com.ua instead of jwbefw.com.ua ?

Regex Solutions


Solution 1 - Regex

This works without look around:

Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$ Replace: \1\.\2

It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.

There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.

Incorrect

Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:

Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$ Replace: \1\.\2

It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.

Solution 2 - Regex

The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.

Solution 3 - Regex

.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$

This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.

Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy. .+ will zip all the way to the right at the end of the line capturing everything, then work its way backwards (to the left) looking for a match from here:

(ru|ua|com\.ua|com|net|info)

With srfsf.jwbefw.com.ua the regex engine will first fail to match a, then it will move the token one place to the left to look at "ua" At that point, ua from the regex (the second option) is a match.

The engine will not keep looking to find "com.ua" because ".ua" met that requirement.

Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.

But the implimentation of ? also creates issues.

Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru

This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru

To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.

resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$

Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.

last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$

Solution 4 - Regex

Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1

Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

enter image description here

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpnslgView Question on Stackoverflow
Solution 1 - Regexjpmc26View Answer on Stackoverflow
Solution 2 - RegexNiet the Dark AbsolView Answer on Stackoverflow
Solution 3 - RegexdavidlcView Answer on Stackoverflow
Solution 4 - RegexHaji RahmatullahView Answer on Stackoverflow