Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

How can I emulate regular expression's branch reset in Java?

+8
−0

I've got this sample regex:

Pattern p = Pattern.compile("(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");

It basically has the following parts:

  • one or more lowercase vowels ([aeiou]+), followed by one or more digits ([0-9]+), or
  • one or more digits 1, 2 or 3 ([123]+), followed by lowercase letters ([a-z]+)
  • all of this followed by one or more non-alphanumeric characters (\W+)

There are also two capturing groups: one for the vowels, and another one for the digits 1, 2 or 3. I'm using alternation (|), which means that only one of these groups will be captured. Example:

Matcher m = p.matcher("ae123.");
if (m.find()) {
    int n = m.groupCount();
    for (int i = 1; i <= n; i++) {
        System.out.format("group %d: %s\n", i, m.group(i));
    }
}

In this case, only the first group is captured, and the output is:

group 1: ae
group 2: null

But if the input string is "111abc!!", the second group is captured, and the output is:

group 1: null
group 2: 111

Therefore, to know which group was captured, I need to loop through them and test if they are not null.


Some regex engines support the branch reset feature: putting the expression inside (?| and ), the groups numbering is reset each time an alternation is found (example). So the regex above could be written as:

Pattern p = Pattern.compile("(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");

The branch reset ((?|) makes both ([aeiou]+) and ([123]+) to be group 1 (and because there's an alternation - just one or another - only one of these expressions is captured). Using this feature, there would be no need to loop through the groups, testing if it's null. I could just get group 1 directly (m.group(1) would always have a value).

But Java doesn't support branch reset, and the code above throws an exception:

java.util.regex.PatternSyntaxException: Unknown inline modifier near index 2
(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\W+
  ^

I'm using Java 8, and taking a look at the Java 14 docs, we can see that this feature is still not supported (in Java 15 preview there's also no mention of it).

I also checked an alternative solution for .NET: use named groups with the same name for all groups, but it also didn't work in Java:

Pattern p = Pattern.compile("(?:(?<somename>[aeiou]+)[0-9]+|(?<somename>[123]+)[a-z]+)\\W+");

This code throws an exception, because in Java you can't have two or more groups with the same name:

java.util.regex.PatternSyntaxException: Named capturing group <somename> is already defined near index 36
(?:(?<somename>[aeiou]+)[0-9]+|(?<somename>[123]+)[a-z]+)\W+
                                          ^

Is there a way to emulate branch reset in Java or the only solution is to loop through the groups, testing if they are null?

History
Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

2 answers

+6
−0

I've kinda found a very limited, not so elegant, far from ideal "solution", using replaceAll:

String regex = "(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+";
System.out.println("ae123.".replaceAll(regex, "$1$2"));
System.out.println("111abc!!".replaceAll(regex, "$1$2"));

This prints:

ae
111

The trick is in the second argument: "$1$2" means that I'm concatenating groups 1 ($1) and 2 ($2). Because of the alternation (|), only one of the groups is captured and the other will be empty. And when I concatenate them, the result is always the contents of the captured group.


Limitations

But as I said, this solution is very limited. Let's suppose the regex is a little bit more complicated with lots of different groups. Something like that:

(1) | (2) (3) (4) | (5) (6) | (7) | (8)

In that case, I can have only group 1, or only groups 2, 3 and 4, or only groups 5 and 6, or only group 7, or only group 8.

Of course I could still use replaceAll with "$1$2$3$4$5$6$7$8", but if the regex matches groups 2, 3 and 4, they will be concatenated and I wouldn't know each group's value individually. Unless I use some separator, such as "$1,$2,$3,$4,$5,$6,$7,$8" and then split the result, but that would be too "ugly" IMO (not to mention that the separator itself can't be part of the group's value, etc).

If Java supported branch reset, the groups numbering would be like this:

(?| (1) | (1) (2) (3) | (1) (2) | (1) | (1) )

And I'd just need to loop through them, always starting with 1, until m.groupCount().

Which means I'm still waiting for better solutions 😉

History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

+2
−0

Currently, Java 16 is the latest version, and there's no support to branch reset yet. But one - still far from ideal - alternative is to use lookarounds:

Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

In the code above, ae and 111 are both in group 1, which kinda simulates what a branch reset does.

Basically, I use alternation (the | character, that means "or") with two options.

The first option searchs for the vowels, and there's a lookahead that verifies if after them there's \\d+\\W+ (digits and \W+). As this last part is inside a lookahead - inside (?= ) - it won't be part of the match. Lookarounds are zero-length assertions: they just check if something exists (hence, "assertion") but its contents aren't returned as part of the match (hence, "zero length").

The second option searches for 1, 2 or 3, and the part that comes next (letters and \W+) are inside another lookahead.

Everything is inside parenthesis, forming a single capturing group. Doing this way, either the vowels or the digits 1/2/3 (but not what comes after them) will be in this group. Hence, the Matcher just needs to check group 1.


This might solve the simpler cases, but what if I needed two groups? Ex: if the numbers after the vowels, or the letters after 1/2/3, also need to be in one group (in this case, in group 2). With branch reset, all we need is:

(?|([aeiou]+)([0-9]+)|([123]+)([a-z]+))\W+

But using lookarounds, I have to do something similar to what I did, using another alternation:

Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))(\\d+|[a-z]+)(?=\\W+)");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
    System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}

In this case, group 2 is simpler than group 1, as it has only the digits or the letters. The problem here is the redundancy: I have to repeat the digits and letters in group 1 lookaheads, and again in group 2. That's because the lookahead just checks what's ahead, and then it "comes back" to where it was (in this case, it comes back to the point immediately after group 1). In order to have these characters in group 2, I need to put them again in the expression.

And if I needed more groups, the regex would become even more complex and redundant, with parts of the expression being repeated multiple times, turning it into a maintenance nightmare.

Also, this isn't a good solution for cases where each branch of the alternation can have a different number of groups (which would make the regex even more complicated).


Therefore, there not a good solution yet, at least not one that solves all the cases that a branch reset would, in a "clean and smooth" way. Perhaps there's no way to perfectly emulate it at all, and the only solution is to iterate for the groups, checking if they are set.

History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »