Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
How can I emulate regular expression's branch reset in Java?
I've got this sample regex:
Pattern p = Pattern.compile("(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");
It basically has the following parts:
- one or more lowercase vowels (
[aeiou]+
), followed by one or more digits ([0-9]+
), or - one or more digits 1, 2 or 3 (
[123]+
), followed by lowercase letters ([a-z]+
) - all of this followed by one or more non-alphanumeric characters (
\W+
)
There are also two capturing groups: one for the vowels, and another one for the digits 1, 2 or 3.
I'm using alternation (|
), which means that only one of these groups will be captured. Example:
Matcher m = p.matcher("ae123.");
if (m.find()) {
int n = m.groupCount();
for (int i = 1; i <= n; i++) {
System.out.format("group %d: %s\n", i, m.group(i));
}
}
In this case, only the first group is captured, and the output is:
group 1: ae
group 2: null
But if the input string is "111abc!!"
, the second group is captured, and the output is:
group 1: null
group 2: 111
Therefore, to know which group was captured, I need to loop through them and test if they are not null
.
Some regex engines support the branch reset feature: putting the expression inside (?|
and )
, the groups numbering is reset each time an alternation is found (example). So the regex above could be written as:
Pattern p = Pattern.compile("(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+");
The branch reset ((?|
) makes both ([aeiou]+)
and ([123]+)
to be group 1 (and because there's an alternation - just one or another - only one of these expressions is captured). Using this feature, there would be no need to loop through the groups, testing if it's null
. I could just get group 1 directly (m.group(1)
would always have a value).
But Java doesn't support branch reset, and the code above throws an exception:
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 2
(?|([aeiou]+)[0-9]+|([123]+)[a-z]+)\W+
^
I'm using Java 8, and taking a look at the Java 14 docs, we can see that this feature is still not supported (in Java 15 preview there's also no mention of it).
I also checked an alternative solution for .NET: use named groups with the same name for all groups, but it also didn't work in Java:
Pattern p = Pattern.compile("(?:(?<somename>[aeiou]+)[0-9]+|(?<somename>[123]+)[a-z]+)\\W+");
This code throws an exception, because in Java you can't have two or more groups with the same name:
java.util.regex.PatternSyntaxException: Named capturing group <somename> is already defined near index 36
(?:(?<somename>[aeiou]+)[0-9]+|(?<somename>[123]+)[a-z]+)\W+
^
Is there a way to emulate branch reset in Java or the only solution is to loop through the groups, testing if they are null
?
2 answers
Currently, Java 16 is the latest version, and there's no support to branch reset yet. But one - still far from ideal - alternative is to use lookarounds:
Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
In the code above, ae
and 111
are both in group 1, which kinda simulates what a branch reset does.
Basically, I use alternation (the |
character, that means "or") with two options.
The first option searchs for the vowels, and there's a lookahead that verifies if after them there's \\d+\\W+
(digits and \W+
). As this last part is inside a lookahead - inside (?= )
- it won't be part of the match. Lookarounds are zero-length assertions: they just check if something exists (hence, "assertion") but its contents aren't returned as part of the match (hence, "zero length").
The second option searches for 1, 2 or 3, and the part that comes next (letters and \W+
) are inside another lookahead.
Everything is inside parenthesis, forming a single capturing group. Doing this way, either the vowels or the digits 1/2/3 (but not what comes after them) will be in this group. Hence, the Matcher
just needs to check group 1.
This might solve the simpler cases, but what if I needed two groups? Ex: if the numbers after the vowels, or the letters after 1/2/3, also need to be in one group (in this case, in group 2). With branch reset, all we need is:
(?|([aeiou]+)([0-9]+)|([123]+)([a-z]+))\W+
But using lookarounds, I have to do something similar to what I did, using another alternation:
Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))(\\d+|[a-z]+)(?=\\W+)");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}
In this case, group 2 is simpler than group 1, as it has only the digits or the letters. The problem here is the redundancy: I have to repeat the digits and letters in group 1 lookaheads, and again in group 2. That's because the lookahead just checks what's ahead, and then it "comes back" to where it was (in this case, it comes back to the point immediately after group 1). In order to have these characters in group 2, I need to put them again in the expression.
And if I needed more groups, the regex would become even more complex and redundant, with parts of the expression being repeated multiple times, turning it into a maintenance nightmare.
Also, this isn't a good solution for cases where each branch of the alternation can have a different number of groups (which would make the regex even more complicated).
Therefore, there not a good solution yet, at least not one that solves all the cases that a branch reset would, in a "clean and smooth" way. Perhaps there's no way to perfectly emulate it at all, and the only solution is to iterate for the groups, checking if they are set.
0 comment threads
I've kinda found a very limited, not so elegant, far from ideal "solution", using replaceAll
:
String regex = "(?:([aeiou]+)[0-9]+|([123]+)[a-z]+)\\W+";
System.out.println("ae123.".replaceAll(regex, "$1$2"));
System.out.println("111abc!!".replaceAll(regex, "$1$2"));
This prints:
ae
111
The trick is in the second argument: "$1$2"
means that I'm concatenating groups 1 ($1
) and 2 ($2
). Because of the alternation (|
), only one of the groups is captured and the other will be empty. And when I concatenate them, the result is always the contents of the captured group.
Limitations
But as I said, this solution is very limited. Let's suppose the regex is a little bit more complicated with lots of different groups. Something like that:
(1) | (2) (3) (4) | (5) (6) | (7) | (8)
In that case, I can have only group 1, or only groups 2, 3 and 4, or only groups 5 and 6, or only group 7, or only group 8.
Of course I could still use replaceAll
with "$1$2$3$4$5$6$7$8"
, but if the regex matches groups 2, 3 and 4, they will be concatenated and I wouldn't know each group's value individually. Unless I use some separator, such as "$1,$2,$3,$4,$5,$6,$7,$8"
and then split
the result, but that would be too "ugly" IMO (not to mention that the separator itself can't be part of the group's value, etc).
If Java supported branch reset, the groups numbering would be like this:
(?| (1) | (1) (2) (3) | (1) (2) | (1) | (1) )
And I'd just need to loop through them, always starting with 1, until m.groupCount()
.
Which means I'm still waiting for better solutions 😉
0 comment threads