Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

66%
+2 −0
Q&A How can I emulate regular expression's branch reset in Java?

Currently, Java 16 is the latest version, and there's no support to branch reset yet. But one - still far from ideal - alternative is to use lookarounds: Pattern pattern = Pattern.compile("([aeiou...

posted 3y ago by hkotsubo‭

Answer
#1: Initial revision by user avatar hkotsubo‭ · 2021-06-11T18:16:03Z (almost 3 years ago)
Currently, [Java 16 is the latest version][1], and there's no support to branch reset yet. But one - still far from ideal - alternative is to use [*lookarounds*](https://www.regular-expressions.info/lookaround.html):

```
Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
    System.out.println(matcher.group(1));
}
```

In the code above, `ae` and `111` are both in group 1, which *kinda* simulates what a branch reset does.

Basically, I use [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, that means "or") with two options.

The first option searchs for the vowels, and there's a lookahead that verifies if after them there's `\\d+\\W+` (digits and `\W+`). As this last part is inside a lookahead - inside `(?= )` - it won't be part of the match. Lookarounds are zero-length assertions: they just check if something exists (hence, "assertion") but its contents aren't returned as part of the match (hence, "zero length").

The second option searches for 1, 2 or 3, and the part that comes next (letters and `\W+`) are inside another lookahead.

Everything is inside parenthesis, forming a single capturing group. Doing this way, either the vowels or the digits 1/2/3 (but not what comes after them) will be in this group. Hence, the `Matcher` just needs to check group 1.

---

This might solve the simpler cases, but what if I needed two groups? Ex: if the numbers after the vowels, or the letters after 1/2/3, also need to be in one group (in this case, in group 2). With branch reset, all we need is:

    (?|([aeiou]+)([0-9]+)|([123]+)([a-z]+))\W+

But using lookarounds, I have to do something similar to what I did, using another alternation:

```
Pattern pattern = Pattern.compile("([aeiou]+(?=\\d+\\W+)|[123]+(?=[a-z]+\\W+))(\\d+|[a-z]+)(?=\\W+)");
Matcher matcher = pattern.matcher("ae123. 111abc!!");
while (matcher.find()) {
    System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}
```

In this case, group 2 is simpler than group 1, as it has only the digits or the letters. The problem here is the redundancy: I have to repeat the digits and letters in group 1 lookaheads, and again in group 2. That's because the lookahead just checks what's ahead, and then it "comes back" to where it was (in this case, it comes back to the point immediately after group 1). In order to have these characters in group 2, I need to put them again in the expression.

And if I needed more groups, the regex would become even more complex and redundant, with parts of the expression being repeated multiple times, turning it into a maintenance nightmare.

Also, this isn't a good solution for cases where each branch of the alternation can have a different number of groups (which would make the regex even more complicated).

---
Therefore, there not a good solution yet, at least not one that solves all the cases that a branch reset would, in a "clean and smooth" way. Perhaps there's no way to perfectly emulate it at all, and the only solution is to iterate for the groups, checking if they are set.


  [1]: https://openjdk.java.net/projects/jdk/16/