Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

77%

+5 −0

Q&A Regex to get text outside brackets

Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO. But if you insist on using regex... I'm afraid w...

posted 6mo ago by hkotsubo‭ · edited 30d ago by hkotsubo‭

Answer

#8: Post edited by

hkotsubo‭ · 2024-11-26T13:02:33Z (30 days ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string). I've tried with this:
```none
\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)
```
~~And it _seems_ to work (although it doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.~~

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also try with `split`:
```python
import re
r = re.compile(r'\[[^\]]*\]')
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(filter(None, r.split(s)))
print(f'{s:.<40} -> {groups}')
```
The `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it still doesn't work for nested brackets** (it just fails in a different way):
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will] not work')
```
Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string). I've tried with this:
```none
\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)
```
And it _seems_ to work (although it still doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.

#7: Post edited by

hkotsubo‭ · 2024-06-18T14:52:18Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string).
I've tried with `\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)` and it _seems_ to work (although it doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string). I've tried with this:
```none
\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)
```
And it _seems_ to work (although it doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.

#6: Post edited by

hkotsubo‭ · 2024-06-14T17:03:17Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string).

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string).
I've tried with `\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)` and it _seems_ to work (although it doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.

#5: Post edited by

hkotsubo‭ · 2024-06-14T15:13:24Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
~~And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.~~

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
---
# Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as `'ab[ c ] ]]]]] def'`. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: `'malformed[ string ] ]]]]] what now? [ '`. The regex returned `('malformed', ' ', ' what now?', ' ')`, the other regex with `split` returned `('malformed', ' ]]]]] what now? [ ')`, and the last solution without regex returned `('malformed', ' what now?')`. Which one would be the correct in this case? Should all the `]`'s be part of the output, because they're not part of a pair (there's no corresponding `[`)?
If we want to include the `]`'s, it's easy with the last solution:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
```
And now the result will be `('malformed', ' ]]]]] what now? ')`. I admit that it's debatable if the last `[` should be part of the result, but anyway, even this change is easier without regex. One could argue that `split` included the last `[`, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to <strike>go through the gates of hell</strike> build a very complicated one, that checks if there's a corresponding `[` in previous positions. My guess is that a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string).

#4: Post edited by

hkotsubo‭ · 2024-06-14T14:51:30Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I couldn't find a way to get all the groups in a single step. You said that the strings vary a lot, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

#3: Post edited by

hkotsubo‭ · 2024-06-14T14:49:56Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I couldn't find a way to get all the groups in a single step. You said that the strings vary a lot, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
~~# testing with lots of differenet strings~~
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
~~So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets.~~
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
~~The purpose of this part is to match anything that's not inside brackets. The negated character class guarantess that it'll stop as soon as it finds a `[` or `]`.~~
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which one were found. That's why I tested `if match.group(1)`, to make sure that we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I couldn't find a way to get all the groups in a single step. You said that the strings vary a lot, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets, including no text at all (it also matches `[]`).
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which option was found.
That's why I tested `if match.group(1)`. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

#2: Post edited by

hkotsubo‭ · 2024-06-14T14:44:22Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I couldn't find a way to get all the groups in a single step. You said that the strings vary a lot, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of differenet strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
~~groups = [ match.group(1) for match in r.finditer(s) if match.group(1) ]~~
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets.
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets. The negated character class guarantess that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which one were found. That's why I tested `if match.group(1)`, to make sure that we discard the matches that contain text inside brackets.
The output for the code above is:
```none
~~testing_[_is_]_done..................... -> ['testing_', '_done']~~
~~no brackets............................. -> ['no brackets']~~
~~[only brackets]......................... -> []~~
~~[a]b[c]d[e]............................. -> ['b', 'd']~~
~~empty bracket: []....................... -> ['empty bracket: ']~~
~~with [nested [brackets] will] not work.. -> ['with ', ' will', ' not work']~~
```
You could also use `list(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
~~groups = list(get_text_outside_brackets(s))~~
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
~~testing_[_is_]_done..................... -> ['testing_', '_done']~~
~~no brackets............................. -> ['no brackets']~~
~~[only brackets]......................... -> []~~
~~[a]b[c]d[e]............................. -> ['b', 'd']~~
~~empty bracket: []....................... -> ['empty bracket: ']~~
~~with [nested [brackets] will] not work.. -> ['with ', ' not work']~~
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._
# But if you insist on using regex...
I couldn't find a way to get all the groups in a single step. You said that the strings vary a lot, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
```python
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of differenet strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
```
The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is `\[[^\]]*\]`:
- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
- Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character
So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets.
The second option of the alternation is `([^[\]]+)`:
- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`
The purpose of this part is to match anything that's not inside brackets. The negated character class guarantess that it'll stop as soon as it finds a `[` or `]`.
## How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which one were found. That's why I tested `if match.group(1)`, to make sure that we discard the matches that contain text inside brackets.
The output for the code above is:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
```
You could also use `tuple(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.
**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).
# Solution without regex
One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:
```python
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
```
Now it'll ignore the nested brackets correctly:
```none
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
```
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

#1: Initial revision by

hkotsubo‭ · 2024-06-14T14:37:02Z (6 months ago)

Copy Link

Raw

Markdown

> _Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO._

# But if you insist on using regex...

I couldn't find a way to get all the groups in a single step. You said that the strings vary a lot, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.

You could do something like this:

```python
import re

r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of differenet strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
    groups = [ match.group(1) for match in r.finditer(s) if match.group(1) ]
    print(f'{s:.<40} -> {groups}')
```

The regex uses [alternation](https://www.regular-expressions.info/alternation.html) (the `|` character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.

The first option is `\[[^\]]*\]`:

- `\[` is a literal `[` character
- Then we have `[^\]]`. The first `[` and last `]` creates a [character class](https://www.regular-expressions.info/charclass.html), and the `^` after `[` means a negated character class. This means that it'll match anything that's **not** the list of characters inside the brackets. In this case, there's only one character, which is `]`, but I had to escape it (`\]`) because otherwise it'll be interpreted as the closing bracket of the character class.
    - Anyway, `[^\]]` will match anything that's not `]`
- Then we have `*`, which means "[zero or more occurences](https://www.regular-expressions.info/repeat.html)". Therefore, `[^\]]*` means "zero or more characters that are not `]`"
- `\]` is a literal `]` character

So this first part matches a `[`, followed by zero or more characters that aren't `]`, followed by `]`. In another words, it matches any text inside brackets.

The second option of the alternation is `([^[\]]+)`:

- the parenthesis create a [capturing group](https://www.regular-expressions.info/brackets.html)
- then we have a negated character class, very similar to the previous one, except that this one also includes the `[` character. Hence, this matches anything that's neither `[` nor `]`
- `+` means "one or more occurrences", so it won't match empty strings. Therefore `[^[\]]+` will match one or more characters, as long as they're not `[` or `]`

The purpose of this part is to match anything that's not inside brackets. The negated character class guarantess that it'll stop as soon as it finds a `[` or `]`.

## How does this regex work?

The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: **if the former is matched, the group will be empty**, and that's how we know which one were found. That's why I tested `if match.group(1)`, to make sure that we discard the matches that contain text inside brackets.

The output for the code above is:

```none
testing_[_is_]_done..................... -> ['testing_', '_done']
no brackets............................. -> ['no brackets']
[only brackets]......................... -> []
[a]b[c]d[e]............................. -> ['b', 'd']
empty bracket: []....................... -> ['empty bracket: ']
with [nested [brackets] will] not work.. -> ['with ', ' will', ' not work']
```

You could also use `list(filter(None, re.split(r'\[[^\]]*\]', s)))`: the `split` method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that `split` also [creates empty strings](https://stackoverflow.com/q/30924509) in the beginning and end of the list, so we have to filter them out.

**But note that it doesn't work for nested brackets**. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use [recursive patterns](https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#recursive-patterns-hg-issue-27) that are not supported by the native `re` module (so you'll have to [install one that supports it](https://pypi.org/project/regex/)).

And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but [they're not always the best solution](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/).

# Solution without regex

One simple way is to loop through the characters of the string. If we find a `[`, just ignore everything until the respective `]` is found. Actually, to handle nested brackets, we ignore everything until the first `[` is closed. Everything else we add to our list, something like this:

```python
def get_text_outside_brackets(s):
    brackets = 0
    current_token = ''
    for c in s:
        if c == '[':
            brackets += 1
            if current_token: # if there's text outside brackets, return it
                yield current_token
                current_token = ''
        elif c == ']':
            if brackets > 0:
                brackets -= 1
        elif brackets == 0:
            current_token += c
    if current_token:
        yield current_token

strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
    groups = list(get_text_outside_brackets(s))
    print(f'{s:.<40} -> {groups}')
```

Now it'll ignore the nested brackets correctly:

```none
testing_[_is_]_done..................... -> ['testing_', '_done']
no brackets............................. -> ['no brackets']
[only brackets]......................... -> []
[a]b[c]d[e]............................. -> ['b', 'd']
empty bracket: []....................... -> ['empty bracket: ']
with [nested [brackets] will] not work.. -> ['with ', ' not work']
```

And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.

Communities

Post History