Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Regex to get text outside brackets
I am trying to capture the content outside square brackets in groups, using this regex:
(.*)\[.*?\](.*)
And it works perfectly for a simple string like this:
testing_[_is_]_done
This is the sample script I am using:
import re
groups = re.match(r"(.*)\[.*?\](.*)", "testing_[_is_]_done").groups()
print(groups)
And this is the output I am getting:
('testing_', '_done')
But for some strings there will be multiple open-close square brackets and I want to capture everything outside of square brackets into groups but I am not able to figure out how to come up with regex which can do this job.
This is the example:
testing_[_is_]_done_([but need to])_handle_[this]_scenario_as_well
And ideally, I want to capture everything outside of brackets in groups, like this:
('testing_', '_done_(', ')_handle_', '_scenario_as_well')
Because it can be any number of open-close brackets, so I am looking for a dynamic regex, but so far I am not able to find anything on it.
2 answers
First, we need to define some semantics. While it may not matter for your actual inputs, I propose that it should be valid for the output elements - the parts of the text found outside of brackets - to be empty strings. For example, if two bracketed parts are adjacent, like one[two][three]four
, then a result ('one', '', 'four')
makes more sense than ('one', 'four')
- because it allows us to know that there were two distinct bracketed parts. Similarly, by distinguishing ('outside', '')
from ('', 'outside')
, we can see whether a bracketed part appeared before the outside
text or after.
Aside from this, it's important to understand that classically, regex cannot handle arbitrarily nested brackets (whatever symbols are used for the open and closing "bracket"). This is a theoretical limitation (see Somnath Musib's answer on Stack Overflow). There are many variants on the original idea of regular expressions that all call themselves "regex"; some of them have extensions that make it possible to match balanced brackets, but Python's does not. However, the third-party regex package adds such support.
Using re.split
The most natural way to solve the problem is to let Python's regex library do some of the work, instead of expecting the regex itself to do everything. The split
function (or method of compiled patterns) works much like the .split
method of strings, except that the delimiter matches a regular expression:
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
We don't want to include anything from within the brackets, so we should be careful not to use any capturing groups in the pattern. Fortunately, our delimiter pattern is quite simple:
>>> bracketed = re.compile('\[.*?\]')
>>> bracketed.split('testing_[_is_]_done_([but need to])_handle_[this]_scenario_as_well')
['testing_', '_done_(', ')_handle_', '_scenario_as_well']
If it were necessary to group parts of the regex without capturing them - that is what the appropriately-named non-capturing groups are for.
We can see how this handles "empty" parts between the brackets as I suggested at the start:
>>> bracketed.split('one[two][three]four')
['one', '', 'four']
>>> bracketed.split('example[]')
['example', '']
>>> bracketed.split('[]example')
['', 'example']
>>> bracketed.split('[]')
['', '']
>>> bracketed.split('')
['']
0 comment threads
The following users marked this post as Works for me:
User | Comment | Date |
---|---|---|
TonyMontana | (no comment) | Jun 14, 2024 at 16:30 |
Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO.
But if you insist on using regex...
I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.
You could do something like this:
import re
r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
print(f'{s:.<40} -> {groups}')
The regex uses alternation (the |
character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.
The first option is \[[^\]]*\]
:
-
\[
is a literal[
character - Then we have
[^\]]
. The first[
and last]
creates a character class, and the^
after[
means a negated character class. This means that it'll match anything that's not the list of characters inside the brackets. In this case, there's only one character, which is]
, but I had to escape it (\]
) because otherwise it'll be interpreted as the closing bracket of the character class.- Anyway,
[^\]]
will match anything that's not]
- Anyway,
- Then we have
*
, which means "zero or more occurences". Therefore,[^\]]*
means "zero or more characters that are not]
" -
\]
is a literal]
character
So this first part matches a [
, followed by zero or more characters that aren't ]
, followed by ]
. In another words, it matches any text inside brackets, including no text at all (it also matches []
).
The second option of the alternation is ([^[\]]+)
:
- the parenthesis create a capturing group
- then we have a negated character class, very similar to the previous one, except that this one also includes the
[
character. Hence, this matches anything that's neither[
nor]
-
+
means "one or more occurrences", so it won't match empty strings. Therefore[^[\]]+
will match one or more characters, as long as they're not[
or]
The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a [
or ]
.
How does this regex work?
The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: if the former is matched, the group will be empty, and that's how we know which option was found.
That's why I tested if match.group(1)
. If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.
The output for the code above is:
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')
You could also use tuple(filter(None, re.split(r'\[[^\]]*\]', s)))
: the split
method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that split
also creates empty strings in the beginning and end of the list, so we have to filter them out.
But note that it doesn't work for nested brackets. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use recursive patterns that are not supported by the native re
module (so you'll have to install one that supports it).
And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but they're not always the best solution.
Solution without regex
One simple way is to loop through the characters of the string. If we find a [
, just ignore everything until the respective ]
is found. Actually, to handle nested brackets, we ignore everything until the first [
is closed. Everything else we add to our list, something like this:
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
elif brackets == 0:
current_token += c
if current_token:
yield current_token
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
groups = tuple(get_text_outside_brackets(s))
print(f'{s:.<40} -> {groups}')
Now it'll ignore the nested brackets correctly:
testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')
And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.
Final considerations
It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as 'ab[ c ] ]]]]] def'
. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.
For example, I tested with this string: 'malformed[ string ] ]]]]] what now? [ '
. The regex returned ('malformed', ' ', ' what now?', ' ')
, the other regex with split
returned ('malformed', ' ]]]]] what now? [ ')
, and the last solution without regex returned ('malformed', ' what now?')
. Which one would be the correct in this case? Should all the ]
's be part of the output, because they're not part of a pair (there's no corresponding [
)?
If we want to include the ]
's, it's easy with the last solution:
def get_text_outside_brackets(s):
brackets = 0
current_token = ''
for c in s:
if c == '[':
brackets += 1
if current_token: # if there's text outside brackets, return it
yield current_token
current_token = ''
elif c == ']':
if brackets > 0:
brackets -= 1
else: # <--- not a pair, add "]" to the current token
current_token += c
elif brackets == 0:
current_token += c
if current_token:
yield current_token
And now the result will be ('malformed', ' ]]]]] what now? ')
. I admit that it's debatable if the last [
should be part of the result, but anyway, even this change is easier without regex. One could argue that split
included the last [
, but remember that it failed with nested brackets (so you must check if this case is relevant or not).
To change the first regex in order to achieve the same result, you'll have to go through the gates of hell build a very complicated one, that checks if there's a corresponding [
in previous positions. My guess is that a negative lookbehind will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string). I've tried with this:
\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)
And it seems to work (although it doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.
0 comment threads