Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Regex to get text outside brackets

+2
−0

I am trying to capture the content outside square brackets in groups, using this regex:

(.*)\[.*?\](.*)

And it works perfectly for a simple string like this:

testing_[_is_]_done

This is the sample script I am using:

import re
groups = re.match(r"(.*)\[.*?\](.*)", "testing_[_is_]_done").groups()
print(groups)

And this is the output I am getting:

('testing_', '_done')

But for some strings there will be multiple open-close square brackets and I want to capture everything outside of square brackets into groups but I am not able to figure out how to come up with regex which can do this job.

This is the example:

testing_[_is_]_done_([but need to])_handle_[this]_scenario_as_well

And ideally, I want to capture everything outside of brackets in groups, like this:

('testing_', '_done_(', ')_handle_', '_scenario_as_well')

Because it can be any number of open-close brackets, so I am looking for a dynamic regex, but so far I am not able to find anything on it.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

2 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+2
−0

First, we need to define some semantics. While it may not matter for your actual inputs, I propose that it should be valid for the output elements - the parts of the text found outside of brackets - to be empty strings. For example, if two bracketed parts are adjacent, like one[two][three]four, then a result ('one', '', 'four') makes more sense than ('one', 'four') - because it allows us to know that there were two distinct bracketed parts. Similarly, by distinguishing ('outside', '') from ('', 'outside'), we can see whether a bracketed part appeared before the outside text or after.

Aside from this, it's important to understand that classically, regex cannot handle arbitrarily nested brackets (whatever symbols are used for the open and closing "bracket"). This is a theoretical limitation (see Somnath Musib's answer on Stack Overflow). There are many variants on the original idea of regular expressions that all call themselves "regex"; some of them have extensions that make it possible to match balanced brackets, but Python's does not. However, the third-party regex package adds such support.

Using re.split

The most natural way to solve the problem is to let Python's regex library do some of the work, instead of expecting the regex itself to do everything. The split function (or method of compiled patterns) works much like the .split method of strings, except that the delimiter matches a regular expression:

Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

We don't want to include anything from within the brackets, so we should be careful not to use any capturing groups in the pattern. Fortunately, our delimiter pattern is quite simple:

>>> bracketed = re.compile('\[.*?\]')
>>> bracketed.split('testing_[_is_]_done_([but need to])_handle_[this]_scenario_as_well')
['testing_', '_done_(', ')_handle_', '_scenario_as_well']

If it were necessary to group parts of the regex without capturing them - that is what the appropriately-named non-capturing groups are for.

We can see how this handles "empty" parts between the brackets as I suggested at the start:

>>> bracketed.split('one[two][three]four')
['one', '', 'four']
>>> bracketed.split('example[]')
['example', '']
>>> bracketed.split('[]example')
['', 'example']
>>> bracketed.split('[]')
['', '']
>>> bracketed.split('')
['']
History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+4
−0

Perhaps regex is not the best solution. Although it's possible, the expression will be so complicated that it won't be worth the trouble, IMO.

But if you insist on using regex...

I'm afraid we can't get all the groups in a single step. You said that the strings can have many pairs of brackets, therefore we can't know in advance how many groups there will be. Instead of trying to guess, we'd better try to catch one group at a time, and then build our list with the desired matches.

You could do something like this:

import re

r = re.compile(r'\[[^\]]*\]|([^[\]]+)')
# testing with lots of different strings
strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
    groups = tuple(match.group(1) for match in r.finditer(s) if match.group(1))
    print(f'{s:.<40} -> {groups}')

The regex uses alternation (the | character, which means "or"), therefore it'll match either one of the options. Let's analyze them separately.

The first option is \[[^\]]*\]:

  • \[ is a literal [ character
  • Then we have [^\]]. The first [ and last ] creates a character class, and the ^ after [ means a negated character class. This means that it'll match anything that's not the list of characters inside the brackets. In this case, there's only one character, which is ], but I had to escape it (\]) because otherwise it'll be interpreted as the closing bracket of the character class.
    • Anyway, [^\]] will match anything that's not ]
  • Then we have *, which means "zero or more occurences". Therefore, [^\]]* means "zero or more characters that are not ]"
  • \] is a literal ] character

So this first part matches a [, followed by zero or more characters that aren't ], followed by ]. In another words, it matches any text inside brackets, including no text at all (it also matches []).

The second option of the alternation is ([^[\]]+):

  • the parenthesis create a capturing group
  • then we have a negated character class, very similar to the previous one, except that this one also includes the [ character. Hence, this matches anything that's neither [ nor ]
  • + means "one or more occurrences", so it won't match empty strings. Therefore [^[\]]+ will match one or more characters, as long as they're not [ or ]

The purpose of this part is to match anything that's not inside brackets, and put in a capturing group. The negated character class guarantees that it'll stop as soon as it finds a [ or ].

How does this regex work?

The alternation matches one of the options: either a text inside brackets, or a text outside them. But only the latter will be in a capturing group: if the former is matched, the group will be empty, and that's how we know which option was found.

That's why I tested if match.group(1). If the group is not empty, it means that the second option (text outside brackets) was matched. This way, we discard the matches that contain text inside brackets.

The output for the code above is:

testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' will', ' not work')

You could also use tuple(filter(None, re.split(r'\[[^\]]*\]', s))): the split method breaks the string in many parts, using the regex as a separator. In this case, the separator is "any text inside brackets" (using the same first part of the previous regex). The only problem is that split also creates empty strings in the beginning and end of the list, so we have to filter them out.

But note that it doesn't work for nested brackets. Although it might be possible to build a regex to recognize nested patterns, it's not worth the trouble, IMO. You'll have to use recursive patterns that are not supported by the native re module (so you'll have to install one that supports it).

And based on my experience, if you think you need a recursive regex, you're probably wrong. There are better and more straightforward solutions. Don't get me wrong, I like regex, it's a cool thing that can be helpful in many situations, but they're not always the best solution.

Solution without regex

One simple way is to loop through the characters of the string. If we find a [, just ignore everything until the respective ] is found. Actually, to handle nested brackets, we ignore everything until the first [ is closed. Everything else we add to our list, something like this:

def get_text_outside_brackets(s):
    brackets = 0
    current_token = ''
    for c in s:
        if c == '[':
            brackets += 1
            if current_token: # if there's text outside brackets, return it
                yield current_token
                current_token = ''
        elif c == ']':
            if brackets > 0:
                brackets -= 1
        elif brackets == 0:
            current_token += c
    if current_token:
        yield current_token

strings = ['testing_[_is_]_done', 'no brackets', '[only brackets]', '[a]b[c]d[e]', 'empty bracket: []', 'with [nested [brackets] will] not work']
for s in strings:
    groups = tuple(get_text_outside_brackets(s))
    print(f'{s:.<40} -> {groups}')

Now it'll ignore the nested brackets correctly:

testing_[_is_]_done..................... -> ('testing_', '_done')
no brackets............................. -> ('no brackets',)
[only brackets]......................... -> ()
[a]b[c]d[e]............................. -> ('b', 'd')
empty bracket: []....................... -> ('empty bracket: ',)
with [nested [brackets] will] not work.. -> ('with ', ' not work')

And - IMHO - the code is very simple and easier to understand and change, if compared to the regex.


Final considerations

It's not clear if you'll face the nested brackets scenario, or even malformed strings, such as 'ab[ c ] ]]]]] def'. Even if that's not the case, you should analyze if the regex is too complicated to be worth maintaning. And if there are such strings, you should prefer the solution without regex.

For example, I tested with this string: 'malformed[ string ] ]]]]] what now? [ '. The regex returned ('malformed', ' ', ' what now?', ' '), the other regex with split returned ('malformed', ' ]]]]] what now? [ '), and the last solution without regex returned ('malformed', ' what now?'). Which one would be the correct in this case? Should all the ]'s be part of the output, because they're not part of a pair (there's no corresponding [)?

If we want to include the ]'s, it's easy with the last solution:

def get_text_outside_brackets(s):
    brackets = 0
    current_token = ''
    for c in s:
        if c == '[':
            brackets += 1
            if current_token: # if there's text outside brackets, return it
                yield current_token
                current_token = ''
        elif c == ']':
            if brackets > 0:
                brackets -= 1
            else: # <--- not a pair, add "]" to the current token
                current_token += c
        elif brackets == 0:
            current_token += c
    if current_token:
        yield current_token

And now the result will be ('malformed', ' ]]]]] what now? '). I admit that it's debatable if the last [ should be part of the result, but anyway, even this change is easier without regex. One could argue that split included the last [, but remember that it failed with nested brackets (so you must check if this case is relevant or not).

To change the first regex in order to achieve the same result, you'll have to go through the gates of hell build a very complicated one, that checks if there's a corresponding [ in previous positions. My guess is that a negative lookbehind will be needed, which makes things not only more complex, but also less efficient (lookarounds add some overhead to the matching process, as they need to go back and forth the string). I've tried with this:

\[[^\]]*\]|(([^[\]]|\[(?!.*\])|(?!\[[^[\]]*)\])+)

And it seems to work (although it doesn't work with the nested brackets case), but check how complex and hard to understand it is. Definitely not worth it, IMO.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »