Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

66%
+2 −0
Q&A Regex to get text outside brackets

First, we need to define some semantics. While it may not matter for your actual inputs, I propose that it should be valid for the output elements - the parts of the text found outside of brackets ...

posted 5mo ago by Karl Knechtel‭

Answer
#1: Initial revision by user avatar Karl Knechtel‭ · 2024-06-14T21:52:42Z (5 months ago)
First, we need to define some semantics. While it may not matter for your actual inputs, I propose that **it should be valid for the output elements** - the parts of the text found outside of brackets - **to be empty strings**. For example, if two bracketed parts are adjacent, like `one[two][three]four`, then a result `('one', '', 'four')` makes more sense than `('one', 'four')` - because it allows us to know that there were two distinct bracketed parts. Similarly, by distinguishing `('outside', '')` from `('', 'outside')`, we can see whether a bracketed part appeared before the `outside` text or after.

Aside from this, it's important to understand that [**classically, regex cannot handle arbitrarily nested brackets**](https://stackoverflow.com/questions/546433) (whatever symbols are used for the open and closing "bracket"). This is a theoretical limitation (see Somnath Musib's answer on Stack Overflow). There are many variants on the original idea of regular expressions that all call themselves "regex"; some of them have extensions that make it possible to match balanced brackets, but Python's does not. However, the third-party [regex](https://pypi.org/project/regex/) package adds such support.

## Using `re.split`

The most natural way to solve the problem is to let Python's regex library do some of the work, instead of expecting the regex itself to do everything. The `split` function (or method of [compiled patterns](https://docs.python.org/3/library/re.html#re-objects)) works much like the `.split` method of strings, except that the *delimiter* matches a regular expression:

```
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.
```

We don't want to include anything from within the brackets, so we should be careful not to use any capturing groups in the pattern. Fortunately, our delimiter pattern is quite simple:

```
>>> bracketed = re.compile('\[.*?\]')
>>> bracketed.split('testing_[_is_]_done_([but need to])_handle_[this]_scenario_as_well')
['testing_', '_done_(', ')_handle_', '_scenario_as_well']
```

If it were necessary to group parts of the regex without capturing them - that is what the appropriately-named [non-capturing groups](https://stackoverflow.com/questions/3512471) are for.

We can see how this handles "empty" parts between the brackets as I suggested at the start:

```
>>> bracketed.split('one[two][three]four')
['one', '', 'four']
>>> bracketed.split('example[]')
['example', '']
>>> bracketed.split('[]example')
['', 'example']
>>> bracketed.split('[]')
['', '']
>>> bracketed.split('')
['']
```