Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

66%

+2 −0

Q&A Regex to get text outside brackets

First, we need to define some semantics. While it may not matter for your actual inputs, I propose that it should be valid for the output elements - the parts of the text found outside of brackets ...

posted 6mo ago by Karl Knechtel‭

Answer

#1: Initial revision by

Karl Knechtel‭ · 2024-06-14T21:52:42Z (6 months ago)

Copy Link

Raw

Markdown

First, we need to define some semantics. While it may not matter for your actual inputs, I propose that **it should be valid for the output elements** - the parts of the text found outside of brackets - **to be empty strings**. For example, if two bracketed parts are adjacent, like `one[two][three]four`, then a result `('one', '', 'four')` makes more sense than `('one', 'four')` - because it allows us to know that there were two distinct bracketed parts. Similarly, by distinguishing `('outside', '')` from `('', 'outside')`, we can see whether a bracketed part appeared before the `outside` text or after.

Aside from this, it's important to understand that [**classically, regex cannot handle arbitrarily nested brackets**](https://stackoverflow.com/questions/546433) (whatever symbols are used for the open and closing "bracket"). This is a theoretical limitation (see Somnath Musib's answer on Stack Overflow). There are many variants on the original idea of regular expressions that all call themselves "regex"; some of them have extensions that make it possible to match balanced brackets, but Python's does not. However, the third-party [regex](https://pypi.org/project/regex/) package adds such support.

## Using `re.split`

The most natural way to solve the problem is to let Python's regex library do some of the work, instead of expecting the regex itself to do everything. The `split` function (or method of [compiled patterns](https://docs.python.org/3/library/re.html#re-objects)) works much like the `.split` method of strings, except that the *delimiter* matches a regular expression:

```
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.
```

We don't want to include anything from within the brackets, so we should be careful not to use any capturing groups in the pattern. Fortunately, our delimiter pattern is quite simple:

```
>>> bracketed = re.compile('\[.*?\]')
>>> bracketed.split('testing_[_is_]_done_([but need to])_handle_[this]_scenario_as_well')
['testing_', '_done_(', ')_handle_', '_scenario_as_well']
```

If it were necessary to group parts of the regex without capturing them - that is what the appropriately-named [non-capturing groups](https://stackoverflow.com/questions/3512471) are for.

We can see how this handles "empty" parts between the brackets as I suggested at the start:

```
>>> bracketed.split('one[two][three]four')
['one', '', 'four']
>>> bracketed.split('example[]')
['example', '']
>>> bracketed.split('[]example')
['', 'example']
>>> bracketed.split('[]')
['', '']
>>> bracketed.split('')
['']
```

Communities

Post History