Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Code Reviews

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

66%
+2 −0
Code Reviews Detecting balanced parentheses in Python

Command-line timing Rather than using separate code for timing, I tried running the timeit module as a command-line tool. I wrote initial versions of the implementations based on the OP and hkotsu...

posted 1y ago by Karl Knechtel‭

Answer
#1: Initial revision by user avatar Karl Knechtel‭ · 2024-01-16T05:56:00Z (about 1 year ago)
## Command-line timing

Rather than using separate code for timing, I tried running the `timeit` module as a command-line tool. I wrote initial versions of the implementations based on the OP and hkotsubo's answer, as a file `balance.py`:

<details><summary>

`balance.py`
</summary>

```
def using_replace(s):
    if len(s) % 2 != 0:
        return False

    for i in range(len(s) // 2):
        s = s.replace("()", "").replace("[]", "").replace("{}", "")

    return len(s) == 0

def using_stack(s, brackets={ '(': ')', '[': ']', '{': '}' }):
    if len(s) % 2 != 0:
        return False

    open_brackets = set(brackets.keys())
    close_brackets = set(brackets.values())
    stack = []
    for c in s:
        if c in open_brackets:
            stack.append(c)
        elif c in close_brackets:
            # close bracket without respective opening, string is invalid
            if len(stack) == 0:
                return False
            # close bracket doesn't correspond to last open bracket, string is invalid
            if c != brackets[stack[-1]]:
                return False
            stack.pop()
        else: # invalid char
            return False

    return len(stack) == 0
```
</details>

Which I can then test like so:
```bash
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.23 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
50 loops, best of 5: 4.19 msec per loop
```

## Performance analysis

As one might expect, adding a couple imbalanced brackets at the start, middle or end of that test string has little effect on the `replace`-based algorithm, but completely determines when the stack-based algorithm can exit:

<details><summary>Test results for imbalanced inputs</summary>

```bash
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
500000 loops, best of 5: 719 nsec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.48 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_stack(s)"
100 loops, best of 5: 2.18 msec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_replace(s)"
100 loops, best of 5: 2.39 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_stack(s)"
50 loops, best of 5: 4.21 msec per loop
```
</details>

The main reason that the replace-based approach is able to perform this well is because the length of the string dramatically decreases during the first few iterations. With the "balanced brackets" input, the string length is 32000 initially, 12000 after the first pass *and zero for every other pass*. (Other contributors may not have realized, or accounted for, the fact that `.replace` replaces *every non-overlapping occurrence* of the replacement string, in a single pass; and internally, the Python implementation can "cheat" and use mutable strings and string buffers.)

We get much worse performance with the "curious" test case, `'{[((ab))]{()[()]}}' * 2000`, similarly, because the length of the string drops from 36000 to 20000 in the first pass and then *stays at 20000* for each subsequent pass.

Another poorly-performing case for the replace-based approach is deeply nested parentheses:
```bash
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_replace(s)"
1 loop, best of 5: 630 msec per loop
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_stack(s)"
50 loops, best of 5: 4.14 msec per loop
```
On average, the string length is half the original length, because it can only decrease steadily with each loop iteration, and only reaches zero at the end. Mixing up bracket types improves the performance, because it allows all the `.replace` calls to do work, and allows a lot of empty string processing at the end:
```bash
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_replace(s)"
1 loop, best of 5: 222 msec per loop
```
And this improves even more if we take care to match the order of brackets in the input to the order of `.replace` calls:
```bash
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_replace(s)"
2 loops, best of 5: 138 msec per loop
```

## Performance improvements for the replace-based approach

As Dirk Herrmann‭ noted, iterating *until the length stops changing* avoids unnecessary iterations, which greatly improves performance especially in cases where an imbalance can be detected early. Using a modified version of `using_replace` (rearranging the loop a bit):

<details><summary>

Improved `string_replace` and basic benchmark results
</summary>

```
def using_replace(s):
    if len(s) % 2 != 0:
        return False
    prevlen = len(s)
    while True:
        if prevlen == 0:
            return True
        s = s.replace("()", "").replace("[]", "").replace("{}", "")
        if prevlen == len(s): # no improvement and didn't reach zero
            return False
        prevlen = len(s)
```
Repeating the tests:
```bash
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 234 usec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 250 usec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_replace(s)"
1000 loops, best of 5: 249 usec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_replace(s)"
1000 loops, best of 5: 248 usec per loop
```
</details>

The good cases are an order of magnitude faster again, by avoiding repeated `.replace` calls on empty strings or strings that contain only the unbalanced `))` pair.

However, imbalanced brackets still fare poorly, because the string length *does* continually decrease, so that an early exit isn't possible:
```bash
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_replace(s)"
1 loop, best of 5: 631 msec per loop
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_replace(s)"
1 loop, best of 5: 223 msec per loop
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_replace(s)"
2 loops, best of 5: 137 msec per loop
```

## Regex is counterproductive

Another avenue one might consider exploring is regular expressions. This would allow for multiple kinds of brackets to be replaced in a single pass, rather than making three method calls per loop; and it would mean not prioritizing one bracket type over another (at least, not to the same extent). 

<details><summary>Code for regex-based approach</summary>

```
import re

pattern = re.compile('\(\)|\[\]|\{\}')

def using_regex(s):
    if len(s) % 2 != 0:
        return False
    prevlen = len(s)
    while True:
        if prevlen == 0:
            return True
        s = pattern.sub('', s)
        if prevlen == len(s):
            return False
        prevlen = len(s)
```
</details>

But this actually makes things much worse all around. The string replacement algorithm is already highly optimized, and even though the regular expression interface uses a state machine implemented in C, it adds too much complexity.

It does, however, reduce the importance of the order of brackets, as predicted:
```bash
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_regex(s)"
1 loop, best of 5: 3.24 sec per loop
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_regex(s)"
1 loop, best of 5: 3.38 sec per loop
```
## Performance improvements for the stack-based approach

We don't need to test for invalid characters explicitly, because a non-bracket character will necessarily fail the other tests (for a valid open bracket or a matching close bracket).

We can also simplify the code by putting the corresponding *close* brackets on the stack when we find an open bracket. That way, the character that we should pop from the stack on a successful match, will already be what we need for the comparison - so we don't discard the result from `pop`.

We also then find that it's unnecessary to compute the `close_brackets` set, although this is trivial for long inputs.

Finally, using `collections.deque` rather than a list seems to offer a minor performance boost, even for the simple inputs where a balanced sequence is repeated many times (so that the stack should only ever grow to a few elements) - and the performance improvement is still minor, even dubious, for deeply nested inputs. I'm not entirely sure why this is; but I used it for the final version on principle.

<details><summary>

Improved `using_stack` implementation and benchmark results
</summary>

```
from collections import deque

def using_stack(s, brackets={ '(': ')', '[': ']', '{': '}' }):
    if len(s) % 2 != 0:
        return False

    open_brackets = set(brackets.keys())
    stack = deque()
    for c in s:
        if c in open_brackets:
            stack.append(brackets[c])
        # otherwise, must match the top of a non-empty stack
        elif len(stack) == 0 or c != stack.pop():
            return False

    return not stack
```
Results:
```bash
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.59 msec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
500000 loops, best of 5: 503 nsec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_stack(s)"
200 loops, best of 5: 1.72 msec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_stack(s)"
100 loops, best of 5: 3.65 msec per loop
$ # deeply nested
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.43 msec per loop
```
</details>

So, a fairly consistent improvement of around 15%, across the board. Note that relying on "look before you leap" (catching a `KeyError` on `brackets[c]` rather than checking the key-set explicitly) makes things considerably slower (about double the runtime overall on the simple cases).

## The best of both worlds?

The replace-based approach seems better for "normal" cases, but much slower if the string length can't be rapidly reduced by individual `.replace` passes. 
What if we tried using a heuristic to start out with the replace algorithm, and switch to the stack algorithm if the string length only goes down slightly - say, by less than an eighth of the original length?

<details><summary>Combined approach and benchmark results</summary>

```
def using_heuristic(s):
    if len(s) % 2 != 0:
        return False
    prevlen = len(s)
    while True:
        if prevlen == 0:
            return True
        s = s.replace("()", "").replace("[]", "").replace("{}", "")
        currlen = len(s)
        if currlen == prevlen:
            return False
        if currlen > (prevlen - (prevlen >> 3)):
            return using_stack(s)
        prevlen = currlen
```

```bash
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 233 usec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 247 usec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 248 usec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 249 usec per loop
$ # deeply nested
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_heuristic(s)"
100 loops, best of 5: 3.43 msec per loop
```
</details>

It works as anticipated. It can't bail out immediately when the imbalance is at the beginning of the string (because it will try the replacement approach at least once before giving up on that), but it otherwise effectively chooses the best approach.

It also lets both algorithms do what they're good at, at least to some extent:
```bash
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 317 usec per loop
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.01 msec per loop
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.99 msec per loop
```


Of course, as written this means that e.g. the `if len(s) % 2 != 0:` test is repeated, and of course the `using_stack` logic could be inlined. But these changes don't make an observable difference.

We also lost the ability to bail out early on strings that contain non-brackets. Here's a quick way to add that test back in:

```
delete_bracket_mapping = str.maketrans('', '', '()[]{}')

if s.translate(delete_bracket_mapping):
    # After deleting all brackets, there are still string contents.
    # Therefore, there are non-bracket characters.
    return False
```