Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Code Reviews

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Detecting balanced parentheses in Python

+7
−0

The problem

Given a string s containing just the characters '(', ')', '{', '}', '[' and ']', determine if the input string is valid.

An input string is valid if:

  1. Open brackets are closed by the same type of brackets.
  2. Open brackets are closed in the correct order.

The solution

def isValid(s: str) -> bool:
    if len(s) % 2 != 0:
        return False

    for i in range(len(s) // 2):
        s = s.replace("()", "").replace("[]", "").replace("{}", "")

    return len(s) == 0

My approach to the problem is replacing the pairs. The string is balanced if the string is empty after replacing len(str) // 2 times. Is this a good approach? How can I improve my algorithm?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

5 answers

+9
−0

Use a stack while just scanning your string once from left to right. No need for multiple (performance-wise) expensive string replacements. If implemented right, the stack will only ever contain at maximum len(s)/2 elements.

The algorithm is rather straightforward:

Check first whether the string is of odd or even length. If the string is of odd length, it must be an unbalanced string (Your code already does this part, i just mentioned it here again simply for having a full description of the stack-based alogrithm.)

For any opening paranthesis/bracket/brace in the string, push some value/token that represents the type of this opening bracket onto the stack (that can be the opening bracket characters themselves, the associated closing brackets, or some other identifying values such as from an enum). If before doing a push the count of elements in the stack is already len(s)/2 you know the string must be unbalanced and the routine can abort now.

For any closing bracket in the string, pop a value/token from the stack and check whether that value/token represents the same bracket type as the closing bracket. If the stack is empty before attempting to pop, or the popped value/token mismatches the type of the closing bracket in the string, you know the string must be unbalanced and the routine can abort now.

When you finished scanning the string from left to right and the stack is not empty, you again know that the string must be unbalanced.

When the stack is empty after completing the scan of the text string, you know the string was balanced (or the string was empty).

I leave it to you as an exercise to translate this algorithm into concrete Python code.

P.S.: As a further exercise, you might try implementing the same stack-based algorithm while avoiding using an explicit stack data structure. This can be achieved by implementing the algorithm in a recursive manner, with the call stack conceptually taking the role of the stack that's keeping track of the (nested) bracket pairs.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Implementing the algorithm recursively is a bad idea (in languages which don't do tail recursion elim... (1 comment)
+5
−0

Instead of replacing the brackets, you could do just one loop, and keep a stack with the opening brackets. Every time you find a closing bracket, check if it corresponds to the stack top: if it's not, the string is invalid, else, pop the opening bracket and proceed to the next character.

Do it until the end of the string, and the stack will be empty if it's a valid one:

def is_balanced(s, brackets={ '(': ')', '[': ']', '{': '}' }):
    if len(s) % 2 != 0:
        return False

    open_brackets = set(brackets.keys())
    close_brackets = set(brackets.values())
    stack = []
    for c in s:
        if c in open_brackets:
            stack.append(c)
        elif c in close_brackets:
            # close bracket without respective opening, string is invalid
            if len(stack) == 0:
                return False
            # close bracket doesn't correspond to last open bracket, string is invalid
            if c != brackets[stack[-1]]:
                return False
            stack.pop()
        else: # invalid char
            return False

    return len(stack) == 0

I put the keys and values inside set's (because searching with the in operator is O(1)), but it didn't make much difference. You can remove it if you want.


I admit I initially thought that calling replace 3 times per iteration, and always do it N / 2 times, would be slower, but to my surprise I found out that it depends on the string being tested. A quick test with timeit module (using your isValid function and is_balanced as defined above):

from timeit import timeit

# run 10 times each test
params = { 'number' : 10, 'globals': globals() }

s = '{[(())]{()[()]}}' * 1000
s += '))'
s += '{[(())]{()[()]}}' * 1000
print('--------------\nUnbalanced in the middle')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))

s = '))'
s += '{[(())]{()[()]}}' * 1000
s += '{[(())]{()[()]}}' * 1000
print('--------------\nUnbalanced in the beginning')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))

s = '{[(())]{()[()]}}' * 2000
s += '))'
print('--------------\nUnbalanced in the end')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))

s = '{[(())]{()[()]}}' * 2000
print('--------------\nBalanced')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))

In my machine, I've got these results:

--------------
Unbalanced in the middle
0.03587988700019196
0.02324946300359443
--------------
Unbalanced in the beginning
0.026359478004451375
1.2465003237593919e-05
--------------
Unbalanced in the end
0.025080425999476574
0.041097565001109615
--------------
Balanced
0.022329219995299354
0.04195229199831374

For a balanced string, surprisingly - at least to me - the replace approach is faster.

For unbalanced strings, it depends on where the unbalanced brackets are. If they are in the beginning or in the middle, the stack approach wins (with a huge margin if the problem is in the beginning, due to the else clause that checks an invalid char as soon as it finds one). But if they are in the end, the replace approach wins.


I really thought that using replace would always be slower, because each call to replace needs to scan the whole string, searching for the substring to be replaced (and return a new string). Even if you're looping N / 2 times, you're calling replace 3 times per iteration, which I really thought it'd have a huge cost. But that wasn't the case, and this really surprised me.

The only case where the difference is really huge is the aforementioned case, when the unbalanced bracket is in the beginning. One particular case is curious:

s = '{[((ab))]{()[()]}}' * 2000

With this, the replace approach took 4.5 seconds, while the stack approach took around 0.00001 seconds. Not sure if there's an implementation detail for the replace method, that makes this particular case very slow (tested on Python 3.8.5, Ubuntu 20.04).

Another case where the stack approach wins by one order of magnitude is:

# lots of opening brackets, followed by lots of closing brackets
s = ('(' * 2000) + (')' * 2000)

But as I said, I was expecting this difference in all cases. Anyway, that's why we should always test everything, instead of trusting on what we believe to be "obvious".


BTW, testing in another environments, the results vary. In IdeOne.com, the stack approach is always faster for unbalanced strings, and almost the same for balanced ones. And in Repl.it, the results were similar to my machine's.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

2 comment threads

Performance explanations (1 comment)
General comments (7 comments)
+4
−0

Is this a good approach? How can I improve my algorithm?

Your code is correct and simple. That is good, and it may be all that is needed. You have received some responses that indicate that there would be a performance problem. But, whether or not performance is an issue, and what the precise performance problem is, depends on how your code is meant to be used:

Will the code be executed frequently? Will it be executed with long input strings or with short ones? What is more likely: valid or invalid input strings? For which case do you have to give a fast answer: valid or invalid input strings? And, as the experiments from @hkotsubo‭ show, even the expected contents of strings can play a role...

Unless that is clear, it does not make much sense to optimize the code at the cost of clarity. In your case I assume that it is an exercise from a programming lecture. Then, the code will likely be run, say, 10-100 times or so for checking your result, likely with input strings of short or moderate length. If that is true, the total execution time of all executions ever of your code (before it is being abandoned and you will look at the next exercise) will likely be below 10min. Thus, investing 11min in optimization could already be a waste of time :-).

In fact, under this assumption your code is even more complex than it needs to be: You have an initial check for odd string lengths. This check is redundant because the rest of the function works without it. The check optimizes the function's response time for some invalid cases, but at the cost of prolonging the computation for all the other cases. Thus, I recommend to get rid of the initial check to end with the most likely simplest possible solution for the problem.

Some more remarks:

  1. In the explanation of the review request you gave a short summary of your approach:

    My approach to the problem is replacing the pairs.

    Such a short summary would would also be good as a comment in the code, maybe a bit extended as

    The function analyzes the parenthesization "from inside" by iteratively deleting valid adjacent pairs of parentheses.

    but given the simplicity of the code the value of such a comment may already be debatable.

  2. For such a simple function, short variable names as you have chosen are fine. The name of the function fits the description of the exercise and thus may be OK in this case. Generally, for symbols to be used by the users of your code, a more descriptive name should be chosen. Again, @hkotsubo‭'s answer is worth looking at.

  3. A list of test cases would be good to have.

And, finally: It is quite possible that your instructors wanted you to learn about the use of stacks for such kinds of problems. Your solution may have come as a surprise to them. In this case, certainly, and also for the general benefit of learning new techniques I can only recommend to look at the proposals brought up by the other commenters.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Recursion? (2 comments)
+2
−0

You've got an inefficiency in your code, as you always do replacements 3/2 times the length of the string. That is unnecessarily expensive.

By instead testing in each iteration whether the length actually changed, you get a much improved performance:

def is_valid(s: str) -> bool:
    if len(s) % 1 != 0:
        return False

    # this initial value causes the loop to be skipped entirely for empty strings
    prevlen = 0

    # stop as soon as no further replacements have been made
    while len(s) != prevlen:
        prevlen = len(s)
        s = s.replace("()", "").replace("[]", "").replace("{}", "")

    return len(s) == 0

I've put it together with your code and hkotsubo's timing code on tio.run, and got the following:

--------------
Unbalanced in the middle
0.04957069695228711
0.002779866976197809
--------------
Unbalanced in the beginning
0.05233071401016787
0.0026999289984814823
--------------
Unbalanced in the end
0.05092682002577931
0.0026755660073831677
--------------
Balanced
0.047405752004124224
0.002398615994025022

Try it online!

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+2
−0

Command-line timing

Rather than using separate code for timing, I tried running the timeit module as a command-line tool. I wrote initial versions of the implementations based on the OP and hkotsubo's answer, as a file balance.py:

balance.py

def using_replace(s):
    if len(s) % 2 != 0:
        return False

    for i in range(len(s) // 2):
        s = s.replace("()", "").replace("[]", "").replace("{}", "")

    return len(s) == 0

def using_stack(s, brackets={ '(': ')', '[': ']', '{': '}' }):
    if len(s) % 2 != 0:
        return False

    open_brackets = set(brackets.keys())
    close_brackets = set(brackets.values())
    stack = []
    for c in s:
        if c in open_brackets:
            stack.append(c)
        elif c in close_brackets:
            # close bracket without respective opening, string is invalid
            if len(stack) == 0:
                return False
            # close bracket doesn't correspond to last open bracket, string is invalid
            if c != brackets[stack[-1]]:
                return False
            stack.pop()
        else: # invalid char
            return False

    return len(stack) == 0

Which I can then test like so:

$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.23 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
50 loops, best of 5: 4.19 msec per loop

Performance analysis

As one might expect, adding a couple imbalanced brackets at the start, middle or end of that test string has little effect on the replace-based algorithm, but completely determines when the stack-based algorithm can exit:

Test results for imbalanced inputs
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
500000 loops, best of 5: 719 nsec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.48 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_stack(s)"
100 loops, best of 5: 2.18 msec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_replace(s)"
100 loops, best of 5: 2.39 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_stack(s)"
50 loops, best of 5: 4.21 msec per loop

The main reason that the replace-based approach is able to perform this well is because the length of the string dramatically decreases during the first few iterations. With the "balanced brackets" input, the string length is 32000 initially, 12000 after the first pass and zero for every other pass. (Other contributors may not have realized, or accounted for, the fact that .replace replaces every non-overlapping occurrence of the replacement string, in a single pass; and internally, the Python implementation can "cheat" and use mutable strings and string buffers.)

We get much worse performance with the "curious" test case, '{[((ab))]{()[()]}}' * 2000, similarly, because the length of the string drops from 36000 to 20000 in the first pass and then stays at 20000 for each subsequent pass.

Another poorly-performing case for the replace-based approach is deeply nested parentheses:

$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_replace(s)"
1 loop, best of 5: 630 msec per loop
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_stack(s)"
50 loops, best of 5: 4.14 msec per loop

On average, the string length is half the original length, because it can only decrease steadily with each loop iteration, and only reaches zero at the end. Mixing up bracket types improves the performance, because it allows all the .replace calls to do work, and allows a lot of empty string processing at the end:

$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_replace(s)"
1 loop, best of 5: 222 msec per loop

And this improves even more if we take care to match the order of brackets in the input to the order of .replace calls:

$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_replace(s)"
2 loops, best of 5: 138 msec per loop

Performance improvements for the replace-based approach

As Dirk Herrmann‭ noted, iterating until the length stops changing avoids unnecessary iterations, which greatly improves performance especially in cases where an imbalance can be detected early. Using a modified version of using_replace (rearranging the loop a bit):

Improved string_replace and basic benchmark results

def using_replace(s):
    if len(s) % 2 != 0:
        return False
    prevlen = len(s)
    while True:
        if prevlen == 0:
            return True
        s = s.replace("()", "").replace("[]", "").replace("{}", "")
        if prevlen == len(s): # no improvement and didn't reach zero
            return False
        prevlen = len(s)

Repeating the tests:

$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 234 usec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 250 usec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_replace(s)"
1000 loops, best of 5: 249 usec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_replace(s)"
1000 loops, best of 5: 248 usec per loop

The good cases are an order of magnitude faster again, by avoiding repeated .replace calls on empty strings or strings that contain only the unbalanced )) pair.

However, imbalanced brackets still fare poorly, because the string length does continually decrease, so that an early exit isn't possible:

$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_replace(s)"
1 loop, best of 5: 631 msec per loop
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_replace(s)"
1 loop, best of 5: 223 msec per loop
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_replace(s)"
2 loops, best of 5: 137 msec per loop

Regex is counterproductive

Another avenue one might consider exploring is regular expressions. This would allow for multiple kinds of brackets to be replaced in a single pass, rather than making three method calls per loop; and it would mean not prioritizing one bracket type over another (at least, not to the same extent).

Code for regex-based approach
import re

pattern = re.compile('\(\)|\[\]|\{\}')

def using_regex(s):
    if len(s) % 2 != 0:
        return False
    prevlen = len(s)
    while True:
        if prevlen == 0:
            return True
        s = pattern.sub('', s)
        if prevlen == len(s):
            return False
        prevlen = len(s)

But this actually makes things much worse all around. The string replacement algorithm is already highly optimized, and even though the regular expression interface uses a state machine implemented in C, it adds too much complexity.

It does, however, reduce the importance of the order of brackets, as predicted:

$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_regex(s)"
1 loop, best of 5: 3.24 sec per loop
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_regex(s)"
1 loop, best of 5: 3.38 sec per loop

Performance improvements for the stack-based approach

We don't need to test for invalid characters explicitly, because a non-bracket character will necessarily fail the other tests (for a valid open bracket or a matching close bracket).

We can also simplify the code by putting the corresponding close brackets on the stack when we find an open bracket. That way, the character that we should pop from the stack on a successful match, will already be what we need for the comparison - so we don't discard the result from pop.

We also then find that it's unnecessary to compute the close_brackets set, although this is trivial for long inputs.

Finally, using collections.deque rather than a list seems to offer a minor performance boost, even for the simple inputs where a balanced sequence is repeated many times (so that the stack should only ever grow to a few elements) - and the performance improvement is still minor, even dubious, for deeply nested inputs. I'm not entirely sure why this is; but I used it for the final version on principle.

Improved using_stack implementation and benchmark results

from collections import deque

def using_stack(s, brackets={ '(': ')', '[': ']', '{': '}' }):
    if len(s) % 2 != 0:
        return False

    open_brackets = set(brackets.keys())
    stack = deque()
    for c in s:
        if c in open_brackets:
            stack.append(brackets[c])
        # otherwise, must match the top of a non-empty stack
        elif len(stack) == 0 or c != stack.pop():
            return False

    return not stack

Results:

$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.59 msec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
500000 loops, best of 5: 503 nsec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_stack(s)"
200 loops, best of 5: 1.72 msec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_stack(s)"
100 loops, best of 5: 3.65 msec per loop
$ # deeply nested
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.43 msec per loop

So, a fairly consistent improvement of around 15%, across the board. Note that relying on "look before you leap" (catching a KeyError on brackets[c] rather than checking the key-set explicitly) makes things considerably slower (about double the runtime overall on the simple cases).

The best of both worlds?

The replace-based approach seems better for "normal" cases, but much slower if the string length can't be rapidly reduced by individual .replace passes. What if we tried using a heuristic to start out with the replace algorithm, and switch to the stack algorithm if the string length only goes down slightly - say, by less than an eighth of the original length?

Combined approach and benchmark results
def using_heuristic(s):
    if len(s) % 2 != 0:
        return False
    prevlen = len(s)
    while True:
        if prevlen == 0:
            return True
        s = s.replace("()", "").replace("[]", "").replace("{}", "")
        currlen = len(s)
        if currlen == prevlen:
            return False
        if currlen > (prevlen - (prevlen >> 3)):
            return using_stack(s)
        prevlen = currlen
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 233 usec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 247 usec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 248 usec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 249 usec per loop
$ # deeply nested
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_heuristic(s)"
100 loops, best of 5: 3.43 msec per loop

It works as anticipated. It can't bail out immediately when the imbalance is at the beginning of the string (because it will try the replacement approach at least once before giving up on that), but it otherwise effectively chooses the best approach.

It also lets both algorithms do what they're good at, at least to some extent:

$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 317 usec per loop
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.01 msec per loop
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.99 msec per loop

Of course, as written this means that e.g. the if len(s) % 2 != 0: test is repeated, and of course the using_stack logic could be inlined. But these changes don't make an observable difference.

We also lost the ability to bail out early on strings that contain non-brackets. Here's a quick way to add that test back in:

delete_bracket_mapping = str.maketrans('', '', '()[]{}')

if s.translate(delete_bracket_mapping):
    # After deleting all brackets, there are still string contents.
    # Therefore, there are non-bracket characters.
    return False
History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »