Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Detecting balanced parentheses in Python
The problem
Given a string s containing just the characters
'('
,')'
,'{'
,'}'
,'['
and']'
, determine if the input string is valid.An input string is valid if:
- Open brackets are closed by the same type of brackets.
- Open brackets are closed in the correct order.
The solution
def isValid(s: str) -> bool:
if len(s) % 2 != 0:
return False
for i in range(len(s) // 2):
s = s.replace("()", "").replace("[]", "").replace("{}", "")
return len(s) == 0
My approach to the problem is replacing the pairs. The string is balanced if the string is empty after replacing len(str) // 2
times. Is this a good approach? How can I improve my algorithm?
Use a stack while just scanning your string once from left to right. No need for multiple (performance-wise) expensive s …
3y ago
Instead of replacing the brackets, you could do just one loop, and keep a stack with the opening brackets. Every time yo …
3y ago
> Is this a good approach? How can I improve my algorithm? Your code is correct and simple. That is good, and it may …
10mo ago
Command-line timing Rather than using separate code for timing, I tried running the `timeit` module as a command-line …
10mo ago
You've got an inefficiency in your code, as you always do replacements 3/2 times the length of the string. That is unnec …
3y ago
5 answers
Use a stack while just scanning your string once from left to right. No need for multiple (performance-wise) expensive string replacements. If implemented right, the stack will only ever contain at maximum len(s)/2
elements.
The algorithm is rather straightforward:
Check first whether the string is of odd or even length. If the string is of odd length, it must be an unbalanced string (Your code already does this part, i just mentioned it here again simply for having a full description of the stack-based alogrithm.)
For any opening paranthesis/bracket/brace in the string, push some value/token that represents the type of this opening bracket onto the stack (that can be the opening bracket characters themselves, the associated closing brackets, or some other identifying values such as from an enum). If before doing a push the count of elements in the stack is already len(s)/2
you know the string must be unbalanced and the routine can abort now.
For any closing bracket in the string, pop a value/token from the stack and check whether that value/token represents the same bracket type as the closing bracket. If the stack is empty before attempting to pop, or the popped value/token mismatches the type of the closing bracket in the string, you know the string must be unbalanced and the routine can abort now.
When you finished scanning the string from left to right and the stack is not empty, you again know that the string must be unbalanced.
When the stack is empty after completing the scan of the text string, you know the string was balanced (or the string was empty).
I leave it to you as an exercise to translate this algorithm into concrete Python code.
P.S.: As a further exercise, you might try implementing the same stack-based algorithm while avoiding using an explicit stack data structure. This can be achieved by implementing the algorithm in a recursive manner, with the call stack conceptually taking the role of the stack that's keeping track of the (nested) bracket pairs.
1 comment thread
Instead of replacing the brackets, you could do just one loop, and keep a stack with the opening brackets. Every time you find a closing bracket, check if it corresponds to the stack top: if it's not, the string is invalid, else, pop the opening bracket and proceed to the next character.
Do it until the end of the string, and the stack will be empty if it's a valid one:
def is_balanced(s, brackets={ '(': ')', '[': ']', '{': '}' }):
if len(s) % 2 != 0:
return False
open_brackets = set(brackets.keys())
close_brackets = set(brackets.values())
stack = []
for c in s:
if c in open_brackets:
stack.append(c)
elif c in close_brackets:
# close bracket without respective opening, string is invalid
if len(stack) == 0:
return False
# close bracket doesn't correspond to last open bracket, string is invalid
if c != brackets[stack[-1]]:
return False
stack.pop()
else: # invalid char
return False
return len(stack) == 0
I put the keys
and values
inside set
's (because searching with the in
operator is O(1)), but it didn't make much difference. You can remove it if you want.
I admit I initially thought that calling replace
3 times per iteration, and always do it N / 2 times, would be slower, but to my surprise I found out that it depends on the string being tested. A quick test with timeit
module (using your isValid
function and is_balanced
as defined above):
from timeit import timeit
# run 10 times each test
params = { 'number' : 10, 'globals': globals() }
s = '{[(())]{()[()]}}' * 1000
s += '))'
s += '{[(())]{()[()]}}' * 1000
print('--------------\nUnbalanced in the middle')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))
s = '))'
s += '{[(())]{()[()]}}' * 1000
s += '{[(())]{()[()]}}' * 1000
print('--------------\nUnbalanced in the beginning')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))
s = '{[(())]{()[()]}}' * 2000
s += '))'
print('--------------\nUnbalanced in the end')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))
s = '{[(())]{()[()]}}' * 2000
print('--------------\nBalanced')
print(timeit('isValid(s)', **params))
print(timeit('is_balanced(s)', **params))
In my machine, I've got these results:
--------------
Unbalanced in the middle
0.03587988700019196
0.02324946300359443
--------------
Unbalanced in the beginning
0.026359478004451375
1.2465003237593919e-05
--------------
Unbalanced in the end
0.025080425999476574
0.041097565001109615
--------------
Balanced
0.022329219995299354
0.04195229199831374
For a balanced string, surprisingly - at least to me - the replace
approach is faster.
For unbalanced strings, it depends on where the unbalanced brackets are. If they are in the beginning or in the middle, the stack approach wins (with a huge margin if the problem is in the beginning, due to the else
clause that checks an invalid char as soon as it finds one). But if they are in the end, the replace
approach wins.
I really thought that using replace
would always be slower, because each call to replace
needs to scan the whole string, searching for the substring to be replaced (and return a new string). Even if you're looping N / 2 times, you're calling replace
3 times per iteration, which I really thought it'd have a huge cost. But that wasn't the case, and this really surprised me.
The only case where the difference is really huge is the aforementioned case, when the unbalanced bracket is in the beginning. One particular case is curious:
s = '{[((ab))]{()[()]}}' * 2000
With this, the replace
approach took 4.5 seconds, while the stack approach took around 0.00001 seconds. Not sure if there's an implementation detail for the replace
method, that makes this particular case very slow (tested on Python 3.8.5, Ubuntu 20.04).
Another case where the stack approach wins by one order of magnitude is:
# lots of opening brackets, followed by lots of closing brackets
s = ('(' * 2000) + (')' * 2000)
But as I said, I was expecting this difference in all cases. Anyway, that's why we should always test everything, instead of trusting on what we believe to be "obvious".
BTW, testing in another environments, the results vary. In IdeOne.com, the stack approach is always faster for unbalanced strings, and almost the same for balanced ones. And in Repl.it, the results were similar to my machine's.
Is this a good approach? How can I improve my algorithm?
Your code is correct and simple. That is good, and it may be all that is needed. You have received some responses that indicate that there would be a performance problem. But, whether or not performance is an issue, and what the precise performance problem is, depends on how your code is meant to be used:
Will the code be executed frequently? Will it be executed with long input strings or with short ones? What is more likely: valid or invalid input strings? For which case do you have to give a fast answer: valid or invalid input strings? And, as the experiments from @hkotsubo show, even the expected contents of strings can play a role...
Unless that is clear, it does not make much sense to optimize the code at the cost of clarity. In your case I assume that it is an exercise from a programming lecture. Then, the code will likely be run, say, 10-100 times or so for checking your result, likely with input strings of short or moderate length. If that is true, the total execution time of all executions ever of your code (before it is being abandoned and you will look at the next exercise) will likely be below 10min. Thus, investing 11min in optimization could already be a waste of time :-).
In fact, under this assumption your code is even more complex than it needs to be: You have an initial check for odd string lengths. This check is redundant because the rest of the function works without it. The check optimizes the function's response time for some invalid cases, but at the cost of prolonging the computation for all the other cases. Thus, I recommend to get rid of the initial check to end with the most likely simplest possible solution for the problem.
Some more remarks:
-
In the explanation of the review request you gave a short summary of your approach:
My approach to the problem is replacing the pairs.
Such a short summary would would also be good as a comment in the code, maybe a bit extended as
The function analyzes the parenthesization "from inside" by iteratively deleting valid adjacent pairs of parentheses.
but given the simplicity of the code the value of such a comment may already be debatable.
-
For such a simple function, short variable names as you have chosen are fine. The name of the function fits the description of the exercise and thus may be OK in this case. Generally, for symbols to be used by the users of your code, a more descriptive name should be chosen. Again, @hkotsubo's answer is worth looking at.
-
A list of test cases would be good to have.
And, finally: It is quite possible that your instructors wanted you to learn about the use of stacks for such kinds of problems. Your solution may have come as a surprise to them. In this case, certainly, and also for the general benefit of learning new techniques I can only recommend to look at the proposals brought up by the other commenters.
You've got an inefficiency in your code, as you always do replacements 3/2 times the length of the string. That is unnecessarily expensive.
By instead testing in each iteration whether the length actually changed, you get a much improved performance:
def is_valid(s: str) -> bool:
if len(s) % 1 != 0:
return False
# this initial value causes the loop to be skipped entirely for empty strings
prevlen = 0
# stop as soon as no further replacements have been made
while len(s) != prevlen:
prevlen = len(s)
s = s.replace("()", "").replace("[]", "").replace("{}", "")
return len(s) == 0
I've put it together with your code and hkotsubo's timing code on tio.run, and got the following:
--------------
Unbalanced in the middle
0.04957069695228711
0.002779866976197809
--------------
Unbalanced in the beginning
0.05233071401016787
0.0026999289984814823
--------------
Unbalanced in the end
0.05092682002577931
0.0026755660073831677
--------------
Balanced
0.047405752004124224
0.002398615994025022
0 comment threads
Command-line timing
Rather than using separate code for timing, I tried running the timeit
module as a command-line tool. I wrote initial versions of the implementations based on the OP and hkotsubo's answer, as a file balance.py
:
balance.py
balance.py
def using_replace(s):
if len(s) % 2 != 0:
return False
for i in range(len(s) // 2):
s = s.replace("()", "").replace("[]", "").replace("{}", "")
return len(s) == 0
def using_stack(s, brackets={ '(': ')', '[': ']', '{': '}' }):
if len(s) % 2 != 0:
return False
open_brackets = set(brackets.keys())
close_brackets = set(brackets.values())
stack = []
for c in s:
if c in open_brackets:
stack.append(c)
elif c in close_brackets:
# close bracket without respective opening, string is invalid
if len(stack) == 0:
return False
# close bracket doesn't correspond to last open bracket, string is invalid
if c != brackets[stack[-1]]:
return False
stack.pop()
else: # invalid char
return False
return len(stack) == 0
Which I can then test like so:
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.23 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
50 loops, best of 5: 4.19 msec per loop
Performance analysis
As one might expect, adding a couple imbalanced brackets at the start, middle or end of that test string has little effect on the replace
-based algorithm, but completely determines when the stack-based algorithm can exit:
Test results for imbalanced inputs
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
500000 loops, best of 5: 719 nsec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.48 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_stack(s)"
100 loops, best of 5: 2.18 msec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_replace(s)"
100 loops, best of 5: 2.39 msec per loop
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_stack(s)"
50 loops, best of 5: 4.21 msec per loop
The main reason that the replace-based approach is able to perform this well is because the length of the string dramatically decreases during the first few iterations. With the "balanced brackets" input, the string length is 32000 initially, 12000 after the first pass and zero for every other pass. (Other contributors may not have realized, or accounted for, the fact that .replace
replaces every non-overlapping occurrence of the replacement string, in a single pass; and internally, the Python implementation can "cheat" and use mutable strings and string buffers.)
We get much worse performance with the "curious" test case, '{[((ab))]{()[()]}}' * 2000
, similarly, because the length of the string drops from 36000 to 20000 in the first pass and then stays at 20000 for each subsequent pass.
Another poorly-performing case for the replace-based approach is deeply nested parentheses:
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_replace(s)"
1 loop, best of 5: 630 msec per loop
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_stack(s)"
50 loops, best of 5: 4.14 msec per loop
On average, the string length is half the original length, because it can only decrease steadily with each loop iteration, and only reaches zero at the end. Mixing up bracket types improves the performance, because it allows all the .replace
calls to do work, and allows a lot of empty string processing at the end:
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_replace(s)"
1 loop, best of 5: 222 msec per loop
And this improves even more if we take care to match the order of brackets in the input to the order of .replace
calls:
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_replace(s)"
2 loops, best of 5: 138 msec per loop
Performance improvements for the replace-based approach
As Dirk Herrmann noted, iterating until the length stops changing avoids unnecessary iterations, which greatly improves performance especially in cases where an imbalance can be detected early. Using a modified version of using_replace
(rearranging the loop a bit):
Improved string_replace
and basic benchmark results
string_replace
and basic benchmark resultsdef using_replace(s):
if len(s) % 2 != 0:
return False
prevlen = len(s)
while True:
if prevlen == 0:
return True
s = s.replace("()", "").replace("[]", "").replace("{}", "")
if prevlen == len(s): # no improvement and didn't reach zero
return False
prevlen = len(s)
Repeating the tests:
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 234 usec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 250 usec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_replace(s)"
1000 loops, best of 5: 249 usec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_replace(s)"
1000 loops, best of 5: 248 usec per loop
The good cases are an order of magnitude faster again, by avoiding repeated .replace
calls on empty strings or strings that contain only the unbalanced ))
pair.
However, imbalanced brackets still fare poorly, because the string length does continually decrease, so that an early exit isn't possible:
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_replace(s)"
1 loop, best of 5: 631 msec per loop
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_replace(s)"
1 loop, best of 5: 223 msec per loop
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_replace(s)"
2 loops, best of 5: 137 msec per loop
Regex is counterproductive
Another avenue one might consider exploring is regular expressions. This would allow for multiple kinds of brackets to be replaced in a single pass, rather than making three method calls per loop; and it would mean not prioritizing one bracket type over another (at least, not to the same extent).
Code for regex-based approach
import re
pattern = re.compile('\(\)|\[\]|\{\}')
def using_regex(s):
if len(s) % 2 != 0:
return False
prevlen = len(s)
while True:
if prevlen == 0:
return True
s = pattern.sub('', s)
if prevlen == len(s):
return False
prevlen = len(s)
But this actually makes things much worse all around. The string replacement algorithm is already highly optimized, and even though the regular expression interface uses a state machine implemented in C, it adds too much complexity.
It does, however, reduce the importance of the order of brackets, as predicted:
$ python -m timeit --setup "import balance; s='{[(' * 5000 + ')]}' * 5000" -- "balance.using_regex(s)"
1 loop, best of 5: 3.24 sec per loop
$ python -m timeit --setup "import balance; s='([{' * 5000 + '}])' * 5000" -- "balance.using_regex(s)"
1 loop, best of 5: 3.38 sec per loop
Performance improvements for the stack-based approach
We don't need to test for invalid characters explicitly, because a non-bracket character will necessarily fail the other tests (for a valid open bracket or a matching close bracket).
We can also simplify the code by putting the corresponding close brackets on the stack when we find an open bracket. That way, the character that we should pop from the stack on a successful match, will already be what we need for the comparison - so we don't discard the result from pop
.
We also then find that it's unnecessary to compute the close_brackets
set, although this is trivial for long inputs.
Finally, using collections.deque
rather than a list seems to offer a minor performance boost, even for the simple inputs where a balanced sequence is repeated many times (so that the stack should only ever grow to a few elements) - and the performance improvement is still minor, even dubious, for deeply nested inputs. I'm not entirely sure why this is; but I used it for the final version on principle.
Improved using_stack
implementation and benchmark results
using_stack
implementation and benchmark resultsfrom collections import deque
def using_stack(s, brackets={ '(': ')', '[': ']', '{': '}' }):
if len(s) % 2 != 0:
return False
open_brackets = set(brackets.keys())
stack = deque()
for c in s:
if c in open_brackets:
stack.append(brackets[c])
# otherwise, must match the top of a non-empty stack
elif len(stack) == 0 or c != stack.pop():
return False
return not stack
Results:
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.59 msec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_stack(s)"
500000 loops, best of 5: 503 nsec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_stack(s)"
200 loops, best of 5: 1.72 msec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_stack(s)"
100 loops, best of 5: 3.65 msec per loop
$ # deeply nested
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.43 msec per loop
So, a fairly consistent improvement of around 15%, across the board. Note that relying on "look before you leap" (catching a KeyError
on brackets[c]
rather than checking the key-set explicitly) makes things considerably slower (about double the runtime overall on the simple cases).
The best of both worlds?
The replace-based approach seems better for "normal" cases, but much slower if the string length can't be rapidly reduced by individual .replace
passes.
What if we tried using a heuristic to start out with the replace algorithm, and switch to the stack algorithm if the string length only goes down slightly - say, by less than an eighth of the original length?
Combined approach and benchmark results
def using_heuristic(s):
if len(s) % 2 != 0:
return False
prevlen = len(s)
while True:
if prevlen == 0:
return True
s = s.replace("()", "").replace("[]", "").replace("{}", "")
currlen = len(s)
if currlen == prevlen:
return False
if currlen > (prevlen - (prevlen >> 3)):
return using_stack(s)
prevlen = currlen
$ # with balanced brackets
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 233 usec per loop
$ # at the beginning
$ python -m timeit --setup "import balance; s='))' + '{[(())]{()[()]}}' * 2000" -- "balance.using_replace(s)"
1000 loops, best of 5: 247 usec per loop
$ # in the middle
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 1000 + '))' + '{[(())]{()[()]}}' * 1000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 248 usec per loop
$ # at the end
$ python -m timeit --setup "import balance; s='{[(())]{()[()]}}' * 2000 + '))'" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 249 usec per loop
$ # deeply nested
$ python -m timeit --setup "import balance; s='((((((((((((((((' * 1000 + '))))))))))))))))' * 1000" -- "balance.using_heuristic(s)"
100 loops, best of 5: 3.43 msec per loop
It works as anticipated. It can't bail out immediately when the imbalance is at the beginning of the string (because it will try the replacement approach at least once before giving up on that), but it otherwise effectively chooses the best approach.
It also lets both algorithms do what they're good at, at least to some extent:
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_heuristic(s)"
1000 loops, best of 5: 317 usec per loop
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_stack(s)"
100 loops, best of 5: 3.01 msec per loop
$ python -m timeit --setup "import balance; s='(' * 1000 + ')' * 1000 + '()' * 14000" -- "balance.using_replace(s)"
100 loops, best of 5: 2.99 msec per loop
Of course, as written this means that e.g. the if len(s) % 2 != 0:
test is repeated, and of course the using_stack
logic could be inlined. But these changes don't make an observable difference.
We also lost the ability to bail out early on strings that contain non-brackets. Here's a quick way to add that test back in:
delete_bracket_mapping = str.maketrans('', '', '()[]{}')
if s.translate(delete_bracket_mapping):
# After deleting all brackets, there are still string contents.
# Therefore, there are non-bracket characters.
return False
0 comment threads