Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

84%

+9 −0

Q&A Python Regex to parse multiple "word. word. word."

First of all, let's understand why your regex didn't work. The first part is \w+\.\s, which is "one or more alpha-numeric characters" (\w+), followed by a dot and a space (\.\s). If the regex was ...

posted 4y ago by hkotsubo‭ · edited 3y ago by hkotsubo‭

Answer

#4: Post edited by

hkotsubo‭ · 2021-09-16T12:26:35Z (over 3 years ago)

Copy Link

Raw

Markdown

First of all, let's understand why your regex didn't work.
The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. `  (the word "THIS", plus the dot and space after it).
But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. `   there's a letter "T", so the *lookahead* fails in this case, thus no match is found.
It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.
---
Not entirely clear what parts of the string you want, but anyway...
If you want the words that are followed by a dot, you could do this:
```
import re
lines = [ 'THIS. THAT..OTHER ' ]
pattern = re.compile(r'\w+\.')
for line in lines:
matches = pattern.findall(line)
if matches:
print(matches) # ['THIS.', 'THAT.']
else:
print("REGEX FAILED: [{}]".format(line))
```
`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.
But if you want a single string containing both words, then you could do:
```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0]) # THIS. THAT.
else:
print("REGEX FAILED: [{}]".format(line))
```
Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).
This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).
---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.
If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):
```
pattern = re.compile(r'^(\w+\.\s?)+')
```
If you want to limit to only 2 words, just change the `+` quantifier:
```
pattern = re.compile(r'^(\w+\.\s?){2}')
```
You can customize the quantity in many different ways:
- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences
---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:
```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```
And `\s` matches not only spaces, but also TAB's, line breaks and [many other characters](https://docs.python.org/3/library/re.html#index-30) (such as [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space)). If you want to match only a space, use this:
```
# note the space before "?"
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
# ↑
# there's a space here
```
And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.
---
After your edit, I still think that the expressions above might work. Ex:
```
import re
lines = [
'THIS. THAT..OTHER ',
'THIS. THAT. ANOTHER..OTHER ',
'THIS..OTHER ',
]
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0])
else:
print("REGEX FAILED: [{}]".format(line))
```
The output is:
```
THIS. THAT.
THIS. THAT. ANOTHER.
THIS.
```
Basically, the regex matches one or more ocurrences of "word followed by a dot, followed by an optional space", which seems to be what you want.
In the third case, it's getting only `THIS.`, because the `+` quantifier gets one or more ocurrences. If you want only 2 or more words, just change it accordingly, as previously said (such as `{2,}` for "at least two", etc).

First of all, let's understand why your regex didn't work.
The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. `  (the word "THIS", plus the dot and space after it).
But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. `   there's a letter "T", so the *lookahead* fails in this case, thus no match is found.
It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.
---
Not entirely clear what parts of the string you want, but anyway...
If you want the words that are followed by a dot, you could do this:
```
import re
lines = [ 'THIS. THAT..OTHER ' ]
pattern = re.compile(r'\w+\.')
for line in lines:
matches = pattern.findall(line)
if matches:
print(matches) # ['THIS.', 'THAT.']
else:
print("REGEX FAILED: [{}]".format(line))
```
`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.
But if you want a single string containing both words, then you could do:
```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0]) # THIS. THAT.
else:
print("REGEX FAILED: [{}]".format(line))
```
Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).
This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).
---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.
If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):
```
pattern = re.compile(r'^(\w+\.\s?)+')
```
If you want to limit to only 2 words, just change the `+` quantifier:
```
pattern = re.compile(r'^(\w+\.\s?){2}')
```
You can customize the quantity in many different ways:
- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences
---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:
```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```
And `\s` matches not only spaces, but also TAB's, line breaks and [many other characters](https://docs.python.org/3/library/re.html#index-30) (such as [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space) - see [here](https://software.codidact.com/posts/282065) for a more detailed explanation). If you want to match only a space, use this:
```
# note the space before "?"
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
# ↑
# there's a space here
```
And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.
---
After your edit, I still think that the expressions above might work. Ex:
```
import re
lines = [
'THIS. THAT..OTHER ',
'THIS. THAT. ANOTHER..OTHER ',
'THIS..OTHER ',
]
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0])
else:
print("REGEX FAILED: [{}]".format(line))
```
The output is:
```
THIS. THAT.
THIS. THAT. ANOTHER.
THIS.
```
Basically, the regex matches one or more ocurrences of "word followed by a dot, followed by an optional space", which seems to be what you want.
In the third case, it's getting only `THIS.`, because the `+` quantifier gets one or more ocurrences. If you want only 2 or more words, just change it accordingly, as previously said (such as `{2,}` for "at least two", etc).

#3: Post edited by

hkotsubo‭ · 2021-03-03T16:52:57Z (about 4 years ago)

Copy Link

Raw

Markdown

First of all, let's understand why your regex didn't work.
The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. `  (the word "THIS", plus the dot and space after it).
But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. `   there's a letter "T", so the *lookahead* fails in this case, thus no match is found.
It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.
---
Not entirely clear what parts of the string you want, but anyway...
If you want the words that are followed by a dot, you could do this:
```
import re
lines = [ 'THIS. THAT..OTHER ' ]
pattern = re.compile(r'\w+\.')
for line in lines:
matches = pattern.findall(line)
if matches:
print(matches) # ['THIS.', 'THAT.']
else:
print("REGEX FAILED: [{}]".format(line))
```
`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.
But if you want a single string containing both words, then you could do:
```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0]) # THIS. THAT.
else:
print("REGEX FAILED: [{}]".format(line))
```
Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).
This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).
---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.
If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):
```
pattern = re.compile(r'^(\w+\.\s?)+')
```
If you want to limit to only 2 words, just change the `+` quantifier:
```
pattern = re.compile(r'^(\w+\.\s?){2}')
```
You can customize the quantity in many different ways:
- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences
---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:
```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```
And `\s` matches not only spaces, but also TAB's, line breaks and [many other characters](https://docs.python.org/3/library/re.html#index-30) (such as [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space)). If you want to match only a space, use this:
```
# note the space before "?"
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
# ↑
# there's a space here
```
~~And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.~~

First of all, let's understand why your regex didn't work.
The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. `  (the word "THIS", plus the dot and space after it).
But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. `   there's a letter "T", so the *lookahead* fails in this case, thus no match is found.
It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.
---
Not entirely clear what parts of the string you want, but anyway...
If you want the words that are followed by a dot, you could do this:
```
import re
lines = [ 'THIS. THAT..OTHER ' ]
pattern = re.compile(r'\w+\.')
for line in lines:
matches = pattern.findall(line)
if matches:
print(matches) # ['THIS.', 'THAT.']
else:
print("REGEX FAILED: [{}]".format(line))
```
`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.
But if you want a single string containing both words, then you could do:
```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0]) # THIS. THAT.
else:
print("REGEX FAILED: [{}]".format(line))
```
Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).
This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).
---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.
If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):
```
pattern = re.compile(r'^(\w+\.\s?)+')
```
If you want to limit to only 2 words, just change the `+` quantifier:
```
pattern = re.compile(r'^(\w+\.\s?){2}')
```
You can customize the quantity in many different ways:
- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences
---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:
```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```
And `\s` matches not only spaces, but also TAB's, line breaks and [many other characters](https://docs.python.org/3/library/re.html#index-30) (such as [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space)). If you want to match only a space, use this:
```
# note the space before "?"
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
# ↑
# there's a space here
```
And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.
---
After your edit, I still think that the expressions above might work. Ex:
```
import re
lines = [
'THIS. THAT..OTHER ',
'THIS. THAT. ANOTHER..OTHER ',
'THIS..OTHER ',
]
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0])
else:
print("REGEX FAILED: [{}]".format(line))
```
The output is:
```
THIS. THAT.
THIS. THAT. ANOTHER.
THIS.
```
Basically, the regex matches one or more ocurrences of "word followed by a dot, followed by an optional space", which seems to be what you want.
In the third case, it's getting only `THIS.`, because the `+` quantifier gets one or more ocurrences. If you want only 2 or more words, just change it accordingly, as previously said (such as `{2,}` for "at least two", etc).

#2: Post edited by

hkotsubo‭ · 2021-03-03T01:18:51Z (about 4 years ago)

Copy Link

Raw

Markdown

First of all, let's understand why your regex didn't work.
The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. `  (the word "THIS", plus the dot and space after it).
But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. `   there's a letter "T", so the *lookahead* fails in this case, thus no match is found.
It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.
---
Not entirely clear what parts of the string you want, but anyway...
If you want the words that are followed by a dot, you could do this:
```
import re
lines = [ 'THIS. THAT..OTHER ' ]
pattern = re.compile(r'\w+\.')
for line in lines:
matches = pattern.findall(line)
if matches:
print(matches) # ['THIS.', 'THAT.']
else:
print("REGEX FAILED: [{}]".format(line))
```
`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.
But if you want a single string containing both words, then you could do:
```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0]) # THIS. THAT.
else:
print("REGEX FAILED: [{}]".format(line))
```
Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).
This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).
---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.
If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):
```
pattern = re.compile(r'^(\w+\.\s?)+')
```
If you want to limit to only 2 words, just change the `+` quantifier:
```
pattern = re.compile(r'^(\w+\.\s?){2}')
```
You can customize the quantity in many different ways:
- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences
---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:
```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```
And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.

First of all, let's understand why your regex didn't work.
The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. `  (the word "THIS", plus the dot and space after it).
But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. `   there's a letter "T", so the *lookahead* fails in this case, thus no match is found.
It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.
---
Not entirely clear what parts of the string you want, but anyway...
If you want the words that are followed by a dot, you could do this:
```
import re
lines = [ 'THIS. THAT..OTHER ' ]
pattern = re.compile(r'\w+\.')
for line in lines:
matches = pattern.findall(line)
if matches:
print(matches) # ['THIS.', 'THAT.']
else:
print("REGEX FAILED: [{}]".format(line))
```
`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.
But if you want a single string containing both words, then you could do:
```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
match = pattern.search(line)
if match:
print(match[0]) # THIS. THAT.
else:
print("REGEX FAILED: [{}]".format(line))
```
Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).
This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).
---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.
If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):
```
pattern = re.compile(r'^(\w+\.\s?)+')
```
If you want to limit to only 2 words, just change the `+` quantifier:
```
pattern = re.compile(r'^(\w+\.\s?){2}')
```
You can customize the quantity in many different ways:
- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences
---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:
```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```
And `\s` matches not only spaces, but also TAB's, line breaks and [many other characters](https://docs.python.org/3/library/re.html#index-30) (such as [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space)). If you want to match only a space, use this:
```
# note the space before "?"
pattern = re.compile(r'([a-zA-Z]+\. ?)+')
# ↑
# there's a space here
```
And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.

#1: Initial revision by

hkotsubo‭ · 2021-03-03T01:04:18Z (about 4 years ago)

Copy Link

Raw

Markdown

First of all, let's understand why your regex didn't work.

The first part is `\w+\.\s`, which is "one or more alpha-numeric characters" (`\w+`), followed by a dot and a space (`\.\s`). If the regex was only this, it would *match* `THIS. ` &nbsp;(the word "THIS", plus the dot and space after it).

But you also used the [*lookahead*](https://www.regular-expressions.info/lookaround.html) `(?=[\.\s])`, which means that after `\w+\.\s` there **must** be another character, and that character must be a dot or a space. But after `THIS. ` &nbsp; there's a letter "T", so the *lookahead* fails in this case, thus no match is found.

It also can't find a match after "THAT", because it's followed by two dots (but the regex is looking for one dot and one space). And after "OTHER", there are no dots, only spaces, so the regex can't find any matches in this string.

---
Not entirely clear what parts of the string you want, but anyway...

If you want the words that are followed by a dot, you could do this:

```
import re

lines = [ 'THIS. THAT..OTHER                  ' ]

pattern = re.compile(r'\w+\.')
for line in lines:
    matches = pattern.findall(line)
    if matches:
        print(matches) # ['THIS.', 'THAT.']
    else:
        print("REGEX FAILED: [{}]".format(line))
```

`findall` returns a list containing all the matches found. In this case, I'm searching for `\w+\.` (one or more alphanumeric characters followed by a dot). The result is the list `['THIS.', 'THAT.']`.

But if you want a single string containing both words, then you could do:

```
pattern = re.compile(r'(\w+\.\s?)+')
for line in lines:
    match = pattern.search(line)
    if match:
        print(match[0]) # THIS. THAT.
    else:
        print("REGEX FAILED: [{}]".format(line))
```

Now I'm searching for `\w+\.` (alphanumeric characters followed by a dot), followed by an [optional](http://regular-expressions.info/optional.html) space (`\s?`). And  this whole thing (the word, dot and optional space) can be [repeated one or more times](https://www.regular-expressions.info/repeat.html) (this is indicated by the `+` in the end, which is applied to everything inside the parenthesis).

This finds the string `'THIS. THAT.'` (obtained from the `Match` object returned by the `search` method).

---
Not sure if this will work for all your cases (as you didn't specify all of them, I'm just assuming it's a "word, dot, space, word, dot" sequence), but anyway, you can make some adjustments according to your specific needs.

If you want to get the words from the beggining, [you can use `match` instead of `search`](https://docs.python.org/3/library/re.html#search-vs-match). Or, you can also make it explicit in the expression, by using the [anchor](https://www.regular-expressions.info/anchors.html) `^` (which indicates the beginning of the string):

```
pattern = re.compile(r'^(\w+\.\s?)+')
```

If you want to limit to only 2 words, just change the `+` quantifier:

```
pattern = re.compile(r'^(\w+\.\s?){2}')
```

You can customize the quantity in many different ways:

- `{2}`: exactly two ocurrences
- `{2,}`: two or more ocurrences (no upper limit)
- `{2,5}`: at least two, at most five ocurrences


---
Another caveat: in Python 3, the `\w` shorthand matches letters (from all languages, not only ASCII letters), digits (not restricted to arabic digits, so it'll also match characteres like `३` - [DEVANAGARI DIGIT THREE](https://www.fileformat.info/info/unicode/char/0969/index.htm)) and the character `_`. If you want to restrict to ASCII letters only, you could change it to:

```
pattern = re.compile(r'([a-zA-Z]+\.\s?)+')
```

And so on... the possibilities are endless, and the correct one(s) will depend on the data you have and what exactly you're trying to extract from it.

Communities

Post History