Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Post History
The complete set of characters matched by the \s shorthand varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behavi...
Answer
#4: Post edited
- The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.
- In a general way, `\s` - at least in the engines that I've seen - always include the following characters:
- - [space][2]
- - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3])
- - *newline* (`\n`) (AKA [LINE FEED][4])
- - [*carriage return* (`\r`)][5]
- - [*form feed* (`\f`)][6]
- The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11].
- But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]:
- > `\s` any whitespace character
- >
- > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)
- Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*.
- And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]:
- > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**;
- Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertical tab*][15]). So checking the docs is always recommended.
- ---
- # Unicode
- Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters.
- For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code:
- ```java
- Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
- // loop all Unicode code points
- for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
- String s = new String(new int[] { i }, 0, 1);
- matcher.reset(s);
- if (matcher.find()) {
- // if \s matches, print the codepoint and character name
- System.out.printf("%06X, %s\n", i, Character.getName(i));
- }
- }
- ```
- The output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- 000085, NEXT LINE (NEL)
- 0000A0, NO-BREAK SPACE
- 001680, OGHAM SPACE MARK
- 002000, EN QUAD
- 002001, EM QUAD
- 002002, EN SPACE
- 002003, EM SPACE
- 002004, THREE-PER-EM SPACE
- 002005, FOUR-PER-EM SPACE
- 002006, SIX-PER-EM SPACE
- 002007, FIGURE SPACE
- 002008, PUNCTUATION SPACE
- 002009, THIN SPACE
- 00200A, HAIR SPACE
- 002028, LINE SEPARATOR
- 002029, PARAGRAPH SEPARATOR
- 00202F, NARROW NO-BREAK SPACE
- 00205F, MEDIUM MATHEMATICAL SPACE
- 003000, IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][19]</sup>
- But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`):
- ```java
- Matcher matcher = Pattern.compile("\\s").matcher("");
- ... rest of the code is the same
- ```
- Now the output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- ```
- <sup>[See this code running][20]</sup>
- ---
- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one:
- ```python
- import unicodedata as u
- import re
- r = re.compile(r'\s')
- for i in range(0x10ffff + 1):
- s = chr(i)
- if r.search(s):
- print('{:02X} {}'.format(i, u.name(s, '')))
- ```
- The output is:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 1C
- 1D
- 1E
- 1F
- 20 SPACE
- 85
- A0 NO-BREAK SPACE
- 1680 OGHAM SPACE MARK
- 2000 EN QUAD
- 2001 EM QUAD
- 2002 EN SPACE
- 2003 EM SPACE
- 2004 THREE-PER-EM SPACE
- 2005 FOUR-PER-EM SPACE
- 2006 SIX-PER-EM SPACE
- 2007 FIGURE SPACE
- 2008 PUNCTUATION SPACE
- 2009 THIN SPACE
- 200A HAIR SPACE
- 2028 LINE SEPARATOR
- 2029 PARAGRAPH SEPARATOR
- 202F NARROW NO-BREAK SPACE
- 205F MEDIUM MATHEMATICAL SPACE
- 3000 IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][21]</sup>
- If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]:
- ```python
- r = re.compile(r'\s', re.ASCII)
- ... rest of the code is the same
- ```
- And the output will be:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 20 SPACE
- ```
- <sup>[See this code running][23]</sup>
- PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]).
- ---
- One detail is that, in the tests above, the Python 3 code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due to some details regarding the regex engine's internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations.
- And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34].
- ---
- ### Final considerations
- Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.
- Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations.
- Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).
- In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`).
- Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV.
- [1]: https://www.regular-expressions.info/shorthand.html
- [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm
- [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm
- [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm
- [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm
- [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm
- [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm
- [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef
- [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
- [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes
- [11]: https://docs.python.org/3/library/re.html#index-30
- [12]: https://www.php.net/manual/en/regexp.reference.escape.php
- [13]: https://perldoc.perl.org/perlre.html
- [14]: https://support.google.com/docs/answer/3098292?hl=en
- [15]: https://github.com/google/re2/blob/master/doc/syntax.txt
- [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS
- [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
- [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- [19]: http://ideone.com/FZnbZ8
- [20]: http://ideone.com/BNYdci
- [21]: https://ideone.com/sxAW1G
- [22]: https://docs.python.org/3/library/re.html#re.ASCII
- [23]: https://ideone.com/0BhdAu
- [24]: https://docs.python.org/2.7/library/re.html
- [25]: https://ideone.com/NAvibh
- [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE
- [27]: https://ideone.com/SFFUyV
- [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm
- [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm
- [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm
- [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm
- [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode
- [33]: https://pypi.org/project/regex/
- [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py
- [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix
- [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php
- [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending
- The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.
- In a general way, `\s` - at least in the engines that I've seen - always include the following characters:
- - [space][2]
- - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3])
- - *newline* (`\n`) (AKA [LINE FEED][4])
- - [*carriage return* (`\r`)][5]
- - [*form feed* (`\f`)][6]
- The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11].
- But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]:
- > `\s` any whitespace character
- >
- > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)
- Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*.
- And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]:
- > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**;
- Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertical tab*][15]). So checking the docs is always recommended.
- ---
- # Unicode
- Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters.
- For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code:
- ```java
- Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
- // loop all Unicode code points
- for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
- String s = new String(new int[] { i }, 0, 1);
- matcher.reset(s);
- if (matcher.find()) {
- // if \s matches, print the codepoint and character name
- System.out.printf("%06X, %s\n", i, Character.getName(i));
- }
- }
- ```
- The output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- 000085, NEXT LINE (NEL)
- 0000A0, NO-BREAK SPACE
- 001680, OGHAM SPACE MARK
- 002000, EN QUAD
- 002001, EM QUAD
- 002002, EN SPACE
- 002003, EM SPACE
- 002004, THREE-PER-EM SPACE
- 002005, FOUR-PER-EM SPACE
- 002006, SIX-PER-EM SPACE
- 002007, FIGURE SPACE
- 002008, PUNCTUATION SPACE
- 002009, THIN SPACE
- 00200A, HAIR SPACE
- 002028, LINE SEPARATOR
- 002029, PARAGRAPH SEPARATOR
- 00202F, NARROW NO-BREAK SPACE
- 00205F, MEDIUM MATHEMATICAL SPACE
- 003000, IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][19]</sup>
- But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`):
- ```java
- Matcher matcher = Pattern.compile("\\s").matcher("");
- ... rest of the code is the same
- ```
- Now the output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- ```
- <sup>[See this code running][20]</sup>
- ---
- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one:
- ```python
- import unicodedata as u
- import re
- r = re.compile(r'\s')
- for i in range(0x10ffff + 1):
- s = chr(i)
- if r.search(s):
- print('{:02X} {}'.format(i, u.name(s, '')))
- ```
- The output is:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 1C
- 1D
- 1E
- 1F
- 20 SPACE
- 85
- A0 NO-BREAK SPACE
- 1680 OGHAM SPACE MARK
- 2000 EN QUAD
- 2001 EM QUAD
- 2002 EN SPACE
- 2003 EM SPACE
- 2004 THREE-PER-EM SPACE
- 2005 FOUR-PER-EM SPACE
- 2006 SIX-PER-EM SPACE
- 2007 FIGURE SPACE
- 2008 PUNCTUATION SPACE
- 2009 THIN SPACE
- 200A HAIR SPACE
- 2028 LINE SEPARATOR
- 2029 PARAGRAPH SEPARATOR
- 202F NARROW NO-BREAK SPACE
- 205F MEDIUM MATHEMATICAL SPACE
- 3000 IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][21]</sup>
- If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]:
- ```python
- r = re.compile(r'\s', re.ASCII)
- ... rest of the code is the same
- ```
- And the output will be:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 20 SPACE
- ```
- <sup>[See this code running][23]</sup>
- PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]).
- ---
- One detail is that, in the tests above, the Python 3 code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due to some details regarding the regex engine's internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations.
- And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34].
- ---
- ### Final considerations
- Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.
- Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations.
- Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).
- In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`).
- Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV.
- ---
- As a final note, there's also the `\S` shortcut (note the uppercase "S"), which means "*any character that it's **not** matched by `\s`*" (this follows the common pattern of having an uppercase shortcut as the oposite of the respective lowercase one, such as `\D` being "anything that it's not a `\d`", `\W` is "anything that it's not `\w`", etc).
- [1]: https://www.regular-expressions.info/shorthand.html
- [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm
- [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm
- [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm
- [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm
- [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm
- [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm
- [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef
- [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
- [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes
- [11]: https://docs.python.org/3/library/re.html#index-30
- [12]: https://www.php.net/manual/en/regexp.reference.escape.php
- [13]: https://perldoc.perl.org/perlre.html
- [14]: https://support.google.com/docs/answer/3098292?hl=en
- [15]: https://github.com/google/re2/blob/master/doc/syntax.txt
- [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS
- [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
- [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- [19]: http://ideone.com/FZnbZ8
- [20]: http://ideone.com/BNYdci
- [21]: https://ideone.com/sxAW1G
- [22]: https://docs.python.org/3/library/re.html#re.ASCII
- [23]: https://ideone.com/0BhdAu
- [24]: https://docs.python.org/2.7/library/re.html
- [25]: https://ideone.com/NAvibh
- [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE
- [27]: https://ideone.com/SFFUyV
- [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm
- [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm
- [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm
- [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm
- [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode
- [33]: https://pypi.org/project/regex/
- [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py
- [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix
- [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php
- [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending
#3: Post edited
- The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.
- In a general way, `\s` - at least in the engines that I've seen - always include the following characters:
- - [space][2]
- - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3])
- - *newline* (`\n`) (AKA [LINE FEED][4])
- - [*carriage return* (`\r`)][5]
- - [*form feed* (`\f`)][6]
- The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11].
- But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]:
- > `\s` any whitespace character
- >
- > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)
- Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*.
- And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]:
- > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**;
Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertibal tab*][15]). So checking the docs is always recommended.- ---
- # Unicode
- Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters.
- For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code:
- ```java
- Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
- // loop all Unicode code points
- for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
- String s = new String(new int[] { i }, 0, 1);
- matcher.reset(s);
- if (matcher.find()) {
- // if \s matches, print the codepoint and character name
- System.out.printf("%06X, %s\n", i, Character.getName(i));
- }
- }
- ```
- The output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- 000085, NEXT LINE (NEL)
- 0000A0, NO-BREAK SPACE
- 001680, OGHAM SPACE MARK
- 002000, EN QUAD
- 002001, EM QUAD
- 002002, EN SPACE
- 002003, EM SPACE
- 002004, THREE-PER-EM SPACE
- 002005, FOUR-PER-EM SPACE
- 002006, SIX-PER-EM SPACE
- 002007, FIGURE SPACE
- 002008, PUNCTUATION SPACE
- 002009, THIN SPACE
- 00200A, HAIR SPACE
- 002028, LINE SEPARATOR
- 002029, PARAGRAPH SEPARATOR
- 00202F, NARROW NO-BREAK SPACE
- 00205F, MEDIUM MATHEMATICAL SPACE
- 003000, IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][19]</sup>
- But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`):
- ```java
- Matcher matcher = Pattern.compile("\\s").matcher("");
- ... rest of the code is the same
- ```
- Now the output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- ```
- <sup>[See this code running][20]</sup>
- ---
- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one:
- ```python
- import unicodedata as u
- import re
- r = re.compile(r'\s')
- for i in range(0x10ffff + 1):
- s = chr(i)
- if r.search(s):
- print('{:02X} {}'.format(i, u.name(s, '')))
- ```
- The output is:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 1C
- 1D
- 1E
- 1F
- 20 SPACE
- 85
- A0 NO-BREAK SPACE
- 1680 OGHAM SPACE MARK
- 2000 EN QUAD
- 2001 EM QUAD
- 2002 EN SPACE
- 2003 EM SPACE
- 2004 THREE-PER-EM SPACE
- 2005 FOUR-PER-EM SPACE
- 2006 SIX-PER-EM SPACE
- 2007 FIGURE SPACE
- 2008 PUNCTUATION SPACE
- 2009 THIN SPACE
- 200A HAIR SPACE
- 2028 LINE SEPARATOR
- 2029 PARAGRAPH SEPARATOR
- 202F NARROW NO-BREAK SPACE
- 205F MEDIUM MATHEMATICAL SPACE
- 3000 IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][21]</sup>
- If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]:
- ```python
- r = re.compile(r'\s', re.ASCII)
- ... rest of the code is the same
- ```
- And the output will be:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 20 SPACE
- ```
- <sup>[See this code running][23]</sup>
- PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]).
- ---
- One detail is that, in the tests above, the Python 3 code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due to some details regarding the regex engine's internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations.
- And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34].
- ---
- ### Final considerations
- Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.
- Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations.
- Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).
- In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`).
- Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV.
- [1]: https://www.regular-expressions.info/shorthand.html
- [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm
- [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm
- [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm
- [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm
- [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm
- [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm
- [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef
- [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
- [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes
- [11]: https://docs.python.org/3/library/re.html#index-30
- [12]: https://www.php.net/manual/en/regexp.reference.escape.php
- [13]: https://perldoc.perl.org/perlre.html
- [14]: https://support.google.com/docs/answer/3098292?hl=en
- [15]: https://github.com/google/re2/blob/master/doc/syntax.txt
- [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS
- [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
- [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- [19]: http://ideone.com/FZnbZ8
- [20]: http://ideone.com/BNYdci
- [21]: https://ideone.com/sxAW1G
- [22]: https://docs.python.org/3/library/re.html#re.ASCII
- [23]: https://ideone.com/0BhdAu
- [24]: https://docs.python.org/2.7/library/re.html
- [25]: https://ideone.com/NAvibh
- [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE
- [27]: https://ideone.com/SFFUyV
- [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm
- [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm
- [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm
- [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm
- [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode
- [33]: https://pypi.org/project/regex/
- [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py
- [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix
- [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php
- [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending
- The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.
- In a general way, `\s` - at least in the engines that I've seen - always include the following characters:
- - [space][2]
- - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3])
- - *newline* (`\n`) (AKA [LINE FEED][4])
- - [*carriage return* (`\r`)][5]
- - [*form feed* (`\f`)][6]
- The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11].
- But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]:
- > `\s` any whitespace character
- >
- > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)
- Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*.
- And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]:
- > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**;
- Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertical tab*][15]). So checking the docs is always recommended.
- ---
- # Unicode
- Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters.
- For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code:
- ```java
- Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
- // loop all Unicode code points
- for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
- String s = new String(new int[] { i }, 0, 1);
- matcher.reset(s);
- if (matcher.find()) {
- // if \s matches, print the codepoint and character name
- System.out.printf("%06X, %s\n", i, Character.getName(i));
- }
- }
- ```
- The output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- 000085, NEXT LINE (NEL)
- 0000A0, NO-BREAK SPACE
- 001680, OGHAM SPACE MARK
- 002000, EN QUAD
- 002001, EM QUAD
- 002002, EN SPACE
- 002003, EM SPACE
- 002004, THREE-PER-EM SPACE
- 002005, FOUR-PER-EM SPACE
- 002006, SIX-PER-EM SPACE
- 002007, FIGURE SPACE
- 002008, PUNCTUATION SPACE
- 002009, THIN SPACE
- 00200A, HAIR SPACE
- 002028, LINE SEPARATOR
- 002029, PARAGRAPH SEPARATOR
- 00202F, NARROW NO-BREAK SPACE
- 00205F, MEDIUM MATHEMATICAL SPACE
- 003000, IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][19]</sup>
- But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`):
- ```java
- Matcher matcher = Pattern.compile("\\s").matcher("");
- ... rest of the code is the same
- ```
- Now the output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- ```
- <sup>[See this code running][20]</sup>
- ---
- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one:
- ```python
- import unicodedata as u
- import re
- r = re.compile(r'\s')
- for i in range(0x10ffff + 1):
- s = chr(i)
- if r.search(s):
- print('{:02X} {}'.format(i, u.name(s, '')))
- ```
- The output is:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 1C
- 1D
- 1E
- 1F
- 20 SPACE
- 85
- A0 NO-BREAK SPACE
- 1680 OGHAM SPACE MARK
- 2000 EN QUAD
- 2001 EM QUAD
- 2002 EN SPACE
- 2003 EM SPACE
- 2004 THREE-PER-EM SPACE
- 2005 FOUR-PER-EM SPACE
- 2006 SIX-PER-EM SPACE
- 2007 FIGURE SPACE
- 2008 PUNCTUATION SPACE
- 2009 THIN SPACE
- 200A HAIR SPACE
- 2028 LINE SEPARATOR
- 2029 PARAGRAPH SEPARATOR
- 202F NARROW NO-BREAK SPACE
- 205F MEDIUM MATHEMATICAL SPACE
- 3000 IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][21]</sup>
- If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]:
- ```python
- r = re.compile(r'\s', re.ASCII)
- ... rest of the code is the same
- ```
- And the output will be:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 20 SPACE
- ```
- <sup>[See this code running][23]</sup>
- PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]).
- ---
- One detail is that, in the tests above, the Python 3 code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due to some details regarding the regex engine's internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations.
- And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34].
- ---
- ### Final considerations
- Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.
- Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations.
- Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).
- In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`).
- Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV.
- [1]: https://www.regular-expressions.info/shorthand.html
- [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm
- [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm
- [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm
- [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm
- [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm
- [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm
- [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef
- [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
- [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes
- [11]: https://docs.python.org/3/library/re.html#index-30
- [12]: https://www.php.net/manual/en/regexp.reference.escape.php
- [13]: https://perldoc.perl.org/perlre.html
- [14]: https://support.google.com/docs/answer/3098292?hl=en
- [15]: https://github.com/google/re2/blob/master/doc/syntax.txt
- [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS
- [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
- [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- [19]: http://ideone.com/FZnbZ8
- [20]: http://ideone.com/BNYdci
- [21]: https://ideone.com/sxAW1G
- [22]: https://docs.python.org/3/library/re.html#re.ASCII
- [23]: https://ideone.com/0BhdAu
- [24]: https://docs.python.org/2.7/library/re.html
- [25]: https://ideone.com/NAvibh
- [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE
- [27]: https://ideone.com/SFFUyV
- [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm
- [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm
- [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm
- [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm
- [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode
- [33]: https://pypi.org/project/regex/
- [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py
- [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix
- [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php
- [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending
#2: Post edited
- The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.
- In a general way, `\s` - at least in the engines that I've seen - always include the following characters:
- - [space][2]
- - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3])
- - *newline* (`\n`) (AKA [LINE FEED][4])
- - [*carriage return* (`\r`)][5]
- - [*form feed* (`\f`)][6]
- The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11].
- But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]:
- > `\s` any whitespace character
- >
- > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)
- Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*.
- And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]:
- > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**;
- Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertibal tab*][15]). So checking the docs is always recommended.
- ---
- # Unicode
- Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters.
- For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code:
- ```java
- Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
- // loop all Unicode code points
- for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
- String s = new String(new int[] { i }, 0, 1);
- matcher.reset(s);
- if (matcher.find()) {
- // if \s matches, print the codepoint and character name
- System.out.printf("%06X, %s\n", i, Character.getName(i));
- }
- }
- ```
- The output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- 000085, NEXT LINE (NEL)
- 0000A0, NO-BREAK SPACE
- 001680, OGHAM SPACE MARK
- 002000, EN QUAD
- 002001, EM QUAD
- 002002, EN SPACE
- 002003, EM SPACE
- 002004, THREE-PER-EM SPACE
- 002005, FOUR-PER-EM SPACE
- 002006, SIX-PER-EM SPACE
- 002007, FIGURE SPACE
- 002008, PUNCTUATION SPACE
- 002009, THIN SPACE
- 00200A, HAIR SPACE
- 002028, LINE SEPARATOR
- 002029, PARAGRAPH SEPARATOR
- 00202F, NARROW NO-BREAK SPACE
- 00205F, MEDIUM MATHEMATICAL SPACE
- 003000, IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][19]</sup>
- But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`):
- ```java
- Matcher matcher = Pattern.compile("\\s").matcher("");
- ... rest of the code is the same
- ```
- Now the output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- ```
- <sup>[See this code running][20]</sup>
- ---
- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one:
- ```python
- import unicodedata as u
- import re
- r = re.compile(r'\s')
- for i in range(0x10ffff + 1):
- s = chr(i)
- if r.search(s):
- print('{:02X} {}'.format(i, u.name(s, '')))
- ```
- The output is:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 1C
- 1D
- 1E
- 1F
- 20 SPACE
- 85
- A0 NO-BREAK SPACE
- 1680 OGHAM SPACE MARK
- 2000 EN QUAD
- 2001 EM QUAD
- 2002 EN SPACE
- 2003 EM SPACE
- 2004 THREE-PER-EM SPACE
- 2005 FOUR-PER-EM SPACE
- 2006 SIX-PER-EM SPACE
- 2007 FIGURE SPACE
- 2008 PUNCTUATION SPACE
- 2009 THIN SPACE
- 200A HAIR SPACE
- 2028 LINE SEPARATOR
- 2029 PARAGRAPH SEPARATOR
- 202F NARROW NO-BREAK SPACE
- 205F MEDIUM MATHEMATICAL SPACE
- 3000 IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][21]</sup>
- If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]:
- ```python
- r = re.compile(r'\s', re.ASCII)
- ... rest of the code is the same
- ```
- And the output will be:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 20 SPACE
- ```
- <sup>[See this code running][23]</sup>
- PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]).
- ---
One detail is that, in the tests above, Python code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due some details regarding the internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations.- And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34].
- ---
- ### Final considerations
- Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.
- Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations.
- Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).
- In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`).
- Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV.
- [1]: https://www.regular-expressions.info/shorthand.html
- [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm
- [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm
- [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm
- [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm
- [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm
- [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm
- [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef
- [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
- [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes
- [11]: https://docs.python.org/3/library/re.html#index-30
- [12]: https://www.php.net/manual/en/regexp.reference.escape.php
- [13]: https://perldoc.perl.org/perlre.html
- [14]: https://support.google.com/docs/answer/3098292?hl=en
- [15]: https://github.com/google/re2/blob/master/doc/syntax.txt
- [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS
- [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
- [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- [19]: http://ideone.com/FZnbZ8
- [20]: http://ideone.com/BNYdci
- [21]: https://ideone.com/sxAW1G
- [22]: https://docs.python.org/3/library/re.html#re.ASCII
- [23]: https://ideone.com/0BhdAu
- [24]: https://docs.python.org/2.7/library/re.html
- [25]: https://ideone.com/NAvibh
- [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE
- [27]: https://ideone.com/SFFUyV
- [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm
- [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm
- [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm
- [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm
- [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode
- [33]: https://pypi.org/project/regex/
- [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py
- [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix
- [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php
- [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending
- The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.
- In a general way, `\s` - at least in the engines that I've seen - always include the following characters:
- - [space][2]
- - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3])
- - *newline* (`\n`) (AKA [LINE FEED][4])
- - [*carriage return* (`\r`)][5]
- - [*form feed* (`\f`)][6]
- The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11].
- But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]:
- > `\s` any whitespace character
- >
- > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)
- Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*.
- And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]:
- > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**;
- Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertibal tab*][15]). So checking the docs is always recommended.
- ---
- # Unicode
- Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters.
- For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code:
- ```java
- Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
- // loop all Unicode code points
- for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
- String s = new String(new int[] { i }, 0, 1);
- matcher.reset(s);
- if (matcher.find()) {
- // if \s matches, print the codepoint and character name
- System.out.printf("%06X, %s\n", i, Character.getName(i));
- }
- }
- ```
- The output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- 000085, NEXT LINE (NEL)
- 0000A0, NO-BREAK SPACE
- 001680, OGHAM SPACE MARK
- 002000, EN QUAD
- 002001, EM QUAD
- 002002, EN SPACE
- 002003, EM SPACE
- 002004, THREE-PER-EM SPACE
- 002005, FOUR-PER-EM SPACE
- 002006, SIX-PER-EM SPACE
- 002007, FIGURE SPACE
- 002008, PUNCTUATION SPACE
- 002009, THIN SPACE
- 00200A, HAIR SPACE
- 002028, LINE SEPARATOR
- 002029, PARAGRAPH SEPARATOR
- 00202F, NARROW NO-BREAK SPACE
- 00205F, MEDIUM MATHEMATICAL SPACE
- 003000, IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][19]</sup>
- But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`):
- ```java
- Matcher matcher = Pattern.compile("\\s").matcher("");
- ... rest of the code is the same
- ```
- Now the output will be:
- ```none
- 000009, CHARACTER TABULATION
- 00000A, LINE FEED (LF)
- 00000B, LINE TABULATION
- 00000C, FORM FEED (FF)
- 00000D, CARRIAGE RETURN (CR)
- 000020, SPACE
- ```
- <sup>[See this code running][20]</sup>
- ---
- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one:
- ```python
- import unicodedata as u
- import re
- r = re.compile(r'\s')
- for i in range(0x10ffff + 1):
- s = chr(i)
- if r.search(s):
- print('{:02X} {}'.format(i, u.name(s, '')))
- ```
- The output is:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 1C
- 1D
- 1E
- 1F
- 20 SPACE
- 85
- A0 NO-BREAK SPACE
- 1680 OGHAM SPACE MARK
- 2000 EN QUAD
- 2001 EM QUAD
- 2002 EN SPACE
- 2003 EM SPACE
- 2004 THREE-PER-EM SPACE
- 2005 FOUR-PER-EM SPACE
- 2006 SIX-PER-EM SPACE
- 2007 FIGURE SPACE
- 2008 PUNCTUATION SPACE
- 2009 THIN SPACE
- 200A HAIR SPACE
- 2028 LINE SEPARATOR
- 2029 PARAGRAPH SEPARATOR
- 202F NARROW NO-BREAK SPACE
- 205F MEDIUM MATHEMATICAL SPACE
- 3000 IDEOGRAPHIC SPACE
- ```
- <sup>[See this code running][21]</sup>
- If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]:
- ```python
- r = re.compile(r'\s', re.ASCII)
- ... rest of the code is the same
- ```
- And the output will be:
- ```none
- 09
- 0A
- 0B
- 0C
- 0D
- 20 SPACE
- ```
- <sup>[See this code running][23]</sup>
- PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]).
- ---
- One detail is that, in the tests above, the Python 3 code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due to some details regarding the regex engine's internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations.
- And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34].
- ---
- ### Final considerations
- Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.
- Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations.
- Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).
- In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`).
- Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV.
- [1]: https://www.regular-expressions.info/shorthand.html
- [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm
- [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm
- [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm
- [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm
- [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm
- [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm
- [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef
- [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
- [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes
- [11]: https://docs.python.org/3/library/re.html#index-30
- [12]: https://www.php.net/manual/en/regexp.reference.escape.php
- [13]: https://perldoc.perl.org/perlre.html
- [14]: https://support.google.com/docs/answer/3098292?hl=en
- [15]: https://github.com/google/re2/blob/master/doc/syntax.txt
- [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS
- [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
- [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
- [19]: http://ideone.com/FZnbZ8
- [20]: http://ideone.com/BNYdci
- [21]: https://ideone.com/sxAW1G
- [22]: https://docs.python.org/3/library/re.html#re.ASCII
- [23]: https://ideone.com/0BhdAu
- [24]: https://docs.python.org/2.7/library/re.html
- [25]: https://ideone.com/NAvibh
- [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE
- [27]: https://ideone.com/SFFUyV
- [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm
- [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm
- [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm
- [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm
- [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode
- [33]: https://pypi.org/project/regex/
- [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py
- [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix
- [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php
- [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending
#1: Initial revision
The complete set of characters matched by the [`\s` shorthand][1] varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour. In a general way, `\s` - at least in the engines that I've seen - always include the following characters: - [space][2] - <kbd>TAB</kbd> (`\t`) (AKA "horizontal tab" or ["CHARACTER TABULATION"][3]) - *newline* (`\n`) (AKA [LINE FEED][4]) - [*carriage return* (`\r`)][5] - [*form feed* (`\f`)][6] The *vertical tab* (`\v`) (or ["LINE TABULATION"][7]) is also matched in many languages, such as [Java][8], [JavaScript][9], [Ruby][10] and [Python][11]. But in PHP, `\s` doesn't match a *vertical tab*. According to the [documentation][12]: > `\s` any whitespace character > > The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32) Where HT is the *horizontal tab*, LF is the *line feed*, FF is the *form feed* and CR is the *carriage return*. And in Perl, the *vertical tab* is matched only in versions >= 5.18, according to the [documentation][13]: > `\s` means the five characters `[ \f\n\r\t]`, and **starting in Perl v5.18, the vertical tab**; Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses [RE2 engine][14], that [doesn't match the *vertibal tab*][15]). So checking the docs is always recommended. --- # Unicode Many languages have configurations that enable some kind of "Unicode Mode", which makes `\s` match many other characters. For example, in Java, if you set the option [`UNICODE_CHARACTER_CLASS`][16], `\s` will match all characters that have the [Unicode `White_Space` property][17] (check the full list [here][18]). So for this code: ```java Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher(""); // loop all Unicode code points for (int i = 0; i <= Character.MAX_CODE_POINT; i++) { String s = new String(new int[] { i }, 0, 1); matcher.reset(s); if (matcher.find()) { // if \s matches, print the codepoint and character name System.out.printf("%06X, %s\n", i, Character.getName(i)); } } ``` The output will be: ```none 000009, CHARACTER TABULATION 00000A, LINE FEED (LF) 00000B, LINE TABULATION 00000C, FORM FEED (FF) 00000D, CARRIAGE RETURN (CR) 000020, SPACE 000085, NEXT LINE (NEL) 0000A0, NO-BREAK SPACE 001680, OGHAM SPACE MARK 002000, EN QUAD 002001, EM QUAD 002002, EN SPACE 002003, EM SPACE 002004, THREE-PER-EM SPACE 002005, FOUR-PER-EM SPACE 002006, SIX-PER-EM SPACE 002007, FIGURE SPACE 002008, PUNCTUATION SPACE 002009, THIN SPACE 00200A, HAIR SPACE 002028, LINE SEPARATOR 002029, PARAGRAPH SEPARATOR 00202F, NARROW NO-BREAK SPACE 00205F, MEDIUM MATHEMATICAL SPACE 003000, IDEOGRAPHIC SPACE ``` <sup>[See this code running][19]</sup> But if we remove `UNICODE_CHARACTER_CLASS`, the *default* is to consider only the aforementioned characters (`[ \t\n\r\f\v]`): ```java Matcher matcher = Pattern.compile("\\s").matcher(""); ... rest of the code is the same ``` Now the output will be: ```none 000009, CHARACTER TABULATION 00000A, LINE FEED (LF) 00000B, LINE TABULATION 00000C, FORM FEED (FF) 00000D, CARRIAGE RETURN (CR) 000020, SPACE ``` <sup>[See this code running][20]</sup> --- In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and [`\s`][11] matches all Unicode whitespace characters. Making a code similar to the previous one: ```python import unicodedata as u import re r = re.compile(r'\s') for i in range(0x10ffff + 1): s = chr(i) if r.search(s): print('{:02X} {}'.format(i, u.name(s, ''))) ``` The output is: ```none 09 0A 0B 0C 0D 1C 1D 1E 1F 20 SPACE 85 A0 NO-BREAK SPACE 1680 OGHAM SPACE MARK 2000 EN QUAD 2001 EM QUAD 2002 EN SPACE 2003 EM SPACE 2004 THREE-PER-EM SPACE 2005 FOUR-PER-EM SPACE 2006 SIX-PER-EM SPACE 2007 FIGURE SPACE 2008 PUNCTUATION SPACE 2009 THIN SPACE 200A HAIR SPACE 2028 LINE SEPARATOR 2029 PARAGRAPH SEPARATOR 202F NARROW NO-BREAK SPACE 205F MEDIUM MATHEMATICAL SPACE 3000 IDEOGRAPHIC SPACE ``` <sup>[See this code running][21]</sup> If we want the regex to match only `[ \t\n\r\f\v]`, we need to use the [`ASCII` flag][22]: ```python r = re.compile(r'\s', re.ASCII) ... rest of the code is the same ``` And the output will be: ```none 09 0A 0B 0C 0D 20 SPACE ``` <sup>[See this code running][23]</sup> PS: in [Python 2][24] the behaviour is the same as Java. By *default*, `\s` matches only `[ \f\n\r\v\t]` ([see here][25]), and "Unicode Mode" is enabled by setting the [`UNICODE` flag][26] ([see here][27]). --- One detail is that, in the tests above, Python code returned 4 characters that the Java code didn't ([1C][28], [1D][29], [1E][30] e [1F][31]). *My guess* is that it's due to Unicode's version used by each language (I've tested with Java 8, [which uses Unicode 6.2.0][32], and Python 3.8, [which uses Unicode 12.10](https://docs.python.org/3.8/library/unicodedata.html)), **or** due some details regarding the internal implementation, that might or might not consider some factors other than the `White_Space` property. Anyway, this confirms that the `\s` shorthand can and will vary according to the programming language and their versions/configurations. And even different libraries for the same language can have different behaviours. If I change the Python code above to use the [`regex` module][33] (an awesome module that extends the native `re`'s functionalities), [the output will be the same as the Java code][34]. --- ### Final considerations Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it. Some engines might also support Unicode properties, such as `\p{IsWhite_Space}` to match all Unicode whitespace characters (and this might or might not be equivalent to `\s`). So always check the docs to make sure that `\s` matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as `\d`, `\w`, `\b`, etc, because their behaviour can also vary according to the languange/engine and their configurations. Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using `\s` in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference). In addition to that, some languages support other similar shorthands, such as *POSIX character classes*. For example, [in Java][35] you can use `\p{Blank}`, and [in PHP][36], `[:blank:]`, and both matches `[ \t]` (a space or a <kbd>TAB</kbd>) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the `\R` shorthand, which matches all line breaks (still, with differences: [in Java][37] it matches `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`, and [in PHP][12] it matches only `\r`, `\n` or `\r\n`). Depending on what you need to do, these options - when available - can be more suitable than `\s`. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that `\s` considers. YMMV. [1]: https://www.regular-expressions.info/shorthand.html [2]: http://www.fileformat.info/info/unicode/char/0020/index.htm [3]: http://www.fileformat.info/info/unicode/char/0009/index.htm [4]: http://www.fileformat.info/info/unicode/char/000a/index.htm [5]: http://www.fileformat.info/info/unicode/char/000d/index.htm [6]: http://www.fileformat.info/info/unicode/char/000c/index.htm [7]: http://www.fileformat.info/info/unicode/char/000b/index.htm [8]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#predef [9]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types [10]: https://ruby-doc.org/core-2.5.1/Regexp.html#class-Regexp-label-Metacharacters+and+Escapes [11]: https://docs.python.org/3/library/re.html#index-30 [12]: https://www.php.net/manual/en/regexp.reference.escape.php [13]: https://perldoc.perl.org/perlre.html [14]: https://support.google.com/docs/answer/3098292?hl=en [15]: https://github.com/google/re2/blob/master/doc/syntax.txt [16]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS [17]: https://en.wikipedia.org/wiki/Whitespace_character#Unicode [18]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt [19]: http://ideone.com/FZnbZ8 [20]: http://ideone.com/BNYdci [21]: https://ideone.com/sxAW1G [22]: https://docs.python.org/3/library/re.html#re.ASCII [23]: https://ideone.com/0BhdAu [24]: https://docs.python.org/2.7/library/re.html [25]: https://ideone.com/NAvibh [26]: https://docs.python.org/2.7/library/re.html#re.UNICODE [27]: https://ideone.com/SFFUyV [28]: http://www.fileformat.info/info/unicode/char/1c/index.htm [29]: http://www.fileformat.info/info/unicode/char/1d/index.htm [30]: http://www.fileformat.info/info/unicode/char/1e/index.htm [31]: http://www.fileformat.info/info/unicode/char/1f/index.htm [32]: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/enhancements.8.html#unicode [33]: https://pypi.org/project/regex/ [34]: https://repl.it/@hkotsubo/QuaintCumbersomeBase#main.py [35]: https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html#posix [36]: https://www.php.net/manual/en/regexp.reference.character-classes.php [37]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending