Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs

Dashboard
Notifications
Mark all as read
Q&A

What does the "\s" shorthand match?

+8
−0

I've seen some regular expressions (regex) using \s when they want to match a space, but I noticed that it also matches line breaks.

Example: the regex [a-z]\s[0-9] (lowercase ASCII letter, followed by \s, followed by a digit) matches both a 1 and

b
2

Because \s matches either a space or a newline (see this regex running here).

But I also noticed that, depending on the programming language I use and/or specific settings on their regex API, it may or may not match some other "Unicode spaces", such as the No-Break Space.

Hence, the question: what does the \s shorthand actually match? Does it depend on the language, or there are any other factors that can change its behaviour? Can I always assume that at least spaces and newlines (or any other fixed set of characters) will be matched?

Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

1 answer

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+11
−0

The complete set of characters matched by the \s shorthand varies according to the language/API/tool/engine you're using. In addition to that, there might be configurations that change this behaviour.

In a general way, \s - at least in the engines that I've seen - always include the following characters:

The vertical tab (\v) (or "LINE TABULATION") is also matched in many languages, such as Java, JavaScript, Ruby and Python.

But in PHP, \s doesn't match a vertical tab. According to the documentation:

\s any whitespace character

The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)

Where HT is the horizontal tab, LF is the line feed, FF is the form feed and CR is the carriage return.

And in Perl, the vertical tab is matched only in versions >= 5.18, according to the documentation:

\s means the five characters [ \f\n\r\t], and starting in Perl v5.18, the vertical tab;

Anyway, this list can vary according to the languague, API, tool or engine (Google Docs, for example, uses RE2 engine, that doesn't match the vertical tab). So checking the docs is always recommended.


Unicode

Many languages have configurations that enable some kind of "Unicode Mode", which makes \s match many other characters.

For example, in Java, if you set the option UNICODE_CHARACTER_CLASS, \s will match all characters that have the Unicode White_Space property (check the full list here). So for this code:

Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
// loop all Unicode code points
for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
    String s = new String(new int[] { i }, 0, 1);
    matcher.reset(s);
    if (matcher.find()) {
        // if \s matches, print the codepoint and character name
        System.out.printf("%06X, %s\n", i, Character.getName(i));
    }
}

The output will be:

000009, CHARACTER TABULATION
00000A, LINE FEED (LF)
00000B, LINE TABULATION
00000C, FORM FEED (FF)
00000D, CARRIAGE RETURN (CR)
000020, SPACE
000085, NEXT LINE (NEL)
0000A0, NO-BREAK SPACE
001680, OGHAM SPACE MARK
002000, EN QUAD
002001, EM QUAD
002002, EN SPACE
002003, EM SPACE
002004, THREE-PER-EM SPACE
002005, FOUR-PER-EM SPACE
002006, SIX-PER-EM SPACE
002007, FIGURE SPACE
002008, PUNCTUATION SPACE
002009, THIN SPACE
00200A, HAIR SPACE
002028, LINE SEPARATOR
002029, PARAGRAPH SEPARATOR
00202F, NARROW NO-BREAK SPACE
00205F, MEDIUM MATHEMATICAL SPACE
003000, IDEOGRAPHIC SPACE

See this code running

But if we remove UNICODE_CHARACTER_CLASS, the default is to consider only the aforementioned characters ([ \t\n\r\f\v]):

Matcher matcher = Pattern.compile("\\s").matcher("");
... rest of the code is the same

Now the output will be:

000009, CHARACTER TABULATION
00000A, LINE FEED (LF)
00000B, LINE TABULATION
00000C, FORM FEED (FF)
00000D, CARRIAGE RETURN (CR)
000020, SPACE

See this code running


In Python it's similar, but in Python 3 the behaviour is the opposite of Java. By default, the regex is already in "Unicode Mode", and \s matches all Unicode whitespace characters. Making a code similar to the previous one:

import unicodedata as u
import re

r = re.compile(r'\s')
for i in range(0x10ffff + 1):
    s = chr(i)
    if r.search(s):
        print('{:02X} {}'.format(i, u.name(s, '')))

The output is:

09 
0A 
0B 
0C 
0D 
1C 
1D 
1E 
1F 
20 SPACE
85 
A0 NO-BREAK SPACE
1680 OGHAM SPACE MARK
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2007 FIGURE SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202F NARROW NO-BREAK SPACE
205F MEDIUM MATHEMATICAL SPACE
3000 IDEOGRAPHIC SPACE

See this code running

If we want the regex to match only [ \t\n\r\f\v], we need to use the ASCII flag:

r = re.compile(r'\s', re.ASCII)
... rest of the code is the same

And the output will be:

09 
0A 
0B 
0C 
0D 
20 SPACE

See this code running

PS: in Python 2 the behaviour is the same as Java. By default, \s matches only [ \f\n\r\v\t] (see here), and "Unicode Mode" is enabled by setting the UNICODE flag (see here).


One detail is that, in the tests above, the Python 3 code returned 4 characters that the Java code didn't (1C, 1D, 1E e 1F). My guess is that it's due to Unicode's version used by each language (I've tested with Java 8, which uses Unicode 6.2.0, and Python 3.8, which uses Unicode 12.10), or due to some details regarding the regex engine's internal implementation, that might or might not consider some factors other than the White_Space property. Anyway, this confirms that the \s shorthand can and will vary according to the programming language and their versions/configurations.

And even different libraries for the same language can have different behaviours. If I change the Python code above to use the regex module (an awesome module that extends the native re's functionalities), the output will be the same as the Java code.


Final considerations

Other languages and tools might or might not support the "Unicode Mode" (and this might or might not be the default), and they might or might not have a way to enable or disable it.

Some engines might also support Unicode properties, such as \p{IsWhite_Space} to match all Unicode whitespace characters (and this might or might not be equivalent to \s). So always check the docs to make sure that \s matches what you need (and doesn't match what you don't need) - as a side note, this is also true for other shorthands, such as \d, \w, \b, etc, because their behaviour can also vary according to the languange/engine and their configurations.

Obviously, if you're working with very controlled input and you "know for sure" all the characters that the text has and doesn't have, it probably won't make much difference using \s in Unicode or non-Unicode mode, or just use a regex with a space instead (but if you want to match, let's say, just the spaces but not newlines, then this can make a difference).

In addition to that, some languages support other similar shorthands, such as POSIX character classes. For example, in Java you can use \p{Blank}, and in PHP, [:blank:], and both matches [ \t] (a space or a TAB) - although this changes in Java when Unicode Mode is enabled. And there are also engines that support the \R shorthand, which matches all line breaks (still, with differences: in Java it matches \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029], and in PHP it matches only \r, \n or \r\n).

Depending on what you need to do, these options - when available - can be more suitable than \s. For example, if you want to match only line breaks, ignoring spaces, or any other situation that you don't want to match everything that \s considers. YMMV.

Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »