Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Post History
what ways can I get a string's length in D? There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length...
Answer
#5: Post edited
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing only with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
- ```d
- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
- ```d
- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
- ```d
- string s1 = "á";
- wstring s2 = "á"w;
- dstring s3 = "á"d;
- foreach(c; s1)
{ // outout: c3 a1- writef("%x ", c);
- }
- foreach(c; s2)
{ // outout: e1- writef("%x ", c);
- }
- foreach(c; s3)
{ // outout: e1- writef("%x ", c);
- }
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, resulting in bytes 0xC3 and 0xA1, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
- ```d
- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
- ```d
- import std.utf: count;
- string s1 = "💩";
- wstring s2 = "💩"w;
- dstring s3 = "💩"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```d
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
- ```d
- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_It's one single drawing/image/'thing' on the screen, so I'll count it as 1_").
- To count the number of grapheme clusters, you could use the [`byGrapheme` function](https://dlang.org/library/std/uni/by_grapheme.html):
- ```d
- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
- ```d
- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- ```d
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- ```
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
- ```d
- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" approach might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things can get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing only with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
- ```d
- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
- ```d
- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
- ```d
- string s1 = "á";
- wstring s2 = "á"w;
- dstring s3 = "á"d;
- foreach(c; s1)
- { // output: c3 a1
- writef("%x ", c);
- }
- foreach(c; s2)
- { // output: e1
- writef("%x ", c);
- }
- foreach(c; s3)
- { // output: e1
- writef("%x ", c);
- }
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, resulting in bytes 0xC3 and 0xA1, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
- ```d
- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
- ```d
- import std.utf: count;
- string s1 = "💩";
- wstring s2 = "💩"w;
- dstring s3 = "💩"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```d
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
- ```d
- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_It's one single drawing/image/'thing' on the screen, so I'll count it as 1_").
- To count the number of grapheme clusters, you could use the [`byGrapheme` function](https://dlang.org/library/std/uni/by_grapheme.html):
- ```d
- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
- ```d
- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- ```d
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- ```
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
- ```d
- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" approach might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things can get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
#4: Post edited
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing only with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
- ```d
- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
- ```d
- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
- ```d
writefln("%x %x", s1[0], s1[1]); // c3 a1writefln("%x", s2[0]); // e1writefln("%x", s3[0]); // e1- ```
Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
- ```d
- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
- ```d
- import std.utf: count;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```d
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
- ```d
- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_it's one single drawing on the screen, so I count it as 1_").- ```d
- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
- ```d
- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- ```d
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- ```
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
- ```d
- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" might not work for all cases.Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things will get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing only with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
- ```d
- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
- ```d
- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
- ```d
- string s1 = "á";
- wstring s2 = "á"w;
- dstring s3 = "á"d;
- foreach(c; s1)
- { // outout: c3 a1
- writef("%x ", c);
- }
- foreach(c; s2)
- { // outout: e1
- writef("%x ", c);
- }
- foreach(c; s3)
- { // outout: e1
- writef("%x ", c);
- }
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, resulting in bytes 0xC3 and 0xA1, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
- ```d
- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
- ```d
- import std.utf: count;
- string s1 = "💩";
- wstring s2 = "💩"w;
- dstring s3 = "💩"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```d
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
- ```d
- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_It's one single drawing/image/'thing' on the screen, so I'll count it as 1_").
- To count the number of grapheme clusters, you could use the [`byGrapheme` function](https://dlang.org/library/std/uni/by_grapheme.html):
- ```d
- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
- ```d
- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- ```d
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- ```
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
- ```d
- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" approach might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things can get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
#3: Post edited
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
If you're dealing just with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
```- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
```- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
```- writefln("%x %x", s1[0], s1[1]); // c3 a1
- writefln("%x", s2[0]); // e1
- writefln("%x", s3[0]); // e1
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
```- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
```- import std.utf: count;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
```- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
```- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_it's one single drawing on the screen, so I count it as 1_").
```- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
```- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
```- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things will get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing only with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
- ```d
- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
- ```d
- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
- ```d
- writefln("%x %x", s1[0], s1[1]); // c3 a1
- writefln("%x", s2[0]); // e1
- writefln("%x", s3[0]); // e1
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
- ```d
- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
- ```d
- import std.utf: count;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```d
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
- ```d
- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_it's one single drawing on the screen, so I count it as 1_").
- ```d
- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
- ```d
- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- ```d
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- ```
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
- ```d
- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things will get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
#2: Post edited
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing just with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
```d-lang- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
```d-lang- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
```d-lang- writefln("%x %x", s1[0], s1[1]); // c3 a1
- writefln("%x", s2[0]); // e1
- writefln("%x", s3[0]); // e1
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
```d-lang- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
```d-lang- import std.utf: count;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
```d-lang- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_it's one single drawing on the screen, so I count it as 1_").
```d-lang- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
```d-lang- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
```d-lang- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things will get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
- > *what ways can I get a string's length in D?*
- There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
- If you're dealing just with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
- ---
- # Different char and string types
- According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char:
- | Type | Description |
- |:------|:-----------------------------------|
- | char | unsigned 8 bit (UTF-8 code unit) |
- | wchar | unsigned 16 bit (UTF-16 code unit) |
- | dchar | unsigned 32 bit (UTF-32 code unit) |
- And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
- ```
- alias string = immutable(char)[];
- alias wstring = immutable(wchar)[]; // UTF-16
- alias dstring = immutable(dchar)[]; // UTF-32
- ```
- That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- - Unicode assigns a numeric value to each character, called **code point**
- - From 0 to 127, all code point values and their respective characters match the ASCII table
- - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
- That said, what happens when we use a string containing a non-ASCII character?
- ```
- string s1 = "á";
- writefln("%d", s1.length); // 2
- wstring s2 = "á"w;
- writefln("%d", s2.length); // 1
- dstring s3 = "á"d;
- writefln("%d", s3.length); // 1
- ```
- Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
- ```
- writefln("%x %x", s1[0], s1[1]); // c3 a1
- writefln("%x", s2[0]); // e1
- writefln("%x", s3[0]); // e1
- ```
- Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
- But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
- ---
- But things can get even more complicated, as nowadays there are weirder things such as emoji:
- ```
- string s1 = "💩";
- writefln("%d", s1.length); // 4
- wstring s2 = "💩"w;
- writefln("%d", s2.length); // 2
- dstring s3 = "💩"d;
- writefln("%d", s3.length); // 1
- ```
- The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
- But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
- Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html):
- ```
- import std.utf: count;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- But sometimes it's just not enough. Unicode has even weirder things:
- ```
- import std.utf: count;
- string s1 = "Ä";
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 2
- writefln("%d", count(s2)); // 2
- writefln("%d", count(s3)); // 2
- ```
- The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm)
- - as a combination of two code points:
- - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm)
- - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm)
- The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
- ```
- string s1 = "Ä"; // now Ä is in NFC (just one code point)
- wstring s2 = "Ä"w;
- dstring s3 = "Ä"d;
- writefln("%d", count(s1)); // 1
- writefln("%d", count(s2)); // 1
- writefln("%d", count(s3)); // 1
- ```
- > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
- But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
- And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_it's one single drawing on the screen, so I count it as 1_").
- ```
- import std.range : walkLength;
- import std.uni: byGrapheme;
- // create strings containing Ä in NFD
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", s1.byGrapheme.walkLength); // 1
- writefln("%d", s2.byGrapheme.walkLength); // 1
- writefln("%d", s3.byGrapheme.walkLength); // 1
- ```
- Or, you could normalize the strings to NFC and count the code points:
- ```
- import std.utf: count;
- import std.uni;
- string s1 = "\u0041\u0308";
- wstring s2 = "\u0041\u0308"w;
- dstring s3 = "\u0041\u0308"d;
- writefln("%d", count(normalize!NFC(s1))); // 1
- writefln("%d", count(normalize!NFC(s2))); // 1
- writefln("%d", count(normalize!NFC(s3))); // 1
- ```
- But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
- Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points:
- - [MAN][1]
- - [ZERO WIDTH JOINER][2]
- - [WOMAN][3]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- - [ZERO WIDTH JOINER][2]
- - [GIRL][4]
- So, such a string could be created as:
- string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
- But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one:
- ```
- auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
- writeln(g.valid); // false
- ```
- In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
- ---
- ### Conclusion
- Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
- Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" might not work for all cases.
- Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things will get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
- [1]: http://www.fileformat.info/info/unicode/char/1f468
- [2]: http://www.fileformat.info/info/unicode/char/200d
- [3]: http://www.fileformat.info/info/unicode/char/1f469
- [4]: http://www.fileformat.info/info/unicode/char/1f467
#1: Initial revision
> *what ways can I get a string's length in D?* There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length". If you're dealing just with [ASCII characters](https://www.asciitable.com/), using `length` - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated. --- # Different char and string types According to the [D language documentation](https://dlang.org/spec/type.html#basic-data-types), there are three different types of char: | Type | Description | |:------|:-----------------------------------| | char | unsigned 8 bit (UTF-8 code unit) | | wchar | unsigned 16 bit (UTF-16 code unit) | | dchar | unsigned 32 bit (UTF-32 code unit) | And, as stated [here](https://dlang.org/spec/arrays.html#strings) and [here](https://tour.dlang.org/tour/en/basics/alias-strings), a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type: ```d-lang alias string = immutable(char)[]; alias wstring = immutable(wchar)[]; // UTF-16 alias dstring = immutable(dchar)[]; // UTF-32 ``` That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about [Unicode](http://unicode.org/main.html) and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary: - Unicode assigns a numeric value to each character, called **code point** - From 0 to 127, all code point values and their respective characters match the ASCII table - Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character) - There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions - A [Code Unit](http://www.unicode.org/glossary/#code_unit) is "_the minimal bit combination that can represent a unit of encoded text_", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes). That said, what happens when we use a string containing a non-ASCII character? ```d-lang string s1 = "á"; writefln("%d", s1.length); // 2 wstring s2 = "á"w; writefln("%d", s2.length); // 1 dstring s3 = "á"d; writefln("%d", s3.length); // 1 ``` Note that `s1.length` is `2`, because that's the number of UTF-8 code units used to represent such character. Using `wstring` or `dstring`, on the other hand, results in `1`, because the character `á` encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly: ```d-lang writefln("%x %x", s1[0], s1[1]); // c3 a1 writefln("%x", s2[0]); // e1 writefln("%x", s3[0]); // e1 ``` Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some [bit manipulation](https://en.wikipedia.org/wiki/UTF-8#Encoding) on the original value, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1). But isn't `á` a single character? I don't know all the languages of the world, but at least in Portuguese, `a` and `á` are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2? --- But things can get even more complicated, as nowadays there are weirder things such as emoji: ```d-lang string s1 = "💩"; writefln("%d", s1.length); // 4 wstring s2 = "💩"w; writefln("%d", s2.length); // 2 dstring s3 = "💩"d; writefln("%d", s3.length); // 1 ``` The code point for 💩 emoji is [U+1F4A9](https://www.fileformat.info/info/unicode/char/1f4a9). In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type. But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value? Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use [`str.utf.count`](https://dlang.org/library/std/utf/count.html): ```d-lang import std.utf: count; writefln("%d", count(s1)); // 1 writefln("%d", count(s2)); // 1 writefln("%d", count(s3)); // 1 ``` But sometimes it's just not enough. Unicode has even weirder things: ``` import std.utf: count; string s1 = "Ä"; wstring s2 = "Ä"w; dstring s3 = "Ä"d; writefln("%d", count(s1)); // 2 writefln("%d", count(s2)); // 2 writefln("%d", count(s3)); // 2 ``` The character `Ä` is being counted as 2 code points. That's because Unicode defines two ways of representing this character: - as the code point [U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)](https://www.fileformat.info/info/unicode/char/c4/index.htm) - as a combination of two code points: - [U+0041 (LATIN CAPITAL LETTER A)](https://www.fileformat.info/info/unicode/char/41/index.htm) - [U+0308 (COMBINING DIAERESIS)](https://www.fileformat.info/info/unicode/char/308/index.htm) The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined [here](https://unicode.org/reports/tr15/#Norm_Forms). Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC: ```d-lang string s1 = "Ä"; // now Ä is in NFC (just one code point) wstring s2 = "Ä"w; dstring s3 = "Ä"d; writefln("%d", count(s1)); // 1 writefln("%d", count(s2)); // 1 writefln("%d", count(s3)); // 1 ``` > You could also create the strings using the code points hexadecimal values, such as `string s1 = "\u00c4";` to create Ä in NFC, or `string s1 = "\u0041\u0308";` to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case. But is `Ä` also considered a single character? Should it be counted as 1, regardless of the normalization form being used? And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a [Grapheme Cluster](https://unicode.org/glossary/#grapheme_cluster) (see full definition [here](http://www.unicode.org/reports/tr29/)), also known as "User-perceived character" (aka "_it's one single drawing on the screen, so I count it as 1_"). ```d-lang import std.range : walkLength; import std.uni: byGrapheme; // create strings containing Ä in NFD string s1 = "\u0041\u0308"; wstring s2 = "\u0041\u0308"w; dstring s3 = "\u0041\u0308"d; writefln("%d", s1.byGrapheme.walkLength); // 1 writefln("%d", s2.byGrapheme.walkLength); // 1 writefln("%d", s3.byGrapheme.walkLength); // 1 ``` Or, you could normalize the strings to NFC and count the code points: ```d-lang import std.utf: count; import std.uni; string s1 = "\u0041\u0308"; wstring s2 = "\u0041\u0308"w; dstring s3 = "\u0041\u0308"d; writefln("%d", count(normalize!NFC(s1))); // 1 writefln("%d", count(normalize!NFC(s2))); // 1 writefln("%d", count(normalize!NFC(s3))); // 1 ``` But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them). Not to mention, of course, emojis. There's a specific type of grapheme cluster called [Emoji ZWJ Sequence](https://emojipedia.org/emoji-zwj-sequence/), which is a, well, sequence of emojis, joined together by the [Zero Width Joiner character (ZWJ)](https://www.fileformat.info/info/unicode/char/200d/index.htm). One example are the family emojis, such as the family with [dad, mom and 2 daughters](https://emojipedia.org/family-man-woman-girl-girl/), which is actually built with 7 code points: - [MAN][1] - [ZERO WIDTH JOINER][2] - [WOMAN][3] - [ZERO WIDTH JOINER][2] - [GIRL][4] - [ZERO WIDTH JOINER][2] - [GIRL][4] So, such a string could be created as: string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467"; But, unfortunately, the current implementation of `byGrapheme` doesn't seem to support Emoji ZWJ Sequences, as for the string above, `s1.byGrapheme.walkLength` returns `4` (each face emoji is counted as a grapheme cluster), and creating a `Grapheme` with it results in an invalid one: ```d-lang auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467"); writeln(g.valid); // false ``` In that case, to count this whole thing as `1`, you would have to manually process the string, or rely on some external library (I don't know if there's any, though). --- ### Conclusion Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary. Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" might not work for all cases. Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things will get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV. [1]: http://www.fileformat.info/info/unicode/char/1f468 [2]: http://www.fileformat.info/info/unicode/char/200d [3]: http://www.fileformat.info/info/unicode/char/1f469 [4]: http://www.fileformat.info/info/unicode/char/1f467