−0

what ways can I get a string's length in D?

There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".

If you're dealing only with ASCII characters, using length - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.

Different char and string types

According to the D language documentation, there are three different types of char:

Type	Description
char	unsigned 8 bit (UTF-8 code unit)
wchar	unsigned 16 bit (UTF-16 code unit)
dchar	unsigned 32 bit (UTF-32 code unit)

And, as stated here and here, a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:

alias string = immutable(char)[];
alias wstring = immutable(wchar)[]; // UTF-16
alias dstring = immutable(dchar)[]; // UTF-32

That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about Unicode and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:

Unicode assigns a numeric value to each character, called code point
From 0 to 127, all code point values and their respective characters match the ASCII table
Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
A Code Unit is "the minimal bit combination that can represent a unit of encoded text", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).

That said, what happens when we use a string containing a non-ASCII character?

string s1 = "á";
writefln("%d", s1.length); // 2
wstring s2 = "á"w;
writefln("%d", s2.length); // 1
dstring s3 = "á"d;
writefln("%d", s3.length); // 1

Note that s1.length is 2, because that's the number of UTF-8 code units used to represent such character. Using wstring or dstring, on the other hand, results in 1, because the character á encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:

string s1 = "á";
wstring s2 = "á"w;
dstring s3 = "á"d;

foreach(c; s1)
{ // output: c3 a1
    writef("%x ", c);
}
foreach(c; s2)
{ // output: e1
    writef("%x ", c);
}
foreach(c; s3)
{ // output: e1
    writef("%x ", c);
}

Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some bit manipulation on the original value, resulting in bytes 0xC3 and 0xA1, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).

But isn't á a single character? I don't know all the languages of the world, but at least in Portuguese, a and á are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?

But things can get even more complicated, as nowadays there are weirder things such as emoji:

string s1 = "💩";
writefln("%d", s1.length); // 4
wstring s2 = "💩"w;
writefln("%d", s2.length); // 2
dstring s3 = "💩"d;
writefln("%d", s3.length); // 1

The code point for 💩 emoji is U+1F4A9. In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a surrogate pair - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.

But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?

Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use str.utf.count:

import std.utf: count;


string s1 = "💩";
wstring s2 = "💩"w;
dstring s3 = "💩"d;
writefln("%d", count(s1)); // 1
writefln("%d", count(s2)); // 1
writefln("%d", count(s3)); // 1

But sometimes it's just not enough. Unicode has even weirder things:

import std.utf: count;

string s1 = "Ä";
wstring s2 = "Ä"w;
dstring s3 = "Ä"d;
writefln("%d", count(s1)); // 2
writefln("%d", count(s2)); // 2
writefln("%d", count(s3)); // 2

The character Ä is being counted as 2 code points. That's because Unicode defines two ways of representing this character:

as the code point U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)
as a combination of two code points:
- U+0041 (LATIN CAPITAL LETTER A)
- U+0308 (COMBINING DIAERESIS)

The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined here. Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:

string s1 = "Ä"; // now Ä is in NFC (just one code point)
wstring s2 = "Ä"w;
dstring s3 = "Ä"d;
writefln("%d", count(s1)); // 1
writefln("%d", count(s2)); // 1
writefln("%d", count(s3)); // 1

You could also create the strings using the code points hexadecimal values, such as string s1 = "\u00c4"; to create Ä in NFC, or string s1 = "\u0041\u0308"; to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.

But is Ä also considered a single character? Should it be counted as 1, regardless of the normalization form being used?

And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a Grapheme Cluster (see full definition here), also known as "User-perceived character" (aka "It's one single drawing/image/'thing' on the screen, so I'll count it as 1").

To count the number of grapheme clusters, you could use the byGrapheme function:

import std.range : walkLength;
import std.uni: byGrapheme;


// create strings containing Ä in NFD
string s1 = "\u0041\u0308";
wstring s2 = "\u0041\u0308"w;
dstring s3 = "\u0041\u0308"d;
writefln("%d", s1.byGrapheme.walkLength); // 1
writefln("%d", s2.byGrapheme.walkLength); // 1
writefln("%d", s3.byGrapheme.walkLength); // 1

Or, you could normalize the strings to NFC and count the code points:

import std.utf: count;
import std.uni;


string s1 = "\u0041\u0308";
wstring s2 = "\u0041\u0308"w;
dstring s3 = "\u0041\u0308"d;
writefln("%d", count(normalize!NFC(s1))); // 1
writefln("%d", count(normalize!NFC(s2))); // 1
writefln("%d", count(normalize!NFC(s3))); // 1

But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).

Not to mention, of course, emojis. There's a specific type of grapheme cluster called Emoji ZWJ Sequence, which is a, well, sequence of emojis, joined together by the Zero Width Joiner character (ZWJ). One example are the family emojis, such as the family with dad, mom and 2 daughters, which is actually built with 7 code points:

So, such a string could be created as:

string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";

But, unfortunately, the current implementation of byGrapheme doesn't seem to support Emoji ZWJ Sequences, as for the string above, s1.byGrapheme.walkLength returns 4 (each face emoji is counted as a grapheme cluster), and creating a Grapheme with it results in an invalid one:

auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
writeln(g.valid); // false

In that case, to count this whole thing as 1, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).

Conclusion

Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.

Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" approach might not work for all cases.

Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things can get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.

posted almost 4 years ago

CC BY-SA 4.0

4y ago

hkotsubo‭

5235 reputation 21 70 590 239

Copy Link

Raw

Markdown

History

2 comment threads

It's really hard to remember that "this whole Unicode mess" (a description I wholeheartedly endorse) ... (2 comments)

There needs to be a way to celebrate excellent answers aside from a mere upvote! (3 comments)

Communities

How to get string length in D?

1 comment thread

3 answers

Different char and string types

Conclusion

2 comment threads

1 comment thread

1 comment thread