Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
3 answers
what ways can I get a string's length in D?
There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".
If you're dealing only with ASCII characters, using length
- as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.
Different char and string types
According to the D language documentation, there are three different types of char:
Type | Description |
---|---|
char | unsigned 8 bit (UTF-8 code unit) |
wchar | unsigned 16 bit (UTF-16 code unit) |
dchar | unsigned 32 bit (UTF-32 code unit) |
And, as stated here and here, a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:
alias string = immutable(char)[];
alias wstring = immutable(wchar)[]; // UTF-16
alias dstring = immutable(dchar)[]; // UTF-32
That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about Unicode and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:
- Unicode assigns a numeric value to each character, called code point
- From 0 to 127, all code point values and their respective characters match the ASCII table
- Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
- There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
- A Code Unit is "the minimal bit combination that can represent a unit of encoded text", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).
That said, what happens when we use a string containing a non-ASCII character?
string s1 = "á";
writefln("%d", s1.length); // 2
wstring s2 = "á"w;
writefln("%d", s2.length); // 1
dstring s3 = "á"d;
writefln("%d", s3.length); // 1
Note that s1.length
is 2
, because that's the number of UTF-8 code units used to represent such character. Using wstring
or dstring
, on the other hand, results in 1
, because the character á
encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:
string s1 = "á";
wstring s2 = "á"w;
dstring s3 = "á"d;
foreach(c; s1)
{ // output: c3 a1
writef("%x ", c);
}
foreach(c; s2)
{ // output: e1
writef("%x ", c);
}
foreach(c; s3)
{ // output: e1
writef("%x ", c);
}
Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some bit manipulation on the original value, resulting in bytes 0xC3 and 0xA1, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).
But isn't á
a single character? I don't know all the languages of the world, but at least in Portuguese, a
and á
are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?
But things can get even more complicated, as nowadays there are weirder things such as emoji:
string s1 = "💩";
writefln("%d", s1.length); // 4
wstring s2 = "💩"w;
writefln("%d", s2.length); // 2
dstring s3 = "💩"d;
writefln("%d", s3.length); // 1
The code point for 💩 emoji is U+1F4A9. In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a surrogate pair - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.
But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?
Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use str.utf.count
:
import std.utf: count;
string s1 = "💩";
wstring s2 = "💩"w;
dstring s3 = "💩"d;
writefln("%d", count(s1)); // 1
writefln("%d", count(s2)); // 1
writefln("%d", count(s3)); // 1
But sometimes it's just not enough. Unicode has even weirder things:
import std.utf: count;
string s1 = "Ä";
wstring s2 = "Ä"w;
dstring s3 = "Ä"d;
writefln("%d", count(s1)); // 2
writefln("%d", count(s2)); // 2
writefln("%d", count(s3)); // 2
The character Ä
is being counted as 2 code points. That's because Unicode defines two ways of representing this character:
- as the code point U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)
- as a combination of two code points:
The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined here. Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:
string s1 = "Ä"; // now Ä is in NFC (just one code point)
wstring s2 = "Ä"w;
dstring s3 = "Ä"d;
writefln("%d", count(s1)); // 1
writefln("%d", count(s2)); // 1
writefln("%d", count(s3)); // 1
You could also create the strings using the code points hexadecimal values, such as
string s1 = "\u00c4";
to create Ä in NFC, orstring s1 = "\u0041\u0308";
to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.
But is Ä
also considered a single character? Should it be counted as 1, regardless of the normalization form being used?
And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a Grapheme Cluster (see full definition here), also known as "User-perceived character" (aka "It's one single drawing/image/'thing' on the screen, so I'll count it as 1").
To count the number of grapheme clusters, you could use the byGrapheme
function:
import std.range : walkLength;
import std.uni: byGrapheme;
// create strings containing Ä in NFD
string s1 = "\u0041\u0308";
wstring s2 = "\u0041\u0308"w;
dstring s3 = "\u0041\u0308"d;
writefln("%d", s1.byGrapheme.walkLength); // 1
writefln("%d", s2.byGrapheme.walkLength); // 1
writefln("%d", s3.byGrapheme.walkLength); // 1
Or, you could normalize the strings to NFC and count the code points:
import std.utf: count;
import std.uni;
string s1 = "\u0041\u0308";
wstring s2 = "\u0041\u0308"w;
dstring s3 = "\u0041\u0308"d;
writefln("%d", count(normalize!NFC(s1))); // 1
writefln("%d", count(normalize!NFC(s2))); // 1
writefln("%d", count(normalize!NFC(s3))); // 1
But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).
Not to mention, of course, emojis. There's a specific type of grapheme cluster called Emoji ZWJ Sequence, which is a, well, sequence of emojis, joined together by the Zero Width Joiner character (ZWJ). One example are the family emojis, such as the family with dad, mom and 2 daughters, which is actually built with 7 code points:
So, such a string could be created as:
string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";
But, unfortunately, the current implementation of byGrapheme
doesn't seem to support Emoji ZWJ Sequences, as for the string above, s1.byGrapheme.walkLength
returns 4
(each face emoji is counted as a grapheme cluster), and creating a Grapheme
with it results in an invalid one:
auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
writeln(g.valid); // false
In that case, to count this whole thing as 1
, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).
Conclusion
Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.
Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" approach might not work for all cases.
Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things can get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.
Strings in D have .length
property which is used to calculate the number of characters in a string.
Note: Spaces are calculated.
Here is an example:
import std.stdio;
void main() {
string s = "Hello, World!";
write(s.length);
}
Learn more about D in Documentation.
Strings in D can be assigned either as char[]
or string
. Both have .length
with them, which can be added at the end of the variable's name after assignment.
import std.stdio;
void main(string[] args) {
string greeting1 = "Good";
writefln("Length of string greeting1 is %d",greeting1.length);
char[] greeting2 = "morning".dup;
writefln("Length of string greeting2 is %d",greeting2.length);
}
Code edited and imported from TutorialsPoint.
1 comment thread