Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on How to get string length in D?

Parent

How to get string length in D?

+4
−1

I'm new to D and am planning to use it for golfing. I want to make a ROT13 converter and I want to determine the length of an inputted string.

Is there a function for this? If not, what ways can I get a string's length in D?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

https://dlang.org/phobos/std_string.html (1 comment)
Post
+8
−0

what ways can I get a string's length in D?

There are many different ways, and that will depend on the content of the strings, their types, and how you define the terms "character" and "length".

If you're dealing only with ASCII characters, using length - as pointed by the other answers - will be enough. But beyond ASCII world, things start getting a little bit more complicated.


Different char and string types

According to the D language documentation, there are three different types of char:

Type Description
char unsigned 8 bit (UTF-8 code unit)
wchar unsigned 16 bit (UTF-16 code unit)
dchar unsigned 32 bit (UTF-32 code unit)

And, as stated here and here, a string is just an alias for an immutable array of chars. For each char type above, there's a respective string type:

alias string = immutable(char)[];
alias wstring = immutable(wchar)[]; // UTF-16
alias dstring = immutable(dchar)[]; // UTF-32

That distinction is important, because it affects the way strings are stored internally, and also the different results we might get when getting their lengths. To understand that, we need to know some things about Unicode and encodings. I'm not going into all the details (which would require entire books for it) but for the sake of this question - how to get a string's length - here's a short summary:

  • Unicode assigns a numeric value to each character, called code point
  • From 0 to 127, all code point values and their respective characters match the ASCII table
  • Unicode covers thousands of characters, which means that many code point values - most of them, actually - will require more than 1 byte to be represented (so 1 byte is not always the same as 1 character)
  • There are different ways to convert code point values to/from bytes. Those are called "encodings". UTF-8, UTF-16 and UTF-32 are different encodings, each using their own algorithms to do such conversions
  • A Code Unit is "the minimal bit combination that can represent a unit of encoded text", and it varies for each encoding. For example, UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units (each code unit has 2 bytes), and UTF-32 uses 32-bit code units (each code unit has 4 bytes).

That said, what happens when we use a string containing a non-ASCII character?

string s1 = "á";
writefln("%d", s1.length); // 2
wstring s2 = "á"w;
writefln("%d", s2.length); // 1
dstring s3 = "á"d;
writefln("%d", s3.length); // 1

Note that s1.length is 2, because that's the number of UTF-8 code units used to represent such character. Using wstring or dstring, on the other hand, results in 1, because the character á encoded in both UTF-16 and UTF-32 requires just one code unit. We can check that by printing the chars directly:

string s1 = "á";
wstring s2 = "á"w;
dstring s3 = "á"d;

foreach(c; s1)
{ // output: c3 a1
    writef("%x ", c);
}
foreach(c; s2)
{ // output: e1
    writef("%x ", c);
}
foreach(c; s3)
{ // output: e1
    writef("%x ", c);
}

Note that for UTF-8, not only the length is different, but also the resulting bytes. That's because the algorithm does some bit manipulation on the original value, resulting in bytes 0xC3 and 0xA1, while UTF-16 and UTF-32 don't (so they keep the code point value U+00E1).

But isn't á a single character? I don't know all the languages of the world, but at least in Portuguese, a and á are different, as they completely change the meaning of a word. I'm not sure if officialy they are considered different letters, of the same letter with different semantics (I'm not a linguistics expert), but we usually count it (and consider it) as a single character. So does it make sense to count it as 2?


But things can get even more complicated, as nowadays there are weirder things such as emoji:

string s1 = "💩";
writefln("%d", s1.length); // 4
wstring s2 = "💩"w;
writefln("%d", s2.length); // 2
dstring s3 = "💩"d;
writefln("%d", s3.length); // 1

The code point for 💩 emoji is U+1F4A9. In UTF-8, it requires 4 code units to be encoded, in UTF-16, it requires 2 code units (as code points above U+FFFF are encoded as a surrogate pair - each member of this pair is a code unit), and in UTF-32, it requires just 1 code unit (reminding that in UTF-32, each code unit uses 4 bytes, which is enough to store 0x1F4A9). That's why the lengths are different for each string type.

But shouldn't a single 💩 be counted as just 1? Why 4? Or 2? Or any other value?

Maybe we should count the number of code points, regardless of the string type or the types of characters it contains. To do that, you could use str.utf.count:

import std.utf: count;


string s1 = "💩";
wstring s2 = "💩"w;
dstring s3 = "💩"d;
writefln("%d", count(s1)); // 1
writefln("%d", count(s2)); // 1
writefln("%d", count(s3)); // 1

But sometimes it's just not enough. Unicode has even weirder things:

import std.utf: count;

string s1 = "Ä";
wstring s2 = "Ä"w;
dstring s3 = "Ä"d;
writefln("%d", count(s1)); // 2
writefln("%d", count(s2)); // 2
writefln("%d", count(s3)); // 2

The character is being counted as 2 code points. That's because Unicode defines two ways of representing this character:

The former is called NFC (Normalization Form C/Canonical Composition), and the latter, NFD (Normalization Form D/Canonical Decomposition), and they are defined here. Both are displayed the same way on screen, and just by looking you can't tell the difference. In the code above, Ä is in NFD, but in the code below, it's in NFC:

string s1 = "Ä"; // now Ä is in NFC (just one code point)
wstring s2 = "Ä"w;
dstring s3 = "Ä"d;
writefln("%d", count(s1)); // 1
writefln("%d", count(s2)); // 1
writefln("%d", count(s3)); // 1

You could also create the strings using the code points hexadecimal values, such as string s1 = "\u00c4"; to create Ä in NFC, or string s1 = "\u0041\u0308"; to create it in NFD, and printing those to see the same Ä in the screen, but different counts/lengths for each case.

But is Ä also considered a single character? Should it be counted as 1, regardless of the normalization form being used?

And here it comes another concept, as defined by Unicode: a sequence of two or more codepoints that together form a single "unit" is called a Grapheme Cluster (see full definition here), also known as "User-perceived character" (aka "It's one single drawing/image/'thing' on the screen, so I'll count it as 1").

To count the number of grapheme clusters, you could use the byGrapheme function:

import std.range : walkLength;
import std.uni: byGrapheme;


// create strings containing Ä in NFD
string s1 = "\u0041\u0308";
wstring s2 = "\u0041\u0308"w;
dstring s3 = "\u0041\u0308"d;
writefln("%d", s1.byGrapheme.walkLength); // 1
writefln("%d", s2.byGrapheme.walkLength); // 1
writefln("%d", s3.byGrapheme.walkLength); // 1

Or, you could normalize the strings to NFC and count the code points:

import std.utf: count;
import std.uni;


string s1 = "\u0041\u0308";
wstring s2 = "\u0041\u0308"w;
dstring s3 = "\u0041\u0308"d;
writefln("%d", count(normalize!NFC(s1))); // 1
writefln("%d", count(normalize!NFC(s2))); // 1
writefln("%d", count(normalize!NFC(s3))); // 1

But remind that, depending on the language and characters involved, counting grapheme clusters would not be the same as normalizing, because not all characters have a single-code-point-NFC representation (for those, NFD is the only way to represent them).

Not to mention, of course, emojis. There's a specific type of grapheme cluster called Emoji ZWJ Sequence, which is a, well, sequence of emojis, joined together by the Zero Width Joiner character (ZWJ). One example are the family emojis, such as the family with dad, mom and 2 daughters, which is actually built with 7 code points:

So, such a string could be created as:

string s1 = "\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467";

But, unfortunately, the current implementation of byGrapheme doesn't seem to support Emoji ZWJ Sequences, as for the string above, s1.byGrapheme.walkLength returns 4 (each face emoji is counted as a grapheme cluster), and creating a Grapheme with it results in an invalid one:

auto g = Grapheme("\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f467");
writeln(g.valid); // false

In that case, to count this whole thing as 1, you would have to manually process the string, or rely on some external library (I don't know if there's any, though).


Conclusion

Depending on what you mean by "length" (and also the string type and its contents), the way to determine its value can vary.

Some will say that one or another approach would be the most "obvious" way, but since we started dealing with non-ASCII characters (and ended up with this whole Unicode mess), nothing is obvious anymore, IMO. The old idea of "1 char = 1 byte" is outdated (or is valid only to a very limited subset of chars, if you may), and even the "1 char = 1 code point" approach might not work for all cases.

Do you want the number of bytes (in a particular encoding), or code units, or code points, or grapheme clusters? Depending on the situation, all of them will be the same (ex: when dealing with ASCII-only chars in UTF-8 encoding), but change one of those parameters (different encoding, non-ASCII chars, strings in NFD, emoji ZWJ sequences, etc) and things can get messy. In that case, you need to know exactly what you want to count, and might get different results depending on the method used. Unfortunately, there's no silver bullet, and YMMV.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

2 comment threads

It's really hard to remember that "this whole Unicode mess" (a description I wholeheartedly endorse) ... (2 comments)
There needs to be a way to celebrate excellent answers aside from a mere upvote! (3 comments)
There needs to be a way to celebrate excellent answers aside from a mere upvote!
elgonzo‭ wrote over 3 years ago

Tipping my hat top you, this is one mighty answer about the weird, dark corners of Unicode encodings...

hkotsubo‭ wrote over 3 years ago

elgonzo‭ Thanks! Regarding "celebrate excellent answers", perhaps something similar to bounties? I don't know, there are some discussions about not having rep at all, so maybe we'll need another way to do it...

elgonzo‭ wrote over 3 years ago

Maybe not bounties, but perhaps some reaction emojis that don't have an effect on the votes/score nor users reputation/privileges. I am thinking along the lines of those reaction emojis as available in Github's issue tracker or on Steam, for example. That said, i do not want to bring this forward as a feature suggestion as of now, as i personally believe the QPixel/Codidact developer time is better spent fixing bugs, implementing important features and polishing the existing feature set. :-)