Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on How does Zalgo Text work, and how can I prevent my application from accepting it?

Parent

How does Zalgo Text work, and how can I prevent my application from accepting it?

+14
−0

A Zalgo Text is something like this:



T̃͟͏̧̟͓̯̘͓͙͔o̤̫͋ͯͫ̂ ̥͍̫̻͚̦͖͇̌ͪ̇ͤ̑̐͋̾̕i̢͖̩͙͐͑ͬ̄̿̍̚ͅn̵̢̼̙̳̒̄ͥ̋̐v̡̟̗̹̻̜͕̲ͣ̐ͤͤ͒́oͫ͂̆͑ͩ҉͇̰͚̹̠̫͔̗k̷̭̬̭͙̹̺̯ͩ̌̾̒̋̓ͤ͛͘͠e̥͙̓̄̕ ̵̫͈ͪţ̱̺̺̑̿̉̌͛̂̇h͙̣̬̓̂͞ę̡̲̟͎͉̟͛̓̉̆̉͘ ͍̯̱͎̬͍ͬ̒ͣͩ͟͡ḥ̗͖̝̮̗̼ͮ̋̉̃͐̿ͪͅi̞͉̯͖̞͉̙ͬͦ̄͋̈̂ͥ̊́̕v̶̝̼̫͔̬̯̯ͯ͑̈͠e̪͓͕̦̪̗̠ͯ͛͌̀̉͘ͅ-̍̉ͦ̈́͌͏̸͉͍͖̥͓̭̗̖mͣͣͪ̇͂͏̳̤̺͓̜̪͙͕i͋̏̎ͤ͆́͏̛̰͉̮̬͙̩̩ͅn̴̴̞̗ͩ̓̉̋̕d̝͇̬̣́̃ͦ̀͡ ̖̩̞̯̇̍́̆ͮͤ̄ͬ͂͜͡r̷͎͙͉̀͒͐͛́è̫̹͙̔ͭͦ͑ͩͮͯṗ̶̸͍̺̘͈́̄ͫ̅ͣ̇̃̚r̨͚͎̙̻̜ͬͬ̆͆e̩̞̹͈̱͕ͮ̈́̒̑͛̊̌͛͘ŝ̱̹̠̂ͭͦ̅̅͛̚͜e̝̳̞̊̍̐ñ͔̟̳̱͔ͣ̅̄ͮt̴̫̻̮̤̿̆̂̊͘i̷̫ͤͨ̏ͤͧͅͅn̛̜̠͋͐͆̆͑̎̍͑̚g̶̜ͧ́ͭͯ͐ͬ͛̇͜ ̳̩͉̣̤̜̰̠̌ͤ̆̚͟c̷̝̥͉͓͕͓͇̣͉̒̈̇ͧ͆̌ͣͣ͟ḩͥ̅ͥ̇ͩ͗҉̳͎̱ą̍̀͆͑̈́͂҉͍̤̘͔̹̭̱͢o̥̮̘ͨ̎̄̎́͢s̯̮̖͒.̨͋̈́͑̈́ͪ̚҉̻͓͔̤̯̲̤̺̀ ̔ͨ̋̾̒̋̚̚͏͓̰͙̘͈I̧̗͇ͮ͠n̠̬͗ͮ̿ͭ͌̉̈ͦv̦͚͇̠̹̰̭ͤo̭̞ͫ̃ͨ̏ͥ̒́̚k̯̱͛ͫ͆͂̇͡͝i̗̪͎͈͕͚̥̫ͥ͂̐n̖̤͔̟̖͇̠͋̐̀ͅͅg̛͑ͪ͏͇̪͓̘ ̎̒ͯͥͭ͂͏̠͇̮̪t̡͍̫̘̯̼̣̋ͤ̒h̨͇͕̺͍͈̠͖̤̱ͮ̃e͖̝̣̽ͬ͊̀ ͍ͫ̅̆͊́͌͡fͫ̾͏̭͎̰̣̱̺̯̥é̻̪͖͙̮͓̊̌̓ͩ̑̆ē͙̤̪͈͉̦̭͛̃ͩͧͭ̏͒̕͢ľ̡҉̪͙̻̣̪̼̲i̜ͯ̄̐͠͝n̸̨̩̰̺̱̗̍̉̄̉̆͐̎g̹͖͈ͥ͗̍͝ͅ ̙̹͉̰̪̠͍͖͔ͧ̾̈́ͫ̔͋̈̍o̵̤̥͍ͬ̐̀̿̇ͯ̅f̵̜̟͚́͛͜͠ ̺̭̆̑ͭc̶̢̫̭͓̭̐͗h͍͉̞̩͕ͮ̉́͛͑ͤͯ͒͡ä̜̱͙͎̯̟ͮ̂͑̄͊̍̑͜oͫ̋̿͌͢͏̟̲̗s̵̮̟̙̻̗̭̲͂̍͑͑͂ͤ́̚ͅ.̵̨͉̟͓ͦͯͬͤ͌ͧ͑̾ ̝̥̰͋ͬ͂̋͛͑̚W̶̹̹̣̟͕ͪͤ̽̀i̳̪͉͇͇̟̲̇̃́ͧ̾ͭ̿̆ͯ͡t̬̪̬̑̿͊͊́̚h̋͑̈̇͋҉̴͖̜ͅͅo̴̠̭̯̫̽ͨṳ̡̟͕̝̦͓͖̪̇́́͢t̬̙̖̝͈͔̻͛͊ ̵̛̗̝̝̰̫̞̠̙̽̄̊͆o̺̮͆ͯ͊͑ͭ̚̕ṙ̶̟̺̱̜̺̻̘̰̀͊ͮ̀͡ḋ̢̰̂̎ͦͥe̛̩͍̯̍ͭͩͤͦ̚r͍̲̾ͯ̊͛͡.̮̙͙̠̦̜̳̋̓ͩ͒̓̒͜͝ ̷͋͢͏̬̤̼T̸̯̞̥ͤ̽͛̈̎̚͜ȟ̨̌͋͏̲͚̪̩͚̺͖̠é̻̰̞̣̍̕͟ ̧̩͔̥̘͓̘̽̂͑̏ͣ͑̓̃͝N̬̙͈̗͍̣̻̓́ͨͫͮ͘͜͡e͍̜̠͓̹̙͓̱͑͡z̵̙̦̳̲͓ͭ̑ͮ̀p͒͐҉̹͎̱̯̰͜ȩ̨̗̝͓̥̳̣̈̽ͧ͋̎́͑ͯ̚͞r̮̖̟̆ͩḑ͍͉͚̪̲̭͈̮ͮͬͣͭ̽̐̄̓̑́͜i̶̪͎̹̰̭͎͋͗ͧ͌̎͛͜͢a͉̦͇̖͓̰ͣͭ̾͗n̮̞̣̝̤̤͔̎̊͜ ̧̗͐̃̊͂͢h̴͙̳͉̺͕͌̉ͫ͐͒̋͊i̢̳͓̤̖͎̼ͤ̃̓ͫ̐ͬ̆͌̕͢ṽ͎͉̳̭̭̠̠̬̠͂͐͊̾͊ȩ̣̺̖̒̏̌͠-̛͕̼̞͌͑ͣͨͩ̃̒͢m̶̱̣̥̥̭̉̿ͩͧ̓͛̚ͅi̢͍̙̠ͩ͐̏n̹̩̬͓̬̦̲̈́̉͊̐̎d̨̧̛̤̭̯̥͓ͧ̒ͥ ̞̙̮͌ͫͬͪͤ̌ợ̫̏̋͋͒̈̀f̶̳̿̀̊͐ͅ ̑͘҉̱̯̤̘͕̺͡c̢̭̭̜̲̙̥͊̓ͤ̍ͨ́ẖ̫̖̆̃ͪ̌ͭ͞ä̷̫̺̩̘̪͍́ͤͫͫ̒ͦ͆͂͠͡o̥̖̥͚̾̋͋͊ͨ̆ͅs̘͉͔̲̫̪͕̠̟ͯ̒͋ͦ̒͢.̣̦̦̞ͮͬͩ ̰̦̤̞̹̮̟ͩ̓̂̓̉͒̏ͫ͢Z̎́̉͘͏̗̮a̧͌͋̂͛̏͏͏̣̩͍̞l̛̟̥͎ͨ͒̐ͤ͛ͅg̰̣ͫͣ̂̋͊ô̷̶͕̩͎̳̻̣̯̘̋̊̅ͧ̏ͫ.̴̮̯̺̻̻̣̻̣̗̿ͯ͒ ̹̮̮̱͖̫̝̋ͥ͆̌͛ͬ͌̄͗͟H̢̅҉̘̠̻e̷͊͊͋ͪ̑҉̮̻̬̞̖̖̭̩̩̕ ̬̪̍͜ͅw̴̧̳͓͓̤̤ͮ̍ͨ͒͜ͅh̳̱ͪ͋ͦ͗o͚͎̭͎͍̺̽ͣͦͅ ̣͙͊̑ͣ̄W͍͔̻͉̲̹͇̼̑̇ͥ̔̎ͮ̚͟a̫̞ͪ̔͘͜į̷̶͔͍̰͙̖̻̲ͯ̓̄͌̄̒̔̆t̷͚͓̼̼̔̈̄̀̚sͭ̎͛ͨ͏̼̻̗͚̼̫ ̵̨̦ͪ͂̂B͂̈́҉̸͎̩͓e͌͂҉̤͕̲͇̮h̴̤͎̟̫̐̀̐ͬį̖̪̜̦̋̔n̨̯͚̽͗̾̀̾̿̏̀d̙͔ͬͧ̃ͯ̀ͦͫ͌ ̸̮͓͎̣̻̪̹̒̈́ͧ̀͐̓̿͘͠T̮͎̫̝ͫ̔͛̈ͯ̈ͬͣ̈́͟h̲̉̆̀e̬̭ͧ̉ͮ̓ͥ̈́ͨ ̉̆̏͠͏̖̯͚̰W͎̞͕̺ͣ̉̓ͧ̎̽͟a̢̝̝̎̅̌̂ͦ͡ḽ͖͉̽ͥ̌̓̉̆l̠̣̾ͣ̾͛̕.̷̡̩̰̀̑͑̽͋̃̌̓ ̴͙̼̭̺̙̱͚̳̪ͨ̈́́Z̵̶̲̘̗̩̜̍ͭ̀̿͒̊ͫA̴̸ͧ̊̚҉̺͖̠͖̜̠̩͇L̶̲̠̾͗ͦ͐͌ͯͤ̇G̫̞͔̭̻̖̥̜̗͊ͪͦͧ̔ͩ̀͘͜Õ̷͉͈̖̭̰̲̪̾ͭͯ͆̚



In case your browser doesn't render it correctly, here's an image:

zalgo text

And it's not that hard to create it, as there are many generators available.

Although it doesn't look like regular ("normal") text, it kinda is - you can copy-paste it normally (if it'll be correctly displayed, it depends on the editor you're using).

But how does it work? How is it created, and how can I detect it and prevent my application from accepting inputs like this?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

Why? (3 comments)
Post
+22
−0

First, let's see how Zalgo Text works.


Unicode Combining Characters

Unicode defines the concept of combining characters. Basically, some characters can be combined with others, to "make/create" different ones (you can also say that they can modify other characters).

Example: in Portuguese there's the letter a. But if you combine it with the COMBINING ACUTE ACCENT character, the result is á, which is a different letter (or the same letter, but with a different semantics - sorry for the lack of accuracy, I'm not a grammar specialist). Anyway, that makes difference in words such as "sábia" (wise), "sabia" (the verb "to know" in past tense) and "sabiá" (a bird).

Unicode defines more than 2000 combining characters, all contained in one of the following categories: Mn (Mark, Nonspacing), Me (Mark, Enclosing) and Mc (Mark, Spacing).

Zalgo Text is created by applying lots of combining characters on the same letter. In many languages, it's perfectly valid to have more than one combining character on the same letter, and that's why it's allowed by Unicode. The text in the question starts with these:

Character Code Point * Category Name
T U+0054 Lu LATIN CAPITAL LETTER T
̃ U+0303 Mn COMBINING TILDE
͟ U+035F Mn COMBINING DOUBLE MACRON BELOW
͏ U+034F Mn COMBINING GRAPHEME JOINER
̧ U+0327 Mn COMBINING CEDILLA
̟ U+031F Mn COMBINING PLUS SIGN BELOW
͓ U+0353 Mn COMBINING X BELOW
̯ U+032F Mn COMBINING INVERTED BREVE BELOW
̘ U+0318 Mn COMBINING LEFT TACK BELOW
͓ U+0353 Mn COMBINING X BELOW
͙ U+0359 Mn COMBINING ASTERISK BELOW
͔ U+0354 Mn COMBINING LEFT ARROWHEAD BELOW

* Unicode defines that each character has an unique numeric value, called code point.

So, the text in the question starts with a letter "T" followed by 11 combining characters. This sequence produces the following:

T̃͟͏̧̟͓̯̘͓͙͔


An enlarged image of the above:

letter T with 11 combining characters

The rest of the Zalgo Text follows the same pattern: a letter with lots of combining characters. The full text consists of this pattern repeated lots of times.


Another thing that makes Zalgo Texts have this peculiar appearance is the stacking algorithm (described here), that defines what happens when more than one combining character is applied to the same letter.

Each combining character can be rendered above or below the pre-existing ones - each one has its own rules about the position it should be, but the exact rendering also depends on the font being used). Let's see below what happens when we add combining characters to the letter "T" (in each line, a new combining character is added):

T    <-- letter "T" without combining characters

T̃    <-- adding COMBINING TILDE

T̃͟    <-- adding COMBINING DOUBLE MACRON BELOW

T̃͟͏    <-- adding COMBINING GRAPHEME JOINER

T̃͟͏̧    <-- adding COMBINING CEDILLA

T̃͟͏̧̟    <-- adding COMBINING PLUS SIGN BELOW

T̃͟͏̧̟͓    <-- adding COMBINING X BELOW

T̃͟͏̧̟͓̯    <-- adding COMBINING INVERTED BREVE BELOW

T̃͟͏̧̟͓̯̘    <-- adding COMBINING LEFT TACK BELOW


T̃͟͏̧̟͓̯̘͓    <-- adding COMBINING X BELOW

T̃͟͏̧̟͓̯̘͓͙    <-- adding COMBINING ASTERISK BELOW


T̃͟͏̧̟͓̯̘͓͙͔    <-- adding COMBINING LEFT ARROWHEAD BELOW


As we can see, Unicode allows lots of combining characters applied to the same letter, and those are "stacked" on top of (or at any other relative position, depending on the character) the previous ones, resulting in this peculiar appearance of Zalgo Texts.

And now that we know how it's made, we can think of ways to detect it.


So how do I detect it?

One way to detect Zalgo Text could be to verify if there are "lots" of consecutive combining characters.

Different programming languages will have their own functions/libraries to work with Unicode data. In languages that support Regex Unicode Properties, you could use something like:

// PHP supports Regex Unicode Properties
$text = // string you want to check
if (preg_match('/\p{M}{3,}/u', $text)) {
    echo "zalgo";
} else {
    echo "not zalgo";
}

I've made an example in PHP because it supports Regex Unicode Properties: \p{M} matches any character in the "Mark" categories (the three already mentioned above: Mn (Mark, Nonspacing), Me (Mark, Enclosing) and Mc (Mark, Spacing)).

And the quantifier {3,} searches for 3 or more ocurrences: so, if there are 3 or more consecutive combining characters, the text is considered to be Zalgo. This might or might not be enough, depending on the languages you want your application to accept.

In some languages, the lower bound could be higher - or lower - than 3. Unicode defines the concept of Stream-safe Text Format, that kinda defines a limit of 30 consecutive combining characters. But in real-life applications, I guess 30 is too much, because the longest known sequence is the tibetan character HAKṢHMALAWARAYAṀ: a letter followed by 8 combining characters. It's this one (and I admit that, for those who don't know it, this can easily be mistaken as Zalgo):

ཧྐྵྨླྺྼྻྂ


An enlarged image, so we can see it in all its glory:

tibetan character HAKṢHMALAWARAYAṀ

Therefore, unless your application needs to accept tibetan texts, using \p{M}{8,} would be a valid solution. Depending on how many you use, you might end up excluding valid words in another languages (in many of them, having 2, 3 or even more combining characters is perfectly valid), so you'll have to adjust the value according to the strings you want to be valid.

One could also argue that a text with only 2 or 3 combining characters per letter is not "zalgo enough", even if it's not a valid text in the languages that your application accepts. Anyway, defining an accurate criteria that works for all cases is hard and depends on the context.


Another way - if the programming language you're using doesn't support Regex Unicode Properties - is to simply loop through the string and count the combining characters:

# Python
from unicodedata import combining

text = # string I want to check
count = 0
max_allowed = 3 # maximum of 3 consecutive combining characters allowed
for c in text:
    if combining(c):
        count += 1
        if count > max_allowed:
            print('Zalgo!')
            break
    else:
        count = 0
else:
    print('not zalgo')

I used Python as an example, but the ideia is the same in any other language: count the characters, and when the limit of consecutive combining characters is reached, report the text as Zalgo.

Many programming languages - if not all, at least the "mainstream" ones - have some way to work with Unicode data, so the code above is pretty straightforward to adapt.


Both approaches above don't need to check all the string, as they stop at the first "zalgo character" found. The consequence is that they will consider this as Zalgo:

this text has only one "zalgo char": T̃͟͏̧̟͓̯̘͓͙͔ - the rest is just normal text


Because it considers a text to be Zalgo if it finds a single ocorrence of consecutive combining characters. But it's not hard to adapt the algorithms above to consider only cases when all letters are "zalgified" (or at least N letters are, with N varying according to any criteria you want).


Anyway, there's no silver bullet. The bigger max_allowed is, the less cases of potential Zalgo Texts are not detected.

Another approach to this problem would be: instead of trying to detect Zalgo Text, you could have a whitelist of letters and their respective list of allowed combining characters - and that will vary according to the languages you want to accept.

Example: in Portuguese, vowels can be followed by a COMBINING ACUTE ACCENT, COMBINING CIRCUMFLEX ACCENT or COMBINING TILDE (only one of them at a time). The letter c can be followed by a COMBINING CEDILLA, and letter a can also be followed by a COMBINING GRAVE ACCENT. So the regex will be like this:

// PHP, checking valid combining characters in Portuguese
if (preg_match('/c[^\P{M}\x{327}]|[^aeiouc]\p{M}|[eiou][^\P{M}\x{301}-\x{303}]|a[^\P{M}\x{300}-\x{303}]/iu', $texto)) {
    echo "invalid\n";
} else {
    echo "valid\n";
}
// PS: it's a "simplified" version, because some vowels don't accept all the accents

This could be harder to do, if the languages you want to accept have lots of different and complicated rules. But the trade-off is that it'll be more accurate, although it'll also reject any text that's not compliant with the language grammar (not only Zalgo, but also typos and maybe "not-zalgo-enough" texts, whatever that means).

And in the end, you must also define what your real problem is: do you want to detect Zalgo (something that has that "creepy appearance") or to reject any invalid text (given a list of accepted languages)?

Anyway, there's no one-size-fits-all solution. But once you know how a Zalgo Text is created, you can adapt the solution according to your needs.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

General comments (2 comments)
General comments
Olin Lathrop‭ wrote over 3 years ago

Sigh. It was so much simpler when all valid input was 7-bit ASCII.

Peter Taylor‭ wrote over 3 years ago

Re "which is a different letter (or the same letter, but with a different semantics...)": there isn't going to be a short clean way of explaining it, because it's very contextual. In Spanish, for example, a and á are officially the same letter, but n and ñ are different letters.