+22

−0

The following users marked this post as Works for me:

User	Comment	Date
hkotsubo‭	(no comment)	Oct 5, 2021 at 01:33

First, let's see how Zalgo Text works.

Unicode Combining Characters

Unicode defines the concept of combining characters. Basically, some characters can be combined with others, to "make/create" different ones (you can also say that they can modify other characters).

Example: in Portuguese there's the letter a. But if you combine it with the COMBINING ACUTE ACCENT character, the result is á, which is a different letter (or the same letter, but with a different semantics - sorry for the lack of accuracy, I'm not a grammar specialist). Anyway, that makes difference in words such as "sábia" (wise), "sabia" (the verb "to know" in past tense) and "sabiá" (a bird).

Unicode defines more than 2000 combining characters, all contained in one of the following categories: Mn (Mark, Nonspacing), Me (Mark, Enclosing) and Mc (Mark, Spacing).

Zalgo Text is created by applying lots of combining characters on the same letter. In many languages, it's perfectly valid to have more than one combining character on the same letter, and that's why it's allowed by Unicode. The text in the question starts with these:

Character	Code Point *	Category	Name
T	U+0054	Lu	LATIN CAPITAL LETTER T
̃	U+0303	Mn	COMBINING TILDE
͟	U+035F	Mn	COMBINING DOUBLE MACRON BELOW
͏	U+034F	Mn	COMBINING GRAPHEME JOINER
̧	U+0327	Mn	COMBINING CEDILLA
̟	U+031F	Mn	COMBINING PLUS SIGN BELOW
͓	U+0353	Mn	COMBINING X BELOW
̯	U+032F	Mn	COMBINING INVERTED BREVE BELOW
̘	U+0318	Mn	COMBINING LEFT TACK BELOW
͓	U+0353	Mn	COMBINING X BELOW
͙	U+0359	Mn	COMBINING ASTERISK BELOW
͔	U+0354	Mn	COMBINING LEFT ARROWHEAD BELOW

^{* Unicode defines that each character has an unique numeric value, called code point.}

So, the text in the question starts with a letter "T" followed by 11 combining characters. This sequence produces the following:

T̃͟͏̧̟͓̯̘͓͙͔

An enlarged image of the above:

letter T with 11 combining characters

The rest of the Zalgo Text follows the same pattern: a letter with lots of combining characters. The full text consists of this pattern repeated lots of times.

Another thing that makes Zalgo Texts have this peculiar appearance is the stacking algorithm (described here), that defines what happens when more than one combining character is applied to the same letter.

Each combining character can be rendered above or below the pre-existing ones - each one has its own rules about the position it should be, but the exact rendering also depends on the font being used). Let's see below what happens when we add combining characters to the letter "T" (in each line, a new combining character is added):

T <-- letter "T" without combining characters

T̃ <-- adding COMBINING TILDE

T̃͟ <-- adding COMBINING DOUBLE MACRON BELOW

T̃͟͏ <-- adding COMBINING GRAPHEME JOINER

T̃͟͏̧ <-- adding COMBINING CEDILLA

T̃͟͏̧̟ <-- adding COMBINING PLUS SIGN BELOW

T̃͟͏̧̟͓ <-- adding COMBINING X BELOW

T̃͟͏̧̟͓̯ <-- adding COMBINING INVERTED BREVE BELOW

T̃͟͏̧̟͓̯̘ <-- adding COMBINING LEFT TACK BELOW

T̃͟͏̧̟͓̯̘͓ <-- adding COMBINING X BELOW

T̃͟͏̧̟͓̯̘͓͙ <-- adding COMBINING ASTERISK BELOW

T̃͟͏̧̟͓̯̘͓͙͔ <-- adding COMBINING LEFT ARROWHEAD BELOW

As we can see, Unicode allows lots of combining characters applied to the same letter, and those are "stacked" on top of (or at any other relative position, depending on the character) the previous ones, resulting in this peculiar appearance of Zalgo Texts.

And now that we know how it's made, we can think of ways to detect it.

So how do I detect it?

One way to detect Zalgo Text could be to verify if there are "lots" of consecutive combining characters.

Different programming languages will have their own functions/libraries to work with Unicode data. In languages that support Regex Unicode Properties, you could use something like:

// PHP supports Regex Unicode Properties
$text = // string you want to check
if (preg_match('/\p{M}{3,}/u', $text)) {
    echo "zalgo";
} else {
    echo "not zalgo";
}

I've made an example in PHP because it supports Regex Unicode Properties: \p{M} matches any character in the "Mark" categories (the three already mentioned above: Mn (Mark, Nonspacing), Me (Mark, Enclosing) and Mc (Mark, Spacing)).

And the quantifier {3,} searches for 3 or more ocurrences: so, if there are 3 or more consecutive combining characters, the text is considered to be Zalgo. This might or might not be enough, depending on the languages you want your application to accept.

In some languages, the lower bound could be higher - or lower - than 3. Unicode defines the concept of Stream-safe Text Format, that kinda defines a limit of 30 consecutive combining characters. But in real-life applications, I guess 30 is too much, because the longest known sequence is the tibetan character HAKṢHMALAWARAYAṀ: a letter followed by 8 combining characters. It's this one (and I admit that, for those who don't know it, this can easily be mistaken as Zalgo):

ཧྐྵྨླྺྼྻྂ

An enlarged image, so we can see it in all its glory:

tibetan character HAKṢHMALAWARAYAṀ

Therefore, unless your application needs to accept tibetan texts, using \p{M}{8,} would be a valid solution. Depending on how many you use, you might end up excluding valid words in another languages (in many of them, having 2, 3 or even more combining characters is perfectly valid), so you'll have to adjust the value according to the strings you want to be valid.

One could also argue that a text with only 2 or 3 combining characters per letter is not "zalgo enough", even if it's not a valid text in the languages that your application accepts. Anyway, defining an accurate criteria that works for all cases is hard and depends on the context.

Another way - if the programming language you're using doesn't support Regex Unicode Properties - is to simply loop through the string and count the combining characters:

# Python
from unicodedata import combining

text = # string I want to check
count = 0
max_allowed = 3 # maximum of 3 consecutive combining characters allowed
for c in text:
    if combining(c):
        count += 1
        if count > max_allowed:
            print('Zalgo!')
            break
    else:
        count = 0
else:
    print('not zalgo')

I used Python as an example, but the ideia is the same in any other language: count the characters, and when the limit of consecutive combining characters is reached, report the text as Zalgo.

Many programming languages - if not all, at least the "mainstream" ones - have some way to work with Unicode data, so the code above is pretty straightforward to adapt.

Both approaches above don't need to check all the string, as they stop at the first "zalgo character" found. The consequence is that they will consider this as Zalgo:

this text has only one "zalgo char": T̃͟͏̧̟͓̯̘͓͙͔ - the rest is just normal text

Because it considers a text to be Zalgo if it finds a single ocorrence of consecutive combining characters. But it's not hard to adapt the algorithms above to consider only cases when all letters are "zalgified" (or at least N letters are, with N varying according to any criteria you want).

Anyway, there's no silver bullet. The bigger max_allowed is, the less cases of potential Zalgo Texts are not detected.

Another approach to this problem would be: instead of trying to detect Zalgo Text, you could have a whitelist of letters and their respective list of allowed combining characters - and that will vary according to the languages you want to accept.

Example: in Portuguese, vowels can be followed by a COMBINING ACUTE ACCENT, COMBINING CIRCUMFLEX ACCENT or COMBINING TILDE (only one of them at a time). The letter c can be followed by a COMBINING CEDILLA, and letter a can also be followed by a COMBINING GRAVE ACCENT. So the regex will be like this:

// PHP, checking valid combining characters in Portuguese
if (preg_match('/c[^\P{M}\x{327}]|[^aeiouc]\p{M}|[eiou][^\P{M}\x{301}-\x{303}]|a[^\P{M}\x{300}-\x{303}]/iu', $texto)) {
    echo "invalid\n";
} else {
    echo "valid\n";
}
// PS: it's a "simplified" version, because some vowels don't accept all the accents

This could be harder to do, if the languages you want to accept have lots of different and complicated rules. But the trade-off is that it'll be more accurate, although it'll also reject any text that's not compliant with the language grammar (not only Zalgo, but also typos and maybe "not-zalgo-enough" texts, whatever that means).

And in the end, you must also define what your real problem is: do you want to detect Zalgo (something that has that "creepy appearance") or to reject any invalid text (given a list of accepted languages)?

Anyway, there's no one-size-fits-all solution. But once you know how a Zalgo Text is created, you can adapt the solution according to your needs.

posted about 4 years ago

CC BY-SA 4.0

1y ago

hkotsubo‭

5235 reputation 21 70 590 239

Copy Link

Raw

Markdown

History

1 comment thread

General comments (2 comments)

Communities

How does Zalgo Text work, and how can I prevent my application from accepting it?

1 comment thread

1 answer

Unicode Combining Characters

So how do I detect it?

1 comment thread