Got Those Bidirectional Blues

by James Duguid | May 27, 2018

Anyone who has tried to work with mixed Hebrew and English text has probably had some difficulties, especially with punctuation (e.g., parentheses get flipped around, and show up in the wrong places). While Unicode bidirectional formatting is a wonderful thing, it has some trouble with characters which could be interpreted as requiring either right-to-left or left-to-right formatting. This post details a simple Unicode fix that should solve some of your problems.

Suppose I am writing about the following line from the Dead Sea Scrolls:

כי גדלים רחמי אל ואין קץ[...]

This line of text is formatted right-to-left, and there is an ellipsis at the end, representing a break in the text on the original scroll. Now supposed I cut-and-paste this text into an English left-to-right sentence:

4QInstruction mentions God’s compassion: כי גדלים רחמי אל ואין קץ[...], “For great are the mercies of God, and there is no end…” (4Q416 Frag.3.4).

What? Why has the ellipsis moved to the right of the Hebrew text? Go ahead and click on the example to see how the string of letters is interpreted letter by letter. As you can see, the English characters are placed left-to-right one after another, then when a segment of text with Hebrew characters is reached, these are placed in a right-to-left order. However, when the ellipsis is reached, it is interpreted as a left-to-right character, because it is an ambiguous character (both left-to-right and right-to-left scripts use parentheses!), and the overall direction of the string is left-to-right. So it begins a new left-to-right sequence following the Hebrew sequence.

How do we make this string behave the way we want it to? For that, we need to meet a couple of new Unicode friends. Our first friend is named RIGHT-TO-LEFT OVERRIDE (U+202E). Inserting this character before a string will force the whole string to be read right-to-left. So inserting it into our example from above, directly before the run of Hebrew characters, will give us this:

4QInstruction mentions God’s compassion: ‮כי גדלים רחמי אל ואין קץ[...], “For great are the mercies of God, and there is no end…” (4Q416 Frag.3.4).

So, you can see that the ellipsis occurs in the correct place, since we have forced it to display as right-to-left. However, this effect does not turn off, and so all the English characters are also being displayed right-to-left! This is still not quite what we want (again, you can click for a letter-by-letter animation). To fix this, we need to meet a second friend, POP DIRECTIONAL FORMATTING (U+202C). This Unicode character will cancel out the forced directionality defined by the previous character, and so if we put RIGHT-TO-LEFT OVERRIDE at the beginning of the run of Hebrew characters, and POP DIRECTIONAL FORMATTING at the end, we will get:

4QInstruction mentions God’s compassion: ‮כי גדלים רחמי אל ואין קץ[...]‬, “For great are the mercies of God, and there is no end…” (4Q416 Frag.3.4).

So now we have the behavior we want. However, it should be noted that these characters can be difficult to use, because they are not displayed in most programs, and so it is difficult to see where they are. This site can be used to see hidden formatting in a string if you get confused.

How do you insert these characters in the first place? The most basic way is to use the Character Map in Windows or the Emoji & Symbols window in Mac to locate your character using its code points. So since RIGHT-TO-LEFT OVERRIDE is U+202E, I would scroll down to the '202' row, and find the character under the 'E' column. For a more complete explanation of how to do this, as well as more advanced techniques, see here.

If you want a deeper understanding of the dynamics of Unicode directionality, this article is a great overview. And to see the documentation for these Unicode characters, as well as some other bidirectional control characters, see here.

Got Those Bidirectional Blues

by James Duguid | May 27, 2018

Recent Posts

Archive

2025

2024

2018

Authors

Feeds