Translating Source Code, part 2

Morgan Wahl

2023-09-08 00:00

See the previous post to have any clue what I'm going on about here.

Japanese

Our first attempt at "translating" Python from English was to German. While there were a few issues to consider, it was relatively easy since the two langauges are somewhat similar. They have almost identical writing systems, and their word order and syntactic structures are mostly the same. Neither of those things are true about Japanese.

Firstly, let's start with writing. The Japanese writing system makes use of three or four different scripts, but that's not a huge issue for us. What's more important is that written Japanese rarely uses spaces between words. The different scripts somewhat help delineate the different parts of a sentence. And the head-last syntax (more on that later) works somewhat like a reverse-Polish calculator to help parsing.

This means one of the fundamental syntactic elements of a programming language, spaces separating tokens, is arguably out of place. To fit with this, we could just not use spaces, and when parsing look for keywords at the beginning or ends of identifiers. This would place odd restrictions on identifiers though, and maybe would make writing a parser more difficult.

Instead, when we need to separate token within a line, we'll use a middle-dot ・. This is used in Japanese for those situations where separate words need to be distinguished, typically separate words within a phrase that are all written with the same script.

The upside of no spaces is that identifiers can just be whole phrases, no need for using underscores or camel casing or any other transformations.

The other big different with Japanese is that its word order is almost entirely the opposite of English. Verbs come last in a sentence.[2] Also, the equivalent of English's prepositions occur after their noun phrases instead of before them.[1]

This word order presents an opportunity to simplify the language. In English Python, conditionals have the syntax:

if conditional_expression:
    ...

In Japanese, the "if" naturally comes after the condition, which eliminates the need for the : character.

conditional_expression・と
    ...

The ・ is used to delimit an expressing ending in an identifier from the と conditional keyword.

Japanese also has its own repertoire of punctuation, which I've swapped in for the ASCII punctuation.

Here's some example syntax to demonstrate a different word-order and different punctuation:

Python	ニシキヘビ	nishikihebi
object.attribute	属性。物体	zokusei.buttai
"string"	『列』	"retsu"
'string'	「列」	'retsu'
function_call(arg1, arg2)	【話1、話2】呼び出された関数	(hanashi1, hanashi2)yobidasareta kansuu
(item1, item2)	（要素1、要素2）	(youso1, youso2)
[item1, item2]	［要素1、要素2］	[youso1, youso2]
{thing1, thing2}	｛物1、物2｝	{mono1, mono2}
{thing: "stuff"}	｛物：『やつ』｝	{mono: "yatsu"}

The third column has transliterations of the Japanese column.

Function calls have their arguments before the name of the function. This matches Japanese syntax where verbs come at the end of their clause. Japanese has more kinds of brackets, so I took the opportunity to use the lenticular brackets to distinguish function calls from other uses of parentheses.

I've taken the probably controversial approach of swapping the arguments to the . operator. With the 。 operator, the attribute name comes first, then the object. I think this matches with the general pattern of the syntax, but I could be wrong.

Also, 話 is probably a terrible translation of "function argument".

To see our full example, well use this "translation table":

Python	ニシキヘビ	nishikihebi
x y (placeholders for identifiers)	X Y
class x:	X・種類	X shuroi
def x(y1, y2):	【Y1、Y2】X・関数	(Y1, Y2)X kansuu
if x:	X・と	X to
x is y	X・Y・だ	X Y da
return x	Xを返せ	X okaese
None	無	mu

For the identifiers below, I've put transliterations in square brackets.

# program-translation
# from: en
# to: ja
# identifiers:
#   absolute_value: 絶対値 [zettaichi]
#   is_negative: 負だ [fu da]
#   is_odd: 奇数だ [kisuu da]
#   negated_number: ネゲートした数 [negaato shita kazu]
#   number: この数 [kono kazu]
#   Number: 数 [kazu]
#   self: 己 [ono]

数・種類
    【己】負だ・関数
        …

    【己】奇数だ・関数
        …

【この数:数】絶対値・関数
    この数・無・だ・と
        無を返せ
    【】負だ.この数・と
        ネゲートした数 = -この数
        ネゲートした数を返せ
    この数を返せ

The result is delightfully compact. Even with the negated_number variable becoming a phrase that could be translated "number that was negated", and the number argument having to become "this number" to avoid collisions with the Number class.

A second approach to Japanese could use the "halfwidth" katakana that were used on computers back when they required 1-byte-per-character text encodings. To my eyes, this gives the language a bit of a all-caps FORTRAN feel to it, which isn't Pythonic, but could work for other languages. I think ASCII punctuation is more appropriate here, but I'm not sure.

Python	ﾆｼｷﾍﾋﾞ	nishikihebi
x y (placeholders for identifiers)	X Y	X Y
class x:	X･ｼｭﾛｲ	X shuroi
def x(y1, y2):	(Y1､Y2)X･ｶﾝｽｰ	(Y1, Y2)X kansuu
if x:	X･ﾄ	X to
x is y	X･Y･ﾀﾞ	X Y da
return x	Xｦｶｴｾ	X okaese
None	ﾑ	mu

# program-translation
# from: en
# to: ja-Kana-x-halfwidth
# identifiers:
#   absolute_value: ｾﾞｯﾀｲﾁ [zettaichi]
#   is_negative: ﾊﾟﾀﾞ [fu da]
#   is_odd: ｷｽｰﾀﾞ [kisuu da]
#   negated_number: ﾈｹﾞｰﾄｼﾀｶｽﾞ [negaato shita kazu]
#   number: ｺﾉｶｽﾞ [kono kazu]
#   Number: ｶｽﾞ [kazu]
#   self: ｵﾉ [ono]

ｶｽﾞ･ｼｭﾛｲ
    (ｵﾉ)ﾊﾟﾀﾞ･ｶﾝｽｰ
        ...

    (ｵﾉ)ｷｽｰﾀﾞ･ｶﾝｽｰ
        ...

(ｺﾉｶｽﾞ:ｶｽﾞ)ｾﾞｯﾀｲﾁ･ｶﾝｽｰ
    ｺﾉｶｽﾞ･ﾑ･ﾀﾞ･ﾄ
        ﾑｦｶｴｾ
    ()ﾊﾟﾀﾞ.ｺﾉｶｽﾞ･ﾄ
        ﾈｹﾞｰﾄｼﾀｶｽﾞ = -ｺﾉｶｽﾞ
        ﾈｹﾞｰﾄｼﾀｶｽﾞｦｶｴｾ
    ｺﾉｶｽﾞｦｶｴｾ

Despite having more characters, this is even more compact. I have no idea how readable it is however.

So, lessons from Japanese: explicit token separators aren't so bad, and having keywords at the end requires you to use a lot less punctuation. Also, more kinds of brackets means easier to read code.

Hindi

Syntactically, Hindi is very similar to Japanese, so there's a few places we can use the "keyword-final" syntax again. Hindi doesn't have the complicated four-script writing system, and uses spaces for punctuation, so that's a little more similar to English.

Hindi does have grammatical gender, with only masculine and feminine options. It doesn't come up as often as in German (or Spanish, for example), but could occationally present a headache for choosing an identifier.

Python	अजगर	ajgar
x y (placeholders for identifiers)	X Y
class x:	TBD
def x(y1, y2):	TBD
if x:	यदि X:	yadi X:
x is y	X Y है	X Y hai
return x	X दे	X de
None	TBD

I'm examining Hindi mainly as a nice stop on the road to a language that uses a left-to-right script (see below).

TODO: full Hindi example.

Urdu

Hindi is the relatively modern language that draws on the much older tradition of Urdu. While Hindi is written largely phonemically in Devanagri script, Urdu uses its own variant of the Perso-Arabic script. Since Hindi and Urdu share most of their vocabulary, we can use mostly the same keywords. The main thing to contend with is the left-to-right ordering of the script.

Also, I'm not sure what "monospaced" looks like in Arabic, especially compared to the Nastaliq style used for Urdu.

TODO: flesh out Urdu example.