Translating Source Code

Morgan Wahl

2023-09-01 00:00

I'd like to jot down a few ideas about how to enable computer programming without English.

Why would you want to do that? I can think of a few reasons:

To lower the barrier of entry for people who are fluent in a written language that isn't English.
To explore what programming languages can be like when removed from the constraints of English (and are placed in the constraints of another natural language).
To simply imagine other timelines where English is not the lingua-franca of computers.

There's two parts to enabling programming in a natural language: the identifiers devised by the programmer, and the keywords and syntax of the language the program is written itself.

Let's start with the simpler case: identifiers.

Identifiers

In many modern programming languages, you can use just about any strings for identifiers. They typically can't contain any punctuation besides "_" (I'm including spaces as punctuation), but otherwise you more or less stick whatever in there.

So nothing to do here, right?

Well, the fact is English is the lingua-franca of programming, so even if you have some, say, Japanese words in mind for your identifiers, you know you have to "translate" those to English if you want to actually work with anyone else.

What if you could document in a machine-readable way the identifiers you would use in a particular natural language, alongside the English ones that are your baseline for collaboration? This could be helpful to readers who are also familiar with that language. And maybe text editing software could make the substitutions when displaying the code.

We just need a way to annotate each scope to specify the non-English version of each of its identifiers. The data involved is not particularly complicated. We can use BCP-47 to specify a natural written language, and the rest is basically just one mapping for each scope.

One way to store that data might be in special comments:

# program-translation
# from: en-US
# to: en-x-piglatin

# pt: Umbernay
class Number:
    # pt: isway_egativenay(elfsay)
    def is_negative(self):
        ...

    # pt: isway_oddway(elfsay)
    def is_odd(self):
        ...

# pt: absoluteway_aluevay(umbernay):
def absolute_value(number: Number):
    if number is None:
        return None
    if number.is_negative():
        # pt: egatedway_umbernay
        negated_number = -number
        return negated_number
    return number

This code is quite silly, but serves to demonstrate some of the challenges. Each "pt" comment gives translations for the identifier that is defined on the next line. Python's semantic whitespace means we can be sure there's at most one assignment per logical line, so writing a parser to match up the comments to the identifiers shouldn't be too difficult. Other languages with more free-form line-breaking might be trickier.

One obvious problem with the comment-based approach is the sheer amount noise this adds to the code. With a little more verbosity and some YAML, we could move the mappings to the comment the top of the file where we already specified languages:

# program-translation
# from: en-US
# to: en-x-piglatin
# identifiers:
#   Number:
#     _: Umbernay
#     is_negative:
#       _: isway_egativenay
#       self: elfsay
#     is_odd:
#       _: isway_oddway
#       self: elfsay
#   absolute_value:
#     _: absoluteway_aluevay
#     number: umbernay
#     negated_number: egatedway_umbernay

class Number:
    def is_negative(self):
        ...

    def is_odd(self):
        ...

def absolute_value(number: Number):
    if number is None:
        return None
    if number.is_negative():
        negated_number = -number
        return negated_number
    return number

Well, that's kind of ugly, but at least it's in one place.

This approach has the advantage of being easily extendible to multiple languages: just use multiple comments.

There's still a problem though. In the implementation of absolute_value, how do we know what translation to use for the is_negative attribute on the number variable? In a statically typed language this wouldn't be an issue. In Python's ducktyping, we could look at type hints, but that means pulling in an entire library like mypy just to do some string replacements. And still wouldn't work if the types aren't hinted.

Let's take Python's duck-typing at face value, and just say "if it's a particular identifier in English, then there's only one translation".

# program-translation
# from: en-US
# to: en-x-piglatin
# identifiers:
#   absolute_value: absoluteway_aluevay
#   is_negative: isway_egativenay
#   is_odd: isway_oddway
#   negated_number: egatedway_umbernay
#   number: umbernay
#   Number: Umbernay
#   self: elfsay

class Number:
    def is_negative(self):
        ...

    def is_odd(self):
        ...

def absolute_value(number: Number):
    if number is None:
        return None
    if number.is_negative():
        negated_number = -number
        return negated_number
    return number

Hey, that's a bit nicer. Now we're just doing dumb string-substitution on the identifiers. It also de-duplicated the translation of self.

Now, we may actually want different translations in different scopes. We could enable that with inline comments again, but then we have a noise problem again. Instead let's see if we can specify a scope and then some overrides.

# program-translation
# from: en-US
# to: en-x-piglatin
# identifiers:
#   absolute_value: absoluteway_aluevay
#   is_negative: isway_egativenay
#   is_odd: isway_oddway
#   negated_number: egatedway_umbernay
#   number: umbernay
#   Number: Umbernay
#   self: elfsay
# scopes:
#   absolute_value:
#     number: umberney

class Number:
    def is_negative(self):
        ...

    def is_odd(self):
        ...

def absolute_value(number: Number):
    if number is None:
        return None
    if number.is_negative():
        negated_number = -number
        return negated_number
    return number

Here, we're saying specifically in the scope of the absolute_value function, translate the identifier number differently.

What about imports? Well, the file importing could provide translations of all their identifiers too, but I think it would be better to use separate files to give those translations in one place. For example, assuming our code above is in a module called number, the translations could live in a file called number.en-x-piglatin.translations.yaml. This file could be either next to the Python module, or in the same place in a hierarchy rooted at a different directory. That would allow you to provide your own translations for other people's code if they aren't packaged with them.

When displaying this source code, whatever software is doing that could give a "en-x-piglatin" option to show translated identifiers. I'm not sure what a good UX would be for editing, but I think you could imagine it.

For some commonly used identifiers, such as self, there might be some translations configured at the package level, or even globally.

Syntax

Now for the really interesting part.

So, you got some editor or other piece of software that can show you the source code "translated" by replacing identifiers. Let's see what that would look like, using Shavian as a "translation" of our usual Latin-script English:

# program-translation
# from: en-US
# to: en-US-Shaw
# identifiers:
#   absolute_value: 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵
#   is_negative: 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝
#   is_odd: 𐑦𐑟_𐑷𐑛
#   negated_number: 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
#   number: 𐑳_𐑯𐑳𐑥𐑚𐑻
#   Number: 𐑯𐑳𐑥𐑚𐑻
#   self: 𐑕𐑧𐑤𐑓:

class 𐑯𐑳𐑥𐑚𐑻:
    def 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝(𐑕𐑧𐑤𐑓):
        ...

    def 𐑦𐑟_𐑷𐑛(𐑕𐑧𐑤𐑓):
        ...

def 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵(𐑳_𐑯𐑳𐑥𐑚𐑻: 𐑯𐑳𐑥𐑚𐑻):
    if 𐑳_𐑯𐑳𐑥𐑚𐑻 is None:
        return None
    if 𐑳_𐑯𐑳𐑥𐑚𐑻.𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝():
        𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻 = -𐑳_𐑯𐑳𐑥𐑚𐑻
        return 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
    return 𐑳_𐑯𐑳𐑥𐑚𐑻

I've used Shavian so it's extra clear what "untranslated" English remains after dealing with identifiers.

This is progress, but there's still a ways to go! Even in this trivial example there are 5 keywords. Also there's the "embedded" English syntax; for example: "if" comes before a condition, "return" comes before the thing being returned, "is" is between to two things being copula'd.

This also maybe a good time to point out that "def" isn't exactly English.[1] It's short for "define", but function definition is such a common and fundamental thing that someone (van Rossum?) decided to abbreviate it. So, while Python certainly has an English influence, it's not entirley beholden to it. especially with the conveinence of the language is at stake. This principle will come up later when thinking about what Python would look like under the influence of other languages.

If you're super observant, you'll also notice our first translation wrinkle. The class Number was glossed as 𐑯𐑳𐑥𐑚𐑻, yet the variable number was changed to 𐑳_𐑯𐑳𐑥𐑚𐑻, which is my Shavian spelling of "a number". This is because Shavian has no casing distinctions, hence there's no way to have a capitalized convention for class names.[2] Even when dealing with just English in a different alphabet, we're encountering a way that the ergonomics of the programming language will have to change.

If we wanted to complete our Shavian version of Python, we would need to pick spellings for the 38 or so keywords in the language. Keywords aren't typically added often, so, unlike identifiers, this information could be standardized and provided from a central place.

To "translate" code into a Shavian Python, we'll use these rules for keywords and syntax:

Python (en-US)	·𐑐𐑲𐑔𐑷𐑯 (en-US-Shaw Python)
x y (placeholders for identifiers)	𐑙 𐑣
class x:	𐑒𐑤𐑨𐑕 𐑙:
def x(y1, y2):	𐑛𐑧𐑓 𐑙(𐑣1, 𐑣1):
if x:	𐑦𐑓 𐑙:
x is y	𐑙 𐑦𐑟 𐑣
return x	𐑮𐑩𐑑𐑻𐑯 𐑙
None	𐑯𐑳𐑯

# program-translation
# from: en-US
# to: en-US-Shaw
# identifiers:
#   absolute_value: 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵
#   is_negative: 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝
#   is_odd: 𐑦𐑟_𐑷𐑛
#   negated_number: 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
#   number: 𐑳_𐑯𐑳𐑥𐑚𐑻
#   Number: 𐑯𐑳𐑥𐑚𐑻
#   self: 𐑕𐑧𐑤𐑓:

𐑒𐑤𐑨𐑕 𐑯𐑳𐑥𐑚𐑻:
    𐑛𐑧𐑓 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝(𐑕𐑧𐑤𐑓):
        ...

    𐑛𐑧𐑓 𐑦𐑟_𐑷𐑛(𐑕𐑧𐑤𐑓):
        ...

𐑛𐑧𐑓 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵(𐑳_𐑯𐑳𐑥𐑚𐑻: 𐑯𐑳𐑥𐑚𐑻):
    𐑦𐑓 𐑳_𐑯𐑳𐑥𐑚𐑻 𐑦𐑟 𐑯𐑳𐑯:
        𐑮𐑩𐑑𐑻𐑯 𐑯𐑳𐑯
    𐑦𐑓 𐑳_𐑯𐑳𐑥𐑚𐑻.𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝():
        𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻 = -𐑳_𐑯𐑳𐑥𐑚𐑻
        𐑮𐑩𐑑𐑻𐑯 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
    𐑮𐑩𐑑𐑻𐑯 𐑳_𐑯𐑳𐑥𐑚𐑻

Well, it's lost the syntax highlighting of course, but there we can finally see a Python program fully "translated".

While Shavian programming is a neat trick, we didn't really learn much new about programming languages. Let's try a few more natural languages that have some more substantial differences to English.

German

German presents an interesting wrinkle: in standard written German, all nouns are capitalized. Let's think about what that would mean in a progamming language.

Programming languages can be thought of in (at least) two different ways.

One way is the perspective of functional programming: a function tells a computer what to do. You can define functions and invoke functions. That's it. All "data" is just defined but not-yet-invoked functions. In this perspective, we can relate function invocation to verbs, specifically imperative verbs. When the source code invokes the function, it's saying "do(this)". When the source code defines a function (or other data), we can think of that as a noun, with the copula assigning the noun to its meaning. Thus, a given identifier might have a "noun" meaning in one context, and a "verb" one in another.

In reality, even in very function-oriented languages, there are types of data besides functions, such as numbers or strings. Since those aren't executable, their identifiers always have the "noun" sense.

Another way to think of a program is as a state machine with limitless room to store its state. The state is the nouns, and the different transitions between those are verbs.

So let's define a "verb" as a function (or other "callable"), and a "noun" as everything else. Conceptually, the callables are nouns until you invoke them, but most of the time the only thing you do with them is invoke them.

If you wanted to make a Very German™ progamming language, you might have identifiers always be nouns when being defined (and thus capitalized), but if their value is a function, they are lowercased on function invocation:

Tun = (x): ...
Sachen = 4
tu Sachen

This defines a function called Tun ("do") that takes one argument. It then defines a variable Sachen ("stuff") that is assigned the value 4. Then Tun is invoked with Sachen as its argument. No parenthesis are necessary in the function invocation since the lowercased name indicates it's being invoked. There's also a -n suffix in the noun form. While using casing to distinguish definition from invocation is a cool idea, this is probably an example of embedding too much of a natural language in a programming language.

Let's "translating" Python into German. Here's our keyword and syntax table:

Python (en-US)	Python[3] (de Python)
x y (placeholders for identifiers)	x y
class x:	Klasse x:
def x(y1, y2):	def x(y1, y1):
if x:	wenn x:
x is y	x ist y
return x	gib y
None	Nichts

This is pretty straightforward. The keyword for class definition, Klasse, is capitalized since it's a noun. The keyword for callable definition is still "def", but now it's short for definier.

Please note that, in general, I'm not fluent in these other languages I'm going to use in examples. What you see here is based on what I know from reading dictionaries and grammars. If you have suggestions for better translations, feel free to share.

Let's look at our little example program:

# program-translation
# from: en
# to: de
# identifiers:
#   absolute_value: Absolutwert
#   is_negative: ist_negative
#   is_odd: ist_ungerade
#   negated_number: negiert_Zahl
#   number: ein_Zahl
#   Number: Zahl
#   self: Selbst

Klass Zahl:
    def ist_negative(Selbst):
        ...

    def ist_ungerade(Selbst):
        ...

def Absolutwert(ein_Zahl: Zahl):
    wenn ein_Zahl ist Nichts:
        gib Nichts
    wenn ein_Zahl.ist_negative():
        negiert_Zahl = -ein_Zahl
        gib negiert_Zahl
    gib ein_Zahl

Note the function with a noun name Absolutwert. In code typically a callable with a noun name means "make one of this noun out of this other noun I'm giving you". If we wanted to be pedantic, we could call it mach_Absolutwert, "make absolute value".

The verbs should be in the imperative tense, since the program is telling the computer to take some action. They can also use the "familiar" form instead of the "formal" one. The computer is du instead of Sie.[4]

Now let's consider grammatical gender. Verbs in German must change form depending on the "gender" of the subject of the clause. Thankfully, since all the verbs are orders to the computer, we can just use whatever gender "computer" is assigned in German. And really, we don't even need to worry about that, because imperative verbs don't do gender-agreement.

However, adjectives also must have gender-agreement. The word Zahl has a "feminine" grammatical gender, the adjective negative in the method name ist_negative is the feminine form, and not ist_negativer (masculine) or ist_negatives (neuter).

However let's image we write (in English) an "is negative" function that just returns whether something is less than zero.

def is_negative(thing):
    return thing < 0

How do we translate this? We don't know what gender "thing" will have. We could assume feminine since that's what Zahl is, but that's kind of against the spirit of ducktyping. "thing" can by anything for which the < operator is defined. For example you could define a Vector type which works with this function. Its name would be translated as Vektor, which is masculine. Should you make three aliases for the function, ist_negatives, ist_negativer, and ist_negative? Should the language support some kind of regex-identifier like def ist_negative[rs]?(Ding): ..., that would concisely define those three aliases?

We should probably just pick a convention and stick with it. Python doesn't always match English grammar conventions, and that's OK. I'm not a native speaker, but to me either the shortest form negative or the neuter one negatives seems like a good choice.

To Be Continued

Next time I'll look at what happens when we try to "translate" into Japanese, Hindi, and Urdu.