Translating Source Code
I'd like to jot down a few ideas about how to enable computer programming without English.
Why would you want to do that? I can think of a few reasons:
To lower the barrier of entry for people who are fluent in a written language that isn't English.
To explore what programming languages can be like when removed from the constraints of English (and are placed in the constraints of another natural language).
To simply imagine other timelines where English is not the lingua-franca of computers.
There's two parts to enabling programming in a natural language: the identifiers devised by the programmer, and the keywords and syntax of the language the program is written itself.
Let's start with the simpler case: identifiers.
Identifiers
In many modern programming languages, you can use just about any strings for identifiers. They typically can't contain any punctuation besides "_" (I'm including spaces as punctuation), but otherwise you more or less stick whatever in there.
So nothing to do here, right?
Well, the fact is English is the lingua-franca of programming, so even if you have some, say, Japanese words in mind for your identifiers, you know you have to "translate" those to English if you want to actually work with anyone else.
What if you could document in a machine-readable way the identifiers you would use in a particular natural language, alongside the English ones that are your baseline for collaboration? This could be helpful to readers who are also familiar with that language. And maybe text editing software could make the substitutions when displaying the code.
We just need a way to annotate each scope to specify the non-English version of each of its identifiers. The data involved is not particularly complicated. We can use BCP-47 to specify a natural written language, and the rest is basically just one mapping for each scope.
One way to store that data might be in special comments:
# program-translation # from: en-US # to: en-x-piglatin # pt: Umbernay class Number: # pt: isway_egativenay(elfsay) def is_negative(self): ... # pt: isway_oddway(elfsay) def is_odd(self): ... # pt: absoluteway_aluevay(umbernay): def absolute_value(number: Number): if number is None: return None if number.is_negative(): # pt: egatedway_umbernay negated_number = -number return negated_number return number
This code is quite silly, but serves to demonstrate some of the challenges. Each "pt" comment gives translations for the identifier that is defined on the next line. Python's semantic whitespace means we can be sure there's at most one assignment per logical line, so writing a parser to match up the comments to the identifiers shouldn't be too difficult. Other languages with more free-form line-breaking might be trickier.
One obvious problem with the comment-based approach is the sheer amount noise this adds to the code. With a little more verbosity and some YAML, we could move the mappings to the comment the top of the file where we already specified languages:
# program-translation # from: en-US # to: en-x-piglatin # identifiers: # Number: # _: Umbernay # is_negative: # _: isway_egativenay # self: elfsay # is_odd: # _: isway_oddway # self: elfsay # absolute_value: # _: absoluteway_aluevay # number: umbernay # negated_number: egatedway_umbernay class Number: def is_negative(self): ... def is_odd(self): ... def absolute_value(number: Number): if number is None: return None if number.is_negative(): negated_number = -number return negated_number return number
Well, that's kind of ugly, but at least it's in one place.
This approach has the advantage of being easily extendible to multiple languages: just use multiple comments.
There's still a problem though. In the implementation of absolute_value,
how do we know what translation to use for the is_negative attribute on the
number variable? In a statically typed language this wouldn't be an issue. In
Python's ducktyping, we could look at type hints, but that means pulling in an
entire library like mypy just to do some string replacements. And still
wouldn't work if the types aren't hinted.
Let's take Python's duck-typing at face value, and just say "if it's a particular identifier in English, then there's only one translation".
# program-translation # from: en-US # to: en-x-piglatin # identifiers: # absolute_value: absoluteway_aluevay # is_negative: isway_egativenay # is_odd: isway_oddway # negated_number: egatedway_umbernay # number: umbernay # Number: Umbernay # self: elfsay class Number: def is_negative(self): ... def is_odd(self): ... def absolute_value(number: Number): if number is None: return None if number.is_negative(): negated_number = -number return negated_number return number
Hey, that's a bit nicer. Now we're just doing dumb string-substitution on the
identifiers. It also de-duplicated the translation of self.
Now, we may actually want different translations in different scopes. We could enable that with inline comments again, but then we have a noise problem again. Instead let's see if we can specify a scope and then some overrides.
# program-translation # from: en-US # to: en-x-piglatin # identifiers: # absolute_value: absoluteway_aluevay # is_negative: isway_egativenay # is_odd: isway_oddway # negated_number: egatedway_umbernay # number: umbernay # Number: Umbernay # self: elfsay # scopes: # absolute_value: # number: umberney class Number: def is_negative(self): ... def is_odd(self): ... def absolute_value(number: Number): if number is None: return None if number.is_negative(): negated_number = -number return negated_number return number
Here, we're saying specifically in the scope of the absolute_value function,
translate the identifier number differently.
What about imports? Well, the file importing could provide translations of all
their identifiers too, but I think it would be better to use separate files to
give those translations in one place. For example, assuming our code above is in
a module called number, the translations could live in a file called
number.en-x-piglatin.translations.yaml. This file could be either next to
the Python module, or in the same place in a hierarchy rooted at a different
directory. That would allow you to provide your own translations for other
people's code if they aren't packaged with them.
When displaying this source code, whatever software is doing that could give a "en-x-piglatin" option to show translated identifiers. I'm not sure what a good UX would be for editing, but I think you could imagine it.
For some commonly used identifiers, such as self, there might be some
translations configured at the package level, or even globally.
Syntax
Now for the really interesting part.
So, you got some editor or other piece of software that can show you the source code "translated" by replacing identifiers. Let's see what that would look like, using Shavian as a "translation" of our usual Latin-script English:
# program-translation # from: en-US # to: en-US-Shaw # identifiers: # absolute_value: ๐จ๐๐๐ฉ๐ค๐ต๐_๐๐จ๐ค๐๐ต # is_negative: ๐ฆ๐_๐ฏ๐ง๐๐ฉ๐๐ฆ๐ # is_odd: ๐ฆ๐_๐ท๐ # negated_number: ๐ฏ๐ฉ๐๐ฑ๐๐ฉ๐_๐ฏ๐ณ๐ฅ๐๐ป # number: ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป # Number: ๐ฏ๐ณ๐ฅ๐๐ป # self: ๐๐ง๐ค๐: class ๐ฏ๐ณ๐ฅ๐๐ป: def ๐ฆ๐_๐ฏ๐ง๐๐ฉ๐๐ฆ๐(๐๐ง๐ค๐): ... def ๐ฆ๐_๐ท๐(๐๐ง๐ค๐): ... def ๐จ๐๐๐ฉ๐ค๐ต๐_๐๐จ๐ค๐๐ต(๐ณ_๐ฏ๐ณ๐ฅ๐๐ป: ๐ฏ๐ณ๐ฅ๐๐ป): if ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป is None: return None if ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป.๐ฆ๐_๐ฏ๐ง๐๐ฉ๐๐ฆ๐(): ๐ฏ๐ฉ๐๐ฑ๐๐ฉ๐_๐ฏ๐ณ๐ฅ๐๐ป = -๐ณ_๐ฏ๐ณ๐ฅ๐๐ป return ๐ฏ๐ฉ๐๐ฑ๐๐ฉ๐_๐ฏ๐ณ๐ฅ๐๐ป return ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป
I've used Shavian so it's extra clear what "untranslated" English remains after dealing with identifiers.
This is progress, but there's still a ways to go! Even in this trivial example there are 5 keywords. Also there's the "embedded" English syntax; for example: "if" comes before a condition, "return" comes before the thing being returned, "is" is between to two things being copula'd.
This also maybe a good time to point out that "def" isn't exactly English.[1] It's short for "define", but function definition is such a common and fundamental thing that someone (van Rossum?) decided to abbreviate it. So, while Python certainly has an English influence, it's not entirley beholden to it. especially with the conveinence of the language is at stake. This principle will come up later when thinking about what Python would look like under the influence of other languages.
If you're super observant, you'll also notice our first translation wrinkle.
The class Number was glossed as ๐ฏ๐ณ๐ฅ๐๐ป, yet the variable number was
changed to ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป, which is my Shavian spelling of "a number". This is
because Shavian has no casing distinctions, hence there's no way to have a
capitalized convention for class names.[2] Even when dealing with
just English in a different alphabet, we're encountering a way that the
ergonomics of the programming language will have to change.
If we wanted to complete our Shavian version of Python, we would need to pick spellings for the 38 or so keywords in the language. Keywords aren't typically added often, so, unlike identifiers, this information could be standardized and provided from a central place.
To "translate" code into a Shavian Python, we'll use these rules for keywords and syntax:
Python (en-US) |
ยท๐๐ฒ๐๐ท๐ฏ (en-US-Shaw Python) |
|---|---|
x y (placeholders for identifiers) |
๐ ๐ฃ |
class x: |
๐๐ค๐จ๐ ๐: |
def x(y1, y2): |
๐๐ง๐ ๐(๐ฃ1, ๐ฃ1): |
if x: |
๐ฆ๐ ๐: |
x is y |
๐ ๐ฆ๐ ๐ฃ |
return x |
๐ฎ๐ฉ๐๐ป๐ฏ ๐ |
None |
๐ฏ๐ณ๐ฏ |
# program-translation # from: en-US # to: en-US-Shaw # identifiers: # absolute_value: ๐จ๐๐๐ฉ๐ค๐ต๐_๐๐จ๐ค๐๐ต # is_negative: ๐ฆ๐_๐ฏ๐ง๐๐ฉ๐๐ฆ๐ # is_odd: ๐ฆ๐_๐ท๐ # negated_number: ๐ฏ๐ฉ๐๐ฑ๐๐ฉ๐_๐ฏ๐ณ๐ฅ๐๐ป # number: ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป # Number: ๐ฏ๐ณ๐ฅ๐๐ป # self: ๐๐ง๐ค๐: ๐๐ค๐จ๐ ๐ฏ๐ณ๐ฅ๐๐ป: ๐๐ง๐ ๐ฆ๐_๐ฏ๐ง๐๐ฉ๐๐ฆ๐(๐๐ง๐ค๐): ... ๐๐ง๐ ๐ฆ๐_๐ท๐(๐๐ง๐ค๐): ... ๐๐ง๐ ๐จ๐๐๐ฉ๐ค๐ต๐_๐๐จ๐ค๐๐ต(๐ณ_๐ฏ๐ณ๐ฅ๐๐ป: ๐ฏ๐ณ๐ฅ๐๐ป): ๐ฆ๐ ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป ๐ฆ๐ ๐ฏ๐ณ๐ฏ: ๐ฎ๐ฉ๐๐ป๐ฏ ๐ฏ๐ณ๐ฏ ๐ฆ๐ ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป.๐ฆ๐_๐ฏ๐ง๐๐ฉ๐๐ฆ๐(): ๐ฏ๐ฉ๐๐ฑ๐๐ฉ๐_๐ฏ๐ณ๐ฅ๐๐ป = -๐ณ_๐ฏ๐ณ๐ฅ๐๐ป ๐ฎ๐ฉ๐๐ป๐ฏ ๐ฏ๐ฉ๐๐ฑ๐๐ฉ๐_๐ฏ๐ณ๐ฅ๐๐ป ๐ฎ๐ฉ๐๐ป๐ฏ ๐ณ_๐ฏ๐ณ๐ฅ๐๐ป
Well, it's lost the syntax highlighting of course, but there we can finally see a Python program fully "translated".
While Shavian programming is a neat trick, we didn't really learn much new about programming languages. Let's try a few more natural languages that have some more substantial differences to English.
German
German presents an interesting wrinkle: in standard written German, all nouns are capitalized. Let's think about what that would mean in a progamming language.
Programming languages can be thought of in (at least) two different ways.
One way is the perspective of functional programming: a function tells a computer what to do. You can define functions and invoke functions. That's it. All "data" is just defined but not-yet-invoked functions. In this perspective, we can relate function invocation to verbs, specifically imperative verbs. When the source code invokes the function, it's saying "do(this)". When the source code defines a function (or other data), we can think of that as a noun, with the copula assigning the noun to its meaning. Thus, a given identifier might have a "noun" meaning in one context, and a "verb" one in another.
In reality, even in very function-oriented languages, there are types of data besides functions, such as numbers or strings. Since those aren't executable, their identifiers always have the "noun" sense.
Another way to think of a program is as a state machine with limitless room to store its state. The state is the nouns, and the different transitions between those are verbs.
So let's define a "verb" as a function (or other "callable"), and a "noun" as everything else. Conceptually, the callables are nouns until you invoke them, but most of the time the only thing you do with them is invoke them.
If you wanted to make a Very Germanโข progamming language, you might have identifiers always be nouns when being defined (and thus capitalized), but if their value is a function, they are lowercased on function invocation:
This defines a function called Tun ("do") that takes one argument. It then
defines a variable Sachen ("stuff") that is assigned the value 4. Then
Tun is invoked with Sachen as its argument. No parenthesis are necessary
in the function invocation since the lowercased name indicates it's being
invoked. There's also a -n suffix in the noun form. While using casing to
distinguish definition from invocation is a cool idea, this is probably an
example of embedding too much of a natural language in a programming language.
Let's "translating" Python into German. Here's our keyword and syntax table:
Python (en-US) |
Python[3] (de Python) |
|---|---|
x y (placeholders for identifiers) |
x y |
class x: |
Klasse x: |
def x(y1, y2): |
def x(y1, y1): |
if x: |
wenn x: |
x is y |
x ist y |
return x |
gib y |
None |
Nichts |
This is pretty straightforward. The keyword for class definition, Klasse, is
capitalized since it's a noun. The keyword for callable definition is still
"def", but now it's short for definier.
Please note that, in general, I'm not fluent in these other languages I'm going to use in examples. What you see here is based on what I know from reading dictionaries and grammars. If you have suggestions for better translations, feel free to share.
Let's look at our little example program:
# program-translation # from: en # to: de # identifiers: # absolute_value: Absolutwert # is_negative: ist_negative # is_odd: ist_ungerade # negated_number: negiert_Zahl # number: ein_Zahl # Number: Zahl # self: Selbst Klass Zahl: def ist_negative(Selbst): ... def ist_ungerade(Selbst): ... def Absolutwert(ein_Zahl: Zahl): wenn ein_Zahl ist Nichts: gib Nichts wenn ein_Zahl.ist_negative(): negiert_Zahl = -ein_Zahl gib negiert_Zahl gib ein_Zahl
Note the function with a noun name Absolutwert. In code typically a callable
with a noun name means "make one of this noun out of this other noun I'm giving
you". If we wanted to be pedantic, we could call it mach_Absolutwert, "make
absolute value".
The verbs should be in the imperative tense, since the program is telling the computer to take some action. They can also use the "familiar" form instead of the "formal" one. The computer is du instead of Sie.[4]
Now let's consider grammatical gender. Verbs in German must change form depending on the "gender" of the subject of the clause. Thankfully, since all the verbs are orders to the computer, we can just use whatever gender "computer" is assigned in German. And really, we don't even need to worry about that, because imperative verbs don't do gender-agreement.
However, adjectives also must have gender-agreement. The word Zahl has a
"feminine" grammatical gender, the adjective negative in the method name
ist_negative is the feminine form, and not ist_negativer (masculine) or
ist_negatives (neuter).
However let's image we write (in English) an "is negative" function that just returns whether something is less than zero.
How do we translate this? We don't know what gender "thing" will have. We could
assume feminine since that's what Zahl is, but that's kind of against the
spirit of ducktyping. "thing" can by anything for which the < operator is
defined. For example you could define a Vector type which works with this
function. Its name would be translated as Vektor, which is masculine. Should
you make three aliases for the function, ist_negatives, ist_negativer,
and ist_negative? Should the language support some kind of regex-identifier
like def ist_negative[rs]?(Ding): ..., that would concisely define those
three aliases?
We should probably just pick a convention and stick with it. Python doesn't
always match English grammar conventions, and that's OK. I'm not a native
speaker, but to me either the shortest form negative or the neuter one
negatives seems like a good choice.
To Be Continued
Next time I'll look at what happens when we try to "translate" into Japanese, Hindi, and Urdu.