.. title: Translating Source Code .. date: 2023-09-01 .. type: text .. category: programming .. tags: language, programming I'd like to jot down a few ideas about how to enable computer programming without English. Why would you want to do that? I can think of a few reasons: 1. To lower the barrier of entry for people who are fluent in a written language that isn't English. 2. To explore what programming languages can be like when removed from the constraints of English (and are placed in the constraints of another natural language). 3. To simply imagine other timelines where English is not the lingua-franca of computers. There's two parts to enabling programming in a natural language: the identifiers devised by the programmer, and the keywords and syntax of the language the program is written itself. Let's start with the simpler case: identifiers. Identifiers =========== In many modern programming languages, you can use just about any strings for identifiers. They typically can't contain any punctuation besides "_" (I'm including spaces as punctuation), but otherwise you more or less stick whatever in there. So nothing to do here, right? Well, the fact is English is the lingua-franca of programming, so even if you have some, say, Japanese words in mind for your identifiers, you know you have to "translate" those to English if you want to actually work with anyone else. What if you could document in a machine-readable way the identifiers you would use in a particular natural language, alongside the English ones that are your baseline for collaboration? This could be helpful to readers who are also familiar with that language. And maybe text editing software could make the substitutions when displaying the code. We just need a way to annotate each scope to specify the non-English version of each of its identifiers. The data involved is not particularly complicated. We can use `BCP-47`_ to specify a natural written language, and the rest is basically just one mapping for each scope. One way to store that data might be in special comments: .. code-block:: python # program-translation # from: en-US # to: en-x-piglatin # pt: Umbernay class Number: # pt: isway_egativenay(elfsay) def is_negative(self): ... # pt: isway_oddway(elfsay) def is_odd(self): ... # pt: absoluteway_aluevay(umbernay): def absolute_value(number: Number): if number is None: return None if number.is_negative(): # pt: egatedway_umbernay negated_number = -number return negated_number return number This code is quite silly, but serves to demonstrate some of the challenges. Each "pt" comment gives translations for the identifier that is defined on the next line. Python's semantic whitespace means we can be sure there's at most one assignment per logical line, so writing a parser to match up the comments to the identifiers shouldn't be too difficult. Other languages with more free-form line-breaking might be trickier. One obvious problem with the comment-based approach is the sheer amount noise this adds to the code. With a little more verbosity and some YAML, we could move the mappings to the comment the top of the file where we already specified languages: .. code-block:: python # program-translation # from: en-US # to: en-x-piglatin # identifiers: # Number: # _: Umbernay # is_negative: # _: isway_egativenay # self: elfsay # is_odd: # _: isway_oddway # self: elfsay # absolute_value: # _: absoluteway_aluevay # number: umbernay # negated_number: egatedway_umbernay class Number: def is_negative(self): ... def is_odd(self): ... def absolute_value(number: Number): if number is None: return None if number.is_negative(): negated_number = -number return negated_number return number Well, that's kind of ugly, but at least it's in one place. This approach has the advantage of being easily extendible to multiple languages: just use multiple comments. There's `still` a problem though. In the implementation of ``absolute_value``, how do we know what translation to use for the ``is_negative`` attribute on the ``number`` variable? In a statically typed language this wouldn't be an issue. In Python's ducktyping, we `could` look at type hints, but that means pulling in an entire library like ``mypy`` just to do some string replacements. And still wouldn't work if the types aren't hinted. Let's take Python's duck-typing at face value, and just say "if it's a particular identifier in English, then there's only one translation". .. code-block:: python # program-translation # from: en-US # to: en-x-piglatin # identifiers: # absolute_value: absoluteway_aluevay # is_negative: isway_egativenay # is_odd: isway_oddway # negated_number: egatedway_umbernay # number: umbernay # Number: Umbernay # self: elfsay class Number: def is_negative(self): ... def is_odd(self): ... def absolute_value(number: Number): if number is None: return None if number.is_negative(): negated_number = -number return negated_number return number Hey, that's a bit nicer. Now we're just doing dumb string-substitution on the identifiers. It also de-duplicated the translation of ``self``. Now, we may actually want different translations in different scopes. We could enable that with inline comments again, but then we have a noise problem again. Instead let's see if we can specify a scope and then some overrides. .. code-block:: python # program-translation # from: en-US # to: en-x-piglatin # identifiers: # absolute_value: absoluteway_aluevay # is_negative: isway_egativenay # is_odd: isway_oddway # negated_number: egatedway_umbernay # number: umbernay # Number: Umbernay # self: elfsay # scopes: # absolute_value: # number: umberney class Number: def is_negative(self): ... def is_odd(self): ... def absolute_value(number: Number): if number is None: return None if number.is_negative(): negated_number = -number return negated_number return number Here, we're saying specifically in the scope of the ``absolute_value`` function, translate the identifier ``number`` differently. What about imports? Well, the file importing could provide translations of all `their` identifiers too, but I think it would be better to use separate files to give those translations in one place. For example, assuming our code above is in a module called ``number``, the translations could live in a file called ``number.en-x-piglatin.translations.yaml``. This file could be either next to the Python module, or in the same place in a hierarchy rooted at a different directory. That would allow you to provide your own translations for other people's code if they aren't packaged with them. When displaying this source code, whatever software is doing that could give a "en-x-piglatin" option to show translated identifiers. I'm not sure what a good UX would be for editing, but I think you could imagine it. For some commonly used identifiers, such as ``self``, there might be some translations configured at the package level, or even globally. Syntax ====== Now for the really interesting part. So, you got some editor or other piece of software that can show you the source code "translated" by replacing identifiers. Let's see what that would look like, using Shavian_ as a "translation" of our usual Latin-script English: .. code-block:: python # program-translation # from: en-US # to: en-US-Shaw # identifiers: # absolute_value: ๐‘จ๐‘š๐‘•๐‘ฉ๐‘ค๐‘ต๐‘‘_๐‘๐‘จ๐‘ค๐‘˜๐‘ต # is_negative: ๐‘ฆ๐‘Ÿ_๐‘ฏ๐‘ง๐‘œ๐‘ฉ๐‘›๐‘ฆ๐‘ # is_odd: ๐‘ฆ๐‘Ÿ_๐‘ท๐‘› # negated_number: ๐‘ฏ๐‘ฉ๐‘œ๐‘ฑ๐‘›๐‘ฉ๐‘‘_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป # number: ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป # Number: ๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป # self: ๐‘•๐‘ง๐‘ค๐‘“: class ๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป: def ๐‘ฆ๐‘Ÿ_๐‘ฏ๐‘ง๐‘œ๐‘ฉ๐‘›๐‘ฆ๐‘(๐‘•๐‘ง๐‘ค๐‘“): ... def ๐‘ฆ๐‘Ÿ_๐‘ท๐‘›(๐‘•๐‘ง๐‘ค๐‘“): ... def ๐‘จ๐‘š๐‘•๐‘ฉ๐‘ค๐‘ต๐‘‘_๐‘๐‘จ๐‘ค๐‘˜๐‘ต(๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป: ๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป): if ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป is None: return None if ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป.๐‘ฆ๐‘Ÿ_๐‘ฏ๐‘ง๐‘œ๐‘ฉ๐‘›๐‘ฆ๐‘(): ๐‘ฏ๐‘ฉ๐‘œ๐‘ฑ๐‘›๐‘ฉ๐‘‘_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป = -๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป return ๐‘ฏ๐‘ฉ๐‘œ๐‘ฑ๐‘›๐‘ฉ๐‘‘_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป return ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป I've used Shavian so it's extra clear what "untranslated" English remains after dealing with identifiers. This is progress, but there's still a ways to go! Even in this trivial example there are 5 keywords. Also there's the "embedded" English syntax; for example: "if" comes before a condition, "return" comes before the thing being returned, "is" is between to two things being copula_'d. This also maybe a good time to point out that "def" isn't exactly English.\ [#def]_ It's short for "define", but function definition is such a common and fundamental thing that someone (`van Rossum`_?) decided to abbreviate it. So, while Python certainly has an English influence, it's not entirley beholden to it. especially with the conveinence of the language is at stake. This principle will come up later when thinking about what Python would look like under the influence of other languages. If you're `super` observant, you'll also notice our first translation wrinkle. The class ``Number`` was glossed as ``๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป``, yet the variable ``number`` was changed to ``๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป``, which is my Shavian spelling of "a number". This is because Shavian has no casing distinctions, hence there's no way to have a capitalized convention for class names.\ [#namingDot]_ Even when dealing with just English in a different alphabet, we're encountering a way that the ergonomics of the programming language will have to change. If we wanted to complete our Shavian version of Python, we would need to pick spellings for the 38 or so keywords in the language. Keywords aren't typically added often, so, unlike identifiers, this information could be standardized and provided from a central place. To "translate" code into a Shavian Python, we'll use these rules for keywords and syntax: ====================================== ============== Python (en-US) ยท๐‘๐‘ฒ๐‘”๐‘ท๐‘ฏ (en-US-Shaw Python) ====================================== ============== *x* *y* (placeholders for identifiers) *๐‘™* *๐‘ฃ* class *x*: ๐‘’๐‘ค๐‘จ๐‘• *๐‘™*: def *x*\ (\ *y1*, *y2*): ๐‘›๐‘ง๐‘“ *๐‘™*\ (\ *๐‘ฃ1*, *๐‘ฃ1*): if *x*: ๐‘ฆ๐‘“ *๐‘™*: *x* is *y* *๐‘™* ๐‘ฆ๐‘Ÿ *๐‘ฃ* return *x* ๐‘ฎ๐‘ฉ๐‘‘๐‘ป๐‘ฏ *๐‘™* None ๐‘ฏ๐‘ณ๐‘ฏ ====================================== ============== .. code-block:: python # program-translation # from: en-US # to: en-US-Shaw # identifiers: # absolute_value: ๐‘จ๐‘š๐‘•๐‘ฉ๐‘ค๐‘ต๐‘‘_๐‘๐‘จ๐‘ค๐‘˜๐‘ต # is_negative: ๐‘ฆ๐‘Ÿ_๐‘ฏ๐‘ง๐‘œ๐‘ฉ๐‘›๐‘ฆ๐‘ # is_odd: ๐‘ฆ๐‘Ÿ_๐‘ท๐‘› # negated_number: ๐‘ฏ๐‘ฉ๐‘œ๐‘ฑ๐‘›๐‘ฉ๐‘‘_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป # number: ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป # Number: ๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป # self: ๐‘•๐‘ง๐‘ค๐‘“: ๐‘’๐‘ค๐‘จ๐‘• ๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป: ๐‘›๐‘ง๐‘“ ๐‘ฆ๐‘Ÿ_๐‘ฏ๐‘ง๐‘œ๐‘ฉ๐‘›๐‘ฆ๐‘(๐‘•๐‘ง๐‘ค๐‘“): ... ๐‘›๐‘ง๐‘“ ๐‘ฆ๐‘Ÿ_๐‘ท๐‘›(๐‘•๐‘ง๐‘ค๐‘“): ... ๐‘›๐‘ง๐‘“ ๐‘จ๐‘š๐‘•๐‘ฉ๐‘ค๐‘ต๐‘‘_๐‘๐‘จ๐‘ค๐‘˜๐‘ต(๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป: ๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป): ๐‘ฆ๐‘“ ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป ๐‘ฆ๐‘Ÿ ๐‘ฏ๐‘ณ๐‘ฏ: ๐‘ฎ๐‘ฉ๐‘‘๐‘ป๐‘ฏ ๐‘ฏ๐‘ณ๐‘ฏ ๐‘ฆ๐‘“ ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป.๐‘ฆ๐‘Ÿ_๐‘ฏ๐‘ง๐‘œ๐‘ฉ๐‘›๐‘ฆ๐‘(): ๐‘ฏ๐‘ฉ๐‘œ๐‘ฑ๐‘›๐‘ฉ๐‘‘_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป = -๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป ๐‘ฎ๐‘ฉ๐‘‘๐‘ป๐‘ฏ ๐‘ฏ๐‘ฉ๐‘œ๐‘ฑ๐‘›๐‘ฉ๐‘‘_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป ๐‘ฎ๐‘ฉ๐‘‘๐‘ป๐‘ฏ ๐‘ณ_๐‘ฏ๐‘ณ๐‘ฅ๐‘š๐‘ป Well, it's lost the syntax highlighting of course, but there we can finally see a Python program fully "translated". While Shavian programming is a neat trick, we didn't really learn much new about programming languages. Let's try a few more natural languages that have some more substantial differences to English. German ------ German presents an interesting wrinkle: in standard written German, all nouns are capitalized. Let's think about what that would mean in a progamming language. Programming languages can be thought of in (at least) two different ways. One way is the perspective of functional programming: a function tells a computer what to do. You can define functions and invoke functions. That's it. All "data" is just defined but not-yet-invoked functions. In this perspective, we can relate function invocation to verbs, specifically imperative verbs. When the source code invokes the function, it's saying "do(this)". When the source code defines a function (or other data), we can think of that as a noun, with the copula assigning the noun to its meaning. Thus, a given identifier might have a "noun" meaning in one context, and a "verb" one in another. In reality, even in very function-oriented languages, there are types of data besides functions, such as numbers or strings. Since those aren't executable, their identifiers always have the "noun" sense. Another way to think of a program is as a state machine with limitless room to store its state. The state is the nouns, and the different transitions between those are verbs. So let's define a "verb" as a function (or other "callable"), and a "noun" as everything else. Conceptually, the callables are nouns until you invoke them, but most of the time the only thing you do with them is invoke them. If you wanted to make a Very Germanโ„ข progamming language, you might have identifiers always be nouns when being defined (and thus capitalized), but if their value is a function, they are lowercased on function invocation: .. code-block:: Tun = (x): ... Sachen = 4 tu Sachen This defines a function called ``Tun`` ("do") that takes one argument. It then defines a variable ``Sachen`` ("stuff") that is assigned the value 4. Then ``Tun`` is invoked with ``Sachen`` as its argument. No parenthesis are necessary in the function invocation since the lowercased name indicates it's being invoked. There's also a ``-n`` suffix in the noun form. While using casing to distinguish definition from invocation is a cool idea, this is probably an example of embedding too much of a natural language in a programming language. Let's "translating" Python into German. Here's our keyword and syntax table: ====================================== ============== Python (en-US) Python\ [#germanPython]_ (de Python) ====================================== ============== *x* *y* (placeholders for identifiers) *x* *y* class *x*: Klasse *x*: def *x*\ (\ *y1*, *y2*): def *x*\ (\ *y1*, *y1*): if *x*: wenn *x*: *x* is *y* *x* ist *y* return *x* gib *y* None Nichts ====================================== ============== This is pretty straightforward. The keyword for class definition, ``Klasse``, is capitalized since it's a noun. The keyword for callable definition is still "def", but now it's short for *definier*. Please note that, in general, I'm not fluent in these other languages I'm going to use in examples. What you see here is based on what I know from reading dictionaries and grammars. If you have suggestions for better translations, feel free to share. Let's look at our little example program: .. code-block:: # program-translation # from: en # to: de # identifiers: # absolute_value: Absolutwert # is_negative: ist_negative # is_odd: ist_ungerade # negated_number: negiert_Zahl # number: ein_Zahl # Number: Zahl # self: Selbst Klass Zahl: def ist_negative(Selbst): ... def ist_ungerade(Selbst): ... def Absolutwert(ein_Zahl: Zahl): wenn ein_Zahl ist Nichts: gib Nichts wenn ein_Zahl.ist_negative(): negiert_Zahl = -ein_Zahl gib negiert_Zahl gib ein_Zahl Note the function with a noun name ``Absolutwert``. In code typically a callable with a noun name means "make one of this noun out of this other noun I'm giving you". If we wanted to be pedantic, we could call it ``mach_Absolutwert``, "make absolute value". The verbs should be in the imperative tense, since the program is telling the computer to take some action. They can also use the "familiar" form instead of the "formal" one. The computer is *du* instead of *Sie*.\ [#polite]_ Now let's consider grammatical gender. Verbs in German must change form depending on the "gender" of the subject of the clause. Thankfully, since all the verbs are orders to the computer, we can just use whatever gender "computer" is assigned in German. And really, we don't even need to worry about that, because imperative verbs don't do gender-agreement. However, adjectives also must have gender-agreement. The word *Zahl* has a "feminine" grammatical gender, the adjective *negative* in the method name ``ist_negative`` is the feminine form, and not ``ist_negativer`` (masculine) or ``ist_negatives`` (neuter). However let's image we write (in English) an "is negative" function that just returns whether something is less than zero. .. code-block:: python def is_negative(thing): return thing < 0 How do we translate this? We don't know what gender "thing" will have. We could assume feminine since that's what *Zahl* is, but that's kind of against the spirit of ducktyping. "thing" can by anything for which the ``<`` operator is defined. For example you could define a ``Vector`` type which works with this function. Its name would be translated as ``Vektor``, which is masculine. Should you make three aliases for the function, ``ist_negatives``, ``ist_negativer``, and ``ist_negative``? Should the language support some kind of regex-identifier like ``def ist_negative[rs]?(Ding): ...``, that would concisely define those three aliases? We should probably just pick a convention and stick with it. Python doesn't always match English grammar conventions, and that's OK. I'm not a native speaker, but to me either the shortest form ``negative`` or the neuter one ``negatives`` seems like a good choice. To Be Continued --------------- Next time I'll look at what happens when we try to "translate" into Japanese, Hindi, and Urdu. .. [#def] At least, not the "working English" used in understandable source code. "Def" can be seen (or even heard) as short for "definitely", which is not what it means here. I don't think Mos Def had functions in mind when picking his stage name. .. [#namingDot] Yes, I'm aware of the Shavian "naming dot" ("ยท"), and it actually would make sense as a class name convention. But I wanted to make the point about the natural language forcing changes to the programming language. .. [#germanPython] The German word for "python" (the snake) is "Python". .. [#polite] I feel that being polite to a computer is overly superstitious if you want to be the one programming it. .. _`BCP-47`: https://www.rfc-editor.org/info/bcp47 .. _Shavian: https://en.wikipedia.org/wiki/Shavian_alphabet .. _copula: https://en.wikipedia.org/wiki/Copula_(linguistics) .. _`van Rossum`: https://en.wikipedia.org/wiki/Guido_van_Rossum