I'd like to jot down a few ideas about how to enable computer programming
without English.
There's two parts to enabling programming in a natural language: the identifiers
devised by the programmer, and the keywords and syntax of the language the
program is written itself.
Let's start with the simpler case: identifiers.
Identifiers
In many modern programming languages, you can use just about any strings for
identifiers. They typically can't contain any punctuation besides "_" (I'm
including spaces as punctuation), but otherwise you more or less stick whatever
in there.
So nothing to do here, right?
Well, the fact is English is the lingua-franca of programming, so even if you
have some, say, Japanese words in mind for your identifiers, you know you have
to "translate" those to English if you want to actually work with anyone else.
What if you could document in a machine-readable way the identifiers you would
use in a particular natural language, alongside the English ones that are your
baseline for collaboration? This could be helpful to readers who are also
familiar with that language. And maybe text editing software could make the
substitutions when displaying the code.
We just need a way to annotate each scope to specify the non-English version of
each of its identifiers. The data involved is not particularly complicated.
We can use BCP-47 to specify a natural written language, and the rest is
basically just one mapping for each scope.
One way to store that data might be in special comments:
# program-translation
# from: en-US
# to: en-x-piglatin
# pt: Umbernay
class Number:
# pt: isway_egativenay(elfsay)
def is_negative(self):
...
# pt: isway_oddway(elfsay)
def is_odd(self):
...
# pt: absoluteway_aluevay(umbernay):
def absolute_value(number: Number):
if number is None:
return None
if number.is_negative():
# pt: egatedway_umbernay
negated_number = -number
return negated_number
return number
This code is quite silly, but serves to demonstrate some of the challenges. Each
"pt" comment gives translations for the identifier that is defined on the next
line. Python's semantic whitespace means we can be sure there's at most one
assignment per logical line, so writing a parser to match up the comments to the
identifiers shouldn't be too difficult. Other languages with more free-form
line-breaking might be trickier.
One obvious problem with the comment-based approach is the sheer amount noise
this adds to the code. With a little more verbosity and some YAML, we could move
the mappings to the comment the top of the file where we already specified
languages:
# program-translation
# from: en-US
# to: en-x-piglatin
# identifiers:
# Number:
# _: Umbernay
# is_negative:
# _: isway_egativenay
# self: elfsay
# is_odd:
# _: isway_oddway
# self: elfsay
# absolute_value:
# _: absoluteway_aluevay
# number: umbernay
# negated_number: egatedway_umbernay
class Number:
def is_negative(self):
...
def is_odd(self):
...
def absolute_value(number: Number):
if number is None:
return None
if number.is_negative():
negated_number = -number
return negated_number
return number
Well, that's kind of ugly, but at least it's in one place.
This approach has the advantage of being easily extendible to multiple
languages: just use multiple comments.
There's still a problem though. In the implementation of absolute_value,
how do we know what translation to use for the is_negative attribute on the
number variable? In a statically typed language this wouldn't be an issue. In
Python's ducktyping, we could look at type hints, but that means pulling in an
entire library like mypy just to do some string replacements. And still
wouldn't work if the types aren't hinted.
Let's take Python's duck-typing at face value, and just say "if it's a
particular identifier in English, then there's only one translation".
# program-translation
# from: en-US
# to: en-x-piglatin
# identifiers:
# absolute_value: absoluteway_aluevay
# is_negative: isway_egativenay
# is_odd: isway_oddway
# negated_number: egatedway_umbernay
# number: umbernay
# Number: Umbernay
# self: elfsay
class Number:
def is_negative(self):
...
def is_odd(self):
...
def absolute_value(number: Number):
if number is None:
return None
if number.is_negative():
negated_number = -number
return negated_number
return number
Hey, that's a bit nicer. Now we're just doing dumb string-substitution on the
identifiers. It also de-duplicated the translation of self.
Now, we may actually want different translations in different scopes. We could
enable that with inline comments again, but then we have a noise problem again.
Instead let's see if we can specify a scope and then some overrides.
# program-translation
# from: en-US
# to: en-x-piglatin
# identifiers:
# absolute_value: absoluteway_aluevay
# is_negative: isway_egativenay
# is_odd: isway_oddway
# negated_number: egatedway_umbernay
# number: umbernay
# Number: Umbernay
# self: elfsay
# scopes:
# absolute_value:
# number: umberney
class Number:
def is_negative(self):
...
def is_odd(self):
...
def absolute_value(number: Number):
if number is None:
return None
if number.is_negative():
negated_number = -number
return negated_number
return number
Here, we're saying specifically in the scope of the absolute_value function,
translate the identifier number differently.
What about imports? Well, the file importing could provide translations of all
their identifiers too, but I think it would be better to use separate files to
give those translations in one place. For example, assuming our code above is in
a module called number, the translations could live in a file called
number.en-x-piglatin.translations.yaml. This file could be either next to
the Python module, or in the same place in a hierarchy rooted at a different
directory. That would allow you to provide your own translations for other
people's code if they aren't packaged with them.
When displaying this source code, whatever software is doing that could give a
"en-x-piglatin" option to show translated identifiers. I'm not sure what a good
UX would be for editing, but I think you could imagine it.
For some commonly used identifiers, such as self, there might be some
translations configured at the package level, or even globally.
Syntax
Now for the really interesting part.
So, you got some editor or other piece of software that can show you the source
code "translated" by replacing identifiers. Let's see what that would look like,
using Shavian as a "translation" of our usual Latin-script English:
# program-translation
# from: en-US
# to: en-US-Shaw
# identifiers:
# absolute_value: 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵
# is_negative: 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝
# is_odd: 𐑦𐑟_𐑷𐑛
# negated_number: 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
# number: 𐑳_𐑯𐑳𐑥𐑚𐑻
# Number: 𐑯𐑳𐑥𐑚𐑻
# self: 𐑕𐑧𐑤𐑓:
class 𐑯𐑳𐑥𐑚𐑻:
def 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝(𐑕𐑧𐑤𐑓):
...
def 𐑦𐑟_𐑷𐑛(𐑕𐑧𐑤𐑓):
...
def 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵(𐑳_𐑯𐑳𐑥𐑚𐑻: 𐑯𐑳𐑥𐑚𐑻):
if 𐑳_𐑯𐑳𐑥𐑚𐑻 is None:
return None
if 𐑳_𐑯𐑳𐑥𐑚𐑻.𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝():
𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻 = -𐑳_𐑯𐑳𐑥𐑚𐑻
return 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
return 𐑳_𐑯𐑳𐑥𐑚𐑻
I've used Shavian so it's extra clear what "untranslated" English remains after
dealing with identifiers.
This is progress, but there's still a ways to go! Even in this trivial example
there are 5 keywords. Also there's the "embedded" English syntax; for example:
"if" comes before a condition, "return" comes before the thing being returned,
"is" is between to two things being copula'd.
This also maybe a good time to point out that "def" isn't exactly English. It's short for "define", but function definition is such a common and
fundamental thing that someone (van Rossum?) decided to abbreviate it. So,
while Python certainly has an English influence, it's not entirley beholden to
it. especially with the conveinence of the language is at stake. This principle
will come up later when thinking about what Python would look like
under the influence of other languages.
If you're super observant, you'll also notice our first translation wrinkle.
The class Number was glossed as 𐑯𐑳𐑥𐑚𐑻, yet the variable number was
changed to 𐑳_𐑯𐑳𐑥𐑚𐑻, which is my Shavian spelling of "a number". This is
because Shavian has no casing distinctions, hence there's no way to have a
capitalized convention for class names. Even when dealing with
just English in a different alphabet, we're encountering a way that the
ergonomics of the programming language will have to change.
If we wanted to complete our Shavian version of Python, we would need to pick
spellings for the 38 or so keywords in the language. Keywords aren't typically
added often, so, unlike identifiers, this information could be standardized and
provided from a central place.
To "translate" code into a Shavian Python, we'll use these rules for keywords
and syntax:
Python (en-US) |
·𐑐𐑲𐑔𐑷𐑯 (en-US-Shaw Python) |
x y (placeholders for identifiers) |
𐑙 𐑣 |
class x: |
𐑒𐑤𐑨𐑕 𐑙: |
def x(y1, y2): |
𐑛𐑧𐑓 𐑙(𐑣1, 𐑣1): |
if x: |
𐑦𐑓 𐑙: |
x is y |
𐑙 𐑦𐑟 𐑣 |
return x |
𐑮𐑩𐑑𐑻𐑯 𐑙 |
None |
𐑯𐑳𐑯 |
# program-translation
# from: en-US
# to: en-US-Shaw
# identifiers:
# absolute_value: 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵
# is_negative: 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝
# is_odd: 𐑦𐑟_𐑷𐑛
# negated_number: 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
# number: 𐑳_𐑯𐑳𐑥𐑚𐑻
# Number: 𐑯𐑳𐑥𐑚𐑻
# self: 𐑕𐑧𐑤𐑓:
𐑒𐑤𐑨𐑕 𐑯𐑳𐑥𐑚𐑻:
𐑛𐑧𐑓 𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝(𐑕𐑧𐑤𐑓):
...
𐑛𐑧𐑓 𐑦𐑟_𐑷𐑛(𐑕𐑧𐑤𐑓):
...
𐑛𐑧𐑓 𐑨𐑚𐑕𐑩𐑤𐑵𐑑_𐑝𐑨𐑤𐑘𐑵(𐑳_𐑯𐑳𐑥𐑚𐑻: 𐑯𐑳𐑥𐑚𐑻):
𐑦𐑓 𐑳_𐑯𐑳𐑥𐑚𐑻 𐑦𐑟 𐑯𐑳𐑯:
𐑮𐑩𐑑𐑻𐑯 𐑯𐑳𐑯
𐑦𐑓 𐑳_𐑯𐑳𐑥𐑚𐑻.𐑦𐑟_𐑯𐑧𐑜𐑩𐑛𐑦𐑝():
𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻 = -𐑳_𐑯𐑳𐑥𐑚𐑻
𐑮𐑩𐑑𐑻𐑯 𐑯𐑩𐑜𐑱𐑛𐑩𐑑_𐑯𐑳𐑥𐑚𐑻
𐑮𐑩𐑑𐑻𐑯 𐑳_𐑯𐑳𐑥𐑚𐑻
Well, it's lost the syntax highlighting of course, but there we can finally see
a Python program fully "translated".
While Shavian programming is a neat trick, we didn't really learn much new about
programming languages. Let's try a few more natural languages that have some
more substantial differences to English.
German
German presents an interesting wrinkle: in standard written German, all nouns
are capitalized. Let's think about what that would mean in a progamming language.
Programming languages can be thought of in (at least) two different ways.
One way is the perspective of functional programming: a function tells a
computer what to do. You can define functions and invoke functions. That's it.
All "data" is just defined but not-yet-invoked functions. In this perspective,
we can relate function invocation to verbs, specifically imperative verbs. When
the source code invokes the function, it's saying "do(this)". When the source
code defines a function (or other data), we can think of that as a noun, with
the copula assigning the noun to its meaning. Thus, a given identifier might
have a "noun" meaning in one context, and a "verb" one in another.
In reality, even in very function-oriented languages, there are types of data
besides functions, such as numbers or strings. Since those aren't executable,
their identifiers always have the "noun" sense.
Another way to think of a program is as a state machine with limitless room to
store its state. The state is the nouns, and the different transitions between
those are verbs.
So let's define a "verb" as a function (or other "callable"), and a "noun" as
everything else. Conceptually, the callables are nouns until you invoke them,
but most of the time the only thing you do with them is invoke them.
If you wanted to make a Very German™ progamming language, you might have
identifiers always be nouns when being defined (and thus capitalized), but if
their value is a function, they are lowercased on function invocation:
Tun = (x): ...
Sachen = 4
tu Sachen
This defines a function called Tun ("do") that takes one argument. It then
defines a variable Sachen ("stuff") that is assigned the value 4. Then
Tun is invoked with Sachen as its argument. No parenthesis are necessary
in the function invocation since the lowercased name indicates it's being
invoked. There's also a -n suffix in the noun form. While using casing to
distinguish definition from invocation is a cool idea, this is probably an
example of embedding too much of a natural language in a programming language.
Let's "translating" Python into German. Here's our keyword and syntax table:
Python (en-US) |
Python (de Python) |
x y (placeholders for identifiers) |
x y |
class x: |
Klasse x: |
def x(y1, y2): |
def x(y1, y1): |
if x: |
wenn x: |
x is y |
x ist y |
return x |
gib y |
None |
Nichts |
This is pretty straightforward. The keyword for class definition, Klasse, is
capitalized since it's a noun. The keyword for callable definition is still
"def", but now it's short for definier.
Please note that, in general, I'm not fluent in these other languages I'm going
to use in examples. What you see here is based on what I know from reading
dictionaries and grammars. If you have suggestions for better translations, feel
free to share.
Let's look at our little example program:
# program-translation
# from: en
# to: de
# identifiers:
# absolute_value: Absolutwert
# is_negative: ist_negative
# is_odd: ist_ungerade
# negated_number: negiert_Zahl
# number: ein_Zahl
# Number: Zahl
# self: Selbst
Klass Zahl:
def ist_negative(Selbst):
...
def ist_ungerade(Selbst):
...
def Absolutwert(ein_Zahl: Zahl):
wenn ein_Zahl ist Nichts:
gib Nichts
wenn ein_Zahl.ist_negative():
negiert_Zahl = -ein_Zahl
gib negiert_Zahl
gib ein_Zahl
Note the function with a noun name Absolutwert. In code typically a callable
with a noun name means "make one of this noun out of this other noun I'm giving
you". If we wanted to be pedantic, we could call it mach_Absolutwert, "make
absolute value".
The verbs should be in the imperative tense, since the program is telling the
computer to take some action. They can also use the "familiar" form instead of
the "formal" one. The computer is du instead of Sie.
Now let's consider grammatical gender. Verbs in German must change form
depending on the "gender" of the subject of the clause. Thankfully, since all
the verbs are orders to the computer, we can just use whatever gender "computer"
is assigned in German. And really, we don't even need to worry about that,
because imperative verbs don't do gender-agreement.
However, adjectives also must have gender-agreement. The word Zahl has a
"feminine" grammatical gender, the adjective negative in the method name
ist_negative is the feminine form, and not ist_negativer (masculine) or
ist_negatives (neuter).
However let's image we write (in English) an "is negative" function that just
returns whether something is less than zero.
def is_negative(thing):
return thing < 0
How do we translate this? We don't know what gender "thing" will have. We could
assume feminine since that's what Zahl is, but that's kind of against the
spirit of ducktyping. "thing" can by anything for which the < operator is
defined. For example you could define a Vector type which works with this
function. Its name would be translated as Vektor, which is masculine. Should
you make three aliases for the function, ist_negatives, ist_negativer,
and ist_negative? Should the language support some kind of regex-identifier
like def ist_negative[rs]?(Ding): ..., that would concisely define those
three aliases?
We should probably just pick a convention and stick with it. Python doesn't
always match English grammar conventions, and that's OK. I'm not a native
speaker, but to me either the shortest form negative or the neuter one
negatives seems like a good choice.
To Be Continued
Next time I'll look at what happens when we try to "translate" into Japanese,
Hindi, and Urdu.