The Luna programming language

A few weeks ago I started implementing Luna, an elegant, expressive, and embeddable language written in ANSI C, with performance as a primary focus. For now it’s just a side project, but I thought it would be fun to blog about the different aspects as they unfold, building languages is a really fun process.

Goals

The current goals of Luna are the following:

  • small, simple, expressive, and explicit syntax
  • fast, fast, and fast
  • robust reflection capabilities
  • opt-in callee evaluated messages
  • register based VM
  • embeddable

About Luna’s Syntax

Luna’s syntax is inspired by Python, IO, and Lua, bits of Ruby, and of course, JavaScript.

The language itself will utilize “pure” prototypes for inheritance. When I say “pure”, I mean no confusing class/prototype hybrid monster barfing up poor design, which we all face in JavaScript. The way JavaScript handles prototypes adds to the confusion of what may already be a completely new concept to some. Contrasting with JavaScript this means losing constructors, ditching .prototype, the new operator, instanceof, typeof and other clunky “features”.

Take the following JavaScript for example:

function User(name) {
  this.name = user;
}

User.prototype.toString = function(){
 return this.name;
}

new User('tj')

This is a big W-T-F for some people, some get confused thinking it’s a class, and some get lost trying to understand that the prototype property has to do with anything at all. Simply put, a “prototype” is just an object, that’s it, it’s an object that you “clone”, which for performance reasons typically means you have a new object with the “parent” in a special prototype chain for property or “slot” lookup. That’s it! adding constructors and funky props and stupid keywords to the mix just makes things needlessly confusing, and really takes away from what would otherwise be a pretty elegant language.

Whitespace

Our earlier JavaScript User example might be written as the following, where we first clone the primitive Object, aka the prototype chain now contains Object, then we assign the slot (property) new to a function taking a single argument name. Within this function we clone the User object, so now the prototype chain contains Object, User, and assign the user’s name slot to the arg.

User = Object clone()

User new =: name
  user = User clone()
  user name = name
  user

User toString =:
  self name

tj = User new('tj')

There are a few things to note about Luna’s syntax here. The first being that linear white-space is the slot or “property” delimiter, not . like many languages. Another thing to note is the function literal syntax, which is simply a : character followed by optional params and a block.

Many of you know my thoughts regarding coffeescript, and like I’ve stated in Luna’s Readme, it’s not significant whitespace that I dislike, though it can become nasty very quickly in large doses. It’s of course an honor for any programmer to have a large number of followers, but the reality is that there’s almost nothing original about the language, and instead of striving for something elegant and simple, it just throws more sugar and syntax at the “problems”. The worst part about significant whitespace is that deep nesting, perhaps when defining methods on a class, it’s easy completely lose context, and in a big file/class that can get very annoying, very quickly. The fact that you must specify the receiver for each method reaffirms context within Luna.

EDIT: I’ve seen enough coffeescript used in practice now that looks really bad… so I’m likely going to go with a more lua-like syntax

Function Literals

So now that you’ve seen the function literal, you’re probably thinking what the hell… yes, the function literal is a single character, :. Consider the following JavaScript function:

 var greet = function(user, msg) {
   console.log('hello ' + user + ' ' + msg);
 };

The equivalent Luna script would be:

greet =: user msg
  console log('hello ' . user . ' ' . msg)

While this looks like an operator (=:), it’s simply convention, the following is valid as well:

greet = : user msg
  console log(...)

You’ll see that the params do not require a comma, this is because Luna, like JavaScript, does not allow defining defaults in the signature. Personally I find this more flexible as params are often shifted and swapped based on their types or arity, and I find the clutter of defaults makes the signature difficult to read, clouding what would otherwise at a glance give you a hint at what the function does.

greet =: user msg
  console log('hello #{user} #{msg}')

Another example would be a selection of person’s ferrets older than 4. In Ruby this might look similar to below:

person.pets.select do |pet|
  pet.species == 'ferret' and pet.age > 4
end

and the following CoffeeScript:

person.pets.filter (pet) ->
  pet.species == 'ferret' and pet.age > 4

The equivalent canonical Luna would look like this:

person pets select(: pet
  pet species == 'ferret' && pet age > 4
)

Because this looks odd, and because function literals are almost always passed as the last argument, there’s a special-case, allowing a function literal to be defined after the closing ):

person pets select(): pet
  pet species == 'ferret' && pet age > 4

Furthering this special sauce, is the optional parens when no arguments are present:

person pets select: pet
  pet species == 'ferret' && pet age > 4

Taking this even further, I have planned to add deferred evaluation support to Luna, though I haven’t decided yet if it will opt-in (most likely) or not. This feature would provide extremely expressive capabilities, consider the following:

  person pets select(species == 'ferret' && age > 4)

In most imperative languages this would be evaluated before calling the function, so the argument would simply consist of a boolean from this expression. However in Luna these expressions could be evaluated by the callee against a given context, or optionally the caller’s context. Another slick example, mapping the the names of users over 20, and joining them.

users select(age > 20) map(name) join(', ');

Followed by our lovely JavaScript:

users.filter(function(user){
  return user.age > 20;
}).map(function(user){
  return user.name;
}).join(', ');

Number Literals

The one cool thing about Ruby, is that it allows you to place underscores in numbers, which can dramatically increase readability for larger values, so I’ve added this to Luna as well.

million = 1_000_000

Luna also has the typical octal and hexadecimal literals 0755 and 0xff etc, but also implements a few that I have not seen before, because these ones come from CSS! Often when programming in JavaScript with timeouts, intervals etc, so Luna has the s and ms literals. The following are equivalent:

2.5s
2500ms
2500

I’m not a huge fan of getters, visual ambiguity is not an experienced programmers friend, merely a toy that is fun on occasion. That being said s and ms in Luna would be valid slot accessors, as . is not required so they could be implemented within the language itself if it were to have getters:

Number s =:
   self * 1000

Another feature which I first encountered in Ruby, but is no doubt in several other languages is the means to represent a character code, as the character itself. In Ruby this takes the form ?<char>:

?&
// => 38

In Luna the form is #<char>:

#&
// => 38

Operators

Luna of course provides the swath of operators we’ve all come to know and love, along with a few others. I try to stay away form “wordy” operators, such as typical aliases “and”, and “or”, etc. The reason for this is while it does read better as english, large bodies of code quickly become nothing more than a soup of words, which I find tough to look at, after all, english is a terrible language, and an even worth programming language.

Below is the operator precedence table from highest to lowest:

    operator       |  associativity
    ---------------|---------------
    [ ] ( ) sp     |  left
    ++ --          |  right
    ! ~ + -        |  right
    * / %          |  left
    .              |  left
    + -            |  left
    << >>          |  left
    < <= > >=      |  left
    == !=          |  left
    &              |  left
    ^              |  left
    |              |  left
    &&             |  left
    ||             |  left
    ?:             |  right
    = :=           |  right
    not            |  right
    ,              |  left
    if unless      |  left
    while until

The low-precedence not operator is currently the only exception to my rule. This operator is similar to !, however this low precedence is handy, for example:

 User allowed =: realm
   !(banned || blockedFrom(realm))

may become:

 User allowed =: realm
   not banned || blockedFrom(realm)

Lexical Analysis

Luna’s Lexer currently consists of a small body of code (~350 LOC), not yet entirely complete. The Lexer’s job is to transform an input string into a normalized stream of tokens. For example Luna’s internals do not care if the operator == is two or three characters, hell it could be “equal to”, all it cares is that LUNA_OP_EQ is used internally to represent it.

Semantic Analysis

The Parser is in charge of semantics. Using the token stream created by the Lexer, the parser applies rules, and builds an AST, representing the structure of the program. Luna’s parser is a very small (~900 LOC) hand-coded recursive descent parser.

AST Output

For debugging Luna is bundled with an AST pretty printer. The function luna_prettyprint(luna_node_t *node) simply visits each node recursively, incrementing and decrementing the indentation as they are invoked. For example, the following Luna would write a string followed by a LF to stdout:

puts =: str
  stdout write(str . '\\n')

When luna(1) is invoked with the --ast flag, the following is written to stdout:

(= (id puts) (function str 
  (slot
    (id stdout)
    (call
      (id write)
      (. (id str) (string '\n'))))))

The Luna VM

So we’ve talked about syntax a bit, pretty trivial stuff, let’s get to the fun part! I’ve started implementing the VM in a separate branch, once the VM is usable I’ll begin the code generation, after which both will be merged into master. The code generation or “codegen” phase consists of traversing the AST (potentially several times), performing optimizations (such as precomputing expressions such as “10 / 2” to “5”), generating instructions for our virtual machine.

The VM is register based, these faux registers allow us to dramatically reduce the number of instructions, lowering ANSI C dispatch loop overhead, thus increasing performance. The instructions consist of an opcode (operation code), and optional operands. For example the LUNA_OP_ADD instruction would add two numbers. For example with a stack machine, the expression b = a + 10 may be expressed with the following sequence of pseudo instructions, pushing arguments onto the stack:

push 'a'
get
push 3
add
push 'b'
set

As you can see, that’s 6 instructions for a simple expression. In contrast, with our register based VM, we encode the opcode and operands into a 32 bit instruction, as expressed by the following macro used to pack the operands:

#define abc(op, a, b, c) ((LUNA_OP_##op) \
  | ((a) << 8) \
  | ((b) << 16) \
  | ((c) << 24))

With this, our ADD instruction may take the form shown below, where 1 is 2nd register (b), and 0 is the first (a), while 250 is the first constant.

add 1 0 250

Ditching the overhead required by the “abstract” instructions of a stack machine, dramatically increase our performance. While less, more specific instructions is better, the performance benefit is largely due to the fact that our dispatch loop performs less work.

The assembly output below is an ANSI C switch statement with -O3. As you can see gcc create the instruction cmpl here to compare the length of our switch, in this case 12 cases.

L18:
movl    (%rdx), %eax
addq    $4, %rdx
cmpl    $12, %eax
ja  L18
mov %eax, %eax
movslq  (%rsi,%rax,4),%rax
addq    %rsi, %rax
jmp *%rax

I haven’t yet decided on the approach I will take, however I would like Luna to remain portable, but that doesn’t mean we cannot special-case. One solution may be to utilize gcc’s “first class labels” to implement direct (or indirect) threading, using &&label to grab the label’s address. Using this we can perform unconditional jumps, removing the dispatch loop all together. With this technique gcc generates the following for goto **ip++, much better!

movq    (%rbx), %rax
addq    $8, %rbx
jmp *%rax

The End

If you are interested in seeing where the language goes, or trying it out when it becomes usable you can follow the progress on Github. Later I’ll touch on each component in greater detail, but that’s all for now!

Notes

  1. paginas-web-uruguay reblogged this from tjholowaychuk
  2. aburaage reblogged this from tjholowaychuk
  3. kattrali reblogged this from tjholowaychuk
  4. adrusi reblogged this from tjholowaychuk
  5. cti97 reblogged this from hermanjunge
  6. haru012 reblogged this from tjholowaychuk
  7. hermanjunge reblogged this from tjholowaychuk
  8. tjholowaychuk posted this