'; self->val.asString = strdup(buf); return tId; } return tIllegal; Indentation Support Finally the meat of our blog post, indentation!!! Since all indents (and outdents) begin with a new line we simply switch on the line feed or “newline” character. case '\n': { } The n variable will contain our total number of spaces following the line feed, which will be used to indicate our indentation level (2 spaces per indent). Using the same pattern as we did for identifiers, we simply loop while we have spaces, then undo the final character. case '\n': { int n = 0; while (' ' == (c = next)) ++n; undo; } We proceed to our first case of error handling. We find the remainder of n % 2, if it is positive aka NOT a multiple of two, then our indentation is incorrect, return tIllegal. case '\n': { int n = 0; while (' ' == (c = next)) ++n; undo; if (n % 2) { self->err = "Invalid indentation, must be multiple of 2."; return tIllegal; } } After the if statement above, we know we have a valid multiple, so divide n by two to discover the number of indents. Our first if statement branch reports another error, which the user indents to many times. The second branch checks if we have indented in which cased we increment and return tIndent. Third branch checks if the current indentation level matches the previous, in which case it is a SEP or statement separator as it is neither an indent, nor an outdent. The final else branch is when our current indentation level is lower than the previous, and since a user may outdent several times in a row, we subtract the two. case '\n': { int n = 0; while (' ' == (c = next)) ++n; undo; if (n % 2) { self->err = "Invalid indentation, must be multiple of 2."; return tIllegal; } n /= 2; // To many if (n > self->indents + 1) { self->err = "Invalid indentation, to many successive indents."; return tIllegal; // More } else if (n > self->indents) { ++self->indents; return tIndent; // SEP } else if (n == self->indents) { return tSEP; // Less } else { self->outdents = self->indents - n - 1; self->indents = n; return tOutdent; } } You may have noticed that we finally assigned self->outdents. Due to the streaming nature of our lexer, we cannot simply return several tokens, nor would we want to expensively buffer them so we will simply increment self->outdents to indicate deferred outdent tokens. In the top of Lexer_next() above out switch() statement we can add the following conditional, which will continue returning tOutdent tokens until exhausted. if (self->outdents) { --self->outdents; return tOutdent; } The final case in this tutorial to handle is EOS or end-of-source. To make our token stream normalized for our parser we should not consider EOF the end of our token stream, since the script may not contain trailing outdents correctly. The case below shows that we continue to return _tOutdent while indentations remain, then finally EOS. case EOF: return self->indents-- ? tOutdent : tEOS; Conclusion Thats it! we are done! check out the full Source Code and feel free to keep in touch on Github or Twitter." />

C Indentation Based Language Lexer Tutorial

C Indentation Based Language Lexer Tutorial

In this tutorial I wanted to show you how you can start your own programming language, starting by creating what is commonly referred to as a lexer or scanner. The grammar for our “language” will be indentation based much like Python, however will implement a tiny subset of tokens for illustration purposes. If you prefer, you may also jump straight to the Source Code.

First lets take a look at what I mean by an indentation based grammar. Many popular languages such as C, PHP, Ruby, Java, and many others use blocks formed by braces (or similar) to indicate lexical scope, where as some use indentation as show in the comparisons below:

 block {
   based {
     language {
       yay();
     }
   }
 }

 indentation
   based
     language
       yay();

Taking a second look and inserting “>” and “<” to illustrate indents and outdents, we can see that these grammars are semantically identical in that each indent, has an associated outdent.

indentation
>based
> >language
> > >yay();
< < <
< <
<

Tokens

First things first, we need to include some headers for standard input/output, types and some “string” routines.

#include <stdio.h> // fgetc()
#include <ctype.h> // isalpha()
#include <string.h> // strdup()

Next we need to create an enumeration of tokens that we will be working with, these include EOS or end-of-source, SEP as the statement separator, Illegal for handling invalid input, and Id for variable identifiers.

typedef enum {
    tEOS
  , tSEP
  , tIllegal
  , tIndent
  , tOutdent
  , tId
} Token;

For debugging purposes we may also want some c strings (char arrays) as token names:

static char *tokenNames[] = {
    "EOS"
  , "SEP"
  , "Illegal"
  , "Indent"
  , "Outdent"
  , "Id"
};

Lexer Structure

The lexer itself is tasked with converting a sequence of characters, into a sequence of machine recognizable tokens. These consist of terminal symbols (0, 1, 2) and non-terminal symbols (digit, int). To handle the state of our lexer(s) we first need to create a struct for our data. Below we have the int indents which will hold the current indentation level, outdents which will be used to queue outdent tokens, stream which holds our file stream, filename for reporting, err as a possible exception message, and the val union which will allow us to store non-terminal values for strings, integers, floats, etc.

 typedef struct {
   short indents;
   short outdents;
   FILE *stream;
   char *filename;
   char *err;
   union {
     char *asString;
   } val;
 } Lexer;

Now we can start out main() function, although currently not to exciting:

int
main(int argc, char **argv) {
  Lexer lex;
  Lexer_init(&lex);
  return 0;
}

Above we pass the address of lex to Lexer_init() for initialization, which is defined below accepting the lexer itself as self, a file stream and filename.

void
Lexer_init(Lexer *self, FILE *stream, char *filename) {
  self->stream = stream;
  self->filename = filename;
  self->indents = 0;
  self->outdents = 0;
  self->err = NULL;
}

Lexer Debug Output

Next we will create the Lexer_dump() function, which will output our stream tokens in a human-readable representation such as the stdout below:

stdin:
  Id: just
  Indent
  Id: a
  Indent
  Id: test
  Outdent
  Outdent

To begin, we pass the address of lex again after initialization:

int
main(int argc, char **argv) {
  Lexer lex;
  Lexer_init(&lex);
  Lexer_dump(&lex);
  return 0;
}

After declaring tok, we call Lexer_next(self) to return the next token available in the stream, bailing if the token is EOS or end-of-source, because we are done :)

void
Lexer_dump(Lexer *self) {
  Token tok;
  printf("\x1b[33m%s\x1b[0m:\n", self->filename);
  while (tEOS != (tok = Lexer_next(self))) {
    // Do something with tok
  }
}

We can now begin working with our token, by switch()_ing on it and displaying errors in red, identifiers in yellow, and other tokens in green. Because we previously created our _tokenNames array, and the values are indexed in the same manor as our Token enum we can simply access it via tokenNames[tok].

void
Lexer_dump(Lexer *self) {
  Token tok;
  printf("\x1b[33m%s\x1b[0m:\n", self->filename);
  while (tEOS != (tok = Lexer_next(self))) {
    switch (tok) {
      case tIllegal:
        printf("  \x1b[31m%s\x1b[0m: %s\n", tokenNames[tok], self->err);
        break;
      case tId:
        printf("  \x1b[33m%s\x1b[0m: %s\n", tokenNames[tok], self->val.asString);
        break;
      default:
        printf("  \x1b[32m%s\x1b[0m\n", tokenNames[tok]);
        break;
    }
  }
}

Lexical Analysis

Finally we can begin actually scanning our input. Since we accept any valid FILE stream we can work from stdin, by executing:

$ ./lang < test

Where ./test contains:

if foo
  if bar
    something
  else
    something_else
else
  baz

Defined below are two small macros to ease fetching of the next character in the stream, as well as undoing or re-buffering a character. It is possible to create state machines that do not perform lookaheads, and do not unget characters, however for illustration and simplification purposes we will do so.

 #define next fgetc(self->stream)
 #define undo ungetc(c, self->stream)

The c variable will hold the current character, as switch() starts our scanning. First thing we need to do is ignore whitespace

 Token
 Lexer_next(Lexer *self) {
   int c;
   switch (c = next) {
     case ' ':
       return Lexer_next(self);
   }
 }

We can define the default to handle non-terminals such as numbers, strings and in our case identifiers. We first use isalpha(c) is alphabetic, also excepting underscores via c == ‘_’. We set our buffer index to 0, and add the first character previously matched to the buffer. Typically we need a much more robust solution for handling buffer overflows etc, but again for illustration we omit this. We then perform the same assertions as we continue buffering characters, followed by our undo macro.

Since we loop undo the condition is no longer truthy, we are left with an extra character that is not a valid id character, so we must undo this so that it is still available to the lexer.

 default:
   if (isalpha(c) || c == '_') {
     int i = 0;
     char buf[32];
     buf[i++] = c;
     while (isalpha(c = next) || c == '_') {
       buf[i++] = c;
     }
     undo;

Our last character is null, then finally we assign our union field val.asString to our buffered data and return tId.

default:
  if (isalpha(c) || c == '_') {
    int i = 0;
    char buf[32];
    buf[i++] = c;
    while (isalpha(c = next) || c == '_') {
      buf[i++] = c;
    }
    undo;
    buf[i] = '\0';
    self->val.asString = strdup(buf);
    return tId;
  }
  return tIllegal;

Indentation Support

Finally the meat of our blog post, indentation!!! Since all indents (and outdents) begin with a new line we simply switch on the line feed or “newline” character.

  case '\n': {

  }

The n variable will contain our total number of spaces following the line feed, which will be used to indicate our indentation level (2 spaces per indent). Using the same pattern as we did for identifiers, we simply loop while we have spaces, then undo the final character.

case '\n': {
  int n = 0;
  while (' ' == (c = next)) ++n;
  undo;
}

We proceed to our first case of error handling. We find the remainder of n % 2, if it is positive aka NOT a multiple of two, then our indentation is incorrect, return tIllegal.

case '\n': {
  int n = 0;
  while (' ' == (c = next)) ++n;
  undo;
  if (n % 2) {
    self->err = "Invalid indentation, must be multiple of 2.";
    return tIllegal;
  }
}

After the if statement above, we know we have a valid multiple, so divide n by two to discover the number of indents. Our first if statement branch reports another error, which the user indents to many times. The second branch checks if we have indented in which cased we increment and return tIndent. Third branch checks if the current indentation level matches the previous, in which case it is a SEP or statement separator as it is neither an indent, nor an outdent. The final else branch is when our current indentation level is lower than the previous, and since a user may outdent several times in a row, we subtract the two.

case '\n': {
  int n = 0;
  while (' ' == (c = next)) ++n;
  undo;
  if (n % 2) {
    self->err = "Invalid indentation, must be multiple of 2.";
    return tIllegal;
  }
  n /= 2;
  // To many
  if (n > self->indents + 1) {
    self->err = "Invalid indentation, to many successive indents.";
    return tIllegal;
  // More
  } else if (n > self->indents) {
    ++self->indents;
    return tIndent;
  // SEP
  } else if (n == self->indents) {
    return tSEP;
  // Less
  } else {
    self->outdents = self->indents - n - 1;
    self->indents = n;
    return tOutdent;
  }
}

You may have noticed that we finally assigned self->outdents. Due to the streaming nature of our lexer, we cannot simply return several tokens, nor would we want to expensively buffer them so we will simply increment self->outdents to indicate deferred outdent tokens.

In the top of Lexer_next() above out switch() statement we can add the following conditional, which will continue returning tOutdent tokens until exhausted.

if (self->outdents) {
  --self->outdents;
  return tOutdent;
}

The final case in this tutorial to handle is EOS or end-of-source. To make our token stream normalized for our parser we should not consider EOF the end of our token stream, since the script may not contain trailing outdents correctly. The case below shows that we continue to return _tOutdent while indentations remain, then finally EOS.

case EOF: 
  return self->indents--
    ? tOutdent
    : tEOS;

Conclusion

Thats it! we are done! check out the full Source Code and feel free to keep in touch on Github or Twitter.

Notes

  1. tjholowaychuk posted this