commander.c - C option parser
Every time I write a C program I end up rolling my own option parsing with a loop and strcmp(), so I figured it was time to port over commander.c from the original ruby commander, and there’s also a node.js port.
Commander.c is a super simple script that makes defining options declarative and does the dirty work for you, its usage looks like this:
#include <stdio.h>
#include "commander.h"
static void
verbose(command_t *self) {
printf("verbose: enabled\n");
}
static void
required(command_t *self) {
printf("required: %s\n", self->arg);
}
static void
optional(command_t *self) {
printf("optional: %s\n", self->arg);
}
int
main(int argc, const char **argv){
command_t cmd;
command_init(&cmd, argv[0], "0.0.1");
command_option(&cmd, "-v", "--verbose", "enable verbose stuff", verbose);
command_option(&cmd, "-r", "--required <arg>", "required arg", required);
command_option(&cmd, "-o", "--optional [arg]", "optional arg", optional);
command_parse(&cmd, argc, argv);
printf("additional args:\n");
for (int i = 0; i < cmd.argc; ++i) {
printf(" - '%s'\n", cmd.argv[i]);
}
return 0;
}
As a bonus it generates the --help from what it already knows about your program:
Usage: example [options]
Options:
-V, --version output program version
-h, --help output help information
-v, --verbose enable verbose stuff
-r, --required <arg> required arg
-o, --optional [arg] optional arg
Commander handles optional and required args:
$ mon --sleep
--sleep requires an argument
Warns about unrecognized flags:
$ ./test --foo
unrecognized flag --foo
Unparsed arguments are then provided as cmd.argv and of course its companion cmd.argc, which of course includes -- support to flag subsequent args as literals:
$ ./test foo -- --foo
additional args:
- 'foo'
- '--bar'
If you have a struct that you want to access within each callback just assign to cmd.data, a void *.
That’s it for now, I’ll be adding a few more things in the near future so check it out on github.
The Luna programming language
A few weeks ago I started implementing Luna, an elegant, expressive, and embeddable language written in ANSI C, with performance as a primary focus. For now it’s just a side project, but I thought it would be fun to blog about the different aspects as they unfold, building languages is a really fun process.
Goals
The current goals of Luna are the following:
- small, simple, expressive, and explicit syntax
- fast, fast, and fast
- robust reflection capabilities
- opt-in callee evaluated messages
- register based VM
- embeddable
About Luna’s Syntax
Luna’s syntax is inspired by Python, IO, and Lua, bits of Ruby, and of course, JavaScript.
The language itself will utilize “pure” prototypes for inheritance. When I say “pure”, I mean no confusing class/prototype hybrid monster barfing up poor design, which we all face in JavaScript. The way JavaScript handles prototypes adds to the confusion of what may already be a completely new concept to some. Contrasting with JavaScript this means losing constructors, ditching .prototype, the new operator, instanceof, typeof and other clunky “features”.
Take the following JavaScript for example:
function User(name) {
this.name = user;
}
User.prototype.toString = function(){
return this.name;
}
new User('tj')
This is a big W-T-F for some people, some get confused thinking it’s a class, and some get lost trying to understand that the prototype property has to do with anything at all. Simply put, a “prototype” is just an object, that’s it, it’s an object that you “clone”, which for performance reasons typically means you have a new object with the “parent” in a special prototype chain for property or “slot” lookup. That’s it! adding constructors and funky props and stupid keywords to the mix just makes things needlessly confusing, and really takes away from what would otherwise be a pretty elegant language.
Whitespace
Our earlier JavaScript User example might be written as the following, where we first clone the primitive Object, aka the prototype chain now contains Object, then we assign the slot (property) new to a function taking a single argument name. Within this function we clone the User object, so now the prototype chain contains Object, User, and assign the user’s name slot to the arg.
User = Object clone()
User new =: name
user = User clone()
user name = name
user
User toString =:
self name
tj = User new('tj')
There are a few things to note about Luna’s syntax here. The first being that linear white-space is the slot or “property” delimiter, not . like many languages. Another thing to note is the function literal syntax, which is simply a : character followed by optional params and a block.
Many of you know my thoughts regarding coffeescript, and like I’ve stated in Luna’s Readme, it’s not significant whitespace that I dislike, though it can become nasty very quickly in large doses. It’s of course an honor for any programmer to have a large number of followers, but the reality is that there’s almost nothing original about the language, and instead of striving for something elegant and simple, it just throws more sugar and syntax at the “problems”. The worst part about significant whitespace is that deep nesting, perhaps when defining methods on a class, it’s easy completely lose context, and in a big file/class that can get very annoying, very quickly. The fact that you must specify the receiver for each method reaffirms context within Luna.
EDIT: I’ve seen enough coffeescript used in practice now that looks really bad… so I’m likely going to go with a more lua-like syntax
Function Literals
So now that you’ve seen the function literal, you’re probably thinking what the hell… yes, the function literal is a single character, :. Consider the following JavaScript function:
var greet = function(user, msg) {
console.log('hello ' + user + ' ' + msg);
};
The equivalent Luna script would be:
greet =: user msg
console log('hello ' . user . ' ' . msg)
While this looks like an operator (=:), it’s simply convention, the following is valid as well:
greet = : user msg
console log(...)
You’ll see that the params do not require a comma, this is because Luna, like JavaScript, does not allow defining defaults in the signature. Personally I find this more flexible as params are often shifted and swapped based on their types or arity, and I find the clutter of defaults makes the signature difficult to read, clouding what would otherwise at a glance give you a hint at what the function does.
greet =: user msg
console log('hello #{user} #{msg}')
Another example would be a selection of person’s ferrets older than 4. In Ruby this might look similar to below:
person.pets.select do |pet|
pet.species == 'ferret' and pet.age > 4
end
and the following CoffeeScript:
person.pets.filter (pet) ->
pet.species == 'ferret' and pet.age > 4
The equivalent canonical Luna would look like this:
person pets select(: pet
pet species == 'ferret' && pet age > 4
)
Because this looks odd, and because function literals are almost always passed as the last argument, there’s a special-case, allowing a function literal to be defined after the closing ):
person pets select(): pet
pet species == 'ferret' && pet age > 4
Furthering this special sauce, is the optional parens when no arguments are present:
person pets select: pet
pet species == 'ferret' && pet age > 4
Taking this even further, I have planned to add deferred evaluation support to Luna, though I haven’t decided yet if it will opt-in (most likely) or not. This feature would provide extremely expressive capabilities, consider the following:
person pets select(species == 'ferret' && age > 4)
In most imperative languages this would be evaluated before calling the function, so the argument would simply consist of a boolean from this expression. However in Luna these expressions could be evaluated by the callee against a given context, or optionally the caller’s context. Another slick example, mapping the the names of users over 20, and joining them.
users select(age > 20) map(name) join(', ');
Followed by our lovely JavaScript:
users.filter(function(user){
return user.age > 20;
}).map(function(user){
return user.name;
}).join(', ');
Number Literals
The one cool thing about Ruby, is that it allows you to place underscores in numbers, which can dramatically increase readability for larger values, so I’ve added this to Luna as well.
million = 1_000_000
Luna also has the typical octal and hexadecimal literals 0755 and 0xff etc, but also implements a few that I have not seen before, because these ones come from CSS! Often when programming in JavaScript with timeouts, intervals etc, so Luna has the s and ms literals. The following are equivalent:
2.5s
2500ms
2500
I’m not a huge fan of getters, visual ambiguity is not an experienced programmers friend, merely a toy that is fun on occasion. That being said s and ms in Luna would be valid slot accessors, as . is not required so they could be implemented within the language itself if it were to have getters:
Number s =:
self * 1000
Another feature which I first encountered in Ruby, but is no doubt in several other languages is the means to represent a character code, as the character itself. In Ruby this takes the form ?<char>:
?&
// => 38
In Luna the form is #<char>:
#&
// => 38
Operators
Luna of course provides the swath of operators we’ve all come to know and love, along with a few others. I try to stay away form “wordy” operators, such as typical aliases “and”, and “or”, etc. The reason for this is while it does read better as english, large bodies of code quickly become nothing more than a soup of words, which I find tough to look at, after all, english is a terrible language, and an even worth programming language.
Below is the operator precedence table from highest to lowest:
operator | associativity
---------------|---------------
[ ] ( ) sp | left
++ -- | right
! ~ + - | right
* / % | left
. | left
+ - | left
<< >> | left
< <= > >= | left
== != | left
& | left
^ | left
| | left
&& | left
|| | left
?: | right
= := | right
not | right
, | left
if unless | left
while until
The low-precedence not operator is currently the only exception to my rule. This operator is similar to !, however this low precedence is handy, for example:
User allowed =: realm
!(banned || blockedFrom(realm))
may become:
User allowed =: realm
not banned || blockedFrom(realm)
Lexical Analysis
Luna’s Lexer currently consists of a small body of code (~350 LOC), not yet entirely complete. The Lexer’s job is to transform an input string into a normalized stream of tokens. For example Luna’s internals do not care if the operator == is two or three characters, hell it could be “equal to”, all it cares is that LUNA_OP_EQ is used internally to represent it.
Semantic Analysis
The Parser is in charge of semantics. Using the token stream created by the Lexer, the parser applies rules, and builds an AST, representing the structure of the program. Luna’s parser is a very small (~900 LOC) hand-coded recursive descent parser.
AST Output
For debugging Luna is bundled with an AST pretty printer. The function luna_prettyprint(luna_node_t *node) simply visits each node recursively, incrementing and decrementing the indentation as they are invoked. For example, the following Luna would write a string followed by a LF to stdout:
puts =: str
stdout write(str . '\\n')
When luna(1) is invoked with the --ast flag, the following is written to stdout:
(= (id puts) (function str
(slot
(id stdout)
(call
(id write)
(. (id str) (string '\n'))))))
The Luna VM
So we’ve talked about syntax a bit, pretty trivial stuff, let’s get to the fun part! I’ve started implementing the VM in a separate branch, once the VM is usable I’ll begin the code generation, after which both will be merged into master. The code generation or “codegen” phase consists of traversing the AST (potentially several times), performing optimizations (such as precomputing expressions such as “10 / 2” to “5”), generating instructions for our virtual machine.
The VM is register based, these faux registers allow us to dramatically reduce the number of instructions, lowering ANSI C dispatch loop overhead, thus increasing performance. The instructions consist of an opcode (operation code), and optional operands. For example the LUNA_OP_ADD instruction would add two numbers. For example with a stack machine, the expression b = a + 10 may be expressed with the following sequence of pseudo instructions, pushing arguments onto the stack:
push 'a'
get
push 3
add
push 'b'
set
As you can see, that’s 6 instructions for a simple expression. In contrast, with our register based VM, we encode the opcode and operands into a 32 bit instruction, as expressed by the following macro used to pack the operands:
#define abc(op, a, b, c) ((LUNA_OP_##op) \
| ((a) << 8) \
| ((b) << 16) \
| ((c) << 24))
With this, our ADD instruction may take the form shown below, where 1 is 2nd register (b), and 0 is the first (a), while 250 is the first constant.
add 1 0 250
Ditching the overhead required by the “abstract” instructions of a stack machine, dramatically increase our performance. While less, more specific instructions is better, the performance benefit is largely due to the fact that our dispatch loop performs less work.
The assembly output below is an ANSI C switch statement with -O3. As you can see gcc create the instruction cmpl here to compare the length of our switch, in this case 12 cases.
L18:
movl (%rdx), %eax
addq $4, %rdx
cmpl $12, %eax
ja L18
mov %eax, %eax
movslq (%rsi,%rax,4),%rax
addq %rsi, %rax
jmp *%rax
I haven’t yet decided on the approach I will take, however I would like Luna to remain portable, but that doesn’t mean we cannot special-case. One solution may be to utilize gcc’s “first class labels” to implement direct (or indirect) threading, using &&label to grab the label’s address. Using this we can perform unconditional jumps, removing the dispatch loop all together. With this technique gcc generates the following for goto **ip++, much better!
movq (%rbx), %rax
addq $8, %rbx
jmp *%rax
The End
If you are interested in seeing where the language goes, or trying it out when it becomes usable you can follow the progress on Github. Later I’ll touch on each component in greater detail, but that’s all for now!
C Indentation Based Language Lexer Tutorial
C Indentation Based Language Lexer Tutorial
In this tutorial I wanted to show you how you can start your own programming language, starting by creating what is commonly referred to as a lexer or scanner. The grammar for our “language” will be indentation based much like Python, however will implement a tiny subset of tokens for illustration purposes. If you prefer, you may also jump straight to the Source Code.
First lets take a look at what I mean by an indentation based grammar. Many popular languages such as C, PHP, Ruby, Java, and many others use blocks formed by braces (or similar) to indicate lexical scope, where as some use indentation as show in the comparisons below:
block {
based {
language {
yay();
}
}
}
indentation
based
language
yay();
Taking a second look and inserting “>” and “<” to illustrate indents and outdents, we can see that these grammars are semantically identical in that each indent, has an associated outdent.
indentation
>based
> >language
> > >yay();
< < <
< <
<
Tokens
First things first, we need to include some headers for standard input/output, types and some “string” routines.
#include <stdio.h> // fgetc()
#include <ctype.h> // isalpha()
#include <string.h> // strdup()
Next we need to create an enumeration of tokens that we will be working with, these include EOS or end-of-source, SEP as the statement separator, Illegal for handling invalid input, and Id for variable identifiers.
typedef enum {
tEOS
, tSEP
, tIllegal
, tIndent
, tOutdent
, tId
} Token;
For debugging purposes we may also want some c strings (char arrays) as token names:
static char *tokenNames[] = {
"EOS"
, "SEP"
, "Illegal"
, "Indent"
, "Outdent"
, "Id"
};
Lexer Structure
The lexer itself is tasked with converting a sequence of characters, into a sequence of machine recognizable tokens. These consist of terminal symbols (0, 1, 2) and non-terminal symbols (digit, int). To handle the state of our lexer(s) we first need to create a struct for our data. Below we have the int indents which will hold the current indentation level, outdents which will be used to queue outdent tokens, stream which holds our file stream, filename for reporting, err as a possible exception message, and the val union which will allow us to store non-terminal values for strings, integers, floats, etc.
typedef struct {
short indents;
short outdents;
FILE *stream;
char *filename;
char *err;
union {
char *asString;
} val;
} Lexer;
Now we can start out main() function, although currently not to exciting:
int
main(int argc, char **argv) {
Lexer lex;
Lexer_init(&lex);
return 0;
}
Above we pass the address of lex to Lexer_init() for initialization, which is defined below accepting the lexer itself as self, a file stream and filename.
void
Lexer_init(Lexer *self, FILE *stream, char *filename) {
self->stream = stream;
self->filename = filename;
self->indents = 0;
self->outdents = 0;
self->err = NULL;
}
Lexer Debug Output
Next we will create the Lexer_dump() function, which will output our stream tokens in a human-readable representation such as the stdout below:
stdin:
Id: just
Indent
Id: a
Indent
Id: test
Outdent
Outdent
To begin, we pass the address of lex again after initialization:
int
main(int argc, char **argv) {
Lexer lex;
Lexer_init(&lex);
Lexer_dump(&lex);
return 0;
}
After declaring tok, we call Lexer_next(self) to return the next token available in the stream, bailing if the token is EOS or end-of-source, because we are done :)
void
Lexer_dump(Lexer *self) {
Token tok;
printf("\x1b[33m%s\x1b[0m:\n", self->filename);
while (tEOS != (tok = Lexer_next(self))) {
// Do something with tok
}
}
We can now begin working with our token, by switch()_ing on it and displaying errors in red, identifiers in yellow, and other tokens in green. Because we previously created our _tokenNames array, and the values are indexed in the same manor as our Token enum we can simply access it via tokenNames[tok].
void
Lexer_dump(Lexer *self) {
Token tok;
printf("\x1b[33m%s\x1b[0m:\n", self->filename);
while (tEOS != (tok = Lexer_next(self))) {
switch (tok) {
case tIllegal:
printf(" \x1b[31m%s\x1b[0m: %s\n", tokenNames[tok], self->err);
break;
case tId:
printf(" \x1b[33m%s\x1b[0m: %s\n", tokenNames[tok], self->val.asString);
break;
default:
printf(" \x1b[32m%s\x1b[0m\n", tokenNames[tok]);
break;
}
}
}
Lexical Analysis
Finally we can begin actually scanning our input. Since we accept any valid FILE stream we can work from stdin, by executing:
$ ./lang < test
Where ./test contains:
if foo
if bar
something
else
something_else
else
baz
Defined below are two small macros to ease fetching of the next character in the stream, as well as undoing or re-buffering a character. It is possible to create state machines that do not perform lookaheads, and do not unget characters, however for illustration and simplification purposes we will do so.
#define next fgetc(self->stream)
#define undo ungetc(c, self->stream)
The c variable will hold the current character, as switch() starts our scanning. First thing we need to do is ignore whitespace
Token
Lexer_next(Lexer *self) {
int c;
switch (c = next) {
case ' ':
return Lexer_next(self);
}
}
We can define the default to handle non-terminals such as numbers, strings and in our case identifiers. We first use isalpha(c) is alphabetic, also excepting underscores via c == ‘_’. We set our buffer index to 0, and add the first character previously matched to the buffer. Typically we need a much more robust solution for handling buffer overflows etc, but again for illustration we omit this. We then perform the same assertions as we continue buffering characters, followed by our undo macro.
Since we loop undo the condition is no longer truthy, we are left with an extra character that is not a valid id character, so we must undo this so that it is still available to the lexer.
default:
if (isalpha(c) || c == '_') {
int i = 0;
char buf[32];
buf[i++] = c;
while (isalpha(c = next) || c == '_') {
buf[i++] = c;
}
undo;
Our last character is null, then finally we assign our union field val.asString to our buffered data and return tId.
default:
if (isalpha(c) || c == '_') {
int i = 0;
char buf[32];
buf[i++] = c;
while (isalpha(c = next) || c == '_') {
buf[i++] = c;
}
undo;
buf[i] = '\0';
self->val.asString = strdup(buf);
return tId;
}
return tIllegal;
Indentation Support
Finally the meat of our blog post, indentation!!! Since all indents (and outdents) begin with a new line we simply switch on the line feed or “newline” character.
case '\n': {
}
The n variable will contain our total number of spaces following the line feed, which will be used to indicate our indentation level (2 spaces per indent). Using the same pattern as we did for identifiers, we simply loop while we have spaces, then undo the final character.
case '\n': {
int n = 0;
while (' ' == (c = next)) ++n;
undo;
}
We proceed to our first case of error handling. We find the remainder of n % 2, if it is positive aka NOT a multiple of two, then our indentation is incorrect, return tIllegal.
case '\n': {
int n = 0;
while (' ' == (c = next)) ++n;
undo;
if (n % 2) {
self->err = "Invalid indentation, must be multiple of 2.";
return tIllegal;
}
}
After the if statement above, we know we have a valid multiple, so divide n by two to discover the number of indents. Our first if statement branch reports another error, which the user indents to many times. The second branch checks if we have indented in which cased we increment and return tIndent. Third branch checks if the current indentation level matches the previous, in which case it is a SEP or statement separator as it is neither an indent, nor an outdent. The final else branch is when our current indentation level is lower than the previous, and since a user may outdent several times in a row, we subtract the two.
case '\n': {
int n = 0;
while (' ' == (c = next)) ++n;
undo;
if (n % 2) {
self->err = "Invalid indentation, must be multiple of 2.";
return tIllegal;
}
n /= 2;
// To many
if (n > self->indents + 1) {
self->err = "Invalid indentation, to many successive indents.";
return tIllegal;
// More
} else if (n > self->indents) {
++self->indents;
return tIndent;
// SEP
} else if (n == self->indents) {
return tSEP;
// Less
} else {
self->outdents = self->indents - n - 1;
self->indents = n;
return tOutdent;
}
}
You may have noticed that we finally assigned self->outdents. Due to the streaming nature of our lexer, we cannot simply return several tokens, nor would we want to expensively buffer them so we will simply increment self->outdents to indicate deferred outdent tokens.
In the top of Lexer_next() above out switch() statement we can add the following conditional, which will continue returning tOutdent tokens until exhausted.
if (self->outdents) {
--self->outdents;
return tOutdent;
}
The final case in this tutorial to handle is EOS or end-of-source. To make our token stream normalized for our parser we should not consider EOF the end of our token stream, since the script may not contain trailing outdents correctly. The case below shows that we continue to return _tOutdent while indentations remain, then finally EOS.
case EOF:
return self->indents--
? tOutdent
: tEOS;
Conclusion
Thats it! we are done! check out the full Source Code and feel free to keep in touch on Github or Twitter.
C Insertion Sort
So I wanted to investigate different sorting algorithms for a few personal projects, and I thought while I was doing so that I would write little tutorials for each algo :) why not! First we have one of the most basic sorting algorithms, the Insertion Sort.

Explanation
As illustrated by this video, the insertion sort works much like you might sort a deck of cards laying infront of you on a table. For each value in a list, we compare the value with the previous value. If the current value is lesser than the previous, we move the current value backwards, repeating until considered greater than an already sorted value.
Implementation
First we need to include stdio.h for printf(), which is called in the dump() function below which simply prints the array contents to stdout.
#include <stdio.h>
#include <string.h>
void
dump(char *label, char **arr, int len) {
printf("%s:\n", label);
for (int i = 0; i < len; ++i) {
printf("%s ", arr[i]);
}
printf("\n\n");
}
Next on to insertionSort(). Here we accept a “pointer to a pointer” aka, a pointer array, and a len integer. We start our for loop at 1 because an array with a length of is already sorted. We assign key and then begin our backwards comparison, seeing if key is lesser than any previously sorted values.
void
insertionSort(char **arr, int len) {
for (int k = 1; k < len; ++k) {
char *key = arr[k];
int i = k - 1;
while (i >= 0 && strcmp(key, arr[i]) < 0) {
arr[i + 1] = arr[i--];
}
arr[i + 1] = key;
}
}
In our definition of main() we simply initialize 6 c strings and perform our insertion sort, dumping before and after the sort.
int
main(int argc, const char **argv){
char *arr[6] = { "1.0.0", "0.2.0", "2.0", "0.0.1", "0.5.0", "1.0.0beta" };
dump("before", arr, 6);
insertionSort(arr, 6);
dump("after", arr, 6);
return 0;
}
Compile
$ gcc insertionSort.c -std=c99 -o insertionSort
$ ./insertionSort
stdout:
before:
1.0.0 0.2.0 2.0 0.0.1 0.5.0 1.0.0beta
after:
0.0.1 0.2.0 0.5.0 1.0.0 1.0.0beta 2.0