TJ Holowaychuk

Month

July 2011

3 posts

Reds - light-weight full text search for nodejs backed by redis

Reds (red-s) is a very small (~300LOC) light-weight full text search library for node.js.

I wrote reds with Kue in mind, a priority job queue for node. I wanted to add search capabilities so you can easily find jobs by any of the arbitrary data provided, names, emails, anything.

API

You can use reds for multiple isolated search indexes, which is why you must pass a key to reds.createSearch(), as it’s used for namespacing.

 var search = reds.createSearch('pets');

As mentioned this library could be used with anything, to illustrate this we can even use it with a regular javascript array by indexing the value indices as shown below, where the first value passed to search.index() is the text, and the second is the id.

 var strs = [];
 strs.push('Tobi wants four dollars');
 strs.push('Tobi only wants $4');
 strs.push('Loki is really fat');
 strs.push('Loki, Jane, and Tobi are ferrets');
 strs.push('Manny is a cat');
 strs.push('Luna is a cat');
 strs.push('Mustachio is a cat');

 strs.forEach(function(str, i){ search.index(str, i); });

Within the another process, or the same one, we can then invoke search.query() with a string and callback invoked with possible error and array of ids. With these ids we can then determine their original values, be it an array, object in another data store etc.

 search.query('luna cat', function(err, ids){
   if (err) throw err;
   console.log('Search results:');
   ids.forEach(function(id){
     console.log('  - %s', strs[id]);
   });
 });

Producing:

  Search results:
     - Luna is a cat

You may also remove an id from the index:

  search.remove(id[, callback]);
Implementation

While reds is backed by Redis you can easily use reds with any other data store, as it simply indexes by arbitrary numeric or string ids. This means you could create an index of files on disk, mongodb documents, urls, anything.

The indexing process works like this:

- tokenize words
- strip stop words ("about", "after", "am", "an", ...)
- stem words
- apply metaphone
- add id to metaphone constant set 

The process of stemming the words and applying the metaphone algorithm provide leeway so the user does not need exact matches. For example thanks to metaphone the names “steven” and “stephen” both resolve to the constant STFN. The process of stemming reduces variants to it’s stem, for example “counting” becomes “count”, and “waits” becomes “wait”.

When a query is performed the same sequence is applied, resulting in an array of metaphone constants, providing us with the keys necessary to perform the Redis union or intersection to fetch our ids.

Performance

Preliminary benchmarks show that a small 1.6kb body of text is currently indexed in ~6ms, or 163 ops/s. Medium bodies such as 40kb operate around 6 ops/s, or 166ms.

Querying with a multi-word phrase, and an index containing ~3500 words operates around 5300 ops/s. Not too bad.

Indexing performance was decreased by nearly 100% by applying the porter stemmer algorithm in Chris Umbel’s natural library, however reds is still reasonably quick at indexing text. If you have huge documents, you may want to consider allowing users to specify a description instead.

Future
  • use sorted sets for ordering and priority
  • ranges
  • sorting
  • perf optimization if necessary
Jul 30, 201131 notes
#node #redis #search
Jade Mixins & Includes

The latest release of Jade 0.13.0 adds mixins and includes. Jade users have longed for some method of static include, so it’s (finally) here, and compliments mixins nicely, allowing you to store mixins in one or more separate files.

Mixins

A mixin definition takes the form mixin <name> [( params )] block, where params is identical to JavaScript function params, simply a list, as it’s converted to a JavaScript function within Jade to become part of the output function.

Using a mixin is identical to the definition, however omitting the block. Mixins allow you to encapsulate repeated chunks of a template, such as form fields as shown in the following snippet. This example has a local variable user passed to the template with the properties { name: '...', email: '...' }, which is then accessible throughout any mixins defined without explicitly passing it (though you may if you like).

mixin field(type, name, label)
  .field(class='field-' + type)
    label #{label}:
    input(type=type, name='user[#{name}]', value=user[name])

form
  mixin field('text', 'name', 'Username')
  mixin field('text', 'email', 'Email')
  mixin field('password', 'pass', 'Password')
  input(type='submit', value='Sign Up')

Outputting:

<form>
  <div class="field field-text">
    <label>Username:</label>
    <input type="text" name="user[name]" value="TJ Holowaychuk"/>
  </div>
  <div class="field field-text">
    <label>Email:</label>
    <input type="text" name="user[email]" value="tj@learnboost.com"/>
  </div>
  <div class="field field-password">
    <label>Password:</label>
    <input type="password" name="user[pass]"/>
  </div>
  <input type="submit" value="Sign Up"/>
</form>
Includes

The include <path> directive signals Jade to read and parse the file, then return it’s root node (a block), injecting it into the calling template, as if it were written in the same file. The <path> given is relative to the dirname of the calling template, which is exposed to Jade via the filename option, which is populated by jade.renderFile() and Express, otherwise you need to supply this.

For example our mixin(s) could live in ./mixins, and our previous example could look something like below:

include mixins/form-helpers

form
  mixin field('text', 'name', 'Username')
  mixin field('text', 'email', 'Email')
  mixin field('password', 'pass', 'Password')
  input(type='submit', value='Sign Up')

Or the classic header / footer example, first index.jade:

html
  include includes/head  
  body
    h1 My Site
    p Welcome to my super lame site.
    include includes/foot

includes/foot.jade:

#footer
  p Copyright (c) foobar

includes/head.jade:

head
  title My Site
  script(src='/javascripts/jquery.js')
  script(src='/javascripts/app.js')

That’s all for now, hopefully I’ll have some time in the near future to clean things up and get 1.0 (finally again :)) out the door.

Jul 13, 201136 notes
#node #jade
The Luna programming language

A few weeks ago I started implementing Luna, an elegant, expressive, and embeddable language written in ANSI C, with performance as a primary focus. For now it’s just a side project, but I thought it would be fun to blog about the different aspects as they unfold, building languages is a really fun process.

Goals

The current goals of Luna are the following:

  • small, simple, expressive, and explicit syntax
  • fast, fast, and fast
  • robust reflection capabilities
  • opt-in callee evaluated messages
  • register based VM
  • embeddable
About Luna’s Syntax

Luna’s syntax is inspired by Python, IO, and Lua, bits of Ruby, and of course, JavaScript.

The language itself will utilize “pure” prototypes for inheritance. When I say “pure”, I mean no confusing class/prototype hybrid monster barfing up poor design, which we all face in JavaScript. The way JavaScript handles prototypes adds to the confusion of what may already be a completely new concept to some. Contrasting with JavaScript this means losing constructors, ditching .prototype, the new operator, instanceof, typeof and other clunky “features”.

Take the following JavaScript for example:

function User(name) {
  this.name = user;
}

User.prototype.toString = function(){
 return this.name;
}

new User('tj')

This is a big W-T-F for some people, some get confused thinking it’s a class, and some get lost trying to understand that the prototype property has to do with anything at all. Simply put, a “prototype” is just an object, that’s it, it’s an object that you “clone”, which for performance reasons typically means you have a new object with the “parent” in a special prototype chain for property or “slot” lookup. That’s it! adding constructors and funky props and stupid keywords to the mix just makes things needlessly confusing, and really takes away from what would otherwise be a pretty elegant language.

Whitespace

Our earlier JavaScript User example might be written as the following, where we first clone the primitive Object, aka the prototype chain now contains Object, then we assign the slot (property) new to a function taking a single argument name. Within this function we clone the User object, so now the prototype chain contains Object, User, and assign the user’s name slot to the arg.

User = Object clone()

User new =: name
  user = User clone()
  user name = name
  user

User toString =:
  self name

tj = User new('tj')

There are a few things to note about Luna’s syntax here. The first being that linear white-space is the slot or “property” delimiter, not . like many languages. Another thing to note is the function literal syntax, which is simply a : character followed by optional params and a block.

Many of you know my thoughts regarding coffeescript, and like I’ve stated in Luna’s Readme, it’s not significant whitespace that I dislike, though it can become nasty very quickly in large doses. It’s of course an honor for any programmer to have a large number of followers, but the reality is that there’s almost nothing original about the language, and instead of striving for something elegant and simple, it just throws more sugar and syntax at the “problems”. The worst part about significant whitespace is that deep nesting, perhaps when defining methods on a class, it’s easy completely lose context, and in a big file/class that can get very annoying, very quickly. The fact that you must specify the receiver for each method reaffirms context within Luna.

EDIT: I’ve seen enough coffeescript used in practice now that looks really bad… so I’m likely going to go with a more lua-like syntax

Function Literals

So now that you’ve seen the function literal, you’re probably thinking what the hell… yes, the function literal is a single character, :. Consider the following JavaScript function:

 var greet = function(user, msg) {
   console.log('hello ' + user + ' ' + msg);
 };

The equivalent Luna script would be:

greet =: user msg
  console log('hello ' . user . ' ' . msg)

While this looks like an operator (=:), it’s simply convention, the following is valid as well:

greet = : user msg
  console log(...)

You’ll see that the params do not require a comma, this is because Luna, like JavaScript, does not allow defining defaults in the signature. Personally I find this more flexible as params are often shifted and swapped based on their types or arity, and I find the clutter of defaults makes the signature difficult to read, clouding what would otherwise at a glance give you a hint at what the function does.

greet =: user msg
  console log('hello #{user} #{msg}')

Another example would be a selection of person’s ferrets older than 4. In Ruby this might look similar to below:

person.pets.select do |pet|
  pet.species == 'ferret' and pet.age > 4
end

and the following CoffeeScript:

person.pets.filter (pet) ->
  pet.species == 'ferret' and pet.age > 4

The equivalent canonical Luna would look like this:

person pets select(: pet
  pet species == 'ferret' && pet age > 4
)

Because this looks odd, and because function literals are almost always passed as the last argument, there’s a special-case, allowing a function literal to be defined after the closing ):

person pets select(): pet
  pet species == 'ferret' && pet age > 4

Furthering this special sauce, is the optional parens when no arguments are present:

person pets select: pet
  pet species == 'ferret' && pet age > 4

Taking this even further, I have planned to add deferred evaluation support to Luna, though I haven’t decided yet if it will opt-in (most likely) or not. This feature would provide extremely expressive capabilities, consider the following:

  person pets select(species == 'ferret' && age > 4)

In most imperative languages this would be evaluated before calling the function, so the argument would simply consist of a boolean from this expression. However in Luna these expressions could be evaluated by the callee against a given context, or optionally the caller’s context. Another slick example, mapping the the names of users over 20, and joining them.

users select(age > 20) map(name) join(', ');

Followed by our lovely JavaScript:

users.filter(function(user){
  return user.age > 20;
}).map(function(user){
  return user.name;
}).join(', ');
Number Literals

The one cool thing about Ruby, is that it allows you to place underscores in numbers, which can dramatically increase readability for larger values, so I’ve added this to Luna as well.

million = 1_000_000

Luna also has the typical octal and hexadecimal literals 0755 and 0xff etc, but also implements a few that I have not seen before, because these ones come from CSS! Often when programming in JavaScript with timeouts, intervals etc, so Luna has the s and ms literals. The following are equivalent:

2.5s
2500ms
2500

I’m not a huge fan of getters, visual ambiguity is not an experienced programmers friend, merely a toy that is fun on occasion. That being said s and ms in Luna would be valid slot accessors, as . is not required so they could be implemented within the language itself if it were to have getters:

Number s =:
   self * 1000

Another feature which I first encountered in Ruby, but is no doubt in several other languages is the means to represent a character code, as the character itself. In Ruby this takes the form ?<char>:

?&
// => 38

In Luna the form is #<char>:

#&
// => 38
Operators

Luna of course provides the swath of operators we’ve all come to know and love, along with a few others. I try to stay away form “wordy” operators, such as typical aliases “and”, and “or”, etc. The reason for this is while it does read better as english, large bodies of code quickly become nothing more than a soup of words, which I find tough to look at, after all, english is a terrible language, and an even worth programming language.

Below is the operator precedence table from highest to lowest:

    operator       |  associativity
    ---------------|---------------
    [ ] ( ) sp     |  left
    ++ --          |  right
    ! ~ + -        |  right
    * / %          |  left
    .              |  left
    + -            |  left
    << >>          |  left
    < <= > >=      |  left
    == !=          |  left
    &              |  left
    ^              |  left
    |              |  left
    &&             |  left
    ||             |  left
    ?:             |  right
    = :=           |  right
    not            |  right
    ,              |  left
    if unless      |  left
    while until

The low-precedence not operator is currently the only exception to my rule. This operator is similar to !, however this low precedence is handy, for example:

 User allowed =: realm
   !(banned || blockedFrom(realm))

may become:

 User allowed =: realm
   not banned || blockedFrom(realm)
Lexical Analysis

Luna’s Lexer currently consists of a small body of code (~350 LOC), not yet entirely complete. The Lexer’s job is to transform an input string into a normalized stream of tokens. For example Luna’s internals do not care if the operator == is two or three characters, hell it could be “equal to”, all it cares is that LUNA_OP_EQ is used internally to represent it.

Semantic Analysis

The Parser is in charge of semantics. Using the token stream created by the Lexer, the parser applies rules, and builds an AST, representing the structure of the program. Luna’s parser is a very small (~900 LOC) hand-coded recursive descent parser.

AST Output

For debugging Luna is bundled with an AST pretty printer. The function luna_prettyprint(luna_node_t *node) simply visits each node recursively, incrementing and decrementing the indentation as they are invoked. For example, the following Luna would write a string followed by a LF to stdout:

puts =: str
  stdout write(str . '\\n')

When luna(1) is invoked with the --ast flag, the following is written to stdout:

(= (id puts) (function str 
  (slot
    (id stdout)
    (call
      (id write)
      (. (id str) (string '\n'))))))
The Luna VM

So we’ve talked about syntax a bit, pretty trivial stuff, let’s get to the fun part! I’ve started implementing the VM in a separate branch, once the VM is usable I’ll begin the code generation, after which both will be merged into master. The code generation or “codegen” phase consists of traversing the AST (potentially several times), performing optimizations (such as precomputing expressions such as “10 / 2” to “5”), generating instructions for our virtual machine.

The VM is register based, these faux registers allow us to dramatically reduce the number of instructions, lowering ANSI C dispatch loop overhead, thus increasing performance. The instructions consist of an opcode (operation code), and optional operands. For example the LUNA_OP_ADD instruction would add two numbers. For example with a stack machine, the expression b = a + 10 may be expressed with the following sequence of pseudo instructions, pushing arguments onto the stack:

push 'a'
get
push 3
add
push 'b'
set

As you can see, that’s 6 instructions for a simple expression. In contrast, with our register based VM, we encode the opcode and operands into a 32 bit instruction, as expressed by the following macro used to pack the operands:

#define abc(op, a, b, c) ((LUNA_OP_##op) \
  | ((a) << 8) \
  | ((b) << 16) \
  | ((c) << 24))

With this, our ADD instruction may take the form shown below, where 1 is 2nd register (b), and 0 is the first (a), while 250 is the first constant.

add 1 0 250

Ditching the overhead required by the “abstract” instructions of a stack machine, dramatically increase our performance. While less, more specific instructions is better, the performance benefit is largely due to the fact that our dispatch loop performs less work.

The assembly output below is an ANSI C switch statement with -O3. As you can see gcc create the instruction cmpl here to compare the length of our switch, in this case 12 cases.

L18:
movl    (%rdx), %eax
addq    $4, %rdx
cmpl    $12, %eax
ja  L18
mov %eax, %eax
movslq  (%rsi,%rax,4),%rax
addq    %rsi, %rax
jmp *%rax

I haven’t yet decided on the approach I will take, however I would like Luna to remain portable, but that doesn’t mean we cannot special-case. One solution may be to utilize gcc’s “first class labels” to implement direct (or indirect) threading, using &&label to grab the label’s address. Using this we can perform unconditional jumps, removing the dispatch loop all together. With this technique gcc generates the following for goto **ip++, much better!

movq    (%rbx), %rax
addq    $8, %rbx
jmp *%rax
The End

If you are interested in seeing where the language goes, or trying it out when it becomes usable you can follow the progress on Github. Later I’ll touch on each component in greater detail, but that’s all for now!

Jul 11, 201122 notes
#luna #c
Next page →
2012 2013
  • January
  • February 3
  • March
  • April
  • May
  • June
  • July
  • August
  • September
  • October
  • November
  • December
2011 2012 2013
  • January 2
  • February 2
  • March 4
  • April 3
  • May 2
  • June 2
  • July 4
  • August 3
  • September 1
  • October 2
  • November 1
  • December 5
2010 2011 2012
  • January
  • February 3
  • March
  • April 4
  • May 2
  • June 1
  • July 3
  • August 3
  • September 3
  • October
  • November 3
  • December 1
2010 2011
  • January
  • February
  • March
  • April 8
  • May 1
  • June 3
  • July 4
  • August 4
  • September 3
  • October
  • November 1
  • December 1