code review : BJonas' Interpreter Experiment in J

1. code review : BJonas' Interpreter Experiment in J
2. tokenizer and reader
3. mutable places
4. default environment
5. evaluator
6. putting it together
7. notes

1 code review : BJonas' Interpreter Experiment in J

This is an attempt to understand the small interpreter created by BJonas and posted to the J wiki. (It's at jwiki/Scripts/Scheme, but the full text of that page has been reproduced here.

Quoted text and the actual code belong to BJonas. Each section is quoted verbatim, but I re-arranged the sections themselves so they were juxtaposed with the original code. His code and text sections look like this:

NB. interpreter experiment in J by zsban

This is a simple interpreter that can evaluate basic scheme expressions.

Currently the following expressions exist: lambdas, function calls, variables, if, set!, numeric literals.

The library has a few simple arithmetic functions bounded. Proper lambdas seem to work (now really, after I've fixed envfind), except that varargs are not supported.

Most library functions and types like lists and vectors are missing. There's also no type distinctness or quote. These, however, could be easy to implement. Macros would be harder, but also possible; though some derived expression types could be coded directly in J. call/cc would be impossible to implement this way, and a garbage collector would be very difficult as well.

The interpreter has the following parts. […]

2 tokenizer and reader

The tok verb splits a scheme code string to tokens, but doesn't actually decode those tokens.

spcp =: e.&(9 10 11 12 13 32{a.)
tok1 =: (((~:1&|.)@:spcp+.(+.1&|.)@:(e.&'()'))<;.2])

tok =: ((#~-.@:spcp@:{.@:>"0)@:tok1@:(,&' ')) ::[:

Ok. spcp is a monad that simply checks whether its y argument corresponds to any of the numbered ascii characters (all of which represent whitespace). My guess is the name is a mnemonic for 'space predicate', following the lisp tradition.

As for the other two, my eyes still aren't used to looking at code packed this densely, so I'm going to try to expand them a bit. First, tok1:

tok1 =: (((~: 1&|.)@:spcp +. (+. 1&|.)@:(e.&'()'))  <;.2  ])

The overall structure is:

((a b)@:c  d  (d b)@:e) f ]

It's used as a monad, and this is what it does:

   tok1 '(one  ((  two)     three) (four))'
┌─┬───┬──┬─┬─┬──┬───┬─┬─────┬─────┬─┬─┬─┬────┬─┬─┐
│(│one│  │(│(│  │two│)│     │three│)│ │(│four│)│)│
└─┴───┴──┴─┴─┴──┴───┴─┴─────┴─────┴─┴─┴─┴────┴─┴─┘

The primary verb here is <;.2, where ;. is the (dyadic cut conjunction). With a verb on the left (monadic <, meaning box), and the number 2 on the right, ;. produces a new verb that takes an array of values on the right and an array of booleans on the left. The booleans indicate where to cut the values before applying the verb.

So the fork on the left (the a…e slots) is just checking for spaces and parentheses (in its left and right sides, respectively). Dyadic +. is boolean 'or', ~: is xor/not-equal, and 1&|. is shifting the input one character to the left. So the right side says cut before and after each paren character, and the left side says cut on whitespace, but only if we're not already in whitespace.

So now this gets used in tok, which I'll expand for readability here:

tok =: ((  #~    -.@: spcp@: {.@: >   "0  ) @: tok1 @: (,&' ')) :: [:

The use of the :: conjunction (adverse) is surprising to me. It's purpose is to specify what to do when the verb on the left throws an error. Since [: itself throws an error, I don't yet understand why he's doing this.

The rest is one long pipeline. Reading right to left:

append a space to the input and run tok1
let tok1 cut the string into boxes
for each box:
- open the box
- ignore the first character
- check the rest to see if they're spaces
- invert the result (-. means boolean 'not')
- strip the spaces (make 1 copy for nonspaces, 0 copies for spaces)

Fair enough!

The rdr verb reads a scheme code string to a tree, but the leafs of the tree are still the undecoded tokens.

rdrc =: ('()'i.{.@:>)
rdrs =: rdrb`rdre`rdra@.(rdrc@:{.)
rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}.
rdre =: <@:}.
rdra =: {. , rdrs@:}.
rdr1 =: {.@:rdrs@:(,&(<')'))

rdr =: (rdr1 @: tok) ::[:

Here we see the mysterious ::[: again. More importantly, it's just composing rdr1 with tok.

This is simply a recursive descent parser. Let's follow the call chain, starting with rdr:

rdr applies tok to the input, then applies rdr1 to the result
rdr1 appends a boxed right paren to the chain of tokens, then calls rdrs and takes the head of the result.
rdrs selects the first token and then, based on the result of rdrc, calls either rdrb, rdre, or rdra.
rdrc simply unboxes a token and returns the index of its first character in the string '()':
- '(' yields 0, so the 'b' in rdrb stands for begin.
- ')' yields 1, so the 'e' in rdre stands for end.
- anything else yields 2 so the 'a' in rdra means any

Presumably the input is a well-formed s-expression, so the first token is going to be an opening paren. So let's look at rdrb:

rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}.

This is a pipeline. From right to left, behead the input (so remove the opening paren token), then call rdrs on the rest of the tokens.

Let's trace it through with a specific example:

   ]ts =: '(';'a';'(';'b';'c';')';')'
┌─┬─┬─┬─┬─┬─┬─┐
│(│a│(│b│c│)│)│
└─┴─┴─┴─┴─┴─┴─┘

So far, we've chopped off the first '(' and are now looking at an 'a'. So we need to push rdrb onto a mental stack for a moment, and look at rdra, since that's what rdrs is going to call when it sees an 'a'.

rdra =: {. , rdrs@:}.

This is a fork. It's going to append the head of the list (the boxed 'a') to the result of running rdrs on the tail. So we're recursing again. Since the tail starts with '(', we're doing another rdrb, then two rdra calls for the 'b' and 'c'. All of these cases are recursive, so at this point we're several levels deep into the call stack for each token. We only start to unwind the stack once we hit the first ')'.

Calling rdrs with ')' as the first token invokes rdre:

rdre =: <@:}.

This will behead the token stream (removing the leading ')' token) and then put the entire rest of the token stream inside a new box.

So in our example, we're looking at:

   ')';')';')'
┌─┬─┬─┐
│)│)│)│
└─┴─┴─┘

(The extra ')' would have been appended by rdr1)

And the result will be:

   rdre ')';')';')'
┌─────┐
│┌─┬─┐│
││)│)││
│└─┴─┘│
└─────┘

This is returned up the chain to the innermost call of rdra for the 'c' token, where rdra simply appends it to the 'c'.

   (<'c') , rdre ')';')';')'
┌─┬─────┐
│c│┌─┬─┐│
│ ││)│)││
│ │└─┴─┘│
└─┴─────┘

Same thing happens for 'b':

   ] sofar =. (<'b'), (<'c'), rdre ')';')';')'
┌─┬─┬─────┐
│b│c│┌─┬─┐│
│ │ ││)│)││
│ │ │└─┴─┘│
└─┴─┴─────┘

And now we're back at the innermost call of rdrb.

rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}.
                              ^^^^^^^^^^^ this part is done.

This leaves us with the fork, which we'll apply to the structure above.

(<@:}: , rdrs@:>@:{:)

The left side curtails and boxes:

   <@:}: sofar
┌─────┐
│┌─┬─┐│
││b│c││
│└─┴─┘│
└─────┘

The right side takes the tail of the array, unboxes it, and calls rdrs on that recursively.

Note that f you're used to thinking of the head and tail of a list, remember that in J, the tail is the last item in the array, not a chain of nested cons cells.

So, we can build up the result from right to left ourselves:

   {: sofar
┌─────┐
│┌─┬─┐│
││)│)││
│└─┴─┘│
└─────┘

   >@:{: sofar
┌─┬─┐
│)│)│
└─┴─┘

   rdrs@:>@:{: sofar
┌───┐
│┌─┐│
││)││
│└─┘│
└───┘

Now we can complete the fork by appending this value to the the result on the left side:

   ] sofar2 =. (<@:}: , rdrs@:>@:{:) sofar
┌─────┬───┐
│┌─┬─┐│┌─┐│
││b│c│││)││
│└─┴─┘│└─┘│
└─────┴───┘

So this is the result of the inner call to rdrb and now we climb back up the call stack to rdra, which simply appends this value to the 'a' token:

   ] sofar3 =. (<'a'), sofar2
┌─┬─────┬───┐
│a│┌─┬─┐│┌─┐│
│ ││b│c│││)││
│ │└─┴─┘│└─┘│
└─┴─────┴───┘

And now we're back where we started with rdrb.

rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}.
                              ^^^^^^^^^^^ here again but at top level

   (<@:}: , rdrs@:>@:{:) sofar3
┌─────────┬┐
│┌─┬─────┐││
││a│┌─┬─┐│││
││ ││b│c││││
││ │└─┴─┘│││
│└─┴─────┘││
└─────────┴┘

Finally, we walk back up to rdr1, which returns the head.

   {. (<@:}: , rdrs@:>@:{:) sofar3
┌─────────┐
│┌─┬─────┐│
││a│┌─┬─┐││
││ ││b│c│││
││ │└─┴─┘││
│└─┴─────┘│
└─────────┘

The following test string was commented out in the code:

   echo rdr '(lambda (x) (+ 1 (* x x) x))'
┌────────────────────────────┐
│┌──────┬───┬───────────────┐│
││lambda│┌─┐│┌─┬─┬───────┬─┐││
││      ││x│││+│1│┌─┬─┬─┐│x│││
││      │└─┘││ │ ││*│x│x││ │││
││      │   ││ │ │└─┴─┴─┘│ │││
││      │   │└─┴─┴───────┴─┘││
│└──────┴───┴───────────────┘│
└────────────────────────────┘

So now we have the recursive descent parser.

3 mutable places

The placmak, placref, and placset functions create, get, and set the contents of mutable cells: these cells are used to implement set!. The cells are indexed by integers, and are never destroyed, so we don't have garbage-collection.

placv =: i.0
placmak =: 3 :'<:#placv=:placv,<y'
placref =: 3 :'>y{placv'
placset =: 4 :'0:placv=:(<y) x}placv'

These are fairly straightforward.

The noun placv is initialized as an empty array, to which boxed values are appended when placmak is called. The <:# symbols on the left of placmak cause it to return the index of the newly created cell.

The other two functions simply get and set boxed values in this array.

4 default environment

The environment is a rank 2 array whose first column contains the boxed names of variables in the environment, and second column has the boxed indices of the cell in the cell vector that will always contain the contents of that variable.

denv =: i.0 2
denvadd =: 4 :'0:denv=:denv,(,x);placmak y'
'+' denvadd +/@:>`''
'-' denvadd ({.-+/@:}.)`(-@:{.)@.(1=#)@:>`''
'*' denvadd */@:>`''
'/' denvadd ({.%*/@:}.)`(%@:{.)@.(1=#)@:>`''
'floor' denvadd <.@:{.@:>`''
'exp' denvadd ^@:{.@:>`''
'log' denvadd ^.@:{.@:>`''
'<' denvadd ([:*./2</\])@:>`''
'=' denvadd ([:*./2=/\])@:>`''
'<=' denvadd ([:*./2<:/\])@:>`''
'not' denvadd -.@:{.@:>`''
'g0' denvadd 0

These are also straightforward. The - and / verbs are longer than their + and * equivalents because they have special cases for one argument (producing negative values or an inverse).

5 evaluator

Scheme procedures are represented as J gerunds of monadic functions that accept a list of boxed scheme arguments as its argument. The lambda verb creates such a function from the environment and the function body.

runl =: [ <@:run"_ 0 >@:]
envfind =: ([:>[:{:[{~{."1@:[i:])
match =: ([ , <@:placmak@:>@:])"0
lambda1 =: 2 :'>@:{: (u , (>@:{.v)match y) runl (<@:}.v)'
NB.lambda1 =: 2 :'(u , (>@:{.v)match y) ; (<@:}.v) ; 9'

lambda =: 4 :'(x lambda1 y)`(i.0)'

Okay so it looks like runl is a dyad that applies the verb run to each item in a list.

The run verb runs a scheme source tree (returned by rdr) in an environment.

This function dispatches to one of the six functions runnum, runsym, runset, runcall, runif, runlambda depending on the type of the expression.

runnum =: {.@:,@:(_.&".)@:>@:]
runsym =: placref @: envfind
runset =: [: 0: ([envfind 1{>@:]) placset ([run 2{>@:])
runcall =: [: (>@:{. 4 :'x@.0 y' }.) runl
runif =: [ run ([:-.[run 1{>@:]) { ((<'0'),~2}.>@:])
runlambda =: [ lambda }.@:>@:]

keywd =: ('lambda';'if';'set!')&i.
runo =: runlambda`runif`runset`runcall@.(keywd@:{.@:>@:])
runa =: runsym`runnum@.(((e.&'0123456789+-'@:{.@:>)>(e.&(+`-)))@:])

run =: runo`runa@.(1=L.@:])

6 putting it together

eval =: denv&run @: rdr ::[:

echo eval 0 :0
        (((lambda (fact) (set! fact
                (lambda (n) (if (< n 1) 1 (* n (fact (- n 1)))))) fact) 0) 5)
)
echo eval '((lambda (a) ((lambda (a) a) 2)) 5)' NB. must give 2

7 notes

runl looks similar to (run &. >)… how are they different?