code review : BJonas' Interpreter Experiment in J
Table of Contents
1 code review : BJonas' Interpreter Experiment in J
This is an attempt to understand the small interpreter created by BJonas and posted to the J wiki. (It's at jwiki/Scripts/Scheme, but the full text of that page has been reproduced here.
Quoted text and the actual code belong to BJonas. Each section is quoted verbatim, but I re-arranged the sections themselves so they were juxtaposed with the original code. His code and text sections look like this:
NB. interpreter experiment in J by zsban
This is a simple interpreter that can evaluate basic scheme expressions.
Currently the following expressions exist: lambdas, function calls, variables,
if
,set!
, numeric literals.The library has a few simple arithmetic functions bounded. Proper lambdas seem to work (now really, after I've fixed
envfind
), except that varargs are not supported.Most library functions and types like lists and vectors are missing. There's also no type distinctness or quote. These, however, could be easy to implement. Macros would be harder, but also possible; though some derived expression types could be coded directly in J.
call/cc
would be impossible to implement this way, and a garbage collector would be very difficult as well.The interpreter has the following parts. […]
2 tokenizer and reader
The
tok
verb splits a scheme code string to tokens, but doesn't actually decode those tokens.
spcp =: e.&(9 10 11 12 13 32{a.) tok1 =: (((~:1&|.)@:spcp+.(+.1&|.)@:(e.&'()'))<;.2]) tok =: ((#~-.@:spcp@:{.@:>"0)@:tok1@:(,&' ')) ::[:
Ok. spcp
is a monad that simply checks whether its y
argument corresponds to any of the numbered ascii characters (all of which represent whitespace). My guess is the name is a mnemonic for 'space predicate', following the lisp tradition.
As for the other two, my eyes still aren't used to looking at code packed this densely, so I'm going to try to expand them a bit. First, tok1
:
tok1 =: (((~: 1&|.)@:spcp +. (+. 1&|.)@:(e.&'()')) <;.2 ])
The overall structure is:
((a b)@:c d (d b)@:e) f ]
It's used as a monad, and this is what it does:
tok1 '(one (( two) three) (four))' ┌─┬───┬──┬─┬─┬──┬───┬─┬─────┬─────┬─┬─┬─┬────┬─┬─┐ │(│one│ │(│(│ │two│)│ │three│)│ │(│four│)│)│ └─┴───┴──┴─┴─┴──┴───┴─┴─────┴─────┴─┴─┴─┴────┴─┴─┘
The primary verb here is <;.2
, where ;.
is the (dyadic cut conjunction). With a verb on the left (monadic <
, meaning box), and the number 2 on the right, ;.
produces a new verb that takes an array of values on the right and an array of booleans on the left. The booleans indicate where to cut the values before applying the verb.
So the fork on the left (the a…e slots) is just checking for spaces and parentheses (in its left and right sides, respectively). Dyadic +.
is boolean 'or', ~:
is xor/not-equal, and 1&|.
is shifting the input one character to the left. So the right side says cut before and after each paren character, and the left side says cut on whitespace, but only if we're not already in whitespace.
So now this gets used in tok
, which I'll expand for readability here:
tok =: (( #~ -.@: spcp@: {.@: > "0 ) @: tok1 @: (,&' ')) :: [:
The use of the ::
conjunction (adverse) is surprising to me. It's purpose is to specify what to do when the verb on the left throws an error. Since [:
itself throws an error, I don't yet understand why he's doing this.
The rest is one long pipeline. Reading right to left:
- append a space to the input and run
tok1
- let
tok1
cut the string into boxes - for each box:
- open the box
- ignore the first character
- check the rest to see if they're spaces
- invert the result (
-.
means boolean 'not') - strip the spaces (make 1 copy for nonspaces, 0 copies for spaces)
Fair enough!
The
rdr
verb reads a scheme code string to a tree, but the leafs of the tree are still the undecoded tokens.
rdrc =: ('()'i.{.@:>) rdrs =: rdrb`rdre`rdra@.(rdrc@:{.) rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}. rdre =: <@:}. rdra =: {. , rdrs@:}. rdr1 =: {.@:rdrs@:(,&(<')')) rdr =: (rdr1 @: tok) ::[:
Here we see the mysterious ::[:
again. More importantly, it's just composing rdr1
with tok
.
This is simply a recursive descent parser. Let's follow the call chain, starting with rdr
:
rdr
appliestok
to the input, then appliesrdr1
to the resultrdr1
appends a boxed right paren to the chain of tokens, then callsrdrs
and takes the head of the result.rdrs
selects the first token and then, based on the result ofrdrc
, calls eitherrdrb
,rdre
, orrdra
.rdrc
simply unboxes a token and returns the index of its first character in the string '()':- '(' yields 0, so the 'b' in
rdrb
stands for begin. - ')' yields 1, so the 'e' in
rdre
stands for end. - anything else yields 2 so the 'a' in
rdra
means any
- '(' yields 0, so the 'b' in
Presumably the input is a well-formed s-expression, so the first token is going to be an opening paren. So let's look at rdrb
:
rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}.
This is a pipeline. From right to left, behead the input (so remove the opening paren token), then call rdrs
on the rest of the tokens.
Let's trace it through with a specific example:
]ts =: '(';'a';'(';'b';'c';')';')' ┌─┬─┬─┬─┬─┬─┬─┐ │(│a│(│b│c│)│)│ └─┴─┴─┴─┴─┴─┴─┘
So far, we've chopped off the first '(' and are now looking at an 'a'. So we need to push rdrb
onto a mental stack for a moment, and look at rdra
, since that's what rdrs
is going to call when it sees an 'a'.
rdra =: {. , rdrs@:}.
This is a fork. It's going to append the head of the list (the boxed 'a') to the result of running rdrs
on the tail. So we're recursing again. Since the tail starts with '(', we're doing another rdrb
, then two rdra
calls for the 'b' and 'c'. All of these cases are recursive, so at this point we're several levels deep into the call stack for each token. We only start to unwind the stack once we hit the first ')'.
Calling rdrs
with ')' as the first token invokes rdre
:
rdre =: <@:}.
This will behead the token stream (removing the leading ')' token) and then put the entire rest of the token stream inside a new box.
So in our example, we're looking at:
')';')';')' ┌─┬─┬─┐ │)│)│)│ └─┴─┴─┘
(The extra ')' would have been appended by rdr1
)
And the result will be:
rdre ')';')';')' ┌─────┐ │┌─┬─┐│ ││)│)││ │└─┴─┘│ └─────┘
This is returned up the chain to the innermost call of rdra
for the 'c' token, where rdra
simply appends it to the 'c'.
(<'c') , rdre ')';')';')' ┌─┬─────┐ │c│┌─┬─┐│ │ ││)│)││ │ │└─┴─┘│ └─┴─────┘
Same thing happens for 'b':
] sofar =. (<'b'), (<'c'), rdre ')';')';')' ┌─┬─┬─────┐ │b│c│┌─┬─┐│ │ │ ││)│)││ │ │ │└─┴─┘│ └─┴─┴─────┘
And now we're back at the innermost call of rdrb
.
rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}. ^^^^^^^^^^^ this part is done.
This leaves us with the fork, which we'll apply to the structure above.
(<@:}: , rdrs@:>@:{:)
The left side curtails and boxes:
<@:}: sofar ┌─────┐ │┌─┬─┐│ ││b│c││ │└─┴─┘│ └─────┘
The right side takes the tail of the array, unboxes it, and calls rdrs
on that recursively.
Note that f you're used to thinking of the head and tail of a list, remember that in J, the tail is the last item in the array, not a chain of nested cons cells.
So, we can build up the result from right to left ourselves:
{: sofar ┌─────┐ │┌─┬─┐│ ││)│)││ │└─┴─┘│ └─────┘
>@:{: sofar ┌─┬─┐ │)│)│ └─┴─┘
rdrs@:>@:{: sofar ┌───┐ │┌─┐│ ││)││ │└─┘│ └───┘
Now we can complete the fork by appending this value to the the result on the left side:
] sofar2 =. (<@:}: , rdrs@:>@:{:) sofar ┌─────┬───┐ │┌─┬─┐│┌─┐│ ││b│c│││)││ │└─┴─┘│└─┘│ └─────┴───┘
So this is the result of the inner call to rdrb
and now we climb back up the call stack to rdra
, which simply appends this value to the 'a' token:
] sofar3 =. (<'a'), sofar2 ┌─┬─────┬───┐ │a│┌─┬─┐│┌─┐│ │ ││b│c│││)││ │ │└─┴─┘│└─┘│ └─┴─────┴───┘
And now we're back where we started with rdrb
.
rdrb =: (<@:}: , rdrs@:>@:{:) @: rdrs@:}. ^^^^^^^^^^^ here again but at top level
(<@:}: , rdrs@:>@:{:) sofar3 ┌─────────┬┐ │┌─┬─────┐││ ││a│┌─┬─┐│││ ││ ││b│c││││ ││ │└─┴─┘│││ │└─┴─────┘││ └─────────┴┘
Finally, we walk back up to rdr1
, which returns the head.
{. (<@:}: , rdrs@:>@:{:) sofar3 ┌─────────┐ │┌─┬─────┐│ ││a│┌─┬─┐││ ││ ││b│c│││ ││ │└─┴─┘││ │└─┴─────┘│ └─────────┘
The following test string was commented out in the code:
echo rdr '(lambda (x) (+ 1 (* x x) x))' ┌────────────────────────────┐ │┌──────┬───┬───────────────┐│ ││lambda│┌─┐│┌─┬─┬───────┬─┐││ ││ ││x│││+│1│┌─┬─┬─┐│x│││ ││ │└─┘││ │ ││*│x│x││ │││ ││ │ ││ │ │└─┴─┴─┘│ │││ ││ │ │└─┴─┴───────┴─┘││ │└──────┴───┴───────────────┘│ └────────────────────────────┘
So now we have the recursive descent parser.
3 mutable places
The
placmak
,placref
, andplacset
functions create, get, and set the contents of mutable cells: these cells are used to implementset!
. The cells are indexed by integers, and are never destroyed, so we don't have garbage-collection.
placv =: i.0 placmak =: 3 :'<:#placv=:placv,<y' placref =: 3 :'>y{placv' placset =: 4 :'0:placv=:(<y) x}placv'
These are fairly straightforward.
The noun placv
is initialized as an empty array, to which boxed values are appended when placmak
is called. The <:#
symbols on the left of placmak
cause it to return the index of the newly created cell.
The other two functions simply get and set boxed values in this array.
4 default environment
The environment is a rank 2 array whose first column contains the boxed names of variables in the environment, and second column has the boxed indices of the cell in the cell vector that will always contain the contents of that variable.
denv =: i.0 2 denvadd =: 4 :'0:denv=:denv,(,x);placmak y' '+' denvadd +/@:>`'' '-' denvadd ({.-+/@:}.)`(-@:{.)@.(1=#)@:>`'' '*' denvadd */@:>`'' '/' denvadd ({.%*/@:}.)`(%@:{.)@.(1=#)@:>`'' 'floor' denvadd <.@:{.@:>`'' 'exp' denvadd ^@:{.@:>`'' 'log' denvadd ^.@:{.@:>`'' '<' denvadd ([:*./2</\])@:>`'' '=' denvadd ([:*./2=/\])@:>`'' '<=' denvadd ([:*./2<:/\])@:>`'' 'not' denvadd -.@:{.@:>`'' 'g0' denvadd 0
These are also straightforward. The - and / verbs are longer than their + and * equivalents because they have special cases for one argument (producing negative values or an inverse).
5 evaluator
Scheme procedures are represented as J gerunds of monadic functions that accept a list of boxed scheme arguments as its argument. The
lambda
verb creates such a function from the environment and the function body.
runl =: [ <@:run"_ 0 >@:] envfind =: ([:>[:{:[{~{."1@:[i:]) match =: ([ , <@:placmak@:>@:])"0 lambda1 =: 2 :'>@:{: (u , (>@:{.v)match y) runl (<@:}.v)' NB.lambda1 =: 2 :'(u , (>@:{.v)match y) ; (<@:}.v) ; 9' lambda =: 4 :'(x lambda1 y)`(i.0)'
Okay so it looks like runl
is a dyad that applies the verb run
to each item in a list.
The
run
verb runs a scheme source tree (returned byrdr
) in an environment.This function dispatches to one of the six functions
runnum
,runsym
,runset
,runcall
,runif
,runlambda
depending on the type of the expression.
runnum =: {.@:,@:(_.&".)@:>@:] runsym =: placref @: envfind runset =: [: 0: ([envfind 1{>@:]) placset ([run 2{>@:]) runcall =: [: (>@:{. 4 :'x@.0 y' }.) runl runif =: [ run ([:-.[run 1{>@:]) { ((<'0'),~2}.>@:]) runlambda =: [ lambda }.@:>@:] keywd =: ('lambda';'if';'set!')&i. runo =: runlambda`runif`runset`runcall@.(keywd@:{.@:>@:]) runa =: runsym`runnum@.(((e.&'0123456789+-'@:{.@:>)>(e.&(+`-)))@:]) run =: runo`runa@.(1=L.@:])
6 putting it together
eval =: denv&run @: rdr ::[: echo eval 0 :0 (((lambda (fact) (set! fact (lambda (n) (if (< n 1) 1 (* n (fact (- n 1)))))) fact) 0) 5) ) echo eval '((lambda (a) ((lambda (a) a) 2)) 5)' NB. must give 2
7 notes
runl
looks similar to (run &. >
)… how are they different?