Building a language in Typescript
So you want to create your own programming language but you have no clue where to start?
Then this guide is for you.
We're going to step through how to create a transpiler, converting our new language syntax to something that is Javascript compatible.
WIP Warning: This guide is very WIP, is subject to missing sections and potentially incomplete or inaccurate information. Proceed at your own risk, but also feel free to file issues for anything you encounter.
Prerequisites
We're going to assume basic knowledge around Javascript/Typescript. This guide will try and cover all language aspects we use, but you should have a high level understanding of how types and generics work.
Not Covered
There's a few things this guide currently does not cover:
- We're not going to cover how to compile to 'binary'
- Our language is not going to support types
This is not to say this guide will never support these things, if somebody wants to help contribute pull requests would be more than welcome.
Overview
Before we get started discussing why and how we do specific things, we need to come up with a super high level overview of what we want our language to look like.
We should start simple, and evolve our language as we better understand how to build our compiler.
new hello = 'world'
print hello
Above, we have a variable assignment, and then a statement to print it to stdout. This will be the basis of our language.
The equivalent in Javascript would look like:
const hello = "world"
console.log(hello)
DSLs and languages are typically processed in three distinct phases:
- Lexing
- Parsing
- Generation
The Overview section will go over these at a high level, while the following chapters will deep-dive into each one, where we will discuss the how and whys.
Lexing
This it the process of converting your program into a distinct set of base elements.
Think of things like:
string
start block
end block
for loop
assignment
The above list is certainly not exhaustive, and depends entirely on the syntax and semantics of the language that we're building.
An important things to note about our lexing
phase: We are not assigning meaning to anything here, we are only describing what our program is.
new hello = 'world'
print hello
Take our above DSL as an example, when we parse it, we may end up with a list of tokens that looks something similar to the following:
[
{ type: 'LineBreak' },
{ type: 'VariableDeclaration' },
{ type: 'Literal', value: 'hello' },
{ type: 'AssignmentOperator' },
{ type: 'String', value: 'world' },
{ type: 'LineBreak' },
{ type: 'Log' },
{ type: 'Literal', value: 'hello' },
{ type: 'LineBreak' }
]
Ignore the Literal
destinction if you need to - think of it as just a name. We'll cover this in a later chapter.
As mentioned above, we're not parsing/storing context or meaning here, it's simply a one-dimensional array describing our source code.
Parsing
Now that we have run a lexer on our source code, it's time to assign meaning, and to further understand how elements in our program interact with each other.
This step should return what we call an AST - an abstract synax tree.
An AST is a tree-like representation of our source code, linking relevant tokens together in a way that makes sense specific to our language.
[
{ type: 'LineBreak' },
{ type: 'VariableDeclaration' },
{ type: 'Literal', value: 'hello' },
{ type: 'AssignmentOperator' },
{ type: 'String', value: 'world' },
{ type: 'LineBreak' },
{ type: 'Log' },
{ type: 'Literal', value: 'hello' },
{ type: 'LineBreak' }
]
Taking our output from our lexer example, we can infer that there's probably 2 different things going on here.
The first one being that we're creating a variable, and then assigning it a value.
{ type: 'VariableDeclaration' },
{ type: 'Literal', value: 'hello' },
{ type: 'AssignmentOperator' },
{ type: 'String', value: 'world' },
You can see we have our VariableDeclaration
, then a name of our variable, then finally the actual value we wish to assign to it.
The second action that our code seems to be taking is that we're logging a value.
{ type: 'Log' },
{ type: 'Literal', value: 'hello' },
Can you see how we can semantically recognise how these tokens should be grouped together? This is what our parser is going to.
Now, back to our AST.
Taking the above 2 actions, and giving them the necessary context they require to operate, we may end up with an AST that looks similar to this:
{
type: 'Program',
children: [
{
type: 'Assignment',
name: 'hello',
value: {
type: 'String',
value: 'world'
}
},
{
type: 'Log',
children: [
{
type: 'Literal',
value: 'hello'
}
]
}
]
}
We end up with 3 nodes.
- A root level
Program
node (remember, our AST is a tree, so it needs something at the very top)
And then as children of that node,
- An
Assignment
node - A
Log
node
All 3 of these AST entries have context as well, they either have a list of children
, or other relevant contextual information (like the required name
, and the value
for our assignment).
Generation
This is how we're actually going to output our Javascript.
We're going to take our languages AST and transform it into an AST that a javascript codegen library understands. We'll be using escodegen
to generate the actual JS output.
If everything goes to plan, we’ll end up with something similar to:
const hello = "world"
console.log(hello)
In the future this book may be expanded to build upon alternative output languages.
Lexing
As per our previous chapters, our language is going to look something like:
new hello = 'world'
print hello
We’ll extend it eventually, but for the moment, while we go over the basic constructs, we’ll keep this as the entirety of the supported syntax.
Before we discuss how to build our lexer, we need to understand the different parts of our source, and how we should interpret them.
Tokens
For reference, our source:
new hello = 'world'
print hello
Lets break down the above into things.
A thing is any distinct value, property, symbol etc that could change how our program is interpreted.
Things
new
This is the start of a variable declaration. Now, remember, in our AST we don't really care about what it's declaring, or what its value is, we only care that it exists in this spot.
hello
This is what we will call a Literal. A literal is something that isn’t syntactically special, and may be in our program because it designates something like a variable or a function name.
So why is this not just called Variable
?
Looking at the source we can tell that will be a variable name, but when our lexer is doing its thing, it won't know if this is a variable, or a function call, or a file reference, or anything else it could potentially be in our language.
So we come up with a 'higher level' value type that could be any of these.
=
We now have an assignment operator. In our language (the same as Javascript), anything on the right side of a =
is considered to be a value.
It's entirely possible to remove this if you want, and change our DSL to work with:
new hello 'world'
but for the sake of this book we will keep it in.
'world'
This is a string! Fairly simple.
print
This is another Literal.
Again, looking at our source we can see that this looks suspiciously like a function call, but our lexer won't know this.
We'll include this in our "standard library" for our langauge, when we get around to parsing our AST and generating code.
hello
Surprise, it's another literal. This time we're passing this as an argument, references a variable that we've created previously.
Soo..
Some of these values are static, like our "world"
string, some are syntaxically important, like our =
assignment character. Some things may eventually control program flow (for example, an if
statement), and some may just be Literals.
LineBreak
The last token that we’re actually missing in the above are the line breaks in our source.
These may or may not be important depending on the semantics of your language, but they’re useful to keep in your list of tokens nevertheless.
Our types
enum TokenType {
VariableDeclaration = 'VariableDeclaration',
AssignmentOperator = 'AssignmentOperator',
Literal = 'Literal',
String = 'String',
LineBreak = 'LineBreak',
Log = 'Log'
}
interface TokenNode<T extends TokenType> {
type: T
}
interface TokenValueNode<T extends TokenType> extends TokenNode<T> {
value: string
}
type Token =
TokenNode<TokenType.AssignmentOperator> |
TokenNode<TokenType.VariableDeclaration> |
TokenNode<TokenType.LineBreak> |
TokenNode<TokenType.ConsoleLog> |
TokenValueNode<TokenType.Literal> |
TokenValueNode<TokenType.String>
Phew, some actual code.
Take a minute to read over this, and we'll step through it to explain what it's doing, and why we're doing it.
TokenType
enum TokenType {
VariableDeclaration = 'VariableDeclaration',
AssignmentOperator = 'AssignmentOperator',
Literal = 'Literal',
String = 'String',
LineBreak = 'LineBreak',
Log = 'Log'
}
We start with all the possible types of tokens we may encounter.
You may notice that we've explicitly called out Log
as its own type, this makes it much easier while we're still building the basics. Once we implement proper function calls this can be removed.
This list will grow as we expand our language, but it's small enough to just inline for the moment.
TokenNode
interface TokenNode<T extends TokenType> {
type: T
}
A basic interface that will be the "lowest level" node type that we support.
If you're not super familiar with Typescript or generics syntax, here's a quick recap:
T
Lets assume for the minute that the JS/TS standard library didn't come with a type for an array
, and we wanted to create one that supported all of the basic Javascript primitives: string, number, boolean etc.
To accomplish this, we may write:
interface StringArray {
elements: string[]
}
interface NumberArray {
elements: number[]
}
const strings: StringArray = { elements: [ 'hello', 'world' ] }
const numbers: NumberArray = { elements: [ 1, 2, 3 ] }
This would work, but would mean that we're duplicating code for every single type that we may want to support.
To make this easier, Typescript allows us to pass types as arguments to interfaces & functions. These arguments can then be used as values for types inside that definition.
Take the following for example:
interface Array<T> {
elements: T[]
}
Here we are taking a single generic parameter T
, and we are using it as the value for our elements
key.
This means that we can now use this interface like so:
const strings: Array<string> = { elements: [ 'hello', 'world' ] }
const numbers: Array<number> = { elements: [ 1, 2, 3 ] }
No more duplicating types.
T extends TokenType
So now that we know that T
is a generic, what is this extends TokenType
thing?
This is called a type constraint.
At a high level, this is equivalent to is one of <x>
, where X
is the type after extends
.
So in this case, we're saying that T
needs to be one of the possible values of TokenType
Works:
type VariableDeclarationNode = TokenNode<TokenType.VariableDeclaration>
Will not work:
type MyAwesomeNode = TokenNode<'awesome'>
// Type '"string"' does not satisfy the constraint 'TokenType'
TokenNode
So our interface here is just an object with a variable type
, who's type is what we pass in when constructing our TokenNode
.
TokenValueNode
interface TokenValueNode<T extends TokenType> extends TokenNode<T> {
value: string
}
We now create another interface, this one being an extension of our base TokenNode
, with an additional field called value
.
This will be used when we need to capture what a thing references, or what it is called, when we build our token.
Example
If we take our (shortened) DSL of:
new hello
We could pass this as:
[
{ type: TokenType.VariableDeclaration },
{ type: TokenType.Literal }
]
But.. when we go forward, we have no idea what this literal is actually called.
When we introduce our TokenValueNode
, our result would look like this:
[
{ type: TokenType.VariableDeclaration },
{ type: TokenType.Literal, value: 'hello' }
]
Now we know what the literals value is.
Token
type Token =
TokenNode<TokenType.AssignmentOperator> |
TokenNode<TokenType.VariableDeclaration> |
TokenNode<TokenType.LineBreak> |
TokenNode<TokenType.ConsoleLog> |
TokenValueNode<TokenType.Literal> |
TokenValueNode<TokenType.String>
Let's put all of these possible types together, and create one super type called Token
.
This is the type we'll be using whenever we reference a token, as as token can be any of these different values.
A basic lexer
export function tokeniser (input: string): Token[] {
const out: Token[] = []
let currentPosition = 0
while (currentPosition < input.length) {
// Process token, increment currentPosition
}
return out
}
This will be the main body for our tokeniser.
We take in our program - input
- and return an array of valid tokens.
We start at position 0
in our input, and increasingly loop over each character while building up a list and identifying what each character means in the context of our language.
To help facilitate us matching strings to tokens, lets create a bit of a map for our tokens that can be matched with strings, that we don’t want to be classed as literal values.
const tokenStringMap: Array<{
key: string,
value: Token
}> = [
{ key: '\n', value: { type: TokenType.LineBreak } },
{ key: 'new', value: { type: TokenType.VariableDeclaration } },
{ key: '=', value: { type: TokenType.AssignmentOperator } },
{ key: 'print', value: { type: TokenType.ConsoleLog } }
]
Now that we have this, we can use it to loop over things, and actually start processing our input. Lets change our while loop to something like this:
while (currentPosition < input.length) {
const currentToken = input[currentPosition]
// Our language doesn't care about whitespace.
if (currentToken === ' ') {
currentPosition++
continue
}
let didMatch: boolean = false
for (const { key, value } of tokenStringMap) {
if (!lookaheadString(key)) {
continue
}
out.push(value)
currentPosition += key.length
didMatch = true
}
if (didMatch) continue
}
Our tokeniser now processes approx half of our defined language now!
Lets step through what we’re doing here.
const currentToken = input[currentPosition]
This is just a useful shortcut, as we’ll often be referencing the current character that we’re parsing.
if (currentToken === ' ') {
currentPosition++
continue
}
Here’s our first real check. As our language semantics don’t change on whitespace (unlike, for example, Python), we can completely ignore it if it’s the current token. Whitespace is still significant if we’re doing things like lookaheads - the space might be part of a string, in which case we would be required to capture that.
let didMatch: boolean = false
for (const { key, value } of tokenStringMap) {
if (!lookaheadString(key)) {
continue
}
out.push(value)
currentPosition += key.length
didMatch = true
}
if (didMatch) continue
The rest of the code we added. First, we loop over tokenStringMap object that we created above. Next, we check if the next N characters in our input match. If they don’t we just skip this.
If we do end up finding a match, we add the equivalent token to our list of tokens in out, increment our currentPosition counter by the length of the string we just matched (we don’t want to accidently reprocess this!) and then set didMatch to `true.
Finally, if didMatch was set to true, we continue the outer while loop, as we’ve already attributed the current input token to its correct match.
Lookaheads
Now, we don’t want to explicitly check each character in our program against our list of possible tokens manually, so lets create a basic helper function.
function lookaheadString (str: string): boolean {
const parts = str.split('')
for (let i = 0; i < parts.length; i++) {
if (input[currentPosition + i] !== parts[i]) {
return false
}
}
return true
}
A couple of our tokens can be matched against their string equivalent:
new
print
=
- line breaks
Some of our more advanced tokens - ones that might have other tokens between them (think comments), or ones that require us to capture their values (like literals) - will be processed separately.
This lookaheadString
function will take a string, and then 'look ahead' in our input to see if the proceeding characters, after our current position, match.
Something to note is that we're using currentPosition + i
instead of currentPositio++
. We want don't actually want to increment currentPosition
unless our string completely matches, as we'll need to roll back to what currentPosition
started as to continue checking strings.
To help facilitate us matching strings to tokens, lets create a map for our tokens:
const tokenStringMap: Array<{
key: string,
value: Token
}> = [
{ key: '\n', value: { type: TokenType.LineBreak } },
{ key: 'new', value: { type: TokenType.VariableDeclaration } },
{ key: '=', value: { type: TokenType.AssignmentOperator } },
{ key: 'print', value: { type: TokenType.ConsoleLog } }
]
Now that we have this, we can use it to loop over things, and actually start processing our input.
Processing tokens
Let's modify the while
loop from the previous chapter to:
while (currentPosition < input.length) {
const currentToken = input[currentPosition]
// Our language doesn't care about whitespace.
if (currentToken === ' ') {
currentPosition++
continue
}
let didMatch: boolean = false
for (const { key, value } of tokenStringMap) {
if (!lookaheadString(key)) {
continue
}
out.push(value)
currentPosition += key.length
didMatch = true
}
if (didMatch) continue
}
Our tokeniser now processes approx half of our defined language now!
Lets step through what we’re doing here.
const currentToken = input[currentPosition]
if (currentToken === ' ') {
currentPosition++
continue
}
Here’s our first real check.
As our language semantics don’t change on whitespace (unlike, for example, Python), we can completely ignore it if it’s the current token.
Whitespace is still significant if we’re doing things like lookaheads - the space might be part of a string - so we only skip it if the current token is a string.
You might notice that we're assigning the current input character to currentToken
, this is just for convenience.
let didMatch: boolean = false
for (const { key, value } of tokenStringMap) {
if (!lookaheadString(key)) {
continue
}
out.push(value)
currentPosition += key.length
didMatch = true
}
if (didMatch) continue
Here we iterate over tokenStringMap
and try look for a match.
If we find a match, we add the equivalent token type to our return array (out
), increment currentPosition
by the number of characters that we "consumed", and then we continue.
Matching strings & literals
Now that we have implemented checks for our known token types, it’s time to match things we aren’t aware of ahead of time, or special values like strings.
For this to work, we need another lookahead-type utility function. This one will take a regex instead of a string.
function lookahead (match: RegExp, matchNext?: RegExp): string[] {
const bucket: string[] = []
while (true) {
const nextIndex = currentPosition + bucket.length
const nextToken = input[nextIndex]
if (!nextToken) {
break
}
let m: string | RegExp = match
if (matchNext && bucket.length) {
m = matchNext
}
if (m && !m.test(nextToken)) {
break
}
bucket.push(nextToken)
}
return bucket
}
This may look complicated to start with, but it’s basically just doing what our main while loop is!
It’s iterating over the next characters in our input, trying to see if they match either our start or end regex, and then returns them.
After implementing this function, matching both string and literal values should be fairly straight forward.
strings
What are strings?
In our language, a string is going to be denoted by a '
, then any amount of characters, and then another '
.
So.. if our current token is '
we're starting a new start, and we want to match everything until the next '
.
How?
Utilising the lookahead
function we just created, we can now use that to collect everything that is not something we specify. In this case, we want to match evyerthing that is not a '
.
Code
if (currentToken === "'") {
currentPosition++
const bucket = lookahead(/[^']/)
out.push({
type: TokenType.String,
value: bucket.join('')
})
currentPosition += bucket.length + 1
continue
}
Firstly, we only run this if our current token is the start of a string, which is ' in this case.
We could easily implement matching double-quote strings as well, but that can be left as an exercise for the reader.
You may notice that immediately inside our if statement, we do currentPosition++
, that’s because we don’t actually care about the '
token, now that we know we’re inside a string, so just skip it.
Next, we create an array of tokens that exist up until the next '
we find! This is what our regex of /[^']/
matches, anything that is not a '
.
From here we then add our new string token to our list, then we increment currentPosition
by bucket.length + 1
.
Why + 1
?
Well, if we didn’t do that, the next time the while loop runs, we’d be matching the last ' in the string sequence, but our parser would think it’s a new string, and we’d end up matching the exact inverse of what we wanted to!
literals
The code for parsing literals is extremely similar to the string matching code.
const literalRegex = /[a-zA-Z]/
const literalRegexNext = /[a-zA-Z0-9]/
if (literalRegex.test(currentToken)) {
const bucket = lookahead(literalRegex, literalRegexNext)
out.push({
type: TokenType.Literal,
value: bucket.join('')
})
currentPosition += bucket.length
continue
}
A quick note: This should be put after our string check, and nearly always be the last check (if
) that occurs in our while
loop, as we're basically "catch-all'ing" syntax we don't otherwise recognise, and assuming it's a literal.
You may notice that we're actually passing two regexs to our lookahead
function here. This is because we want to differenciate what a literal is allowed to start with v what it can contain.
In our langauge (exactly the same as Javascript), a literal won't be allows to start with a number, but will be allowed to contain numbers.
This isn’t stricly necessary, as our compiler gets more advanced we'll be able to substitute variables with "valid" Javascript ones, but to start with (for both ease, and development speed) we’re just going to map them 1-1.
Testing
Now that we have all of the parts of our lexer together, we should do one last thing.
At the very end of your outside while loop, add the following line:
throw new Error(`Unknown input character: ${currentToken}`)
An extremely simple error message - but should help us notice if we have anything in our input program that we can’t parse just yet.
Now, we can call our function with what we came up with above:
console.log(tokeniser(`
new hello = 'world'
print hello
`))
And if everything works, the output should look something like:
[
{ type: 'LineBreak' },
{ type: 'VariableDeclaration' },
{ type: 'Literal', value: 'hello' },
{ type: 'AssignmentOperator' },
{ type: 'String', value: 'world' },
{ type: 'LineBreak' },
{ type: 'ConsoleLog' },
{ type: 'Literal', value: 'hello' },
{ type: 'LineBreak' }
]
You can see that each of the parts of our program have been parsed correctly, and their values have been saved where required!
Parsing
Now that we have a working lexer, we need to take these resulting tokens and build our AST.
Like mentioned in our intro chapter, our parser adds context and meaning to our program.
If everything goes right, we should end up with an AST:
{
type: 'Program',
children: [
{
type: 'Assignment',
name: 'hello',
value: {
type: 'String',
value: 'world'
}
},
{
type: 'Log',
children: [
{
type: 'Literal',
value: 'hello'
}
]
}
]
}
Our types
enum ASTNodeType {
Program = 'Program',
Literal = 'Literal',
String = 'String',
Assignment = 'Assignment',
Log = 'Log'
}
interface ASTValueNode<T extends ASTNodeType, K> {
type: T,
value: K
}
interface ASTProgramNode {
type: ASTNodeType.Program,
children: ASTNode[]
}
interface ASTAssignmentNode {
type: ASTNodeType.Assignment,
name: string,
value: ASTNode
}
interface ASTLogNode {
type: ASTNodeType.Log,
children: ASTNode[]
}
type ASTNode =
ASTValueNode<ASTNodeType.String, string> |
ASTValueNode<ASTNodeType.Literal, string> |
ASTProgramNode |
ASTAssignmentNode |
ASTLogNode
These types are very similar to our lexer types, so we'll only go over these briefly.
ASTNodeType
Another enum which would contain our possible AST node types.
Like our lexer, this contains both String
and Literal
entries, as well as other types.
ASTValueNode
The AST equivalent of TokenValueNode
. Used for strings and literals to capture values.
ASTProgramNode
Here we define our Program
node, this shall be the root node in whatever AST we return.
It contains an array of children, which are other AST nodes, sequenced in execution order.
ASTAssignmentNode
This is the resulting AST entry for when we parse an assignment.
It will capture both VariableDeclaration
and an AssignmentOperator
converting these into a single node format.
ASTLogNode
Like our Log
token, this will contain necessary information around what we're actually going to log.
ASTNode
A union of all the above types, so we have a single reference.
Structure
function toAST (tokens: Token[]): ASTNode {
let currentIndex = 0
function process (): ASTNode | null {
const currentToken = tokens[currentIndex]
// Process our current token, and return an AST node
return null
}
const children: ASTNode[] = []
while (currentIndex < tokens.length) {
const next = process()
if (next) {
children.push(next)
}
}
return {
type: ASTNodeType.Program,
children
}
}
This may look similar to our lexer code. We're iterating over our tokens and processing them one-by-one.
Let's break this down quickly.
function process (): ASTNode | null {
const currentToken = tokens[currentIndex]
// Process our current token, and return an AST node
return null
}
We define a function called visit
. This will be called each time we want to visit - or "consume" - a token.
Each token may either return a new AST Node, or nothing.
return {
type: ASTNodeType.Program,
children
}
As mentioned in our type descriptions, we always need a root AST node. As we don't define a program 'start' in our DSL, which will be hardcoded as the first node we return.
Basic nodes
We have a few token types that are very easy to parse into AST nodes, as we can either skip over them, or they just map to an existing AST node.
Line Breaks
In the same way that our lexer skips line breaks, we want to do the same thing in our AST generator.
if (currentToken.type === TokenType.LineBreak) {
currentIndex++
return null
}
Increment our position, and then return nothing. We don't care about individual line breaks.
Literals / Strings
As we have these both as Tokens, and AST nodes, we are able to simply map them to their corresponding type.
if (currentToken.type === TokenType.Literal) {
currentIndex++
return {
type: ASTNodeType.Literal,
value: currentToken.value
}
}
if (currentToken.type === TokenType.String) {
currentIndex++
return {
type: ASTNodeType.String,
value: currentToken.value
}
}
Log Call
Now that we have our basic tokens being parsed, it's time to tackle something a little more difficult.
Looking at our DSL again:
print hello
Taking a look at this we can immediately see that we want to print the thing after our print
call.
As we haven't really defined language semantics, we're able to do that right now. Let's assume that we want to log everything on the same line as our print statement.
if (currentToken.type === TokenType.Log) {
let nextNode = tokens[currentIndex++]
const children: ASTNode[] = []
// Process
return {
type: ASTNodeType.Log,
children
}
}
First, we set up nextNode
to reference the next.. node :)
Then we'll define an array of child nodes - the nodes that we will actually be logging.
while (nextNode.type !== TokenType.LineBreak) {
// Process
nextNode = tokens[currentIndex]
}
Now, we're going to continuously loop over our current tokens. That is, until we find one that is a line break.
Once we hit our line break, we know that we're at the end of our log statement.
const next = process()
if (next) {
children.push(next)
}
To get our actual child node, we need to now call process()
. As we've already incremented currentIndex
, the entire loop will run again (and potentially again and again.. depending on the complexity of the language that we end up) while iterating and generating our child tree.
Once we have this, we need to check it's not null, and then add it to our array of children.
This would be a perfect place to add some validation, what happens if the node we get back is null? We'll cover adding checks in a later section of the guide.
Final Code
if (currentToken.type === TokenType.Log) {
let nextNode = tokens[currentIndex++]
const children: ASTNode[] = []
while (nextNode.type !== TokenType.LineBreak) {
const next = process()
if (next) {
children.push(next)
}
nextNode = tokens[currentIndex]
}
return {
type: ASTNodeType.Log,
children
}
}
Variable Assignment
Now that we know the basic premise of converting tokens to an AST node, lets walk through our assignment node.
As a quick recap, here's what the DSL looks like:
new hello = 'world'
Which we break up into 4 different tokens:
DSL Text | Token | AST Node |
---|---|---|
new | VariableDeclaration | ? |
hello | Literal | Literal |
= | AssignmentOperator | ? |
'world' | String | String |
We've already established that our Literals and Strings map to the same AST nodes, so what about VariableDeclaration
and AssignmentOperator
?
If we look at our DSL, exchanging text for tokens, we end up with:
<VariableDeclaration> <Literal> <AssignmentOperator> <String>
Now, this is true for our case, but we may not always be assigning a string as a value:
<VariableDeclaration> <Literal> <AssignmentOperator> <Token>
Looking at this we can see that our AssignmentOperator
token is actually pretty useless, even though it makes our DSL easier to read.
From an AST perspective, we can completely ignore it, as we know that after we encounter a VariableDeclaration
, the next token should be a Literal
(the variable name), and then the next token after that will be our assignment value.
So, let's just noop our AssignmentOperator
token for the moment:
if (currentToken.type === TokenType.AssignmentOperator) {
currentIndex++
return null
}
Next, let's check for our our variable declaration token:
if (currentToken.type === TokenType.VariableDeclaration) {
currentIndex++
// Process our name and value tokens
return {
type: ASTNodeType.Assignment,
name: ??,
value: ??
}
}
We know from above that we need to process the next 3 tokens.
const variableNameNode = process()
const assignmentOperatorNode = process()
const variableValueNode = process()
Typescript forces us to check if they're null or not, so while we do this, let's actually check that they're the correct types as well.
const variableNameNode = process()
if (!variableNameNode || variableNameNode.type !== ASTNodeType.Literal) {
throw new Error('Invalid variable node')
}
// Process our AssignmentOperator
process()
const variableValueNode = process()
if (!variableValueNode) {
throw new Error('Invalid variable value')
}
There's a couple things you may notice here:
- We're not actually checking our assignment operator node.
- We're not checking that our variable value is a valid as a value.
Assignment Operator check
For our first check, this one needs to be different. Even though we 'process' our assignment operator, we actually return null
, so what do we check?
The easiest solution here is to check our actual token, instead of the processed AST node.
Change our single process()
call to the following:
const assignmentNode = tokens[currentIndex++]
if (assignmentNode.type !== TokenType.AssignmentOperator) {
throw new Error('Must use = operator to assign value')
}
We know that the correct token structure means that the next (and only the next) token is an AssignmentOperator
, so we manually increment currentIndex
and check this token. If it's not correct, we can now throw an error.
Variable value check
If we have a look at the return type of variableValueNode
, it's the following:
ASTValueNode<ASTNodeType.String, string> |
ASTValueNode<ASTNodeType.Literal, string> |
ASTProgramNode |
ASTAssignmentNode |
ASTLogNode |
null
This is what ASTToken
is (our union of all types) in addition to the null
from our process()
type signature.
Why is this a problem?
Because not all of our tokens have a value
key on them, typescript won't let us actually access the value
key unless we check it's valid.
For the sake of this check, let's assume that any AST node that contains a value
key is valid as a value for an assignment.
How do we do this check?
By using the x in y
Javascript syntax - which Typescript understands as a type guard.
const variableValueNode = process()
if (!variableValueNode || !('value' in variableValueNode)) {
throw new Error('Invalid variable value')
}
If we now take a look at the type value of variableValueNode
after this if
statement, we'll see that it's now restricted to the few types that have a value
key:
ASTValueNode<ASTNodeType.String, string> |
ASTValueNode<ASTNodeType.Literal, string> |
ASTAssignmentNode
Technically ASTAssignmentNode
isn't a valid node here, but this check is likely good enough for now, let's continue.
Returning our AST node
The final step in our variable assignment parser is to return our new AST node.
Up above, we had placeholders in our return code:
return {
type: ASTNodeType.Assignment,
name: ??,
value: ??
}
But now that we have our two required nodes (variableNameNode
& variableValueNode
) parsed, and as their correct types, we can fill this out.
return {
type: ASTNodeType.Assignment,
name: variableNameNode.value,
value: variableValueNode
}
You may notice that we actually use these two nodes differently.
We know for a fact that our variable name is always going to be a string, so we're actually going to convert our name Literal
into a string.
For our value though, we're not entirely sure what it will be. It would be a string, like in our example DSL ('world'
), or it could be another literal:
new hello1 = 'world'
new hello2 = hello1
So instead of returning just the value, we need to capture the type of that value as well, so we are just going to return the entire AST node as our value
.
Testing
Basically exactly the same as our lexer, we can test our parser.
Depending on how you have your code split up, make sure you have access to both our tokenise
and toAST
functions, then do something like the following:
const DSL = `
new hello = 'world'
print hello
`
const tokens = tokenise(DSL)
console.log({ tokens })
const AST = toAST(tokens)
console.log({ AST })
If everything goes according to plan, it should print our tokens, and then our AST, which should look similar to:
{
type: 'Program',
children: [
{
type: 'Assignment',
name: 'hello',
value: {
type: 'String',
value: 'world'
}
},
{
type: 'Log',
children: [
{
type: 'Literal',
value: 'hello'
}
]
}
]
}