A basic lexer


export function tokeniser (input: string): Token[] {
  const out: Token[] = []
  let currentPosition = 0

  while (currentPosition < input.length) {
    // Process token, increment currentPosition
  }

  return out
}

This will be the main body for our tokeniser.

We take in our program - input - and return an array of valid tokens.

We start at position 0 in our input, and increasingly loop over each character while building up a list and identifying what each character means in the context of our language.

To help facilitate us matching strings to tokens, lets create a bit of a map for our tokens that can be matched with strings, that we don’t want to be classed as literal values.


const tokenStringMap: Array<{
  key: string,
  value: Token
}> = [
  { key: '\n', value: { type: TokenType.LineBreak } },
  { key: 'new', value: { type: TokenType.VariableDeclaration } },
  { key: '=', value: { type: TokenType.AssignmentOperator } },
  { key: 'print', value: { type: TokenType.ConsoleLog } }
]

Now that we have this, we can use it to loop over things, and actually start processing our input. Lets change our while loop to something like this:


while (currentPosition < input.length) {
  const currentToken = input[currentPosition]

  // Our language doesn't care about whitespace.
  if (currentToken === ' ') {
    currentPosition++
    continue
  }

  let didMatch: boolean = false

  for (const { key, value } of tokenStringMap) {
    if (!lookaheadString(key)) {
      continue
    }

    out.push(value)
    currentPosition += key.length
    didMatch = true
  }

  if (didMatch) continue
}

Our tokeniser now processes approx half of our defined language now!

Lets step through what we’re doing here.


const currentToken = input[currentPosition]
This is just a useful shortcut, as we’ll often be referencing the current character that we’re parsing.

if (currentToken === ' ') {
  currentPosition++
  continue
}

Here’s our first real check. As our language semantics don’t change on whitespace (unlike, for example, Python), we can completely ignore it if it’s the current token. Whitespace is still significant if we’re doing things like lookaheads - the space might be part of a string, in which case we would be required to capture that.


let didMatch: boolean = false

for (const { key, value } of tokenStringMap) {
  if (!lookaheadString(key)) {
    continue
  }

  out.push(value)
  currentPosition += key.length
  didMatch = true
}

if (didMatch) continue

The rest of the code we added. First, we loop over tokenStringMap object that we created above. Next, we check if the next N characters in our input match. If they don’t we just skip this.

If we do end up finding a match, we add the equivalent token to our list of tokens in out, increment our currentPosition counter by the length of the string we just matched (we don’t want to accidently reprocess this!) and then set didMatch to `true.

Finally, if didMatch was set to true, we continue the outer while loop, as we’ve already attributed the current input token to its correct match.

Building a language in Typescript

A basic lexer