123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196 |
- :orphan:
- =====================================================
- Kaleidoscope: Kaleidoscope Introduction and the Lexer
- =====================================================
- .. contents::
- :local:
- The Kaleidoscope Language
- =========================
- This tutorial is illustrated with a toy language called
- "`Kaleidoscope <http://en.wikipedia.org/wiki/Kaleidoscope>`_" (derived
- from "meaning beautiful, form, and view"). Kaleidoscope is a procedural
- language that allows you to define functions, use conditionals, math,
- etc. Over the course of the tutorial, we'll extend Kaleidoscope to
- support the if/then/else construct, a for loop, user defined operators,
- JIT compilation with a simple command line interface, debug info, etc.
- We want to keep things simple, so the only datatype in Kaleidoscope
- is a 64-bit floating point type (aka 'double' in C parlance). As such,
- all values are implicitly double precision and the language doesn't
- require type declarations. This gives the language a very nice and
- simple syntax. For example, the following simple example computes
- `Fibonacci numbers: <http://en.wikipedia.org/wiki/Fibonacci_number>`_
- ::
- # Compute the x'th fibonacci number.
- def fib(x)
- if x < 3 then
- 1
- else
- fib(x-1)+fib(x-2)
- # This expression will compute the 40th number.
- fib(40)
- We also allow Kaleidoscope to call into standard library functions - the
- LLVM JIT makes this really easy. This means that you can use the
- 'extern' keyword to define a function before you use it (this is also
- useful for mutually recursive functions). For example:
- ::
- extern sin(arg);
- extern cos(arg);
- extern atan2(arg1 arg2);
- atan2(sin(.4), cos(42))
- A more interesting example is included in Chapter 6 where we write a
- little Kaleidoscope application that `displays a Mandelbrot
- Set <LangImpl06.html#kicking-the-tires>`_ at various levels of magnification.
- Let's dive into the implementation of this language!
- The Lexer
- =========
- When it comes to implementing a language, the first thing needed is the
- ability to process a text file and recognize what it says. The
- traditional way to do this is to use a
- "`lexer <http://en.wikipedia.org/wiki/Lexical_analysis>`_" (aka
- 'scanner') to break the input up into "tokens". Each token returned by
- the lexer includes a token code and potentially some metadata (e.g. the
- numeric value of a number). First, we define the possibilities:
- .. code-block:: c++
- // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
- // of these for known things.
- enum Token {
- tok_eof = -1,
- // commands
- tok_def = -2,
- tok_extern = -3,
- // primary
- tok_identifier = -4,
- tok_number = -5,
- };
- static std::string IdentifierStr; // Filled in if tok_identifier
- static double NumVal; // Filled in if tok_number
- Each token returned by our lexer will either be one of the Token enum
- values or it will be an 'unknown' character like '+', which is returned
- as its ASCII value. If the current token is an identifier, the
- ``IdentifierStr`` global variable holds the name of the identifier. If
- the current token is a numeric literal (like 1.0), ``NumVal`` holds its
- value. We use global variables for simplicity, but this is not the
- best choice for a real language implementation :).
- The actual implementation of the lexer is a single function named
- ``gettok``. The ``gettok`` function is called to return the next token
- from standard input. Its definition starts as:
- .. code-block:: c++
- /// gettok - Return the next token from standard input.
- static int gettok() {
- static int LastChar = ' ';
- // Skip any whitespace.
- while (isspace(LastChar))
- LastChar = getchar();
- ``gettok`` works by calling the C ``getchar()`` function to read
- characters one at a time from standard input. It eats them as it
- recognizes them and stores the last character read, but not processed,
- in LastChar. The first thing that it has to do is ignore whitespace
- between tokens. This is accomplished with the loop above.
- The next thing ``gettok`` needs to do is recognize identifiers and
- specific keywords like "def". Kaleidoscope does this with this simple
- loop:
- .. code-block:: c++
- if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
- IdentifierStr = LastChar;
- while (isalnum((LastChar = getchar())))
- IdentifierStr += LastChar;
- if (IdentifierStr == "def")
- return tok_def;
- if (IdentifierStr == "extern")
- return tok_extern;
- return tok_identifier;
- }
- Note that this code sets the '``IdentifierStr``' global whenever it
- lexes an identifier. Also, since language keywords are matched by the
- same loop, we handle them here inline. Numeric values are similar:
- .. code-block:: c++
- if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
- std::string NumStr;
- do {
- NumStr += LastChar;
- LastChar = getchar();
- } while (isdigit(LastChar) || LastChar == '.');
- NumVal = strtod(NumStr.c_str(), 0);
- return tok_number;
- }
- This is all pretty straightforward code for processing input. When
- reading a numeric value from input, we use the C ``strtod`` function to
- convert it to a numeric value that we store in ``NumVal``. Note that
- this isn't doing sufficient error checking: it will incorrectly read
- "1.23.45.67" and handle it as if you typed in "1.23". Feel free to
- extend it! Next we handle comments:
- .. code-block:: c++
- if (LastChar == '#') {
- // Comment until end of line.
- do
- LastChar = getchar();
- while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
- if (LastChar != EOF)
- return gettok();
- }
- We handle comments by skipping to the end of the line and then return
- the next token. Finally, if the input doesn't match one of the above
- cases, it is either an operator character like '+' or the end of the
- file. These are handled with this code:
- .. code-block:: c++
- // Check for end of file. Don't eat the EOF.
- if (LastChar == EOF)
- return tok_eof;
- // Otherwise, just return the character as its ascii value.
- int ThisChar = LastChar;
- LastChar = getchar();
- return ThisChar;
- }
- With this, we have the complete lexer for the basic Kaleidoscope
- language (the `full code listing <LangImpl02.html#full-code-listing>`_ for the Lexer
- is available in the `next chapter <LangImpl02.html>`_ of the tutorial).
- Next we'll `build a simple parser that uses this to build an Abstract
- Syntax Tree <LangImpl02.html>`_. When we have that, we'll include a
- driver so that you can use the lexer and parser together.
- `Next: Implementing a Parser and AST <LangImpl02.html>`_
|