Parsing in Go
Mihalis Tsoukalos explains the required theory before presenting parsing applications in Go, you lucky devils.
Mihalis Tsoukalos explains the required theory – and he really goes to town – before presenting parsing applications in Go, you lucky devils.
This tutorial has two aspects: a theoretical one and a practical one. In the theoretical part, you will learn about parsing, grammar and regular expressions; this is how languages are built and therefore understood in terms of construction and usage. In the practical part you will see how the Go programming language parses its code, and how you can use the related Go packages to explore your Go code and reveal handy information from it.
Get regular
The term ‘regular expression’ (regex) comes from computational theory. Something called a Deterministic Finite Automaton (DFA) can implement such expressions. A DFA is a finite state machine that does not use backtracking. What? Well, backtracking occurs when the regular expression engine encounters a regex token that does not match the next character in the string. The regex engine will then back up part of what it has matched so far to try other alternatives.
In other words, the regex engine knows it can backtrack to where it selected the option and can continue with the match by trying a different option. Perl regular expressions are implemented using a Non deterministic Finite Automaton (NFA) and therefore can use backtracking. Without the use of regular expressions and grammars, it would be impossible to specify all the valid programs of a programming language.
Figure 1 (above right) shows the representation of an example grammar that is pretty simple. What this grammar does is to define a simple four-function integer calculator. However, this grammar is far from perfect because it cannot represent floating-point numbers, despite the fact that it supports integer division. Therefore, it is not enough to just create a grammar: it should be logically correct and cover all desired cases.
In order to parse the code of a programming language, you will need a grammar and some rules that will specify whether the given representation of the code is syntactically and logically correct. As an example, note that the Go programming language supports Unicode by design, which means that you can use Unicode characters in variables, function names, and so on without bother. However, this happens because the Go compiler is programmed to support that feature.
When the C programming language was created, Unicode was not a standard, which means that C does not support Unicode by design. Additionally, the grammar of C compared to the grammar of C++ is much simpler and easier to understand. This means that the more capabilities a programming language has, the more complex its syntax, its rules and its grammar will need to be.
The required theory
Parsing a programming language requires two main phases. The first is about breaking up the input into tokens (lexical analysis), and the second is about feeding the parser with all these tokens in order to make sure that these tokens make sense and are in the right order (semantic analysis). After all, combining valid English words without any rules does not necessarily create valid and meaningful English sentences.
A dictionary is the set of the special characters and words that a programming language uses. Usually what belongs to the dictionary cannot be used elsewhere. As an example, you can think of the word ‘int’, which in C has only one meaning and which cannot be used as a variable name.
Meanwhile, a token is a word or a punctuation symbol. A grammar is a set of production rules for making strings in a formal language. A grammar does not describe the meaning of a string or what can be done with it – only its form.
In a context-free grammar F, if there is a production rule in the form X → Xb, where X is a non-terminal and ‘a’ is a string of terminals, it is called a left recursive production. Similarly, production rules of the form X → bx define a right recursive grammar. A non-terminal symbol is a symbol that must be further expanded in order to create a sequence of terminal symbols.
A compiler is a program that converts source code from one programming language into another computer language known as object code. Next, a linker creates an executable binary file from object code files and programming language libraries. Most compilers automatically call the linker so that you do not have to do it yourself. An interpreter is a program that directly executes source code line by line. Both compilers and interpreters have to do the same amount of work in the background, so do not underestimate an interpreter!
Parsing water
A parser reads the output of a scanner, which are tokens, in order to generate a structure from those tokens. Parsers use a grammar that describes a programming language to make sure that the given tokens compose a valid program. That structure is represented as a tree, which is the abstract syntax tree.
A compiler gets one or more source files and processed them using the scanner that does the lexical analysis of the source files. The scanner produces tokens – actually a stream of tokens – which are used as the input to the parser that does the syntax analysis and produces the parse tree. Then semantic analysis takes place and code generation begins. You now have a syntax tree, or an equivalent of a syntax tree.
The last steps are all about code generation, code optimisation and creating the executable file where, among other tools, a linker is involved. Put simply, tools such as lex and flex do the lexical analysis and split the input into tokens, whereas tools such as yacc and bison do the semantic processing using the tokens that flex and lex produce.
Parsing the buck
This section contains some advanced computerscience-related terminology – so you do not need to understand everything in this section of the tutorial. The central idea is that there exist various ways to parse a given text, which in the case of a programming language is one or more source code files. Note that the parsing of a program is often referred to as the syntactical analysis of a program; these parsing techniques are divided into ‘top-down’ and ‘bottom-up’.
In the top-down parsing strategy, the parser first looks at the highest level of the parse tree and works down the parse tree. A bottom-up parser recognises the lowest-level items first, before working with mid-level structures and leaving the highest-level overall structure to last.
There are also various types of top-down and bottom-up parsing techniques. The two most popular ones are the LL and LR parsing techniques. An LL parser (Left-to-right, Leftmost derivation) is a top-down parser for a subset of context-free languages that parse the input from Left to Right. A grammar is called an LL(K) grammar if you can create an LL(K) parser from it. Many computer languages are designed to be LL(1), because parsers for these grammars are easy to construct. An LR parser (Left-to-right, Rightmost derivation in reverse) reads input text from Left to Right without backing up, and produces a rightmost derivation in reverse. By definition, LR parsers are deterministic and produce a single correct parse without guesswork or backtracking.
Enough with the theory! The sections that follow will present the Go approach to parsing and the Go packages that allow you to discover how Go parses its own code.
Parse Go, collect £200
Go translates each source code file into an AST representation. AST stands for Abstract Syntax Tree and is a structured representation of the source code of a Go program. This tree is constructed according to some rules that are specified in the language specification; this means that Go knows what to expect from an AST.
The go/ast package is used for declaring the data types required for representing abstract syntax trees and their nodes. The go/scanner package is used for reading Go programs and generating a series of tokens, whereas the go/parser package processes the output of go/token . In the sections that follow you are going to see all these Go packages in action, as well as two practical examples.
We’ll begin with a scanning example (scango.go) and continue with a parsing example (parsego.go). You will need a Go source code file that will be scanned and parsed; if you want something simple you can use the Go version of the Hello World program (hw.go), whereas if you need something more realistic you can analyse scango.go and parsego.go. However, nothing is stopping you from from scanning or parsing your own Go code, so feel free to experiment.
The logic of scango.go is in the following code: for { pos, tok, lit := myscanner.scan() if tok == token.eof {
break } fmt.printf (“%s\t%s\t%q\n”, one.position( pos), tok, lit) }
This for loop processes each token generated by the scanner. Figure 2 (page 93) shows some of the output generated by scango.go. Notice that
goscanner.go can scan any type of file, even binary files, which means that it does not know too many things about programming languages and regular grammars and that it does not care about the kind of file that it is scanning.
However, if you scan a binary file you might get less readable output, mainly because binary files contain fewer characters that can be used by the scanner for separating words and lines. As you can see from the output, the Go scanner adds semicolons automatically. Note that IDENT notifies an identifier, which is the most popular kind of token.
Let’s talk now about parsing, using the code of
parsego.go. The most important code here is: func (v visitor) Visit(n ast.node) ast.visitor { if n == nil {
return nil } fmt.printf (“%s%t\n”, strings.repeat (“\t”, int(v)), n) return v + 1 }
The Visit() method is executed on every node of the AST that is generated. Figure 3 (below left) shows some of the output of parsego.go. Although the output is still pretty readable, it is totally different from the output of
scango.go. Additionally, you can see that the output of
parsego.go is structured as a tree. If you try to get
parsego.go to process a source file from a different programming language, which in this example is a small C program named hw.c, you get output somewhat similar to the following:
$ go run parsego.go hw.c
Processing: hw.c hw.c:1:1: illegal character U+0023 ‘#’
Note that Go contains internal parsers for parsing dates and times, JSON and XML data. Although none of these have anything to do with the parsing of the Go code itself, they require the implementation of separate parsers internally.
Additionally, bear in mind that for the presented Go utilities to process Go source code files, you should first compile them with go build and then execute them as binary files. Using go run will not work because the Go compiler will think that it has to compile multiple Go source code files, instead of treating them as command line arguments to the first source file. The error message you get if you use go run with multiple Go source files is similar to:
$ go run parsego.go scango.go # command-line-arguments ./scango.go:11:6: main redeclared in this block
previous declaration at ./parsego.go:23:6
The problem here is that both parsego.go and scango.go contain a main() function, which perplexes the Go compiler. However, if the first command line argument is numeric, this will not happen. The interpretation of the command line arguments by the Go compiler is another parsing example that illustrates how tricky parsing can be. You can learn more about the internals of Go at http://bit.ly/lxf253parsego
Backstage parse
The sections that follow present two practical and very interesting examples that illustrate how you can use the Go packages to learn information about your Go code.
Let’s learn how to count the number of times a keyword appears in the input Go files, using the timeskey.go utility. The most crucial part is: for { _, tok, lit := myscanner.scan() if tok == token.eof {
break }
if lit == KEYWORD { COUNT++ localcount++ } }
In the presented for loop, the utility scans the entire Go code until it reaches the end of a file. If it finds a keyword that matches the desired keyword, it updates the counters and continues scanning.
Figure 4 (page 94, top left) shows the entire Go code of timeskey.go, whereas Figure 5 (above) shows the output of timeskey.go when scanning for the func keyword – note that the output will vary depending on the input Go source files. Additionally, note that timeskey.go can process hw.c.
Parse the duchy
In the last section of this tutorial you will learn how to find variable names that have a given string length; the desired string length is provided as a command line argument to the program varlength.go. The most important Go code of here is the implementation of the Visit() method, which processes all the nodes of the generated AST. Additionally, the isitlocal() method checks whether a variable is local or global.
Figure 6 (right) shows the output of varlength.go when processing various Go source code files. Note
that varlength.go is much more complex than both timeskey.go and parsego.go because it constructs and processes all the nodes of the generated AST.
Knowing the theory behind the compilation process and how your code is treated by a programming language can make you a better developer. Go offers and uses its own packages for parsing Go code, and gives you the opportunity to reveal the secrets of the Go compilation process.