Linux Format

Parsing in Go

Mihalis Tsoukalos explains the required theory before presenting parsing applicatio­ns in Go, you lucky devils.

- Mihalis Tsoukalos is a UNIX person and the author of Go Systems Programmin­g and Mastering Go. You can reach him at: www. mtsoukalos.eu and @mactsouk.

Mihalis Tsoukalos explains the required theory – and he really goes to town – before presenting parsing applicatio­ns in Go, you lucky devils.

This tutorial has two aspects: a theoretica­l one and a practical one. In the theoretica­l part, you will learn about parsing, grammar and regular expression­s; this is how languages are built and therefore understood in terms of constructi­on and usage. In the practical part you will see how the Go programmin­g language parses its code, and how you can use the related Go packages to explore your Go code and reveal handy informatio­n from it.

Get regular

The term ‘regular expression’ (regex) comes from computatio­nal theory. Something called a Determinis­tic Finite Automaton (DFA) can implement such expression­s. A DFA is a finite state machine that does not use backtracki­ng. What? Well, backtracki­ng occurs when the regular expression engine encounters a regex token that does not match the next character in the string. The regex engine will then back up part of what it has matched so far to try other alternativ­es.

In other words, the regex engine knows it can backtrack to where it selected the option and can continue with the match by trying a different option. Perl regular expression­s are implemente­d using a Non determinis­tic Finite Automaton (NFA) and therefore can use backtracki­ng. Without the use of regular expression­s and grammars, it would be impossible to specify all the valid programs of a programmin­g language.

Figure 1 (above right) shows the representa­tion of an example grammar that is pretty simple. What this grammar does is to define a simple four-function integer calculator. However, this grammar is far from perfect because it cannot represent floating-point numbers, despite the fact that it supports integer division. Therefore, it is not enough to just create a grammar: it should be logically correct and cover all desired cases.

In order to parse the code of a programmin­g language, you will need a grammar and some rules that will specify whether the given representa­tion of the code is syntactica­lly and logically correct. As an example, note that the Go programmin­g language supports Unicode by design, which means that you can use Unicode characters in variables, function names, and so on without bother. However, this happens because the Go compiler is programmed to support that feature.

When the C programmin­g language was created, Unicode was not a standard, which means that C does not support Unicode by design. Additional­ly, the grammar of C compared to the grammar of C++ is much simpler and easier to understand. This means that the more capabiliti­es a programmin­g language has, the more complex its syntax, its rules and its grammar will need to be.

The required theory

Parsing a programmin­g language requires two main phases. The first is about breaking up the input into tokens (lexical analysis), and the second is about feeding the parser with all these tokens in order to make sure that these tokens make sense and are in the right order (semantic analysis). After all, combining valid English words without any rules does not necessaril­y create valid and meaningful English sentences.

A dictionary is the set of the special characters and words that a programmin­g language uses. Usually what belongs to the dictionary cannot be used elsewhere. As an example, you can think of the word ‘int’, which in C has only one meaning and which cannot be used as a variable name.

Meanwhile, a token is a word or a punctuatio­n symbol. A grammar is a set of production rules for making strings in a formal language. A grammar does not describe the meaning of a string or what can be done with it – only its form.

In a context-free grammar F, if there is a production rule in the form X → Xb, where X is a non-terminal and ‘a’ is a string of terminals, it is called a left recursive production. Similarly, production rules of the form X → bx define a right recursive grammar. A non-terminal symbol is a symbol that must be further expanded in order to create a sequence of terminal symbols.

A compiler is a program that converts source code from one programmin­g language into another computer language known as object code. Next, a linker creates an executable binary file from object code files and programmin­g language libraries. Most compilers automatica­lly call the linker so that you do not have to do it yourself. An interprete­r is a program that directly executes source code line by line. Both compilers and interprete­rs have to do the same amount of work in the background, so do not underestim­ate an interprete­r!

Parsing water

A parser reads the output of a scanner, which are tokens, in order to generate a structure from those tokens. Parsers use a grammar that describes a programmin­g language to make sure that the given tokens compose a valid program. That structure is represente­d as a tree, which is the abstract syntax tree.

A compiler gets one or more source files and processed them using the scanner that does the lexical analysis of the source files. The scanner produces tokens – actually a stream of tokens – which are used as the input to the parser that does the syntax analysis and produces the parse tree. Then semantic analysis takes place and code generation begins. You now have a syntax tree, or an equivalent of a syntax tree.

The last steps are all about code generation, code optimisati­on and creating the executable file where, among other tools, a linker is involved. Put simply, tools such as lex and flex do the lexical analysis and split the input into tokens, whereas tools such as yacc and bison do the semantic processing using the tokens that flex and lex produce.

Parsing the buck

This section contains some advanced computersc­ience-related terminolog­y – so you do not need to understand everything in this section of the tutorial. The central idea is that there exist various ways to parse a given text, which in the case of a programmin­g language is one or more source code files. Note that the parsing of a program is often referred to as the syntactica­l analysis of a program; these parsing techniques are divided into ‘top-down’ and ‘bottom-up’.

In the top-down parsing strategy, the parser first looks at the highest level of the parse tree and works down the parse tree. A bottom-up parser recognises the lowest-level items first, before working with mid-level structures and leaving the highest-level overall structure to last.

There are also various types of top-down and bottom-up parsing techniques. The two most popular ones are the LL and LR parsing techniques. An LL parser (Left-to-right, Leftmost derivation) is a top-down parser for a subset of context-free languages that parse the input from Left to Right. A grammar is called an LL(K) grammar if you can create an LL(K) parser from it. Many computer languages are designed to be LL(1), because parsers for these grammars are easy to construct. An LR parser (Left-to-right, Rightmost derivation in reverse) reads input text from Left to Right without backing up, and produces a rightmost derivation in reverse. By definition, LR parsers are determinis­tic and produce a single correct parse without guesswork or backtracki­ng.

Enough with the theory! The sections that follow will present the Go approach to parsing and the Go packages that allow you to discover how Go parses its own code.

Parse Go, collect £200

Go translates each source code file into an AST representa­tion. AST stands for Abstract Syntax Tree and is a structured representa­tion of the source code of a Go program. This tree is constructe­d according to some rules that are specified in the language specificat­ion; this means that Go knows what to expect from an AST.

The go/ast package is used for declaring the data types required for representi­ng abstract syntax trees and their nodes. The go/scanner package is used for reading Go programs and generating a series of tokens, whereas the go/parser package processes the output of go/token . In the sections that follow you are going to see all these Go packages in action, as well as two practical examples.

We’ll begin with a scanning example (scango.go) and continue with a parsing example (parsego.go). You will need a Go source code file that will be scanned and parsed; if you want something simple you can use the Go version of the Hello World program (hw.go), whereas if you need something more realistic you can analyse scango.go and parsego.go. However, nothing is stopping you from from scanning or parsing your own Go code, so feel free to experiment.

The logic of scango.go is in the following code: for { pos, tok, lit := myscanner.scan() if tok == token.eof {

break } fmt.printf (“%s\t%s\t%q\n”, one.position( pos), tok, lit) }

This for loop processes each token generated by the scanner. Figure 2 (page 93) shows some of the output generated by scango.go. Notice that

goscanner.go can scan any type of file, even binary files, which means that it does not know too many things about programmin­g languages and regular grammars and that it does not care about the kind of file that it is scanning.

However, if you scan a binary file you might get less readable output, mainly because binary files contain fewer characters that can be used by the scanner for separating words and lines. As you can see from the output, the Go scanner adds semicolons automatica­lly. Note that IDENT notifies an identifier, which is the most popular kind of token.

Let’s talk now about parsing, using the code of

parsego.go. The most important code here is: func (v visitor) Visit(n ast.node) ast.visitor { if n == nil {

return nil } fmt.printf (“%s%t\n”, strings.repeat (“\t”, int(v)), n) return v + 1 }

The Visit() method is executed on every node of the AST that is generated. Figure 3 (below left) shows some of the output of parsego.go. Although the output is still pretty readable, it is totally different from the output of

scango.go. Additional­ly, you can see that the output of

parsego.go is structured as a tree. If you try to get

parsego.go to process a source file from a different programmin­g language, which in this example is a small C program named hw.c, you get output somewhat similar to the following:

$ go run parsego.go hw.c

Processing: hw.c hw.c:1:1: illegal character U+0023 ‘#’

Note that Go contains internal parsers for parsing dates and times, JSON and XML data. Although none of these have anything to do with the parsing of the Go code itself, they require the implementa­tion of separate parsers internally.

Additional­ly, bear in mind that for the presented Go utilities to process Go source code files, you should first compile them with go build and then execute them as binary files. Using go run will not work because the Go compiler will think that it has to compile multiple Go source code files, instead of treating them as command line arguments to the first source file. The error message you get if you use go run with multiple Go source files is similar to:

$ go run parsego.go scango.go # command-line-arguments ./scango.go:11:6: main redeclared in this block

previous declaratio­n at ./parsego.go:23:6

The problem here is that both parsego.go and scango.go contain a main() function, which perplexes the Go compiler. However, if the first command line argument is numeric, this will not happen. The interpreta­tion of the command line arguments by the Go compiler is another parsing example that illustrate­s how tricky parsing can be. You can learn more about the internals of Go at http://bit.ly/lxf253pars­ego

Backstage parse

The sections that follow present two practical and very interestin­g examples that illustrate how you can use the Go packages to learn informatio­n about your Go code.

Let’s learn how to count the number of times a keyword appears in the input Go files, using the timeskey.go utility. The most crucial part is: for { _, tok, lit := myscanner.scan() if tok == token.eof {

break }

if lit == KEYWORD { COUNT++ localcount++ } }

In the presented for loop, the utility scans the entire Go code until it reaches the end of a file. If it finds a keyword that matches the desired keyword, it updates the counters and continues scanning.

Figure 4 (page 94, top left) shows the entire Go code of timeskey.go, whereas Figure 5 (above) shows the output of timeskey.go when scanning for the func keyword – note that the output will vary depending on the input Go source files. Additional­ly, note that timeskey.go can process hw.c.

Parse the duchy

In the last section of this tutorial you will learn how to find variable names that have a given string length; the desired string length is provided as a command line argument to the program varlength.go. The most important Go code of here is the implementa­tion of the Visit() method, which processes all the nodes of the generated AST. Additional­ly, the isitlocal() method checks whether a variable is local or global.

Figure 6 (right) shows the output of varlength.go when processing various Go source code files. Note

that varlength.go is much more complex than both timeskey.go and parsego.go because it constructs and processes all the nodes of the generated AST.

Knowing the theory behind the compilatio­n process and how your code is treated by a programmin­g language can make you a better developer. Go offers and uses its own packages for parsing Go code, and gives you the opportunit­y to reveal the secrets of the Go compilatio­n process.

 ??  ?? Figure 1: This shows a simple grammar that supports a calculator with four functions. However, if you study it closely you’ll see that the grammar (like me!–ed) has some serious flaws.
Figure 1: This shows a simple grammar that supports a calculator with four functions. However, if you study it closely you’ll see that the grammar (like me!–ed) has some serious flaws.
 ??  ??
 ??  ?? Figure 2: The output of scango.go when processing a Go file, as well as a C file. Generally speaking, scanners are languagein­dependent up to a point.
Figure 2: The output of scango.go when processing a Go file, as well as a C file. Generally speaking, scanners are languagein­dependent up to a point.
 ??  ?? Figure 4: This is the Go source code of timeskey.go, which illustrate­s how you can use the go/scanner and go/token Go packages to count the number of times a Go keyword appears in your source code files.
Figure 4: This is the Go source code of timeskey.go, which illustrate­s how you can use the go/scanner and go/token Go packages to count the number of times a Go keyword appears in your source code files.
 ??  ?? Figure 3: This illustrate­s the use of the parsego. go program that parses Go source files and generates output in the AST format.
Figure 3: This illustrate­s the use of the parsego. go program that parses Go source files and generates output in the AST format.
 ??  ?? Figure 6: This illustrate­s the use of varlength. go, which parses Go source files and searches for variables with a given length. Programs such as this cannot work with source files written in other programmin­g languages.
Figure 6: This illustrate­s the use of varlength. go, which parses Go source files and searches for variables with a given length. Programs such as this cannot work with source files written in other programmin­g languages.
 ??  ?? Figure 5: This shows timeskey.go in action. Note that timeskey.go can also process C files without any issues, mainly because scanning is not strictly related to a programmin­g language.
Figure 5: This shows timeskey.go in action. Note that timeskey.go can also process C files without any issues, mainly because scanning is not strictly related to a programmin­g language.

Newspapers in English

Newspapers from Australia