Parsing XML files
John Schwartzman shows how to parse and display XML documents using code written in Python and Go.
John Schwartzman shows how to parse and display XML files in Python and Go.
One of the easiest ways to learn a new programming language is to build a program first in a familiar language (Python) and then convert it to the language you want to learn (Go). That makes it easy to compare the features and idiosyncrasies of the two programming languages.
Figure 1 (right) shows the output of our Python program when reading an XML file. The Python parser, Xmlparse.py, doesn’t have the ability to display comments in XML files. It knows how to ignore comments (after all, comments are for people, not programs), but not how to display them. Figure 2 (page
93) shows the output of our Go program, Xmlparse. go, on the exact same XML file, postfix.xml. The Go version clearly does know how to display comments.
Let’s start by looking at Xmlparse.py. Create a working directory and place Xmlparse.py there along with the sample XML files, addressbook.xml and postfix.xml. Make sure that your version of Xmlparse. py is executable. If it’s not, use the alias mx (alias mx=’chmod +x’) to make it executable for everyone.
mx xmlparse.py
./xmlparse.py postfix.xml
In the main section, which every program must have, we first look at the command line arguments to check whether the user has entered the correct number of arguments and whether the user wants help. If so, we invoke the usage() function with an error code of 0, which tells the system the we exited normally. If the user has entered the wrong number of command line arguments, we invoke the usage() function with an error code of 1 to indicate to the system that an error occurred. If we pass through this error checking successfully we invoke main() with sys.argv[1] , which is the filename of the XML file we want to view.
sys.argv[0] is, of course, the program file path for Xmlparse.
The main() function opens the specified XML file, instantiates an Xmlcontenthandler class object and initialises it with the open file. Since these are actions that could conceivably fail, we wrap them in a try/ except block to handle potential errors. These blocks are similar to the try/catch blocks in C++, Java and C#. One of the programmer’s most important jobs is to try to anticipate potential errors and to let the user exit from them gracefully.
def usage(exitflag): print (‘\NUSAGE: ./xmlparse.py xmlfiletoview\n\n’)
sys.exit(exitflag)
def main(sourcefilename): try: source = open(sourcefilename) # instantiate and initialize Xmlcontenthandler xml.sax.parse(source, Xmlcontenthandler()) except xml.sax.saxparseexception as e: # handle parser error print(‘\nfailed to parse ‘ + sourcefilename +
‘. This appears not to be a valid xml document: ‘ + e.getmessage() + ‘\n\n’) sys.exit(2) except Oserror as e: # handle os error print(‘\nfailed to open ‘ + sourcefilename + ‘: ‘ + str(e) + ‘\n\n’) sys.exit(3) if __name__ == “__main__”: if sys.argv[1] == “-h” or sys.argv[1] == “--help”: # does user want help? usage(0) if len(sys.argv) != 2: # there must be 2 arguments usage(1) main(sys.argv[1]) print(‘’) sys.exit(0)
The Xmlcontenthandler (which we instantiate and invoke in main ) walks through the XML file and invokes functions that correspond to the artifacts found in the file. When the handler encounters a startelement, which looks like
The startelement() member function also handles element attributes. When we encounter string characters, we invoke the characters() member function, which pushes the characters into charbuffer . This buffer contains element data and it is accessed and printed when the handler encounters an endelement. Each time the handler encounters a startelement, it pushes the name onto the elementstack and uses the size of the stack to determine where on the X axis it should print the element name. At that point, it increments the row and adds spaces to indicate the column position where the element name should be printed. def startelement(self, name, attributes): # we’ve encountered a startelement pos = self.pushelementtostack(name) self.writenewline() self.writespaces(pos, ‘ ‘) # write 3 spaces per index self.writestartname(name) self.writeattributes(attributes) self.nlastwritepos = pos
A stack is a last in, first out (LIFO) data structure. We are constructing our stack with a list data structure. When we push a name onto the stack it adds the name to the top of the stack, which is the end of the list. When we pop a name from the stack, it removes the name from the top of the stack. The depth of a name in the stack determines where on the X axis we’re going to print startelement and endelement. Elements that are deeply nested will be further to the right than less deeply nested elements. The name of the first startelement will be placed on the stack with a depth of 0. It should appear on the left edge of the display (X position = 0). The last endelement in the file (which must correspond to the first startelement) should also have a depth of 0. It should also appear on the left edge of the display.
Each time the handler encounters an endelement, it pops the name from elementstack . It then checks to see if there is data associated with the element. If so, it prints the data. It may then add spaces to change the column position where the closing element name should be printed. If we’ve just written in the same position as the depth of element on the stack, we write in the current X position. Otherwise, we’re in a new Y position and we use the depth of the element on the stack to indent the closing element name. def endelement(self, name): # we’ve encountered an endelement pos = self.popelementfromstack(name) charstr = self.getcharacterdata() self.writeelementdata(charstr) # write it if pos < self.nlastwritepos: # write name at current x pos? self.writespaces(pos, ‘ ‘) # position endelement on x-axis self.writeendname(name)
That’s pretty much all there is to the Python program. When the parser runs out of elements to process, the program returns to main() , which exits with a return code of 0 to indicate success.
The Go program works exactly like the Python program. Let’s first look at some of the obvious differences between the two programs. In Go programs, functions and constants may start with a capital letter only if you intend to export them. Since we are using them locally, inside the main package, they must start with a lower-case letter.
In Python, we can delimit strings using single quotes or double quotes. In Go, double quotes must be used for strings and single quotes must be used for characters. Note also that functions in Go are delimited with curly braces {} rather than a colon and a new indented tab position, as in Python.
Further, the opening curly brace must appear on the same line as the function definition. You are forced into a Kernighan and Ritchie programming style (as in the authors of the original C) whether you like it or not. There are no semicolons at the end of statements in Go as there are in C and C++. They are injected automatically and invisibly by the Go compiler.
Compare the constant declarations in the Python program and in the Go program. In the Go program they are lower-case, and strings are formed by concatenating characters using the double quote symbol. Even the comments are different. In Python, we use the # symbol to delimit comments, as we do in other scripting languages. Like C and C++ programs, Go uses the // comment delimiter and also the /* this is a comment */ notation.
Go programs can be interpreted or compiled. When you are starting a project, you use the interpreter by invoking go run Xmlparse.go Xmlfilename at the command line. To use the compiler, you invoke go build Xmlparse.go , which creates the executable program Xmlparse in your working directory. You then run the compiled version by invoking ./Xmlparse Xmlfilename . Go programs don’t have classes per se, but they do have structs. Functions can be added to structs in the same way that functions can be added to classes in a pure object-orientated language.
Create a working directory and place Xmlparse.go
there along with the sample XML files addressbook. xml and postfix.xml. Compile the program and copy it to your ~/bin directory: go build xmlparse.go cp xmlparse ~/bin xmlparse postfix.xml
Let’s begin our exploration of the Go version of
Xmlparse with the main() function. In Go, main()
takes no arguments and does not return a value. In main() we check that the user has entered two arguments and then determine whether the user wants help. In either case we invoke the usage() function with an exit code. This function is at the same level as the main() function and all of the other functions. Functions don’t take a class instance as an argument as they do in Python. The functions are much simpler. Notice also how all of the action takes place in main() .
After checking the run-time arguments, we try to open the designated XML file for reading. Note that Go does not have the equivalent of try/except error handling. You handle errors by returning two separate values from a function call that could result in an error. For example, in the main function we have: xmlfile, e := os.open(os.args[1]) // os.args[1] is the xml file to view if e != nill { fmt.printf(“\nproblem reading %s:\n\n”, os.args[1], e) os.exit(2) }
os.open() returns a file object and possibly an error if it couldn’t open the XML file. If you didn’t want to handle errors on your first attempt through the code, you could use the dummy variable _ and simply let potential errors blow up the program. You can get away with this if it’s you running it. If you’re writing programs for others to use, you have to be more careful. The quick and dirty method would look like this: xmlfile, _ := os.open(os.args[1]
Next, we instantiate xml.newdecoder() and pass the open XML file to it for processing. Then we initialise the empty elementstack , which we’ll use in the same way that we did in the Python program. We now enter a for loop where we check each token found in the XML file by the Newdecoder object. In this loop we wait for tokens as xml.newdecoder() processes the XML file. for { //while there are tokens, stay in for loop
// get a new token t. err := decoder.token() if err != nil && err.error() != “EOF” { fmt.printf(“error: %s\n”, err)
} if t == nil {
// we’ve reached the end of the xml document break // exit the for loop
}
// Inspect the type of the token switch se := t.(type) { case xmlstartelement: // we’ve encountered a start elements pos := push(elementstack, se.name.local) // push it onto the stack writenewline() writespaces(pos,““) // write three spaces per index position writestartname(se.name.local) for _, a := range se.attr { writeattribute(a.name.local, a.value)
} nlastwritepos = pos case xml.endelement: // ooh an endelement pos := pop(elementstack, se.name.local) // pop it if pos < nlastwritepos { // write name at x pos? writespaces(pos,““) // set position end element } writeendname(se.name.local) case xml.chardata: // element data
// remove any surrounding whitespace data := strings.trimspace(string(t.(xml.chardata))) if data != “” { writecharacterdata(data) // write it at current x,y } case xml.comment: // comments data := string(t.(xml.comment)) // write it at x,y writecomment(data)
case xml.directive: // we’ve encountered a directive data := string(t.(xml.directive)) writedirective(data)
} // end of switch statements } // end of for loop
fmt.printf(white) // rest ore normal screen formatting writenewline() }
In this way, we process each of the tokens we encounter in the XML file. The only difference between the Python parser and the Go parser is that Go recognises comments and directives and it delivers element data all at once. We respond to each of the tokens we encounter in almost the same way we do with the Python parser. Notice that there are no break statements in the switch statement – Go inserts them
automatically and invisibly. The for statement is the only looping construct in Go. Notice that in the function writespaces() we use i:= for the initial assignment to i, followed by the condition to check, followed by the action to take on each iteration. These are separated by semicolons. func writespaces(pos int, chars string) { // position the cursor column for i := 0; i < pos; i++ { fmt.printf(spaces, chars) }
}
The constant spaces is equal to black + “%s” + white . The constant black makes text black on a black background, or invisible. We print pos copies of ““
on the screen and return to our default colour assignment, white , which is white text on a black background. writespaces() is how we position the cursor where we want it on the X axis.
Notice, also, that all of the const colour macros end with white . This is so that if the program breaks, your terminal should be restored to a normal condition and you should be able to see anything you type on the console. Compare the Python version of writespaces()
to the Go version. Notice that in the Python version, we iterate over the range() function and use a dummy variable for the index.
We now turn to the functions that are invoked when the Go parser reaches an xml.startelement or an xml. endelement token. func push(s *list.list, name string) int { pos := s.len() // use the index before push s.pushback(name) // push it onto the stack return pos
} func pop(s *list.list, name string) int { e := s.back() // get the last element in the list if e.value == name { s.remove(e) // pop it from the stack
} else { fmt.printf(“%s\nerror: %s was not at the top of the stack.\n\n”,
white, name) os.exit(4)
} return s.len() // use the index after pop
These functions are designed to use the list.list
structure that we imported from the container/list
package. We specify that these functions take a pointer to a list ( s *list.list ) and a string ( name string ) as arguments, and return an int which corresponds to the depth of the stack. Note the function signatures for push and pop ; they tell us the names and types of the arguments and the type of the return value. This form of function signature is unique to Go. What does it mean to pass a pointer to a List? It means that we’re passing the address of the List. We don’t want to pass the List around; it’s too big. Passing the address of the List is more economical.
When the parser encounters an xml.startelement, we push the element name onto elementstack . When the parser encounters an xml.endelement, we pop the element name off of elementstack . The depth of the stack tells us where on the X axis we should print the element name. After the push we move to a new line and call writespaces(pos, ““) , where pos is the value returned from the push function. writespaces() indents three spaces for every level that the element name is nested on the stack. In this way we try to print the startelement name and the endelement name at the same horizontal position. Notice that we don’t always print an endelement name at the same horizontal position as a startelement name. When it’s written on the same line as attributes or data we print it at the current X position.
Figure 3 (see far left) shows an XML file used by
Eclipse CDT to keep track of project source file members. Note that the attributes of the resource elements each specify a C++ program in the project path. Eclipse uses this and other XML files to automatically generate make files for your C/C++ projects. Similarly, when you add a new printer or printer driver to your system, a dialogue box is populated with a long list of printers. That dialogue is populated from an XML file (see Figure 4 above).
XML documents are used to store data, configuration information, translations, really almost anything. We’ve used XML files to update a database using Java and Hibernate. Grab a few assorted XML files from your distro and try to view them using
Xmlparse – but beware, not every XML file is wellformed. There are lots of files in your distro that are not.
Xmlparse will spit them out if they don’t follow the rules, while the programs that use the XML files may be more forgiving. Our two parsers are simply generalpurpose XML file viewers that organise and colour-code the contents of XML documents.