OpenSource For You

Interfacin­g Python with Excel

In this article, the author discusses interfacin­g Python with Excel. Data stored in numerous Excel sheets can be extracted by using the Python packages written for the same (like xlrd). And the author’s demonstrat­ion of its benefits in a real world scenar

- By: T V Krishnamur­thy

Most projects and companies maintain a lot of data in the form of Excel files. These are usually reports, and data needs to be generated from these Excel files in a periodic manner, like a status update. Sometimes, when a specific event takes place, there may be some followup action required like generating a mail or filing an escalation in case the data indicates something of interest has happened. So let’s look at how such Excel files can be interfaced by using Python.

The problem of work summaries

Let us a look at a specific problem in this regard. Assume that employees in a particular team are working across different projects for different customers. Figure 1 gives an idea about the associatio­n between an employee, a project and a customer.

An employee could be working on one or more projects. A project could also have one or more employees. A project is targeted to be delivered to only one customer (in our example) though a customer may be involved with multiple projects.

To track the effort being spent on these projects, employees could be asked to record their work in the following manner under the heads shown below.

Let us assume all employees are filing such reports in one Excel file each, in a common location. So how do we arrive at the overall summary of efforts put in by the team on a particular customer, over a period of time?

One of the ways in which this could be managed is to do some fancy linking inside a master Excel file. Then write some Excel macros to walk through each file to summarise the data. As can be imagined, this is tedious, involves manual and automated steps, and can be error prone. Also, summaries over different time periods will have to be managed through macros or filters based on

pivot tables inside a master Excel file. Instead, let us look at a different way of doing the same task, using Python and Excel interfacin­g libraries.

Introducin­g the Python-Excel interfacin­g library

Let me introduce the xlrd package for Python available from the Python Foundation at https://pypi.python.org/ pypi/xlrd

The package used to write into Excel files is available as xlwt. As a collective package, xlutils comprises both xlrd and xlwt. In this column, we use only xlrd.

xlrd is a library that can be used to work with Microsoft Excel spreadshee­ts (. xls and .xlsx) on any platform. Let us check the interfaces available in the following sections and look at how to use them.

Installati­on of the package

The installati­on instructio­ns in the readme.html file are pretty clear. On any OS, unzip the . zip file into a suitable directory, chdir to that directory, and type in python setup. py install.

If the installati­on is successful, you should be able to do what’s shown below at the IDLE GUI prompt:

Important interfaces of the xlrd package

xlrd provides the following interfaces for reading the Excel values. 1. xlrd::open_workbook(path_to_Excel_file): This method returns an object of the Book class, with which we will work quite a bit to get access to the Excel data. 2. Book::sheet_by_index(sheetx): This returns an object of the Sheet class, with which we will access the individual cells of a sheet. 3. Sheet::cell(row,col): This returns a Cell object for the

particular row and column values. 4. A Cell object should always be retrieved from other methods described above from the Book and Sheet objects and not invoked directly. Critical for a Cell object is the value attribute, the type of which varies according to the data type of the cell represente­d by another int attribute which is ctype. The following table gives the type that value has, which depends upon the type of the cell.

Note: In this column, we will work mostly with text and date types only.

Working with dates

As quoted in the xlrd tutorial, in Excel, dates have many problems. 1. Dates are not stored as a separate data type but as floating point numbers, and you have to rely on (a) the ‘number format’ applied to them in Excel and/or (b) knowing which cells are supposed to have dates in them. 2. Dates are stored as elapsed days from epoch, which differs

between Windows and Mac versions of Excel. 3. The Excel implementa­tion of the default Windows 1900 based date system works on the incorrect premise that 1900 was a leap year.

Sample usage

Let us now put together a small spreadshee­t with the data as shown in Figure 2.

We can open this using the interfaces of xlrd in the IDLE GUI.

Note: To qualify a file name via the open_workbook API in Windows, I had to escape the “\” backslash in the path with another, to indicate that I am not using any special sequence.

Once a book is available, we can open a sheet of the same.

From the sheet, we print the following values. You can see the values and types associated with the values.

Note: Indexes of both rows and columns start with 0 (zero), which is good for us as we naturally escape the header row.

The date, as can be seen above, is stored as a number, which xlrd shows as an xldate type. Let us do some more processing with the same. To work with Python dates, we would need the date and datetime classes from the datetime module.

We can see that the date ‘1 Feb 2015’ is represente­d in Python with the year, month and date components.

To work with the xldate type, let us import and use the xldate_as_tuple interface.

We can see that the same date is stored as a datetime internally and has hours, minutes and seconds components too.

To extract only the date part and ignore the last three, we can do a sub-tuple selection and initialise the date with the same.

Now, simple comparison­s can be done between Python date classes in the following way.

Note: In the above segment, we make a slice of testdate starting from 0 to 2 and <3, with only the first three values from testdate to get date components of the year, month and day. We use the * operator to unpack and list the values to be used to construct a date object.

Designing the program

So, having set those items up, let us come back to the original problem. We need to have a program that summarises the efforts by the customer’s name and prints it. Let us assume that the data now looks like what’s shown in Figure 3 in a file c:\test.xlsx:

Given that all reports will be in a particular directory, we will need a way to get a list of the report files so that we can process them in our program. Python provides us with the glob object which helps to do that.

Here’s the sequence for the program. 1. Create a dictionary object to store nHours for each

customer name. 2. Using the glob object, fetch the list of xlsx report files. 3. For each report file: a. Cycle through each row of sheet (0) b. For each row with valid entry… i. Check if the date is present and is valid ii. Check if the date is present and the date is within the last seven days 1. Fetch the hours spent 2. Fetch the customer’s name 3. Customer hours + hours spent 4. Print the final output from the map a. Print the customer’s name and the total hours spent on

the customer The following is the complete program:

Note 1: Note that we print the report being processed to help in debugging, in case any data makes the program misbehave. Note 2: We process xldate_as_tuple in a try:, except: block so that we handle any exceptions thrown by incorrect dates. We print an error message in that case, which along with the report file name will help us know where the problem is.

When run through Python, the above program produces the following output:

More ideas

Think about how you can improve the above solution. To start with, we could make all those constants configurab­le and pass them as input to our program through a cfg file (refer to ‘Practical Python Programmin­g’ in the February 2015 issue of OSFY for how to use a ConfigPars­er module).

We could use better formatting to output the summary (refer repr() and format() interfaces).

Also, readers familiar with pivot tables could start thinking about how to implement generating such pivots from Python. The author has more than 15 years of telecom industry experience. He has been actively involved in radio technologi­es for nearly a decade while working with Siemens Communicat­ions, Nokia Siemens Networks and Radisys Corporatio­n. He lives and works at Bengaluru.

 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??

Newspapers in English

Newspapers from India