In­ter­fac­ing Python with Excel

In this ar­ti­cle, the au­thor dis­cusses in­ter­fac­ing Python with Excel. Data stored in nu­mer­ous Excel sheets can be ex­tracted by us­ing the Python packages writ­ten for the same (like xlrd). And the au­thor’s demon­stra­tion of its benefits in a real world sce­nar

OpenSource For You - - CONTENTS - By: T V Kr­ish­na­murthy

Most projects and com­pa­nies main­tain a lot of data in the form of Excel files. Th­ese are usu­ally re­ports, and data needs to be gen­er­ated from th­ese Excel files in a pe­ri­odic man­ner, like a sta­tus up­date. Some­times, when a spe­cific event takes place, there may be some fol­low­up ac­tion re­quired like gen­er­at­ing a mail or fil­ing an es­ca­la­tion in case the data in­di­cates some­thing of in­ter­est has hap­pened. So let’s look at how such Excel files can be in­ter­faced by us­ing Python.

The prob­lem of work sum­maries

Let us a look at a spe­cific prob­lem in this re­gard. As­sume that em­ploy­ees in a par­tic­u­lar team are work­ing across dif­fer­ent projects for dif­fer­ent cus­tomers. Fig­ure 1 gives an idea about the as­so­ci­a­tion be­tween an em­ployee, a project and a cus­tomer.

An em­ployee could be work­ing on one or more projects. A project could also have one or more em­ploy­ees. A project is tar­geted to be de­liv­ered to only one cus­tomer (in our ex­am­ple) though a cus­tomer may be in­volved with mul­ti­ple projects.

To track the ef­fort be­ing spent on th­ese projects, em­ploy­ees could be asked to record their work in the fol­low­ing man­ner un­der the heads shown be­low.

Let us as­sume all em­ploy­ees are fil­ing such re­ports in one Excel file each, in a com­mon lo­ca­tion. So how do we ar­rive at the over­all sum­mary of ef­forts put in by the team on a par­tic­u­lar cus­tomer, over a pe­riod of time?

One of the ways in which this could be man­aged is to do some fancy link­ing in­side a mas­ter Excel file. Then write some Excel macros to walk through each file to sum­marise the data. As can be imag­ined, this is te­dious, in­volves man­ual and au­to­mated steps, and can be er­ror prone. Also, sum­maries over dif­fer­ent time pe­ri­ods will have to be man­aged through macros or fil­ters based on

pivot ta­bles in­side a mas­ter Excel file. In­stead, let us look at a dif­fer­ent way of do­ing the same task, us­ing Python and Excel in­ter­fac­ing li­braries.

In­tro­duc­ing the Python-Excel in­ter­fac­ing li­brary

Let me in­tro­duce the xlrd pack­age for Python avail­able from the Python Foun­da­tion at https://pypi.python.org/ pypi/xlrd

The pack­age used to write into Excel files is avail­able as xlwt. As a col­lec­tive pack­age, xlu­tils com­prises both xlrd and xlwt. In this col­umn, we use only xlrd.

xlrd is a li­brary that can be used to work with Mi­crosoft Excel spread­sheets (. xls and .xlsx) on any plat­form. Let us check the in­ter­faces avail­able in the fol­low­ing sec­tions and look at how to use them.

In­stal­la­tion of the pack­age

The in­stal­la­tion in­struc­tions in the readme.html file are pretty clear. On any OS, un­zip the . zip file into a suit­able di­rec­tory, chdir to that di­rec­tory, and type in python setup. py in­stall.

If the in­stal­la­tion is suc­cess­ful, you should be able to do what’s shown be­low at the IDLE GUI prompt:

Im­por­tant in­ter­faces of the xlrd pack­age

xlrd pro­vides the fol­low­ing in­ter­faces for read­ing the Excel val­ues. 1. xlrd::open_­work­book(path_­to_Ex­cel_­file): This method re­turns an ob­ject of the Book class, with which we will work quite a bit to get ac­cess to the Excel data. 2. Book::sheet_by_in­dex(sheetx): This re­turns an ob­ject of the Sheet class, with which we will ac­cess the in­di­vid­ual cells of a sheet. 3. Sheet::cell(row,col): This re­turns a Cell ob­ject for the

par­tic­u­lar row and col­umn val­ues. 4. A Cell ob­ject should al­ways be re­trieved from other meth­ods de­scribed above from the Book and Sheet ob­jects and not in­voked di­rectly. Crit­i­cal for a Cell ob­ject is the value at­tribute, the type of which varies ac­cord­ing to the data type of the cell rep­re­sented by an­other int at­tribute which is ctype. The fol­low­ing ta­ble gives the type that value has, which de­pends upon the type of the cell.

Note: In this col­umn, we will work mostly with text and date types only.

Work­ing with dates

As quoted in the xlrd tu­to­rial, in Excel, dates have many prob­lems. 1. Dates are not stored as a sep­a­rate data type but as float­ing point num­bers, and you have to rely on (a) the ‘num­ber for­mat’ ap­plied to them in Excel and/or (b) know­ing which cells are sup­posed to have dates in them. 2. Dates are stored as elapsed days from epoch, which dif­fers

be­tween Win­dows and Mac ver­sions of Excel. 3. The Excel im­ple­men­ta­tion of the de­fault Win­dows 1900 based date sys­tem works on the in­cor­rect premise that 1900 was a leap year.

Sam­ple us­age

Let us now put to­gether a small spread­sheet with the data as shown in Fig­ure 2.

We can open this us­ing the in­ter­faces of xlrd in the IDLE GUI.

Note: To qual­ify a file name via the open_­work­book API in Win­dows, I had to es­cape the “\” back­slash in the path with an­other, to in­di­cate that I am not us­ing any spe­cial se­quence.

Once a book is avail­able, we can open a sheet of the same.

From the sheet, we print the fol­low­ing val­ues. You can see the val­ues and types as­so­ci­ated with the val­ues.

Note: In­dexes of both rows and col­umns start with 0 (zero), which is good for us as we nat­u­rally es­cape the header row.

The date, as can be seen above, is stored as a num­ber, which xlrd shows as an xl­date type. Let us do some more pro­cess­ing with the same. To work with Python dates, we would need the date and date­time classes from the date­time mod­ule.

We can see that the date ‘1 Feb 2015’ is rep­re­sented in Python with the year, month and date com­po­nents.

To work with the xl­date type, let us im­port and use the xl­date_as_­tu­ple in­ter­face.

We can see that the same date is stored as a date­time in­ter­nally and has hours, min­utes and sec­onds com­po­nents too.

To ex­tract only the date part and ig­nore the last three, we can do a sub-tu­ple se­lec­tion and ini­tialise the date with the same.

Now, sim­ple com­par­isons can be done be­tween Python date classes in the fol­low­ing way.

Note: In the above seg­ment, we make a slice of test­date start­ing from 0 to 2 and <3, with only the first three val­ues from test­date to get date com­po­nents of the year, month and day. We use the * op­er­a­tor to un­pack and list the val­ues to be used to con­struct a date ob­ject.

Designing the pro­gram

So, hav­ing set those items up, let us come back to the orig­i­nal prob­lem. We need to have a pro­gram that sum­marises the ef­forts by the cus­tomer’s name and prints it. Let us as­sume that the data now looks like what’s shown in Fig­ure 3 in a file c:\test.xlsx:

Given that all re­ports will be in a par­tic­u­lar di­rec­tory, we will need a way to get a list of the re­port files so that we can process them in our pro­gram. Python pro­vides us with the glob ob­ject which helps to do that.

Here’s the se­quence for the pro­gram. 1. Cre­ate a dic­tio­nary ob­ject to store nHours for each

cus­tomer name. 2. Us­ing the glob ob­ject, fetch the list of xlsx re­port files. 3. For each re­port file: a. Cy­cle through each row of sheet (0) b. For each row with valid en­try… i. Check if the date is present and is valid ii. Check if the date is present and the date is within the last seven days 1. Fetch the hours spent 2. Fetch the cus­tomer’s name 3. Cus­tomer hours + hours spent 4. Print the fi­nal out­put from the map a. Print the cus­tomer’s name and the to­tal hours spent on

the cus­tomer The fol­low­ing is the com­plete pro­gram:

Note 1: Note that we print the re­port be­ing pro­cessed to help in de­bug­ging, in case any data makes the pro­gram mis­be­have. Note 2: We process xl­date_as_­tu­ple in a try:, ex­cept: block so that we han­dle any ex­cep­tions thrown by in­cor­rect dates. We print an er­ror mes­sage in that case, which along with the re­port file name will help us know where the prob­lem is.

When run through Python, the above pro­gram pro­duces the fol­low­ing out­put:

More ideas

Think about how you can im­prove the above so­lu­tion. To start with, we could make all those con­stants con­fig­urable and pass them as in­put to our pro­gram through a cfg file (re­fer to ‘Prac­ti­cal Python Pro­gram­ming’ in the Fe­bru­ary 2015 is­sue of OSFY for how to use a Con­figParser mod­ule).

We could use bet­ter for­mat­ting to out­put the sum­mary (re­fer repr() and for­mat() in­ter­faces).

Also, read­ers familiar with pivot ta­bles could start think­ing about how to im­ple­ment gen­er­at­ing such piv­ots from Python. The au­thor has more than 15 years of tele­com in­dus­try ex­pe­ri­ence. He has been ac­tively in­volved in ra­dio tech­nolo­gies for nearly a decade while work­ing with Siemens Com­mu­ni­ca­tions, Nokia Siemens Net­works and Radisys Cor­po­ra­tion. He lives and works at Ben­galuru.

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.