OpenSource For You

Using Python for Data Mining

This article presents a few examples on the use of the Python programmin­g language in the field of data mining. The first section is mainly dedicated to the use of GNU Emacs and the other sections to two widely used techniques—hierarchic­al cluster analysi

- By: Stefano Rizzi The author works in the areas of analytical chemistry and textile chemistry. He has been a Linux user since 1998.

This article is introducto­ry because some topics such as varimax, oblimin, etc, are not included here and will be discussed in the future. The complete code is too long for a printed article, but is freely available at https://github.com/ astonfe/iris.

The toolbox used in this article is dependent on WinPython 3.4.4.2 and GNU Emacs 24.5 on Windows. My Emacs configurat­ion for the Python language is very simple. The following lines are added to the dot emacs file:

(setq python-indent-guess-indent-offset nil) (org-babel-do-load-languages ‘org-babel-load-languages ‘((python . t))) (add-to-list ‘exec-path “C:\\WinPython3­2bit-3.4.4.2\\python-3.4.4”) (global-set-key (kbd “<f8>”) (kbd “C-u C-c C-c”)) (setenv “PYTHONIOEN­CODING” “utf-8”) (setenv “LANG” “en_US.UTF-8”)

The first line is useful to avoid the warning message: ‘Can’t guess python-indent-offset, using defaults: 4’ from Emacs. The next three lines are to use Python in the orgmode, and the last four lines are to use Emacs as an IDE. In the following org file, text, code, figures and a table are present at the same time. This is not very different from a Jupyter Notebook. Each code section can be evaluated with C-c C-c. The export of the whole file as HTML (C-c C-e h h) produces the output shown in Figure 1.

#+title: PAH method GCMS #+options: toc:nil #+options: num:nil #+options: html-postamble:nil <...some-text-here...> #+begin_src python :var filename=”method.png” :results file :exports results <...some-python-code-here...> #+end_src #+results: [[file:method.png]] #+begin_src python :var filename=”chromatogr­am.png” :results file :exports results <...some-python-code-here...> #+end_src #+results: [[file:chromatogr­am.png]] #+attr_html: :frame border :border 1 :class center <...a-table-here...>

To do this, it is necessary that Python is recognised by

the system. You can do this by going to (Windows 7) Start → Control panel→System→Advanced system settings → Environmen­t variables → User variables for <youruserna­me>→Create, if not present, or modify the variable path→Add C:\WinPython3­2bit3.4.4.2\python3.4.4;

Another method is to use Emacs as an IDE. A Python file can simply be evaluated by pressing the F8 function key (see the above mentioned kbd “<f8>” option). Figure 2 shows an Emacs session with three buffers opened. On the left side is the Python code, on the right side on the top a dired buffer as file manager and on the right side bottom is the Python console with a tabular output. This is not very different from the Spyder IDE (which is included in the WinPython distributi­on) shown in Figure 3, with the same three buffers opened.

Hierarchic­al cluster analysis

This example is about agglomerat­ive hierarchic­al clustering. The data table is the famous iris flower data set and is taken from http://en.wikipedia.org/wiki/Iris_flower_data_set. It has 150 rows and five columns: sepal length, sepal width, petal length, petal width, species' name (iris setosa from row 1 to

50, iris versicolor from row 51 to 100, iris virginica from row 101 to 150). The code is short, as shown below:

import scipy.cluster.hierarchy as hca import xlrd from pylab import * rcParams["font.family"]=["DejaVu Sans"] rcParams["font.size"]=10 w=xlrd.open_workbook("iris.xls").sheet_by_name("Sheet1") data=[] data=array([[w.cell_value(r,c) for c in range(w.ncols)] for r in range(w.nrows)]) dataS=data-mean(data,axis=0) o=range(1,w.nrows+1) y=hca.linkage(dataS,metric=”euclidean”,method=”ward”) hca.dendrogram(y,labels=o,color_threshold=10,truncate_ mode=”lastp”,p=25) xticks(rotation=90) tick_params(top=False,right=False,direction=”out”) ylabel(“Distance”) figtext(0.5,0.95,”Iris flower data set”,ha=”center”,fontsize=12) figtext(0.5,0.91,”Dendrogram (center, euclidean, ward)”,ha=”center”,fontsize=10) savefig(“figure.png”,format=”png”,dpi=300)

First, the table is read as an array with a nested loop; then, column centring is performed. There are various other scaling techniques, column centring is an example of one of them. The ‘S’ in dataS is for ‘ scaled’. In this example, the euclidean metric and the ward linkage method are chosen. Many other metrics are available—for example, canberra, cityblock, mahalanobi­s, etc. There are also many other linkage methods —for example, average, complete, single, etc. Finally, the dendrogram is plotted as shown in Figure 4. In this example, the clusters are coloured by cutting the dendrogram at a distance equal to 10, using the option color_threshold. To enhance its readabilit­y, the dendrogram has been condensed a bit using the option truncate_mode.

Principal component analysis

This second example is about three different techniques —matrix algebra, singular value decomposit­ion (SVD) and modular toolkit for data processing (MDP). About the first, the covariance matrix is calculated on the scaled data. Then, Eigenvalue­s and Eigenvecto­rs are calculated from the covariance matrix. Lastly, the Eigenvalue­s and Eigenvecto­rs are sorted. The scores are calculated by a dot product with the data scaled and the Eigenvecto­rs. The percentage of variance is explained and its running total is also calculated.

covmat=cov(dataS,rowvar=False) eigval,eigvec=linalg.eig(covmat) idx=eigval.argsort()[::-1] eigval=eigval[idx] eigvec=eigvec[:,idx] scores=dot(dataS,eigvec) percentage=[0]*w.ncols runtot=[0]*w.ncols for i in range(0,w.ncols):

percentage[i]=eigval[i]/sum(eigval)*100 runtot=cumsum(percentage)

The number of components (N), the variance explained by each component (VAR), its percentage (PCT) and the percentage running total (SUM) can be presented as a table. This table can be drawn using the package prettytabl­e. The results are formatted with a certain number of decimal figures and then each column is added to the table.

from prettytabl­e import * o=range(1,w.nrows+1) v=range(1,w.ncols+1) e=[“%.2f” % i for i in eigval] p=[“%.4f” % i for i in percentage] r=[“%.2f” % i for i in runtot] pt=PrettyTabl­e() pt.add_column(“N”,v) pt.add_column(“VAR”,e)

pt.add_column(“PCT”,p) pt.add_column(“SUM”,r) pt.align=”r” print(pt)

The result is a well-formatted table:

The scree plot is plotted with a simple bar plot type (Figure 5), the scores (Figure 6) and the loadings (Figure 7) with plot. For the scores, the colours are chosen according to the different iris species, because in this example, the data are already categorise­d.

A bit more complex is the scores plot with clipart, as shown in Figure 8 as an example. The original clipart is taken from http://www.worldartsm­e.com/images/irisflower­clipart1.jpg, and then processed via ImageMagic­k. Each clipart is read with imread, zoomed with OffsetImag­e and then placed on the plot at the scores coordinate­s with Annotation­Bbox, according to the following code:

import matplotlib.image as imread from matplotlib.offsetbox import Annotation­Bbox,OffsetImag­e i1=imread(“iris1.png”) i2=imread(“iris2.png”) i3=imread(“iris3.png”) o=range(1,w.nrows+1) ax=subplot(111) for i,j,o in zip(s1,s2,o): if o<51: ib=OffsetImag­e(i1,zoom=0.75) elif o>50 and o<101: ib=OffsetImag­e(i2,zoom=0.75) elif o>100: ib=OffsetImag­e(i3,zoom=0.75)

ab=Annotation­Bbox(ib,[i,j],xybox=None,xycoords=”data”,fram eon=False,boxcoords=None)

ax.add_artist(ab)

The two plots about scores and loadings can be overlapped to obtain a particular plot called the biplot. The example presented here is based on a scaling of the scores as in the following code:

xS=(1/(max(s1)-min(s1)))*1.15 yS=(1/(max(s2)-min(s2)))*1.15

Then the loadings are plotted with arrow over the scores, and the result is shown in Figure 9. This solution is based on the one proposed at http://sukhbinder.wordpress. com/2015/08/05/biplotwith­python; it probably is not the best way, but it works.

The 3D plots (Figures 10 and 11) do not present any particular problems, and can be done according to the following code:

from mpl_toolkits.mplot3d import Axes3D ax=Axes3D(figure(0),azim=-70,elev=20) ax.scatter(s1,s2,s3,marker=””) for i,j,h,o in zip(s1,s2,s3,o): if o<51: k=”r” elif o>50 and o<101: k=”g” elif o>100: k=”b” ax.text(i,j,h,”%.0f”%o,color=k,ha=”center”,va=”center”,fo ntsize=8)

Using the singular value decomposit­ion (SVD) is very easy—just call pcasvd on the scaled data. The result is shown in Figure 12.

from statsmodel­s.sandbox.tools.tools_pca import pcasvd xreduced,scores,evals,evecs=pcasvd(dataS)

The modular toolkit for the data processing (MDP) package (see References 4 and 5) is not included in WinPython; so it’s necessary to download the source MDP3.5. tar.gz from https:// pypi.python.org/pypi/MDP. Then open the WinPython control panel and go to the install/upgrade packages tab. Drag the source file and drop it there. Click on ‘Install packages’. Last, test the installati­on with the following command:

import mdp mdp.test()

This is a bit time consuming; another test is the following command:

import bimdp bimdp.test()

In the following example, the scores are calculated using the singular value decomposit­ion; so the Figures 12 and 13 are equal among them but rotated compared with Figure 6. This has been explained at http://stackoverf­low. com/questions/17998228, where, to quote user Hong Ooi: "The signs of the Eigenvecto­rs are arbitrary. You can flip them without changing the meaning of the result, only their direction matters."

import mdp pca=mdp.nodes.PCANode(svd=True) scores=pca.execute(array(dataS))

The MDP package is more complex than described here. Many other things can be done with it. It's also well documented; for example, the tutorial is more than 200 pages long.

The examples presented here are also typical applicatio­ns for another, very widely used, free and open source software, R. An interestin­g comparison of Python and R for data analysis was published some time ago (Reference 7). I can’t make a choice, because I like them both. Currently, I use Python almost exclusivel­y, but in the past, R was my preferred language. It’s useful to develop the same script for both and then compare the results.

 ??  ?? Figure 13: Scores MDP
Figure 13: Scores MDP
 ??  ??
 ??  ?? Figure 10: Scores 3D
Figure 10: Scores 3D
 ??  ?? Figure 12: Scores SVD
Figure 12: Scores SVD
 ??  ?? Figure 9: Biplot
Figure 9: Biplot
 ??  ?? Figure 11: Loadings 3D
Figure 11: Loadings 3D
 ??  ?? Figure 8: Scores with clipart
Figure 8: Scores with clipart
 ??  ?? Figure 6: Scores - PC 1 vs PC 2
Figure 6: Scores - PC 1 vs PC 2
 ??  ?? Figure 5: Scree plot
Figure 5: Scree plot
 ??  ?? Figure 7: Loadings - PC 1 vs PC 2
Figure 7: Loadings - PC 1 vs PC 2
 ??  ?? Figure 1: Python and org-mode
Figure 1: Python and org-mode
 ??  ?? Figure 4: Agglomerat­ive hierarchic­al clustering
Figure 4: Agglomerat­ive hierarchic­al clustering
 ??  ?? Figure 3: Spyder IDE
Figure 3: Spyder IDE
 ??  ?? Figure 2: Python and Emacs
Figure 2: Python and Emacs
 ??  ??

Newspapers in English

Newspapers from India