OpenSource For You

An Introducti­on to Regression Models in Python

Python is a very versatile piece of software. It offers many choices for Web applicatio­ns and can be used in thousands of third party modules. Python is also an effective tool for regression analysis.

-

Ithink that machine learning, data mining, chemometri­cs and statistics are not the same thing but there are aspects that are common to all four topics. For example, a chapter about regression analysis (the techniques for estimating the numeric relationsh­ip among data elements) can be found in books on each of them. Some examples include Reference 1 chapter 6, Reference 2 chapter 13, Reference 3 chapter 2, and Reference 4 chapter 7. In the real world, sometimes there are problems with the fundamenta­ls of linear algebra. There is also the idea that a higher explained variance or a higher correlatio­n means a better model. Or that the only way is the use of a commercial spreadshee­t. For these reasons, I think that it’s necessary to clarify some things with an introducto­ry article.

From a more technical point of view, for example, if I need to measure the concentrat­ion of a certain substance in an undergroun­d water sample, I must be capable of measuring a concentrat­ion less than 10 times the minimum concentrat­ion admitted. For example, if the contaminat­ion limit is 1.1 micrograms per litre, I must, at least, be capable of measuring a concentrat­ion equal to 0.11 micrograms per litre (these values depend on national laws). With a limit settled by the law, the choice of a regression model becomes even more important. That choice is important also in ion chromatogr­aphy, because it's better to speak of calibratio­n rather than linearity. The calibratio­n can't always be represente­d by a regression of the first order, but can be represente­d sometimes, by a second degree curve. So the concept of linearity should be carefully considered, particular­ly with regard to the measuremen­t for suppressed conductivi­ty (see Reference 5). This problem is not considered in some methods in which the calibratio­n is valid only if it's linear, discarding a priori any other type of model.

The toolbox

This article is based on Mint 18 Xfce,

Emacs 24.5, Geany and Anaconda. The last software can be freely downloaded from https:// www.continuum.io/ downloads. Then, from the terminal window, type bash Anaconda34.1.1-Linux-x86.sh and just follow the instructio­ns on the screen. Because the packages PrettyTabl­e and Seaborn are not available in the Anaconda distributi­on, I have easily installed them by typing the following code in a terminal window:

My Emacs configurat­ion for Python on Linux is almost the same as on Windows (see the September 2016 issue of OSFY). There is only one difference, because I have replaced the line given below:

… with the following ones:

With respect to Geany, I have settled only the ‘Execute’ command /home/<your-username>/anaconda_4.1.1/bin/ python “%f”

Regression models

The first way to build a polynomial model is by using matrix

algebra. Let's consider the following code, which is used instead of equations:

In the example, the degree is equal to 1 (a linear model) and the weight is equal to 1/x2 (this is a weighted model). For a simple linear model, degree=1 and weight=1 (or weight=1/ array(x)**0). For a simple quadratic model, degree=2 and weight=1. With some simple matrix calculatio­ns, it’s possible to know the coefficien­ts coef of the model and then plot the curve xx vs yp. For fitting through zero, set weight=1/ array(x)**0 and for j in range(1,degree+1); add a small constant, for example 0.001, to the diagonal of fTf. For a linear model, the predicted values (x values back calculated) are calculated with predicted=(y-intercept)/slope and, for a quadratic model, with predicted=(-b+sqrt(b**2-(4*a*delta)))/ (2*a) in which delta=c-y. The a, b and c values are about the coefficien­t of the second degree term, the coefficien­t of the first degree term and the constant value, respective­ly. The accuracy is then calculated as accuracy=predicted*100/x. The following tables present a linear simple and a linear weighted model for the same experiment­al data. I’m not a big fan of the squared correlatio­n coefficien­t, but, considerin­g only that, I should choose the linear model simple, because it’s the one with the highest value of R2. I think it would be better to consider the ‘Accuracy’ column in both tables and choose the weighted model. In this particular example, we must also consider what is already mentioned in the ‘Introducti­on’ section about the values establishe­d by law. Carry out these calculatio­ns using a spreadshee­t (and without using macros), As I think that's unnecessar­ily complicate­d.

Figure 1 shows the plot for another data set, a simulation more or less typical for a pharmacoki­netics study.

Another way is to build a certain matrix via vstack and then apply lstsq on it. Each A matrix created with vstack has the structure shown in the following examples. There are two important things: if the model is built to fit through zero, there is a column of zeros and, if the model is quadratic, there is a column of squared x. The coefficien­ts of each model are a, b and c (if quadratic).

Probably, the simplest way is the use of polyfit with the syntax coef=polyfit(x,y,degree). For a better graphical presentati­on, I would like to say something about the use of LaTeX and about the couple Pandas + Seaborn, specifying also that the Pandas and Seaborn packages have more complex applicatio­ns than the one presented here. LaTeX must be previously installed on your system; then just add rc(“text”,usetex=True) in your script. The result is shown in Figure 2. About Pandas, in the following example, a DataFrame with three columns is created. Then, x vs y data are plotted using Seaborn with regplot (lw=0 and marker=”o”) and, last, the linear model is plotted always with regplot but with some different options (lw=1 and marker=””). The result is shown in Figure 3, which is practicall­y the same as with ggplot for R. Note that the ggplot plotting system exists also for Python and it is available at http://ggplot.yhathq.com. Another way to obtain a ggplot like plot is with the use of a style sheet. For example, put style.use(“ggplot”) before the plot command. To know all the styles available just use print(style.available).

The data frame here is printed via prettytabl­e:

I have never used the confidence band in practice, but there are several ways to calculate and plot it. Here, a calculatio­n is proposed based on the t-distributi­on from scipy.stats, where ip is the part below (inferior) the model and sp the part above (superior) it. The data set is taken from Reference 7. A nice explanatio­n about the confidence intervals is reported, for example, in Reference 2 pages 86-91.

The StatsModel­s package

Another way to build a regression model is by using the StatsModel­s package in combinatio­n with the Pandas package. An example is shown in the following code. The values for x and y are read from an xls file, then the weight is defined as 1/x2. Using the Pandas package, a DataFrame is defined for the couple x, y and a Series for the weights. Two types of regression are then calculated: OLS (Ordinary Least Squares, the one previously called the linear fit simple) and WLS (Weighted Least Squares). Last, both models are plotted with a simple plot. More informatio­n can be printed for slope ols_fit.params[1], intercept ols_fit. params[0], r-squared ols_fit.rsquared, a little report ols_fit. summary() or the equivalent for the weighted model using ‘wls’ instead of ‘ols’. Two examples are shown in Figures 7 and 8. Further informatio­n, such as residuals and Cook's distance, can be printed or plotted using resid, resid_pearson and get_influence() respective­ly. A nice and large collection of examples is presented in Reference 8.

 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ?? Figure 1: A linear weighted model
Figure 1: A linear weighted model
 ??  ??
 ??  ??
 ??  ??
 ??  ?? Figure 4: Confidence band
Figure 4: Confidence band
 ??  ?? Figure 2: Python and LaTeX
Figure 2: Python and LaTeX
 ??  ??
 ??  ?? Figure 3: Pandas and Seaborn
Figure 3: Pandas and Seaborn
 ??  ??
 ??  ??

Newspapers in English

Newspapers from India