A Quick Look at Data Min­ing with Weka

With an abun­dance of data from dif­fer­ent sources, data min­ing for var­i­ous pur­poses is the rage these days. Weka is a col­lec­tion of ma­chine learn­ing al­go­rithms that can be used for data min­ing tasks. It is open source soft­ware and can be used via a GUI, Ja

OpenSource For You - - Developers -

Waikato En­vi­ron­ment for Knowl­edge Anal­y­sis (Weka) is free soft­ware li­censed un­der the GNU General Pub­lic Li­cense. It has been de­vel­oped by the Depart­ment of Com­puter Science, Univer­sity of Waikato, New Zealand. Weka has a col­lec­tion of ma­chine learn­ing al­go­rithms in­clud­ing data pre­pro­cess­ing tools, clas­si­fi­ca­tion/ re­gres­sion al­go­rithms, clus­ter­ing al­go­rithms, al­go­rithms for find­ing as­so­ci­a­tion rules, and al­go­rithms for fea­ture se­lec­tion. It is writ­ten in Java and runs on al­most any plat­form.

Let’s look at the var­i­ous op­tions of ma­chine learn­ing and data min­ing avail­able in Weka and dis­cover how the Weka GUI can be used by a new­bie to learn var­i­ous data min­ing tech­niques. Weka can be used in three dif­fer­ent ways – via the GUI, a Java API and a com­mand line in­ter­face. The GUI has three com­po­nents—Ex­plorer, Ex­per­i­menter and Knowl­edge Flow, apart from a sim­ple com­mand line in­ter­face.

The com­po­nents of Ex­plorer

Ex­plorer has the fol­low­ing com­po­nents.

Pre­pro­cess: The first com­po­nent of Ex­plorer pro­vides an op­tion for data pre­pro­cess­ing. Var­i­ous for­mats of data like ARFF, CSV, C4.5, bi­nary, etc, can be im­ported. ARFF stands for at­tribute-re­la­tion file for­mat, and it was de­vel­oped for use with the Weka ma­chine learn­ing soft­ware. Fig­ure 1 ex­plains var­i­ous com­po­nents of the ARFF for­mat. This is an ex­am­ple of the Iris data set which comes along with Weka. The first part is the re­la­tion name. The ‘at­tribute’ sec­tion con­tains the names of the at­tributes and their data types, as well as all the ac­tual in­stances.

Data can also be im­ported from a URL or from a SQL data­base (us­ing JDBC). The Ex­plorer com­po­nent pro­vides an op­tion to edit the data set, if re­quired. Weka has spe­cific tools for data pre­pro­cess­ing, called fil­ters.

The fil­ter has two prop­er­ties: su­per­vised or un­su­per­vised. Each su­per­vised and un­su­per­vised fil­ter has two cat­e­gories, at­tribute fil­ters and in­stances fil­ters. These fil­ters are used to re­move cer­tain at­tributes or in­stances that meet a cer­tain con­di­tion. They can be used for dis­creti­sa­tion, nor­mal­i­sa­tion, re­sam­pling, at­tribute se­lec­tion, trans­form­ing and com­bin­ing at­tributes. Data dis­creti­sa­tion is a data re­duc­tion tech­nique, which is used to con­vert a large do­main of nu­mer­i­cal val­ues to cat­e­gor­i­cal val­ues.

Clas­sify: The next op­tion in Weka Ex­plorer is the Clas­si­fier, which is a model for pre­dict­ing nom­i­nal or nu­meric quantities and in­cludes var­i­ous ma­chine learn­ing tech­niques like de­ci­sion trees and lists, in­stance-based clas­si­fiers, sup­port vec­tor ma­chines, multi-layer per­cep­trons, lo­gis­tic re­gres­sion, Bayes’ net­works, etc. Fig­ure 3 shows an ex­am­ple of a de­ci­sion tree us­ing the J4.8 al­go­rithm to clas­sify the IRIS data set into dif­fer­ent types of IRIS plants, de­pend­ing upon some at­tributes’ in­for­ma­tion like sepal length and width, petal length and width, etc. It pro­vides an op­tion to use a train­ing set and sup­plied test sets from ex­ist­ing files, as well as cross val­i­date or split the data into train­ing and test­ing data based on the per­cent­age pro­vided. The clas­si­fier out­put gives a de­tailed sum­mary of cor­rectly/ in­cor­rectly clas­si­fied in­stances, mean ab­so­lute er­ror, root mean square er­ror, etc.

Clus­ter: The Clus­ter panel is sim­i­lar to the Clas­sify panel. Many tech­niques like k-Means, EM, Cob­web, X-means and Far­thest First are im­ple­mented. The out­put in this tab con­tains the con­fu­sion ma­trix, which shows how many er­rors there would be if the clus­ters were used in­stead of the true class.

As­so­ciate: To find the as­so­ci­a­tion on the given set of in­put data, ‘As­so­ciate’ can be used. It con­tains an im­ple­men­ta­tion of the Apri­ori al­go­rithm for learn­ing as­so­ci­a­tion rules. These al­go­rithms can iden­tify sta­tis­ti­cal de­pen­den­cies be­tween groups of at­tributes, and com­pute all the rules that have a given min­i­mum sup­port as well as ex­ceed a given con­fi­dence level. Here, as­so­ci­a­tion means how one set of at­tributes de­ter­mines an­other set of at­tributes and after defin­ing min­i­mum sup­port, it shows only those rules that con­tain the set of items out of the to­tal trans­ac­tion. Con­fi­dence in­di­cates the num­ber of times the con­di­tion has been found true.

Select At­tributes: This tab can be used to iden­tify the im­por­tant at­tributes. It has two parts — one is to select an at­tribute us­ing search meth­ods like best-first, for­ward se­lec­tion, ran­dom, ex­haus­tive, ge­netic al­go­rithm and rank­ing, while the other is an eval­u­a­tion method like cor­re­la­tion-based, wrap­per, in­for­ma­tion gain, chi-squared, etc.

Vi­su­al­ize: This tab can be used to vi­su­alise the re­sult. It dis­plays a scat­ter plot for ev­ery at­tribute.

The com­po­nents of Ex­per­i­menter

The Ex­per­i­menter op­tion avail­able in Weka en­ables the user to per­form some ex­per­i­ments on the data set by choos­ing dif­fer­ent al­go­rithms and analysing the out­put. It has the fol­low­ing com­po­nents.

Setup: The first one is to set up the data sets, al­go­rithms out­put des­ti­na­tion, etc. Fig­ure 4 shows an ex­am­ple of com­par­ing the J4.8 de­ci­sion tree with ZeroR on the IRIS data set. We can add more data sets and com­pare the out­come us­ing more al­go­rithms, if re­quired.

Fig­ure 4: Weka Ex­per­i­men­tal en­vi­ron­ment

Fig­ure 2: Weka - Ex­plorer (pre­pro­cess)

Fig­ure 3: Clas­si­fi­ca­tion us­ing Weka

Fig­ure 1: ARFF file for­mat

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.