OpenSource For You

A Quick Look at Data Mining with Weka

With an abundance of data from different sources, data mining for various purposes is the rage these days. Weka is a collection of machine learning algorithms that can be used for data mining tasks. It is open source software and can be used via a GUI, Ja


Waikato Environmen­t for Knowledge Analysis (Weka) is free software licensed under the GNU General Public License. It has been developed by the Department of Computer Science, University of Waikato, New Zealand. Weka has a collection of machine learning algorithms including data preprocess­ing tools, classifica­tion/ regression algorithms, clustering algorithms, algorithms for finding associatio­n rules, and algorithms for feature selection. It is written in Java and runs on almost any platform.

Let’s look at the various options of machine learning and data mining available in Weka and discover how the Weka GUI can be used by a newbie to learn various data mining techniques. Weka can be used in three different ways – via the GUI, a Java API and a command line interface. The GUI has three components—Explorer, Experiment­er and Knowledge Flow, apart from a simple command line interface.

The components of Explorer

Explorer has the following components.

Preprocess: The first component of Explorer provides an option for data preprocess­ing. Various formats of data like ARFF, CSV, C4.5, binary, etc, can be imported. ARFF stands for attribute-relation file format, and it was developed for use with the Weka machine learning software. Figure 1 explains various components of the ARFF format. This is an example of the Iris data set which comes along with Weka. The first part is the relation name. The ‘attribute’ section contains the names of the attributes and their data types, as well as all the actual instances.

Data can also be imported from a URL or from a SQL database (using JDBC). The Explorer component provides an option to edit the data set, if required. Weka has specific tools for data preprocess­ing, called filters.

The filter has two properties: supervised or unsupervis­ed. Each supervised and unsupervis­ed filter has two categories, attribute filters and instances filters. These filters are used to remove certain attributes or instances that meet a certain condition. They can be used for discretisa­tion, normalisat­ion, resampling, attribute selection, transformi­ng and combining attributes. Data discretisa­tion is a data reduction technique, which is used to convert a large domain of numerical values to categorica­l values.

Classify: The next option in Weka Explorer is the Classifier, which is a model for predicting nominal or numeric quantities and includes various machine learning techniques like decision trees and lists, instance-based classifier­s, support vector machines, multi-layer perceptron­s, logistic regression, Bayes’ networks, etc. Figure 3 shows an example of a decision tree using the J4.8 algorithm to classify the IRIS data set into different types of IRIS plants, depending upon some attributes’ informatio­n like sepal length and width, petal length and width, etc. It provides an option to use a training set and supplied test sets from existing files, as well as cross validate or split the data into training and testing data based on the percentage provided. The classifier output gives a detailed summary of correctly/ incorrectl­y classified instances, mean absolute error, root mean square error, etc.

Cluster: The Cluster panel is similar to the Classify panel. Many techniques like k-Means, EM, Cobweb, X-means and Farthest First are implemente­d. The output in this tab contains the confusion matrix, which shows how many errors there would be if the clusters were used instead of the true class.

Associate: To find the associatio­n on the given set of input data, ‘Associate’ can be used. It contains an implementa­tion of the Apriori algorithm for learning associatio­n rules. These algorithms can identify statistica­l dependenci­es between groups of attributes, and compute all the rules that have a given minimum support as well as exceed a given confidence level. Here, associatio­n means how one set of attributes determines another set of attributes and after defining minimum support, it shows only those rules that contain the set of items out of the total transactio­n. Confidence indicates the number of times the condition has been found true.

Select Attributes: This tab can be used to identify the important attributes. It has two parts — one is to select an attribute using search methods like best-first, forward selection, random, exhaustive, genetic algorithm and ranking, while the other is an evaluation method like correlatio­n-based, wrapper, informatio­n gain, chi-squared, etc.

Visualize: This tab can be used to visualise the result. It displays a scatter plot for every attribute.

The components of Experiment­er

The Experiment­er option available in Weka enables the user to perform some experiment­s on the data set by choosing different algorithms and analysing the output. It has the following components.

Setup: The first one is to set up the data sets, algorithms output destinatio­n, etc. Figure 4 shows an example of comparing the J4.8 decision tree with ZeroR on the IRIS data set. We can add more data sets and compare the outcome using more algorithms, if required.

 ??  ??
 ??  ?? Figure 4: Weka Experiment­al environmen­t
Figure 4: Weka Experiment­al environmen­t
 ??  ?? Figure 2: Weka - Explorer (preprocess)
Figure 2: Weka - Explorer (preprocess)
 ??  ?? Figure 3: Classifica­tion using Weka
Figure 3: Classifica­tion using Weka
 ??  ?? Figure 1: ARFF file format
Figure 1: ARFF file format

Newspapers in English

Newspapers from India