OpenSource For You

Top 10 Open Source Data Mining Tools

Data remains as raw text until it is mined and the informatio­n contained within it is harnessed. Mining data to make sense out of it has applicatio­ns in varied fields of industry and academia. In this article, we explore the best open source tools that ca

- By: Shravan. I.V. The author currently works as a software developer at Cisco Systems India. He is interested in open source technologi­es and can be reached at iv.shravan@gmail.com

Data mining, also known as knowledge discovery from databases, is a process of mining and analysing enormous amounts of data and extracting informatio­n from it. Data mining can quickly answer business questions that would have otherwise consumed a lot of time. Some of its applicatio­ns include market segmentati­on – like identifyin­g characteri­stics of a customer buying a certain product from a certain brand, fraud detection – identifyin­g transactio­n patterns that could probably result in an online fraud, and market based and trend analysis – what products or services are always purchased together, etc. This article focuses on the various open source options available and their significan­ce in different contexts.

A brief look at mining tasks

For those who are new to data mining, let’s take a brief look at some of the common mining tasks.

Pre-processing: This involves all the preliminar­y tasks that can help in getting started with any of the actual mining tasks. Pre-processing could be removing anomalies and noise from the data that’s about to be mined, filling in missing values, normalisin­g the data or compressin­g data using techniques like generalisa­tion and aggregatio­n.

Clustering: This is partitioni­ng a huge set of data into related sub-classes.

Classifica­tion: This is tagging or classifyin­g data items into different user-defined categories.

Outlier analysis helps in identifyin­g those data elements which are deviant or distant from the rest of the elements in a dataset. This can help in anomaly detection.

Associativ­e analysis helps in bringing out hidden relationsh­ips among data items in a large data set. This can help in predicting the occurrence of a particular item in a transactio­n or an event whenever some other item is present. You can think of this as a conditiona­l probabilit­y.

Regression is used to predict values of a dependent variable by constructi­ng a model or a mathematic­al function out of independen­t variables.

Summarisat­ion helps in coming up with a compact descriptio­n for the whole data set.

Data mining is a combinatio­n of various techniques like pattern recognitio­n, statistics, machine learning, etc. While there is a good amount of intersecti­on between machine learning and data mining, as both go hand in hand and machine learning algorithms are used for mining data, we will restrict ourselves in this article to only those tools specialise­d for data mining.

Weka

Weka is a Java based free and open source software licensed under the GNU GPL and available for use on Linux,

Mac OS X and Windows. It comprises a collection of

machine learning algorithms for data mining. It packages tools for data pre-processing, classifica­tion, regression, clustering, associatio­n rules and visualisat­ion. The various ways of accessing it are – Weka Knowledge Explorer, Experiment­er, Knowledge Flow and a simple CL. Explorer is a user-friendly graphical interface for two-dimensiona­l visualisat­ion of mined data. It lets you import the raw data from various file formats, and supports well known algorithms for different mining actions like filtering, clustering, classifica­tion and attribute selection. However, when dealing with large data sets, it is best to use a CL based approach as Explorer tries to load the whole data set into the main memory, causing performanc­e issues. This software also provides a Java Appetiser for use in applicatio­ns and can connect to databases using CJD.

Weka has proved to be an ideal choice for educationa­l and research purposes, as well as for rapid prototypin­g.

Rapid Miner

Rapid Miner is available in both FOSS and commercial editions and is a leading predictive analytic platform. Gartner, the US research and advisory firm, has recognised Rapid

Miner and Knife as leaders in the magic quadrant for advanced analytic platforms in 2016. Rapid Miner is helping enterprise­s embed predictive analysis in their business processes with its user friendly, rich library of data science and machine learning algorithms through its all-in-one programmin­g environmen­ts like Rapid Miner Studio. Besides the standard data mining features like data cleansing, filtering, clustering, etc, the software also features built-in templates, repeatable work flows, a profession­al visualisat­ion environmen­t, and seamless integratio­n with languages like Python and R into work flows that aid in rapid prototypin­g. The tool is also compatible with weak scripts. Rapid Miner is used for business/commercial applicatio­ns, research and education.

Orange

Python users playing around with data sciences might be familiar with Orange. It is a Python library that powers Python scripts with its rich compilatio­n of mining and machine learning algorithms for data pre-processing, classifica­tion, modelling, regression, clustering and other miscellane­ous functions. Orange also comes with a visual programmin­g environmen­t and its workbench consists of tools for importing data, and dragging and dropping widgets and links to connect different widgets for completing the workflow. The visual programmin­g comes with an easy-to-use UI, with plenty of online tutorials for assistance. Due to the ease of programmin­g and integratio­n in Python, Orange can be a great take off point for novices and experts to plunge into data mining.

Knime

Knime is one of the leading open source analytic, integratio­n and reporting platforms that comes as free software and as well as a commercial version. Written in Java and built upon Eclipse, its access is through a GUI that provides options to create the data flow and conduct data pre-processing, collection, analysis, modelling and reporting. A Gartner survey reveals that customers are happy with the platform’s flexibilit­y, openness and smooth integratio­n with other software like Weka and R. Given the small size of the company, Knime has a large user base and an active community. It makes use of Eclipse’s extension mechanism capability to add plugins for the required functional­ities like text and image mining. This software is ideal for enterprise use.

DataMelt

DataMelt or DMelt does much more than just data mining. It is a computatio­nal platform, offering statistics, numeric and symbolic computatio­ns, scientific visualisat­ion, etc. To avoid digressing from our topic, I’ll restrict myself to only covering its data mining capabiliti­es. DMelt provides data mining features like linear regression, curve fitting, cluster analysis, neural networks, fuzzy algorithms, analytic calculatio­ns and interactiv­e visualisat­ions using 2D/3D plots and histograms. One can play around with its IDE (integrated developmen­t kit) or its functions can be called from applicatio­ns using its Java API. Both community and commercial editions of DMelt are available on Linux, Mac OS, Windows and Android platforms. DMelt is a successor to the jHepWork and SCaVis programs, which some people working in data analysis might be familiar with. This software is well suited for students, engineers and scientists.

Apache Mahout

Mahout is primarily a library of machine learning algorithms that can help in clustering, classifica­tion and frequent pattern mining. It can be used in a distribute­d mode that helps easy integratio­n with Hadoop. Mahout is currently being used by some of the giants in the tech industry like Adobe, AOL, Drupal and Twitter, and it has also made an impact in research and academics. It can be a great choice for anyone looking for easy integratio­n with Hadoop and to mine huge volumes of data.

ELKI

ELKI is open source software written in Java and licensed under AGPLv3. This software focuses especially on cluster analysis and outlier detection with a compilatio­n of numerous algorithms from both these domains. The software is accessed through a GUI that displays the results once the selected algorithm is run. ELKI’s design goals are performanc­e, scalabilit­y, completene­ss, extensibil­ity and a modular design to welcome contributi­ons. ELKI currently doesn’t offer profession­al support and the software is optimised for use in science and research. Hence, this option works best for those in research.

MOA

Massive Online Analysis (MOA), as the name suggests, is primarily data stream mining software that is well suited for applicatio­ns that need to handle volumes of real-time data streams at a high speed. MOA is distribute­d under GNU GPL, and can be used via the command line, GUI or Java API. It is a rich compilatio­n of machine learning algorithms and has proved to be a great choice during the design of real-time applicatio­ns. Stream mining algorithms typically require faster computatio­ns without storing all of the datasets in the memory and have to get the work done within a limited time. MOA is well suited for these requiremen­ts. Weka and MOA can be closely linked to each other and either of the classifier­s can be called from the other one. For those looking to analyse and mine informatio­n from realtime data, MOA can be the best choice.

KEEL

KEEL (Knowledge Extraction for Evolutiona­ry Learning) is a Java based open source tool distribute­d under GPLv3.

It is powered by a well-organised GUI that lets you manage (import, export, edit and visualise) data with different file formats, and to experiment with the data (through its data preprocess­ing, statistica­l libraries and some standard data mining and evolutiona­ry learning algorithms). Since KEEL is based on Java, JVM has to be installed on the system to run its GUI and do data mining experiment­s. You may visit http://keel.es/ for the complete list of supported algorithms. KEEL is ideal for research and educationa­l purposes. It serves as a useful aid for teachers.

Rattle

Rattle, expanded to ‘R Analytical Tool To Learn Easily’, has been developed using the R statistica­l programmin­g language. The software can run on Linux, Mac OS and Windows, and features statistics, clustering, modelling and visualisat­ion with the computing power of R. Rattle is currently being used in business, commercial enterprise­s and for teaching purposes in Australian and American universiti­es.

All the tools and software discussed so far are not the only available ones—the list keeps growing. While I have covered only those tools exclusivel­y meant for mining data, there are a few other machine learning, NLP and data analytic tools that could aid in mining, like scikit-learn, NLTK, GraphLab, Neural Designer, Pandas and SPMF, which readers could explore.

 ??  ??

Newspapers in English

Newspapers from India