MATLAB to Weka Interface: Introduction

 

Weka is an open-source platform providing various machine learning algorithms for data mining tasks. Although Weka provides fantastic graphical user interfaces (GUI), sometimes I wished I had more flexibility in programming Weka. For instance, I often needed to perform the analysis based on leave-one-out-subject cross-validation, but it was quite difficult to do this on Weka GUI. I do most of my analyses on MATLAB, so I was searching for an interface between MATLAB and Weka. Fortunately, Weka was implemented in Java, and MATLAB had a wrapper that allows communicating with Java.

Here I introduce an efficient MATLAB to Weka interface, which was implemented based on the initial work of Matt Dunham.

This work is still in-progress and I have only included codes that I mainly use for my work. If you would like to collaborate to improve the code or if you find any bugs, please don't hesitate to reach me at "silee {at} partners {dot} org".

You may also find the same information at the Matlab Central. Please click here

.

 

Download

 

MATLAB CODE & EXAMPLES

 

Revisions

 

7/22/2015: 1) Paths to WEKA has been updated to comply with Mac users. Thanks to Giovanni Mascia. 2) The "crossvalind" function, which requires the Bioinformatics toolbox, is replaced with idxCV = ceil(rand([1 N])*K)+1;. Thanks to Igor Varfolomeev.

4/21/2015: 1) The input files for example codes have been added since some older versions of MATLAB don't have them built in. 2) The classifier & cost-sensitive classifier now produces "nominal outputs" rather than "numerical outputs". Thanks to Mr. Giovanni Mascia for this modifications!

3/23/2015: There was a small bug in wekaRegression.m and regression_example.m, which is now fixed.

.

 

Useful Tips

 

Java Memory Issue

Some of the functions in Weka (e.g., Gaussian Process Regression) require a large Java heap size within the MATLAB environment. If you ever receive an error message that is similar to "java.lang.OutOfMemoryError: Java heap space," then you need to manually increase the heap size. You can increase (or decrease) the heap size at File -> Preference -> General -> Java Heap Memory. More information can be found here.

 

Tutorial

 

classifier_example.m

 

This is an example that performs multi-class classification on the IRIS dataset using a classifier available in Weka.

 

costSensitiveClassifier_example.m

 

This is an example that performs cost sensitive classification on the IRIS dataset using a classifier available in Weka.

 

regression_example.m

 

This is an example that performs regression on the Imported-Car dataset using a regression algorithm available in Weka.

 

clustering_example.m

 

This is an example that performs unsupervised clustering on the IRIS dataset using a clustering algorithm available in Weka.

 

matlab2weka / matlab2weka.jar

 

JAVA SOURCE CODE HERE

JAVA DOC HERE

This is a Java code that converts the MATLAB dataset into an Instances object of Weka. This code was originally motivated by the work of Matt Dunham where he used a MATLAB file to convert the MATLAB dataset to an Instances object of Weka. However, I found this code extremely slow because it uses extensive amount of loops. Thus, I implemented the same code in Java and it runs much faster.

The matlab2weka.jar can handle both nominal and numerical attribute values as well as nominal and numerical class values. Although the detailed Java Doc can be found here, I briefly discuss a number of examples to help you use "matlab2weka.jar" in the MATLAB environment.

First, in order to use this code, you must define the path for the JAR file to the MATLAB and impart the matlab2weka package using the following code.

javaaddpath([pwd filesep 'matlab2weka' filesep 'matlab2weka.jar']);
import matlab2weka.*;

 

1. Converting dataset composed of both numerical and nominal attributes and a nominal target (i.e. class) variable.

convert2wekaObj = convert2weka(datasetName, attrNameNumeric, dataNumeric, attrNameNominal, dataNominal, classNominal, hasClass);

where datasetName a string variable represents the description of the dataset (e.g., 'training_dataset'),
attrNameNumeric is a (M by 1) cell vector containing the names of the numerical attributes in strings
dataNumeric is a (M by N) (not N by M!!!) numerical matrix containing the values of the attributes,
attrNameNominal is a (D by 1) cell vector containing the names of the nominal attributes in strings,
dataNominal is a (D by N) (not N by D!!!) cell matrix containing the strings values of the attributes,
classNominal is a (N by 1) cell vector containing the strings labels of the class,
hasClass is boolean variable indicating whether or not to include the "class" attribute in the dataset (e.g., true for classification/regreesion, and false for clustering)
Note that M represents the number of numerical attributes and D represents the number of nominal attributes in this context.

This function will call the following constructor within the matlab2weka.jar, which creates an java object of class convert2weka.

convert2weka(java.lang.String name, java.lang.String[] attrNameNumeric, double[][] dataNumeric, java.lang.String[] attrNameString, java.lang.String[][] dataString, java.lang.String[] classLabel, boolean hasClass)

Then, we retrieve the Instances object of Weka by calling the following function in the MATLAB.

wekaDataset = convert2wekaObj.getInstances();

 

2. Converting dataset composed only of numerical attributes and a numerical dependent (i.e. class) variable.

convert2wekaObj = convert2weka(datasetName, attrNameNumeric, dataNumeric, classNumerical, hasClass);

The matlab2java JAR file supports having only numerical attributes as well as numerical target variables; matlab2java JAR file supports having only nominal attributes. This function will call the following constructor within the matlab2weka.jar, which creates an java object of class convert2weka.

convert2weka(java.lang.String name, java.lang.String[] attrNameNumeric, double[][] dataNumeric, double[] classLabel, boolean hasClass)

Then again, we retrieve the Instances object of Weka by calling the following function in the MATLAB.

wekaDataset = convert2wekaObj.getInstances();

 

Please refer to the example codes such as matlab2weka/wekaClassification.m and matlab2weka/wekaRegression.m for the use of the matlab2java JAR file to covert to Weka-Instances objects.

 

matlab2weka / wekaClassification.m

 

This function receives the MATLAB numerical training and testing data as its input, converts the data into Weka-Instaces objects, train a classification model using the training data, and predicts the class values of the testing data. Currently, this function only supports the numerical input data, but it can be easily modified to accept nominal inputs (since matlab2weka.jar file supports nominal inputs).

This function can be called by executing the following MATLAB code.

[actual, predicted, probDistr] = wekaClassification(featTrain, classTrain, featTest, classTest, featName, classifier);

where featTrain is a (Ntr by M) numerical matrix of training features,
classTrain is a (Ntr by 1) nominal (string) cell vector representing the values of the dependent variable of the training data,
featTest is a (Nts by M) numerical matrix of testing features,
classTest is a (Nts by 1) nominal (string) cell vector representing the values of the dependent variable of the testing data,
featName is a (1 by M) nominal (string) cell vector of string representing the name of the attributes,
classifier is a variable that selects a certain classifier from the Weka package. For now, classifier = 1 selects Random Forest, classifier = 2 selects J48 Decision Tree, classifier = 3 selects Support Vector Machine, and classifier = 4 selects Logistic Regression.

 

matlab2weka / wekaCostSensitiveClassification.m

 

This function receives the MATLAB numerical training and testing data as its input, converts the data into Weka-Instaces objects, train a cost-sensitive classification model using the training data and the cost matrix, and predicts the class values of the testing data. Currently, this function only supports the numerical input data, but it can be easily modified to accept nominal inputs (since matlab2weka.jar file supports nominal inputs).

This function can be called by executing the following MATLAB code.

[actual, predicted, probDistr] = wekaCostSensitiveClassification(featTrain, classTrain, featTest, classTest, featName, costMatrix, classifier);

where featTrain is a (Ntr by M) numerical matrix of training features,
classTrain is a (Ntr by 1) nominal (string) cell vector representing the values of the dependent variable of the training data,
featTest is a (Nts by M) numerical matrix of testing features,
classTest is a (Nts by 1) nominal (string) cell vector representing the values of the dependent variable of the testing data,
featName is a (1 by M) nominal (string) cell vector of string representing the name of the attributes,
costMatrix is a (C by C) numerical matrix representing the cost matrix, where C is the number of classes. For an example of a cost matrix [0, C12; C21, 0], C12 represents the cost for misclassifying 'class_1' to 'class_2', and C21 represents the cost for misclassifying 'class_2' to 'class_1'. classifier is a variable that selects a certain classifier from the Weka package. For now, classifier = 1 selects Random Forest, classifier = 2 selects J48 Decision Tree, classifier = 3 selects Support Vector Machine, and classifier = 4 selects Logistic Regression.

 

matlab2weka / wekaRegression.m

 

This function receives the MATLAB numerical training and testing data as its input, converts the data into Weka-Instaces objects, train a regression model using the training data, and predicts the class values of the testing data. Currently, this function only supports the numerical input data, but it can be easily modified to accept nominal inputs (since matlab2weka.jar file supports nominal inputs).

This function can be called by executing the following MATLAB code.

[actual, predicted, probDistr] = wekaRegression(featTrain, classTrain, featTest, classTest, featName, classifier);

where featTrain is a (Ntr by M) numerical matrix of training features,
classTrain is a (Ntr by 1) nominal (string) cell vector representing the values of the dependent variable of the training data,
featTest is a (Nts by M) numerical matrix of testing features,
classTest is a (Nts by 1) nominal (string) cell vector representing the values of the dependent variable of the testing data,
featName is a (1 by M) nominal (string) cell vector of string representing the name of the attributes,
classifier is a variable that selects a certain classifier from the Weka package. For now, classifier = 1 selects Support Vector Regression, classifier = 2 selects Nearest Neighbor Regression, and classifier = 3 selects Gaussian Process Regression.

 

matlab2weka / wekaClustering.m

 

This function receives the MATLAB numerical data as its input, converts the data into Weka-Instaces objects, and perform unsupervised clustering algorithm. Currently, this function only supports the numerical input data, but it can be easily modified to accept nominal inputs (since matlab2weka.jar file supports nominal inputs).

This function can be called by executing the following MATLAB code.

[predicted, probDistr, numClusters] = wekaClustering(featData, featName, numClusters, clusterer);

where featData is a (N by M) numerical matrix of features,
classTrain is a (Ntr by 1) nominal (string) cell vector representing the values of the dependent variable of the training data,
featTest is a (Nts by M) numerical matrix of testing features,
featName is a (1 by M) nominal (string) cell vector of string representing the name of the attributes,
numClusters is a variable that represents the pre-defined number of clusters. If this number is not known and would like the algorithm to find it out, set the value to -1.
clusterer is a variable that selects a certain clusterer from the Weka package. For now, clusterer = 1 selects Expectation Maximization, and clusterer = 2 selects K-Mean Clustering.

 

matlab2weka / wekaFeatureSelection.m

 

This function receives the MATLAB numerical training and testing data as its input, converts the data into Weka-Instaces objects, and performs an attribute selection. Currently, it only supports attribute selectors that does not transform the attribute dimensionality (i.e., attribute selectors that return indices of the selected attributes). For instance, this file does not support "Principal Component Analysis" based attribute selection algorithm. Furthermore, this function only supports the numerical input data, but it can be easily modified to accept nominal inputs (since matlab2weka.jar file supports nominal inputs).

This function can be called by executing the following MATLAB code.

[selectedAttr] = wekaFeatureSelection(featTrain, featTest, classTrain, classTest, featName, selector);

where featTrain is a (Ntr by M) numerical matrix of training features,
featTest is a (Nts by M) numerical matrix of testing features,
classTrain is a (Ntr by 1) nominal (string) cell vector representing the values of the dependent variable of the training data,
classTest is a (Nts by 1) nominal (string) cell vector representing the values of the dependent variable of the testing data,
featName is a (1 by M) nominal (string) cell vector of string representing the name of the attributes,
selector is a variable that selects a certain attribute selector from the Weka package. For now, classifier = 1 selects CfsSubsetEval.