Sunbelt Computer Software

Classification with Shark-ML machine learning library

Shark-ML is an open-source machine learning library, it's site tells next: "It provides methods for linear and nonlinear optimization, kernel-based learning algorithms, neural networks, and various other machine learning techniques. It serves as a powerful toolbox for real world applications as well as for research. Shark works on Windows, MacOS X, and Linux. It comes with extensive documentation. Shark is licensed under the GNU Lesser General Public License." I can confirm that it offers a wide range of machine learning algorithms together with nice documentation, tutorials and samples. So in this article I will show the basic API concepts, details can be easily found in official documentation.

In this article I will show how to use this library for solving a classification problem. I've used Iris data set in this example, so loading data will depend on it's format.

Library installation

Library should be compiled from sources, build scripts can be generated with CMake. Details about available CMake options can be found in documentation. Also you should have installed Boost, because library depends on it. I did not find any problems with compilation. Also I used CMAKE_INSTALL_PREFIX option to changes default installation path, it allows to clear library artifacts after experiments.

Loading data

In this sample I used to download a data set from internet. There is function shark::download in the library, but the problem is that it doesn't support https protocol. So I used a function writen with curl library API:

static const std::string train_data_url =
"https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv";
...
const std::string data_path{"iris.csv"};
if (!fs::exists(data_path)) {
    if (!utils::DownloadFile(train_data_url, data_path)) {
      std::cerr << "Unable to download the file " << train_data_url
                << std::endl;
      return 1;
    }
}

After file is downloaded you can use shark::csvStringToData do load data in shark::ClassificationDataset type object. But before do it I had to prepare this concrete data set - remove a row with columns names and replace labels string names with numbers:

// ---------- Load data from file to the string
std::ifstream data_file(data_path);
std::string train_data_str((std::istreambuf_iterator<char>(data_file)),
                            std::istreambuf_iterator<char>());

// ----------- Remove first line - columns labels
train_data_str.erase(0, train_data_str.find_first_of("\n") + 1);

// ----------- Replace string labels with integers
train_data_str =
 std::regex_replace(train_data_str, std::regex("Iris-setosa"), "0");
train_data_str =
 std::regex_replace(train_data_str, std::regex("Iris-versicolor"), "1");
train_data_str =
 std::regex_replace(train_data_str, std::regex("Iris-virginica"), "2");

Now I was ready to create a data set object, I used shark::LAST_COLUMN value for last parameter to tell function that last column in data set is labels:

shark::ClassificationDataset train_data;
shark::csvStringToData(train_data, train_data_str, shark::LA ST_COLUMN);

Pre-processing data

Before training classifiers or using other ML algorithms usually it's a good idea to normalize and shuffle your training data. Shark-ML already have good tutorial about normalization. At first I shuffled the data and cutoff the part of them for a test set:
```
train_data.shuffle();
auto test_data = shark::splitAtElement(train_data, 120);
```
Then I defined a normalizer object and trainer for it. It's a common approach in Shark-ML to separate type for algorithm and for its trainer:
```
bool remove_mean = true;
shark::Normalizer<shark::RealVector> normalizer;
shark::NormalizeComponentsUnitVariance<shark::RealVector>
    normalizing_trainer(remove_mean);
//---------- Learn mean and variance from data
normalizing_trainer.train(normalizer, train_data.inputs());
```
After trainer learned mean and variance and configured normalizer, we can use it to transform our data:
```
train_data = shark::transformInputs(train_data, normalizer);
```
But there are trainers without a train method, like a PCA class which can be used for dimension reduction:
```
shark::PCA pca(data.inputs());
shark::LinearModel<> enc;
pca.encoder(enc, 2);
shark::Data<shark::RealVector> encoded_data = enc(data.inputs());
```
Here pca object took data for learning in the constructor and configured a model for a dimension reduction with encoder method.

SVM

In this sample I used cross-validation technique for training and grid search for obtaining optimal parameters for SVM classifier. My code is based on official samples for SVM and model selection. So first of all let divide our data in some number of folds:

const unsigned int k = 5;  // number of folds
shark::CVFolds<shark::ClassificationDataset> folds =
        shark::createCVSameSizeBalanced(train_data, k);

Next I defined SVM trainer and classification model objects:

double c{1.0};
double gamma{0.5};
bool offset = true;
bool unconstrained = true;

auto svm = std::make_shared<SVMModel>(gamma, unconstrained);

shark::CSvmTrainer<shark::RealVector> trainer(&svm->kernel, c, offset,
                                            unconstrained);
trainer.setMcSvmType(shark::McSvm::OVA);  // one-versus-all

Please pay attention at SVMModel type, I defined it because SVM model consist of kernel and classifier and their life time should be the same during all time you are using classification model. When you pass pointer to kernel to trainer it doesn't take or pass ownership to classifier object.

struct SVMModel {
SVMModel(double gamma, bool unconstrained) : kernel(gamma, unconstrained) {}
    shark::GaussianRbfKernel<> kernel;
    //---------- Template parameter is an input type
    shark::KernelClassifier<shark::RealVector> model;
};

After that I got everything required for instantiation of cross validation error object:

shark::ZeroOneLoss<unsigned int> loss;
shark::CrossValidationError<shark::KernelClassifier<shark::RealVector>,
                          unsigned int>
  cv_error(folds, &trainer, &svm->model, &trainer, &loss);

And now I was ready to use grid search for parameters estimation. Lets define initial grid parameters:

//---------- Estimate initial gamma value
shark::JaakkolaHeuristic ja(train_data);
double ljg = log(ja.gamma());
//---------- We have two hyperparameters so define the grid accordingly
shark::GridSearch grid;
std::vector<double> min(2);
std::vector<double> max(2);
std::vector<size_t> sections(2);
//---------- Kernel parameter gamma
min[0] = ljg - 4.;
max[0] = ljg + 4;
sections[0] = 9; // number of values in the interval
//---------- Regularization parameter C
min[1] = 0.0;
max[1] = 10.0;
sections[1] = 11; // number of values in the interval
grid.configure(min, max, sections);

Finally I did search:

grid.step(cv_error);

After search was finished I used estimated parameters to got final model:

trainer.setParameterVector(grid.solution().point);
trainer.train(svm->model, train_data);

Eventually I was able to evaluate the model on train and test data to see errors:

auto output = svm->model(train_data.inputs());
auto train_error = loss.eval(train_data.labels(), output);
std::cout << name << " train error = " << train_error << std::endl;
    
output = svm->model(normalizer(test_data.inputs()));
auto test_error = loss.eval(test_data.labels(), output);
std::cout << name << " test error = " << test_error << std::endl;

Random Forest

To compare SVM classifier with other models and to show Shark-ML API I defined Random Forest classifier, as in previous example I defined a trainer and a model:

//---------- Template parameter is a label type
shark::RFTrainer<unsigned int> trainer;
auto rf = std::make_shared<shark::RFClassifier<unsigned int>>();
trainer.train(*rf, train_data);

To be able to run and compare different models with same code I extracted and used general classification type for these input and output types:

using Model = shark::AbstractModel<remora::vector<double, remora::cpu_tag>,
                                   unsigned int,
                                   remora::vector<double, remora::cpu_tag>>;

The third parameter here is a type of internal coefficients for a model. And the general definition for evaluation function can be like this one:

void EvaluateModel(const Model& model,
                   const shark::Normalizer<shark::RealVector>& normalizer,
                   const shark::ClassificationDataset& train_data,
                   const shark::ClassificationDataset& test_data) {
    auto output = model(train_data.inputs());
    ...
    //----------- Use normalizer in case of unprocessed data
    output = model(normalizer(test_data.inputs()));
}

Visualizing data

To visualize classification I used my wrapper library for gnuplot program. It works with coordinates given with STL compatible iterators. But I didn't find how to get STL compatible iterators to the data stored in shark::ClassificationDataset type, so I defined a class which holds references to the data vectors from the data set object and can iterate over required dimension in STL like manner (next coordinate are chosen according to the given labels vector). It gave me an ability to define the a visualization function pretty simple:

//---------- Coordinates taken according to the original labels
ClassIterator di_0_x(&encoded_data, &lables, 0, 0); // first class, x coord
ClassIterator di_0_y(&encoded_data, &lables, 0, 1); // first class, y coord    
...   
//---------  Coordinates taken according to the predicted labels
ClassIterator pdi_0_x(&encoded_data, &predictions, 0, 0); // first class, x coord
ClassIterator pdi_0_y(&encoded_data, &predictions, 0, 1); // first class, y coord    
...    
plotcpp::Plot plt(true);
plt.SetTerminal("qt");
plt.SetAutoscale();
plt.GnuplotCommand("set grid");
plt.Draw2D(plotcpp::Points(di_0_x, ClassIterator(), di_0_y, "class 0", "lc rgb 'red' pt 4"),
           plotcpp::Points(di_1_x, ClassIterator(), di_1_y, "class 1", "lc rgb 'green' pt 4"),
           plotcpp::Points(di_2_x, ClassIterator(), di_2_y, "class 2", "lc rgb 'blue' pt 4"),
           plotcpp::Points(pdi_0_x, ClassIterator(), pdi_0_y, "predict 0", "lc rgb 'red' pt 1"),
           plotcpp::Points(pdi_1_x, ClassIterator(), pdi_1_y, "predict 1", "lc rgb 'green' pt 1"),
           plotcpp::Points(pdi_2_x, ClassIterator(), pdi_2_y, "predict 2", "lc rgb 'blue' pt 1"));
plt.Flush();

Point types were configured according to gnuplot format, transparent boxes used for original data and crosses for predicted ones. So if the color of box is not equal to the color of cross you can see where classifier prediction failed.

Name		Name	Last commit message	Last commit date
parent directory ..
CMakeLists.txt		CMakeLists.txt
README.md		README.md
class_iterator.h		class_iterator.h
classify_shark.cpp		classify_shark.cpp
svm.png		svm.png

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Classification with Shark-ML machine learning library

Sunbelt Computer Software

PL/B Language Development and Support

FilesExpand file tree

classification_shark

Directory actions

More options

Directory actions

More options

Latest commit

History

classification_shark

Folders and files

parent directory

README.md

Classification with Shark-ML machine learning library