Shark-ML is an open-source machine learning library, it's site tells next: "It provides methods for linear and nonlinear optimization, kernel-based learning algorithms, neural networks, and various other machine learning techniques. It serves as a powerful toolbox for real world applications as well as for research. Shark works on Windows, MacOS X, and Linux. It comes with extensive documentation. Shark is licensed under the GNU Lesser General Public License." I can confirm that it offers a wide range of machine learning algorithms together with nice documentation, tutorials and samples. So in this article I will show the basic API concepts, details can be easily found in official documentation.
In this article I will show how to use this library for solving a classification problem. I've used Iris data set in this example, so loading data will depend on it's format.
-
Library installation
Library should be compiled from sources, build scripts can be generated with CMake. Details about available CMake options can be found in documentation. Also you should have installed Boost, because library depends on it. I did not find any problems with compilation. Also I used
CMAKE_INSTALL_PREFIXoption to changes default installation path, it allows to clear library artifacts after experiments. -
Loading data
In this sample I used to download a data set from internet. There is function
shark::downloadin the library, but the problem is that it doesn't supporthttpsprotocol. So I used a function writen withcurllibrary API:static const std::string train_data_url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"; ... const std::string data_path{"iris.csv"}; if (!fs::exists(data_path)) { if (!utils::DownloadFile(train_data_url, data_path)) { std::cerr << "Unable to download the file " << train_data_url << std::endl; return 1; } }
After file is downloaded you can use
shark::csvStringToDatado load data inshark::ClassificationDatasettype object. But before do it I had to prepare this concrete data set - remove a row with columns names and replace labels string names with numbers:// ---------- Load data from file to the string std::ifstream data_file(data_path); std::string train_data_str((std::istreambuf_iterator<char>(data_file)), std::istreambuf_iterator<char>()); // ----------- Remove first line - columns labels train_data_str.erase(0, train_data_str.find_first_of("\n") + 1); // ----------- Replace string labels with integers train_data_str = std::regex_replace(train_data_str, std::regex("Iris-setosa"), "0"); train_data_str = std::regex_replace(train_data_str, std::regex("Iris-versicolor"), "1"); train_data_str = std::regex_replace(train_data_str, std::regex("Iris-virginica"), "2");
Now I was ready to create a data set object, I used
shark::LAST_COLUMNvalue for last parameter to tell function that last column in data set is labels:shark::ClassificationDataset train_data; shark::csvStringToData(train_data, train_data_str, shark::LA ST_COLUMN);
-
Pre-processing data
Before training classifiers or using other ML algorithms usually it's a good idea to normalize and shuffle your training data. Shark-ML already have good tutorial about normalization. At first I shuffled the data and cutoff the part of them for a test set:
train_data.shuffle(); auto test_data = shark::splitAtElement(train_data, 120);
Then I defined a normalizer object and trainer for it. It's a common approach in Shark-ML to separate type for algorithm and for its trainer:
bool remove_mean = true; shark::Normalizer<shark::RealVector> normalizer; shark::NormalizeComponentsUnitVariance<shark::RealVector> normalizing_trainer(remove_mean); //---------- Learn mean and variance from data normalizing_trainer.train(normalizer, train_data.inputs());
After trainer learned mean and variance and configured normalizer, we can use it to transform our data:
train_data = shark::transformInputs(train_data, normalizer);
But there are trainers without a
trainmethod, like a PCA class which can be used for dimension reduction:shark::PCA pca(data.inputs()); shark::LinearModel<> enc; pca.encoder(enc, 2); shark::Data<shark::RealVector> encoded_data = enc(data.inputs());
Here
pcaobject took data for learning in the constructor and configured a model for a dimension reduction withencodermethod. -
SVM
In this sample I used cross-validation technique for training and grid search for obtaining optimal parameters for SVM classifier. My code is based on official samples for SVM and model selection. So first of all let divide our data in some number of folds:
const unsigned int k = 5; // number of folds shark::CVFolds<shark::ClassificationDataset> folds = shark::createCVSameSizeBalanced(train_data, k);
Next I defined SVM trainer and classification model objects:
double c{1.0}; double gamma{0.5}; bool offset = true; bool unconstrained = true; auto svm = std::make_shared<SVMModel>(gamma, unconstrained); shark::CSvmTrainer<shark::RealVector> trainer(&svm->kernel, c, offset, unconstrained); trainer.setMcSvmType(shark::McSvm::OVA); // one-versus-all
Please pay attention at
SVMModeltype, I defined it because SVM model consist of kernel and classifier and their life time should be the same during all time you are using classification model. When you pass pointer to kernel to trainer it doesn't take or pass ownership to classifier object.struct SVMModel { SVMModel(double gamma, bool unconstrained) : kernel(gamma, unconstrained) {} shark::GaussianRbfKernel<> kernel; //---------- Template parameter is an input type shark::KernelClassifier<shark::RealVector> model; };
After that I got everything required for instantiation of cross validation error object:
shark::ZeroOneLoss<unsigned int> loss; shark::CrossValidationError<shark::KernelClassifier<shark::RealVector>, unsigned int> cv_error(folds, &trainer, &svm->model, &trainer, &loss);
And now I was ready to use grid search for parameters estimation. Lets define initial grid parameters:
//---------- Estimate initial gamma value shark::JaakkolaHeuristic ja(train_data); double ljg = log(ja.gamma()); //---------- We have two hyperparameters so define the grid accordingly shark::GridSearch grid; std::vector<double> min(2); std::vector<double> max(2); std::vector<size_t> sections(2); //---------- Kernel parameter gamma min[0] = ljg - 4.; max[0] = ljg + 4; sections[0] = 9; // number of values in the interval //---------- Regularization parameter C min[1] = 0.0; max[1] = 10.0; sections[1] = 11; // number of values in the interval grid.configure(min, max, sections);
Finally I did search:
grid.step(cv_error);
After search was finished I used estimated parameters to got final model:
trainer.setParameterVector(grid.solution().point); trainer.train(svm->model, train_data);
Eventually I was able to evaluate the model on train and test data to see errors:
auto output = svm->model(train_data.inputs()); auto train_error = loss.eval(train_data.labels(), output); std::cout << name << " train error = " << train_error << std::endl; output = svm->model(normalizer(test_data.inputs())); auto test_error = loss.eval(test_data.labels(), output); std::cout << name << " test error = " << test_error << std::endl;
-
Random Forest
To compare SVM classifier with other models and to show Shark-ML API I defined Random Forest classifier, as in previous example I defined a trainer and a model:
//---------- Template parameter is a label type shark::RFTrainer<unsigned int> trainer; auto rf = std::make_shared<shark::RFClassifier<unsigned int>>(); trainer.train(*rf, train_data);
To be able to run and compare different models with same code I extracted and used general classification type for these input and output types:
using Model = shark::AbstractModel<remora::vector<double, remora::cpu_tag>, unsigned int, remora::vector<double, remora::cpu_tag>>;
The third parameter here is a type of internal coefficients for a model. And the general definition for evaluation function can be like this one:
void EvaluateModel(const Model& model, const shark::Normalizer<shark::RealVector>& normalizer, const shark::ClassificationDataset& train_data, const shark::ClassificationDataset& test_data) { auto output = model(train_data.inputs()); ... //----------- Use normalizer in case of unprocessed data output = model(normalizer(test_data.inputs())); }
-
Visualizing data
To visualize classification I used my wrapper library for
gnuplotprogram. It works with coordinates given with STL compatible iterators. But I didn't find how to get STL compatible iterators to the data stored inshark::ClassificationDatasettype, so I defined a class which holds references to the data vectors from the data set object and can iterate over required dimension in STL like manner (next coordinate are chosen according to the given labels vector). It gave me an ability to define the a visualization function pretty simple://---------- Coordinates taken according to the original labels ClassIterator di_0_x(&encoded_data, &lables, 0, 0); // first class, x coord ClassIterator di_0_y(&encoded_data, &lables, 0, 1); // first class, y coord ... //--------- Coordinates taken according to the predicted labels ClassIterator pdi_0_x(&encoded_data, &predictions, 0, 0); // first class, x coord ClassIterator pdi_0_y(&encoded_data, &predictions, 0, 1); // first class, y coord ... plotcpp::Plot plt(true); plt.SetTerminal("qt"); plt.SetAutoscale(); plt.GnuplotCommand("set grid"); plt.Draw2D(plotcpp::Points(di_0_x, ClassIterator(), di_0_y, "class 0", "lc rgb 'red' pt 4"), plotcpp::Points(di_1_x, ClassIterator(), di_1_y, "class 1", "lc rgb 'green' pt 4"), plotcpp::Points(di_2_x, ClassIterator(), di_2_y, "class 2", "lc rgb 'blue' pt 4"), plotcpp::Points(pdi_0_x, ClassIterator(), pdi_0_y, "predict 0", "lc rgb 'red' pt 1"), plotcpp::Points(pdi_1_x, ClassIterator(), pdi_1_y, "predict 1", "lc rgb 'green' pt 1"), plotcpp::Points(pdi_2_x, ClassIterator(), pdi_2_y, "predict 2", "lc rgb 'blue' pt 1")); plt.Flush();
Point types were configured according to
gnuplotformat, transparent boxes used for original data and crosses for predicted ones. So if the color of box is not equal to the color of cross you can see where classifier prediction failed.
