mlcpp/classification_shark at master · hal2001/mlcpp · GitHub
Skip to content

Latest commit

 

History

History
 
 

Folders and files

Classification with Shark-ML machine learning library

Shark-ML is an open-source machine learning library, it's site tells next: "It provides methods for linear and nonlinear optimization, kernel-based learning algorithms, neural networks, and various other machine learning techniques. It serves as a powerful toolbox for real world applications as well as for research. Shark works on Windows, MacOS X, and Linux. It comes with extensive documentation. Shark is licensed under the GNU Lesser General Public License." I can confirm that it offers a wide range of machine learning algorithms together with nice documentation, tutorials and samples. So in this article I will show the basic API concepts, details can be easily found in official documentation.

In this article I will show how to use this library for solving a classification problem. I've used Iris data set in this example, so loading data will depend on it's format.

  1. Library installation

    Library should be compiled from sources, build scripts can be generated with CMake. Details about available CMake options can be found in documentation. Also you should have installed Boost, because library depends on it. I did not find any problems with compilation. Also I used CMAKE_INSTALL_PREFIX option to changes default installation path, it allows to clear library artifacts after experiments.

  2. Loading data

    In this sample I used to download a data set from internet. There is function shark::download in the library, but the problem is that it doesn't support https protocol. So I used a function writen with curl library API:

    static const std::string train_data_url =
    "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv";
    ...
    const std::string data_path{"iris.csv"};
    if (!fs::exists(data_path)) {
        if (!utils::DownloadFile(train_data_url, data_path)) {
          std::cerr << "Unable to download the file " << train_data_url
                    << std::endl;
          return 1;
        }
    }

    After file is downloaded you can use shark::csvStringToData do load data in shark::ClassificationDataset type object. But before do it I had to prepare this concrete data set - remove a row with columns names and replace labels string names with numbers:

    // ---------- Load data from file to the string
    std::ifstream data_file(data_path);
    std::string train_data_str((std::istreambuf_iterator<char>(data_file)),
                                std::istreambuf_iterator<char>());
    
    // ----------- Remove first line - columns labels
    train_data_str.erase(0, train_data_str.find_first_of("\n") + 1);
    
    // ----------- Replace string labels with integers
    train_data_str =
     std::regex_replace(train_data_str, std::regex("Iris-setosa"), "0");
    train_data_str =
     std::regex_replace(train_data_str, std::regex("Iris-versicolor"), "1");
    train_data_str =
     std::regex_replace(train_data_str, std::regex("Iris-virginica"), "2");

    Now I was ready to create a data set object, I used shark::LAST_COLUMN value for last parameter to tell function that last column in data set is labels:

    shark::ClassificationDataset train_data;
    shark::csvStringToData(train_data, train_data_str, shark::LA ST_COLUMN);
  3. Pre-processing data

    Before training classifiers or using other ML algorithms usually it's a good idea to normalize and shuffle your training data. Shark-ML already have good tutorial about normalization. At first I shuffled the data and cutoff the part of them for a test set:

    train_data.shuffle();
    auto test_data = shark::splitAtElement(train_data, 120);

    Then I defined a normalizer object and trainer for it. It's a common approach in Shark-ML to separate type for algorithm and for its trainer:

    bool remove_mean = true;
    shark::Normalizer<shark::RealVector> normalizer;
    shark::NormalizeComponentsUnitVariance<shark::RealVector>
        normalizing_trainer(remove_mean);
    //---------- Learn mean and variance from data
    normalizing_trainer.train(normalizer, train_data.inputs());

    After trainer learned mean and variance and configured normalizer, we can use it to transform our data:

    train_data = shark::transformInputs(train_data, normalizer);

    But there are trainers without a train method, like a PCA class which can be used for dimension reduction:

    shark::PCA pca(data.inputs());
    shark::LinearModel<> enc;
    pca.encoder(enc, 2);
    shark::Data<shark::RealVector> encoded_data = enc(data.inputs());

    Here pca object took data for learning in the constructor and configured a model for a dimension reduction with encoder method.

  4. SVM

    In this sample I used cross-validation technique for training and grid search for obtaining optimal parameters for SVM classifier. My code is based on official samples for SVM and model selection. So first of all let divide our data in some number of folds:

    const unsigned int k = 5;  // number of folds
    shark::CVFolds<shark::ClassificationDataset> folds =
            shark::createCVSameSizeBalanced(train_data, k);

    Next I defined SVM trainer and classification model objects:

    double c{1.0};
    double gamma{0.5};
    bool offset = true;
    bool unconstrained = true;
    
    auto svm = std::make_shared<SVMModel>(gamma, unconstrained);
    
    shark::CSvmTrainer<shark::RealVector> trainer(&svm->kernel, c, offset,
                                                unconstrained);
    trainer.setMcSvmType(shark::McSvm::OVA);  // one-versus-all

    Please pay attention at SVMModel type, I defined it because SVM model consist of kernel and classifier and their life time should be the same during all time you are using classification model. When you pass pointer to kernel to trainer it doesn't take or pass ownership to classifier object.

    struct SVMModel {
    SVMModel(double gamma, bool unconstrained) : kernel(gamma, unconstrained) {}
        shark::GaussianRbfKernel<> kernel;
        //---------- Template parameter is an input type
        shark::KernelClassifier<shark::RealVector> model;
    };

    After that I got everything required for instantiation of cross validation error object:

    shark::ZeroOneLoss<unsigned int> loss;
    shark::CrossValidationError<shark::KernelClassifier<shark::RealVector>,
                              unsigned int>
      cv_error(folds, &trainer, &svm->model, &trainer, &loss);

    And now I was ready to use grid search for parameters estimation. Lets define initial grid parameters:

    //---------- Estimate initial gamma value
    shark::JaakkolaHeuristic ja(train_data);
    double ljg = log(ja.gamma());
    //---------- We have two hyperparameters so define the grid accordingly
    shark::GridSearch grid;
    std::vector<double> min(2);
    std::vector<double> max(2);
    std::vector<size_t> sections(2);
    //---------- Kernel parameter gamma
    min[0] = ljg - 4.;
    max[0] = ljg + 4;
    sections[0] = 9; // number of values in the interval
    //---------- Regularization parameter C
    min[1] = 0.0;
    max[1] = 10.0;
    sections[1] = 11; // number of values in the interval
    grid.configure(min, max, sections);

    Finally I did search:

    grid.step(cv_error);

    After search was finished I used estimated parameters to got final model:

    trainer.setParameterVector(grid.solution().point);
    trainer.train(svm->model, train_data);

    Eventually I was able to evaluate the model on train and test data to see errors:

    auto output = svm->model(train_data.inputs());
    auto train_error = loss.eval(train_data.labels(), output);
    std::cout << name << " train error = " << train_error << std::endl;
        
    output = svm->model(normalizer(test_data.inputs()));
    auto test_error = loss.eval(test_data.labels(), output);
    std::cout << name << " test error = " << test_error << std::endl;
  5. Random Forest

    To compare SVM classifier with other models and to show Shark-ML API I defined Random Forest classifier, as in previous example I defined a trainer and a model:

    //---------- Template parameter is a label type
    shark::RFTrainer<unsigned int> trainer;
    auto rf = std::make_shared<shark::RFClassifier<unsigned int>>();
    trainer.train(*rf, train_data);

    To be able to run and compare different models with same code I extracted and used general classification type for these input and output types:

    using Model = shark::AbstractModel<remora::vector<double, remora::cpu_tag>,
                                       unsigned int,
                                       remora::vector<double, remora::cpu_tag>>;

    The third parameter here is a type of internal coefficients for a model. And the general definition for evaluation function can be like this one:

    void EvaluateModel(const Model& model,
                       const shark::Normalizer<shark::RealVector>& normalizer,
                       const shark::ClassificationDataset& train_data,
                       const shark::ClassificationDataset& test_data) {
        auto output = model(train_data.inputs());
        ...
        //----------- Use normalizer in case of unprocessed data
        output = model(normalizer(test_data.inputs()));
    }
  6. Visualizing data

    To visualize classification I used my wrapper library for gnuplot program. It works with coordinates given with STL compatible iterators. But I didn't find how to get STL compatible iterators to the data stored in shark::ClassificationDataset type, so I defined a class which holds references to the data vectors from the data set object and can iterate over required dimension in STL like manner (next coordinate are chosen according to the given labels vector). It gave me an ability to define the a visualization function pretty simple:

    //---------- Coordinates taken according to the original labels
    ClassIterator di_0_x(&encoded_data, &lables, 0, 0); // first class, x coord
    ClassIterator di_0_y(&encoded_data, &lables, 0, 1); // first class, y coord    
    ...   
    //---------  Coordinates taken according to the predicted labels
    ClassIterator pdi_0_x(&encoded_data, &predictions, 0, 0); // first class, x coord
    ClassIterator pdi_0_y(&encoded_data, &predictions, 0, 1); // first class, y coord    
    ...    
    plotcpp::Plot plt(true);
    plt.SetTerminal("qt");
    plt.SetAutoscale();
    plt.GnuplotCommand("set grid");
    plt.Draw2D(plotcpp::Points(di_0_x, ClassIterator(), di_0_y, "class 0", "lc rgb 'red' pt 4"),
               plotcpp::Points(di_1_x, ClassIterator(), di_1_y, "class 1", "lc rgb 'green' pt 4"),
               plotcpp::Points(di_2_x, ClassIterator(), di_2_y, "class 2", "lc rgb 'blue' pt 4"),
               plotcpp::Points(pdi_0_x, ClassIterator(), pdi_0_y, "predict 0", "lc rgb 'red' pt 1"),
               plotcpp::Points(pdi_1_x, ClassIterator(), pdi_1_y, "predict 1", "lc rgb 'green' pt 1"),
               plotcpp::Points(pdi_2_x, ClassIterator(), pdi_2_y, "predict 2", "lc rgb 'blue' pt 1"));
    plt.Flush();

    Point types were configured according to gnuplot format, transparent boxes used for original data and crosses for predicted ones. So if the color of box is not equal to the color of cross you can see where classifier prediction failed. svm_classification