- scikit-learn library provides varietes of classification, regression and clustering algorithms
- provides user-friendly and consistent interface for using those algorithms efficiently and productively
- popular in academia as well as in industry
- for more info see https://scikit-learn.org/
-
the following are the typical steps:
- Load data
- Analyze data
- Visualize data
- Split data into train-test set
- Train and test a handful of models
- Select the best model
- Deploy the model
-
next we'll dive into each step
- loading the Iris dataset from scikit-learn
- the third column represents the petal length, and the fourth column the petal width of the flower examples
- the classes are already converted to integer labels where:
- 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica
- https://archive.ics.uci.edu/ml/datasets/iris
sklearn.datasets.load_irisAPI - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as snsfrom IPython.display import Image
%matplotlib inlineiris = datasets.load_iris()iris{'data': array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'frame': None,
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n:Number of Instances: 150 (50 in each of three classes)\n:Number of Attributes: 4 numeric, predictive attributes and the class\n:Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n\n:Summary Statistics:\n\n============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n============== ==== ==== ======= ===== ====================\nsepal length: 4.3 7.9 5.84 0.83 0.7826\nsepal width: 2.0 4.4 3.05 0.43 -0.4194\npetal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\npetal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n============== ==== ==== ======= ===== ====================\n\n:Missing Attribute Values: None\n:Class Distribution: 33.3% for each of 3 classes.\n:Creator: R.A. Fisher\n:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n:Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. dropdown:: References\n\n - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}
iris.data[:10]array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
# dataset is a data Bunch type in sci-kit learn
type(iris)sklearn.utils._bunch.Bunch
# the data itself is numpy's ndarray
type(iris.data)numpy.ndarray
# list all the attributes of iris Bunch object
dir(iris)['DESCR',
'data',
'data_module',
'feature_names',
'filename',
'frame',
'target',
'target_names']
iris.target_namesarray(['setosa', 'versicolor', 'virginica'], dtype='<U10')
iris.feature_names['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
iris.targetarray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
# list all the members of data (ndarray) object
dir(iris.data)['T',
'__abs__',
'__add__',
'__and__',
'__array__',
'__array_finalize__',
'__array_function__',
'__array_interface__',
'__array_namespace__',
'__array_priority__',
'__array_struct__',
'__array_ufunc__',
'__array_wrap__',
'__bool__',
'__buffer__',
'__class__',
'__class_getitem__',
'__complex__',
'__contains__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__delitem__',
'__dir__',
'__divmod__',
'__dlpack__',
'__dlpack_device__',
'__doc__',
'__eq__',
'__float__',
'__floordiv__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__getstate__',
'__gt__',
'__hash__',
'__iadd__',
'__iand__',
'__ifloordiv__',
'__ilshift__',
'__imatmul__',
'__imod__',
'__imul__',
'__index__',
'__init__',
'__init_subclass__',
'__int__',
'__invert__',
'__ior__',
'__ipow__',
'__irshift__',
'__isub__',
'__iter__',
'__itruediv__',
'__ixor__',
'__le__',
'__len__',
'__lshift__',
'__lt__',
'__matmul__',
'__mod__',
'__mul__',
'__ne__',
'__neg__',
'__new__',
'__or__',
'__pos__',
'__pow__',
'__radd__',
'__rand__',
'__rdivmod__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rfloordiv__',
'__rlshift__',
'__rmatmul__',
'__rmod__',
'__rmul__',
'__ror__',
'__rpow__',
'__rrshift__',
'__rshift__',
'__rsub__',
'__rtruediv__',
'__rxor__',
'__setattr__',
'__setitem__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__truediv__',
'__xor__',
'all',
'any',
'argmax',
'argmin',
'argpartition',
'argsort',
'astype',
'base',
'byteswap',
'choose',
'clip',
'compress',
'conj',
'conjugate',
'copy',
'ctypes',
'cumprod',
'cumsum',
'data',
'device',
'diagonal',
'dot',
'dtype',
'dump',
'dumps',
'fill',
'flags',
'flat',
'flatten',
'getfield',
'imag',
'item',
'itemset',
'itemsize',
'mT',
'max',
'mean',
'min',
'nbytes',
'ndim',
'newbyteorder',
'nonzero',
'partition',
'prod',
'ptp',
'put',
'ravel',
'real',
'repeat',
'reshape',
'resize',
'round',
'searchsorted',
'setfield',
'setflags',
'shape',
'size',
'sort',
'squeeze',
'std',
'strides',
'sum',
'swapaxes',
'take',
'to_device',
'tobytes',
'tofile',
'tolist',
'tostring',
'trace',
'transpose',
'var',
'view']
- use Pandas DataFrame object to analyze and visualize dataset
- convert data Bunch object to Pandas DataFrame
- use DataFrame object's various methods and attibutes to learn about the data:
- typically:
- learn the dimension of the dataset
- peek at the data itself
- see the statistical summary of all attributes
- see breakdown of the data by the class variable
# use np.c_ numpy's concatenate function
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris.feature_names + ['target'])type(df)pandas.core.frame.DataFrame
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
150 rows × 5 columns
# Convert target column to string dtype to avoid dtype conflict
#df['target'] = df['target'].astype(str)
# update target id with target name
for i in range(3):
# df.loc[df['target'] == str(i), ['target']] = iris.target_names[i]
df.loc[df['target'] == i, ['target']] = iris.target_names[i]/tmp/ipykernel_14919/2951524723.py:7: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df.loc[df['target'] == i, ['target']] = iris.target_names[i]
df['target']0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: target, Length: 150, dtype: object
# peek the data; head and tail
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
# find the dimension
df.shape(150, 5)
# statistical summary
df.describe()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
# class distribution
print(df.groupby('target').size())target
setosa 50
versicolor 50
virginica 50
dtype: int64
- visualize may help us understand the dataset visually
- two types of plots:
- Univariate plot to better understand each attribute
- Multivariate plot to better understand the relationships between attributes
# create boxplot as univariate plot
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=df.loc[:, iris.feature_names], ax=ax)<Axes: >
# histograms of each feature
fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(data=df.loc[:, iris.feature_names], ax=ax)<Axes: ylabel='Count'>
# use seaborn's pairplot for pair-wise scatter matrix
sns.set_theme(style="ticks")
sns.pairplot(data=df, hue="target")<seaborn.axisgrid.PairGrid at 0x79ab646c0d30>
# just use petal-length and petal width features
# X and y are conventions used to represet X as a matrix of features and y as a vector of labels
X = iris.data
y = iris.target# print the list of unique labels
print('Class labels:', np.unique(y))Class labels: [0 1 2]
# let's look at the first 10 rows of data
X[:10]array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
- training set is used to fit a model
- test set needs to be set aside to evaluate the model
- gives some metric on how well the model will likely to perform if deployed on unseen real-world data
- typical split is 80/20 or 75/25 - 80% or 75% training and 20% or 25% test data
- rule of thumb: if you've fewer data, use 80/20
- if you've larger data set, use 75/25 or 70/30 or 60/40
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- use random seed to reproduce the same random numbers
- use stratification; means the training and test subset splits have the same proportions of class labels as the input dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y)print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))Labels counts in y: [50 50 50]
Labels counts in y_train: [40 40 40]
Labels counts in y_test: [10 10 10]
- choosing appropriate algorithm for a particular problem requires practice and experience
- each algorithm has its own quirks and is based on certain assumptions
- No free lunch theorem by David H. Wolpert
- no single classifier works best across all possible scenarios
- it is always recommended that you compare the performance of at least a handful of different learning algorithms to select the best model for the particular problem
- perceptron is one of the earliest machine learning algorithms that models human brain neurons
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html
- in order to design AI, the first concept of a simplified brain cell called McCulloch-Pitts (MCP) neuron in 1943
- named after the scientists
- Neurons are interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals

- nerve cell is a simple logic gate with binary outputs
- multiple signals arrive at the dendrites, then integrated into the cell body, and, if the accumulated signal exceeds a certain threshold, an output signal is generated that will be passed on by the axon
- in 1957, Frank Rosenblatt published the first concept of the perceptron learning rule based on the MCP neuron model
- the proposed algorithm is a rule that would automatically learn the optimal weight coefficients that would then be multiplied with the input features in order to make a decision of whether a neuron fires (transmits a signal) or not
- in supervised learning and binary classification problem, such an algorithm could then be used to predict whether a new data point belongs to one class or the other
- we'll use the Perceptron and a few other classification algorithms implemented in scikit-learn library
- most algorithms in scikit-learn already support multiclass classification by default using one-vs.-rest (OvR) method
- allowing us to feed all samples with 3 different classes/labels to the perceptron algorithm
from sklearn.linear_model import Perceptron
# eta0 - constant by which updates are multiplied
# random_state - used to shuffle the training data, when shuffle is True (by default)
ppn = Perceptron(eta0=0.1, random_state=1)
ppn.fit(X_train, y_train)Perceptron(eta0=0.1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = ppn.predict(X_test)
print('Misclassified examples: %d' % (y_test != y_pred).sum())Misclassified examples: 1
- sometime it's worth looking at the samples that are misclassified
- for the sake of understanding the nature of the samples and how they're classified
# find the indices of all the misclassified samples
misclassified = np.where(y_test != y_pred)misclassified(array([21]),)
# let's create DataFrame to examine the data
# combin all DataFrame along the axis-1 vertically
mis_df = pd.concat([pd.DataFrame(misclassified[0].transpose(), columns=['ID']),
pd.DataFrame(X_test[misclassified], columns=iris.feature_names),
pd.DataFrame(y_test[misclassified], columns=['y_true']),
pd.DataFrame(y_pred[misclassified], columns=['y_pred'])
],
axis=1)mis_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 21 | 5.6 | 3.0 | 4.5 | 1.5 | 1 | 2 |
# convert target id and predicted id to string dtype
#mis_df['y_true'] = mis_df['y_true'].astype(str)
#mis_df['y_pred'] = mis_df['y_pred'].astype(str)
# let's convert target id to target names
for i in range(3):
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]/tmp/ipykernel_3005/2903258562.py:6: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_3005/2903258562.py:7: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
mis_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 21 | 5.6 | 3.0 | 4.5 | 1.5 | versicolor | virginica |
# let's create a function to find the misclassified samples
# we can use it for the results from all the classifiers
def getMisClassifiedSamples(X_test, y_test, y_pred):
# find the indices of all the misclassified samples
misclassified = np.where(y_test != y_pred)
# let's create DataFrame to examine the data
mis_df = pd.concat([pd.DataFrame(misclassified[0].transpose(), columns=['ID']),
pd.DataFrame(X_test[misclassified], columns=iris.feature_names),
pd.DataFrame(y_test[misclassified], columns=['y_true']),
pd.DataFrame(y_pred[misclassified], columns=['y_pred'])
],
axis=1)
# convert target id and predicted id to string dtype
#mis_df['y_true'] = mis_df['y_true'].astype(str)
#mis_df['y_pred'] = mis_df['y_pred'].astype(str)
for i in range(3):
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
return mis_df# let's test the function
mis_df = getMisClassifiedSamples(X_test, y_test, y_pred)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
mis_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 21 | 5.6 | 3.0 | 4.5 | 1.5 | versicolor | virginica |
- scale the feature values using StandardScaler
- estimate sample mean and standard deviation and use those parameters to scale both the training and test samples
-
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- for each sample:
$\vec{x} = \frac{(\vec{x}-u)}{s}$ where$u$ = mean and$s$ = standard deviation
- for each sample:
- More on feature scaling in Feature Engineering Chapter
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X) # fit the whole data to calculate mean and standard deviation
X_train_sc = sc.transform(X_train) # transform training set
X_test_sc = sc.transform(X_test) # transform validation setX_test_scarray([[ 1.76501198, -0.36217625, 1.44480739, 0.79067065],
[-1.14301691, 0.09821729, -1.2833891 , -1.44707648],
[-0.90068117, -1.28296331, -0.4308277 , -0.13075464],
[-1.26418478, 0.78880759, -1.22655167, -1.3154443 ],
[-1.02184904, 1.01900435, -1.22655167, -0.78891556],
[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],
[ 0.4321654 , 0.78880759, 0.93327055, 1.44883158],
[ 0.67450115, -0.82256978, 0.87643312, 0.92230284],
[-0.29484182, -0.59237301, 0.64908342, 1.05393502],
[ 1.15917263, -0.59237301, 0.59224599, 0.26414192],
[-0.53717756, 1.93979142, -1.39706395, -1.05217993],
[-0.41600969, -1.74335684, 0.13754657, 0.13250973],
[ 1.15917263, -0.13197948, 0.99010798, 1.18556721],
[ 0.18982966, -1.97355361, 0.13754657, -0.26238682],
[-0.17367395, -1.28296331, 0.70592084, 1.05393502],
[-0.17367395, 3.09077525, -1.2833891 , -1.05217993],
[ 1.52267624, -0.13197948, 1.21745768, 1.18556721],
[ 0.31099753, -0.59237301, 0.13754657, 0.13250973],
[-1.14301691, -1.51316008, -0.26031542, -0.26238682],
[ 0.31099753, -0.13197948, 0.64908342, 0.79067065],
[ 0.67450115, -0.36217625, 0.30805885, 0.13250973],
[-0.29484182, -0.13197948, 0.42173371, 0.3957741 ],
[-1.14301691, 1.24920112, -1.34022653, -1.44707648],
[-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
[ 0.79566902, -0.13197948, 0.8195957 , 1.05393502],
[-1.14301691, -1.28296331, 0.42173371, 0.65903847],
[-0.53717756, 1.93979142, -1.16971425, -1.05217993],
[-1.26418478, -0.13197948, -1.34022653, -1.44707648],
[ 0.55333328, 0.55861082, 0.53540856, 0.52740629],
[ 0.79566902, -0.59237301, 0.47857113, 0.3957741 ]])
- let's retrain the Perceptron with the same parameter on the scaled features
# eta0 - learning rate
ppn1 = Perceptron(eta0=0.1, random_state=1)
ppn1.fit(X_train_sc, y_train)
y_pred_sc = ppn1.predict(X_test_sc)
print('Misclassified examples: %d' % (y_test != y_pred_sc).sum())Misclassified examples: 5
# let's use the X_test so we can see the original measurements of
mis_df = getMisClassifiedSamples(X_test, y_test, y_pred_sc)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
mis_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 2 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor | setosa |
| 1 | 9 | 6.8 | 2.8 | 4.8 | 1.4 | versicolor | virginica |
| 2 | 21 | 5.6 | 3.0 | 4.5 | 1.5 | versicolor | setosa |
| 3 | 28 | 6.3 | 3.3 | 4.7 | 1.6 | versicolor | virginica |
| 4 | 29 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor | virginica |
# if you want to examine the scaled features pass scaled data
mis_df_sc = getMisClassifiedSamples(X_test_sc, y_test, y_pred_sc)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
mis_df_sc
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 2 | -0.900681 | -1.282963 | -0.430828 | -0.130755 | versicolor | setosa |
| 1 | 9 | 1.159173 | -0.592373 | 0.592246 | 0.264142 | versicolor | virginica |
| 2 | 21 | -0.294842 | -0.131979 | 0.421734 | 0.395774 | versicolor | setosa |
| 3 | 28 | 0.553333 | 0.558611 | 0.535409 | 0.527406 | versicolor | virginica |
| 4 | 29 | 0.795669 | -0.592373 | 0.478571 | 0.395774 | versicolor | virginica |
- there are various metrics to evalute the performance of a model
- for classification problems, accuracy, confusion matrix, precision, recall, and F1 score are commonly used
- percentage of samples misclassified
- works well on balanced dataset where each class has same number samples
$ERR = \frac {FP + FN}{P+N}$
-
accuracy is defined as the percentage of samples correctly classified
-
works well on balanced dataset where each class has same number of samples
-
$ACC = \frac{TP+TN}{P+N} = 1 - ERR $
- the number of positive samples that the model correctly classified as positive
- the number of negative samples that the model correctly classified as negative
- the number of negative samples that the model incorrectly classified as positive
- the number of positive examples that the model incorrectly classified as negative
- fraction of correctly classified positive instances among all the instances predicted as positive
- $ precision = \frac{TP}{TP+FP}$
- fraction of correctively classified positive instances among all the positive instances
- $ recall = \frac{TP}{P}$
- harmonic mean of precision and recall
$f_1 = \frac{2TP}{2TP+FP+FN} = 2\times\frac{precision \times recall}{precision + recall}$
# let's look at the accuracy score first
from sklearn.metrics import accuracy_score
print('Accuracy of Perceptron on original data: %.3f' % accuracy_score(y_test, y_pred))Accuracy of Perceptron on original data: 0.967
# each classifier object also has score method that can be used to calculate the accuracy
print('Accuracy of Perceptron on original data: %.3f' % ppn.score(X_test, y_test))Accuracy of Perceptron on original data: 0.967
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
# not very informative[[10 0 0]
[ 0 9 1]
[ 0 0 10]]
# let's plot the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplayfig = ConfusionMatrixDisplay.from_estimator(ppn, X_test, y_test,
display_labels=iris.target_names,
cmap=plt.cm.Blues,
)
fig.ax_.set_title("Confusion Matrix on Oringiral Data using Perceptron")
plt.show()# calculate and print the other metrics with classification_report function
from sklearn.metrics import classification_reportprint('Classification report on original data using Perceptron')
print(classification_report(y_test, y_pred))Classification report on original data using Perceptron
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 0.90 0.95 10
2 0.91 1.00 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
- decision tree a supervised learning method
- model is easier to visualize and understand
- tree is composed of nodes and those nodes are chosen looking for the optimum split of the features
- two criteria are used:
giniandentropy
- two criteria are used:
- see https://quantdare.com/decision-trees-gini-vs-entropy/ for the differnce between
giniandentropy
Image(filename='images/03_17.png', width=500)from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(criterion='gini',
random_state=1)
tree_model.fit(X_train, y_train)DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier
DecisionTreeClassifier(random_state=1)
from sklearn import tree
tree.plot_tree(tree_model)
#plt.savefig('images/03_21_1.pdf')
plt.show()- install Graphviz on the system
- install graphviz binary and Python API library
- see system specific answer - https://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft
! sudo apt updateHit:2 http://archive.ubuntu.com/ubuntu focal InRelease �[0m
Get:3 http://security.ubuntu.com/ubuntu focal-security InRelease [128 kB] �[0m
Get:4 https://packages.microsoft.com/repos/microsoft-ubuntu-focal-prod focal InRelease [3632 B]
Hit:5 https://dl.yarnpkg.com/debian stable InRelease �[0m�[33m
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [128 kB] �[0m
Hit:1 https://packagecloud.io/github/git-lfs/ubuntu focal InRelease �[0m
Get:7 http://archive.ubuntu.com/ubuntu focal-backports InRelease [128 kB] �[0mm�[33m
Hit:8 https://repo.anaconda.com/pkgs/misc/debrepo/conda stable InRelease
Get:9 https://packages.microsoft.com/repos/microsoft-ubuntu-focal-prod focal/main amd64 Packages [334 kB]
Get:10 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [4223 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1590 kB]
Get:12 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1298 kB]
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [4699 kB]
Get:14 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [4337 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [4526 kB]
Fetched 21.4 MB in 2s (8681 kB/s) �[0m�[33m�[33m�[33m�[33m�[33m�[33m
Reading package lists... Done
Building dependency tree
Reading state information... Done
48 packages can be upgraded. Run 'apt list --upgradable' to see them.
! sudo apt install -y graphviz! pip install graphvizRequirement already satisfied: graphviz in /home/codespace/.local/lib/python3.12/site-packages (0.20.3)
�[1m[�[0m�[34;49mnotice�[0m�[1;39;49m]�[0m�[39;49m A new release of pip is available: �[0m�[31;49m24.3.1�[0m�[39;49m -> �[0m�[32;49m25.0.1�[0m
�[1m[�[0m�[34;49mnotice�[0m�[1;39;49m]�[0m�[39;49m To update, run: �[0m�[32;49mpython3 -m pip install --upgrade pip�[0m
! python3 -m pip install --upgrade pipRequirement already satisfied: pip in /home/codespace/.python/current/lib/python3.12/site-packages (24.3.1)
Collecting pip
Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
�[2K �[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━�[0m �[32m1.8/1.8 MB�[0m �[31m29.7 MB/s�[0m eta �[36m0:00:00�[0m
�[?25hInstalling collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 24.3.1
Uninstalling pip-24.3.1:
Successfully uninstalled pip-24.3.1
Successfully installed pip-25.0.1
! pip --versionpip 25.0.1 from /home/codespace/.python/current/lib/python3.12/site-packages/pip (python 3.12)
import graphviz
dot_data = tree.export_graphviz(tree_model, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True, special_characters=True)graph = graphviz.Source(dot_data)graphy_pred = tree_model.predict(X_test)
# how many samples are correctly classified?
print('Correctly classified samples on original data: %d' % (y_test == y_pred).sum())Correctly classified samples on original data: 29
# how many samples are misclassified?
print(f'Misclassified samples on original data: {(y_test != y_pred).sum()}')Misclassified samples on original data: 1
mis_df = getMisClassifiedSamples(X_test, y_test, y_pred)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
mis_df
# is it the same sample that was also mis-classified by Perceptron?
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 25 | 4.9 | 2.5 | 4.5 | 1.7 | virginica | versicolor |
# what is the accuracy?
print('Accuracy on original data: %.3f' % accuracy_score(y_test, y_pred))Accuracy on original data: 0.967
# accuracy = correctly classified samples/total number of samples
print(f'Accuracy on original data: = {29/30:.3f}')Accuracy on original data: = 0.967
# plot the confusion matrix
fig = ConfusionMatrixDisplay.from_estimator(tree_model, X_test, y_test,
display_labels=iris.target_names,
cmap=plt.cm.Blues,
)
fig.ax_.set_title("Confusion Matrix on Original Data using DT")
plt.show()print('Classification report on original data using DT')
print(classification_report(y_test, y_pred))Classification report on original data using DT
precision recall f1-score support
0 1.00 1.00 1.00 10
1 0.91 1.00 0.95 10
2 1.00 0.90 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- a ensemble-based meta classifier that fits a number of decision tree classifiers on various sub-samples of the dataset
- uses majority voting from decision trees to do the final prediction
- RF applied to Titanic dataset - https://www.kaggle.com/alexisbcook/getting-started-with-titanic

- let's apply RF classifier to the IRIS dataset
# import the classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplayRF = RandomForestClassifier() # use all default valuesRF.fit(X_train, y_train)RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = RF.predict(X_test)print(f'Accuracy on original data using RF = {accuracy_score(y_test, y_pred):.3f}')Accuracy on original data using RF = 0.967
# let's look at the misclassified samples
mis_df = getMisClassifiedSamples(X_test, y_test, y_pred)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
getMisClassifiedSamples(X_test, y_test, y_pred)
# is it the same sample that was misclassified by Perceptron and Decision Tree?/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 25 | 4.9 | 2.5 | 4.5 | 1.7 | virginica | versicolor |
# plot the confusion matrix
fig = ConfusionMatrixDisplay.from_estimator(RF, X_test, y_test,
display_labels=iris.target_names,
cmap=plt.cm.Blues,
)
fig.ax_.set_title("Confusion Matrix on original data using RF")
plt.show()print('Classification report on original data using RF')
print(classification_report(y_test, y_pred))Classification report on original data using RF
precision recall f1-score support
0 1.00 1.00 1.00 10
1 0.91 1.00 0.95 10
2 1.00 0.90 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
- https://xgboost.readthedocs.io/en/stable/
- XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm based on gradient boosting. It is widely used for structured data tasks like classification and regression due to its efficiency and performance.
XGBoost is based on the gradient boosting framework, which builds an ensemble of decision trees sequentially, where each new tree tries to correct the errors of the previous ones.
- starts with an initial prediction, usually the average of the target values for regression or log odds for classification.
- It calculates the residuals (errors) by taking the difference between actual values and predicted values.
- The gradient of the loss function with respect to the predictions is computed.
- A new decision tree is trained to predict these residuals (i.e., the negative gradient).
- The leaves of the tree contain values that minimize the loss.
The new tree’s predictions are scaled by a learning rate (𝛼) and added to the previous predictions.
- Steps 2–4 are repeated for a specified number of iterations (trees) or until performance stops improving.
- Regularization (L1 & L2 penalties): Helps prevent overfitting.
- Shrinkage (Learning Rate): Controls how much each tree contributes to the final prediction.
- Column and Row Subsampling: Reduces variance and improves generalization.
- Tree Pruning: Uses maximum depth instead of pre-pruning, which makes it efficient.
- Handling Missing Values: Automatically learns how to handle missing data.
- Parallel Computing: Uses optimized algorithms for fast computation.
- Speed & Scalability: Efficient implementation using parallelism.
- Accuracy: Often achieves high performance in machine learning competitions.
- Flexibility: Supports regression, classification, ranking, and more.
! pip install xgboostCollecting xgboost
Downloading xgboost-2.1.4-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Requirement already satisfied: numpy in /home/codespace/.local/lib/python3.12/site-packages (from xgboost) (2.1.3)
Collecting nvidia-nccl-cu12 (from xgboost)
Downloading nvidia_nccl_cu12-2.25.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.8 kB)
Requirement already satisfied: scipy in /home/codespace/.local/lib/python3.12/site-packages (from xgboost) (1.15.1)
Downloading xgboost-2.1.4-py3-none-manylinux_2_28_x86_64.whl (223.6 MB)
�[2K �[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━�[0m �[32m223.6/223.6 MB�[0m �[31m50.8 MB/s�[0m eta �[36m0:00:00�[0m00:01�[0m00:01�[0m
�[?25hDownloading nvidia_nccl_cu12-2.25.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.4 MB)
�[2K �[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━�[0m �[32m201.4/201.4 MB�[0m �[31m41.9 MB/s�[0m eta �[36m0:00:00�[0m00:01�[0m00:01�[0m
�[?25hInstalling collected packages: nvidia-nccl-cu12, xgboost
Successfully installed nvidia-nccl-cu12-2.25.1 xgboost-2.1.4
�[1m[�[0m�[34;49mnotice�[0m�[1;39;49m]�[0m�[39;49m A new release of pip is available: �[0m�[31;49m24.3.1�[0m�[39;49m -> �[0m�[32;49m25.0�[0m
�[1m[�[0m�[34;49mnotice�[0m�[1;39;49m]�[0m�[39;49m To update, run: �[0m�[32;49mpython3 -m pip install --upgrade pip�[0m
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Convert data into XGBoost DMatrix format (optimized for XGBoost)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Set XGBoost parameters
params = {
'objective': 'multi:softmax', # Multi-class classification
'num_class': 3, # Number of classes in target variable
'max_depth': 3, # Depth of trees
'learning_rate': 0.1, # Step size shrinkage
'n_estimators': 100, # Number of trees
'eval_metric': 'mlogloss' # Metric for evaluation
}
# Train the model
model = xgb.train(params, dtrain, num_boost_round=100)
# Test model
y_pred = model.predict(dtest)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'XGBoost Model Accuracy: {accuracy:.2f}')XGBoost Model Accuracy: 0.93
/home/codespace/.python/current/lib/python3.12/site-packages/xgboost/core.py:158: UserWarning: [17:37:04] WARNING: /workspace/src/learner.cc:740:
Parameters: { "n_estimators" } are not used.
warnings.warn(smsg, UserWarning)
# let's look at the misclassified samples
getMisClassifiedSamples(X_test, y_test, y_pred)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| ID | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y_true | y_pred | |
|---|---|---|---|---|---|---|---|
| 0 | 2 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor | setosa |
| 1 | 25 | 4.9 | 2.5 | 4.5 | 1.7 | virginica | versicolor |
- XGBoost can be used with scikit-learn's API, making it easy to integrate into existing workflows
xgb_model = xgb.XGBClassifier() # use all default values
xgb_model.fit(X_train, y_train)
# test model
y_pred = xgb_model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'XGBoost Model Accuracy: {accuracy:.2f}')XGBoost Model Accuracy: 0.93
# let's look at the misclassified samples
getMisClassifiedSamples(X_test, y_test, y_pred)/tmp/ipykernel_2900/1608365914.py:17: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_true'] == i, 'y_true'] = iris.target_names[i]
/tmp/ipykernel_2900/1608365914.py:18: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'setosa' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
mis_df.loc[mis_df['y_pred'] == i, 'y_pred'] = iris.target_names[i]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
- Model selection chapter dives into the details
- for now, we'll use the accuracy score to select the best model and use 80/20 split










