kNN algorithm for beginners- Building Iris species prediction model

5 min readOct 20, 2020

If you are a beginner, this is a step-by-step guide for KNN algorithm where we use KNN to predict the species of iris flower. KNN is the simplest supervised ML algorithm. Left is a demo of an Iris plant. It has 3 species on the basis of its sepal and petal

Here we build a simple model to predict the given dimensions will be of which specie with an explanation of the code to help you throughout

Step 1- loading the necessary libraries

Pandas- Have heard a lot about it but what actually is it? Pandas name was taken from “Panel data” which is much like an excel sheet. So pandas is a way to see the data and make changes to it much like an excel file. It is built around a data structure called DataFrame (modeled after R DataFrame). You can always ingest data from formats like .CSV or SQL.( a great book for learning pandas would be Python for Data Analysis by Wes Mckinney)
Numpy-used for scientific computing in Python. It allows the use of multidimensional arrays and high level mathematical functions like Fourier transformation, pseudorandom number generators.
Scikit learn- it is an open source project(so free to use and distribute). It contains all the machine learning algorithms and also the documentation for learning. In scikit learning, NumPy array is the fundamental DS.
Matplotlib- As the name suggests, this is the primary scientific plotting library in Python. You can use it to make histograms, scatter plots etc.
Sci-Py- It is a collection of functions for scientific computing in Python. Most important part of SciPy is the scipy.sparse: this provides the sparse matrices (another DS apart from NumPy array).

SO down below is a picture where I load all the libraries and check their versions.

STEP 2- MEET THE DATA

what happened over here?

the load_iris() function returns a Bunch object which is like a dictionary. It contains keys and values. Next I print the keys since it is like a dictionary so I can access the keys.

What is DESCR? It is a short description of the iris dataset.

A sneak peek into out data : (don’t be confused, 0,1,2 are labels of iris flowers.

0 is for setosa

1 is for versicolor

3 is for virginica)

STEP-3 SPLIT INTO TRAINING AND TEST

train_test_split function shuffles the dataset using a pseudorandom number generator (why did we need this? Answer in the comments). Random state is just the seed generator. Now the output is X_train, X_test, y_train, y_test (these are all NumPy arrays). The train data set containing 75% of the data and test contains 25%.

INTERMEDIATE STEP- LET US TAKE ONE LOOK AT DATA

we use a scatter matrix which shows our data like:

STEP-4 BUILDING THE KNN MODEL

A lot going on here, let us see step by step:

knn — it is an object which now holds the algorithm used to build the model on training set and build predictions on new data points.
KNeighborsClassifier is just for storing training set.
As told earlier, Sklearn will contains all algorithms so we import KNeighborsClassifier from it.
n_neighbors tell us how many neighbors we want in our model.
knn.fit- returns a knn object but gives a string representation of our Classifier.

STEP-5 MAKING PREDICTIONS(PROBABLY THE LAST STEP. PROBABLY :))

I gave the information to my data set as 5,2.9,1,0.2. You can give as per your liking. .shape() method tells the number rows and columns which happen to be 1 and 4 in the data i gave right now.