kNN algorithm for beginners- Building Iris species prediction model
If you are a beginner, this is a step-by-step guide for KNN algorithm where we use KNN to predict the species of iris flower. KNN is the simplest supervised ML algorithm. Left is a demo of an Iris plant. It has 3 species on the basis of its sepal and petal
Here we build a simple model to predict the given dimensions will be of which specie with an explanation of the code to help you throughout
Step 1- loading the necessary libraries
- Pandas- Have heard a lot about it but what actually is it? Pandas name was taken from “Panel data” which is much like an excel sheet. So pandas is a way to see the data and make changes to it much like an excel file. It is built around a data structure called DataFrame (modeled after R DataFrame). You can always ingest data from formats like .CSV or SQL.( a great book for learning pandas would be Python for Data Analysis by Wes Mckinney)
- Numpy-used for scientific computing in Python. It allows the use of multidimensional arrays and high level mathematical functions like Fourier transformation, pseudorandom number generators.
- Scikit learn- it is an open source project(so free to use and distribute). It contains all the machine learning algorithms and also the documentation for learning. In scikit learning, NumPy array is the fundamental DS.
- Matplotlib- As the name suggests, this is the primary scientific plotting library in Python. You can use it to make histograms, scatter plots etc.
- Sci-Py- It is a collection of functions for scientific computing in Python. Most important part of SciPy is the scipy.sparse: this provides the sparse matrices (another DS apart from NumPy array).
SO down below is a picture where I load all the libraries and check their versions.
STEP 2- MEET THE DATA
what happened over here?
the load_iris() function returns a Bunch object which is like a dictionary. It contains keys and values. Next I print the keys since it is like a dictionary so I can access the keys.
What is DESCR? It is a short description of the iris dataset.
A sneak peek into out data : (don’t be confused, 0,1,2 are labels of iris flowers.
0 is for setosa
1 is for versicolor
3 is for virginica)
STEP-3 SPLIT INTO TRAINING AND TEST
train_test_split function shuffles the dataset using a pseudorandom number generator (why did we need this? Answer in the comments). Random state is just the seed generator. Now the output is X_train, X_test, y_train, y_test (these are all NumPy arrays). The train data set containing 75% of the data and test contains 25%.
INTERMEDIATE STEP- LET US TAKE ONE LOOK AT DATA
we use a scatter matrix which shows our data like:
STEP-4 BUILDING THE KNN MODEL
A lot going on here, let us see step by step:
- knn — it is an object which now holds the algorithm used to build the model on training set and build predictions on new data points.
- KNeighborsClassifier is just for storing training set.
- As told earlier, Sklearn will contains all algorithms so we import KNeighborsClassifier from it.
- n_neighbors tell us how many neighbors we want in our model.
- knn.fit- returns a knn object but gives a string representation of our Classifier.
STEP-5 MAKING PREDICTIONS(PROBABLY THE LAST STEP. PROBABLY :))
I gave the information to my data set as 5,2.9,1,0.2. You can give as per your liking. .shape() method tells the number rows and columns which happen to be 1 and 4 in the data i gave right now.
To make prediction, we use predict() method of knn objects
As you can see, it gives me the prediction of 0 which is a label and tells that it belongs to ‘setosa’. Our model is made!
Last step is to see how accurate our model is. You can do it in 2 ways:
- using score() method of knn object
- np.mean() method
As you see our model gives 0.97 as output which means 97% accuracy. So, it works pretty well.
I hope this article was useful for you. I am open to solving any doubts which you may have.
Keep Learning.