k-Nearest Neighbors (kNN) is a non-parametric, lazy learning algorithm.

  • Non-parametric means it doesn't make any assumptions about the underlying data distribution.
  • Lazy learning means it doesn't build a "model" during training. Instead, it just stores the entire training dataset. The real work happens during prediction.

The core idea is simple: A data point is likely to be similar to the data points closest to it. 🏡

How kNN Works

To classify a new, unseen data point, kNN follows these steps:

  1. Choose a value for 'k': 'k' is the number of neighbors to consider. This is a hyperparameter you choose. Let's say we pick k=5.
  2. Calculate the distance: Find the distance between the new data point and all other points in the training dataset. The most common distance metric is Euclidean distance.
  3. d(p,q)=(q1​−p1​)2+(q2​−p2​)2+⋯+(qn​−pn​)2
  4. Find the 'k' nearest neighbors: Identify the 'k' data points from the training set that have the smallest distances to the new point.
  5. Vote for the class: For classification, look at the class labels of these 'k' neighbors. The new data point is assigned the class that has the most votes (the majority class) among its neighbors.

The Importance of 'k' and Feature Scaling

  • Choosing 'k': A small 'k' (e.g., k=1) can be very sensitive to noise and outliers, leading to high variance. A large 'k' can oversmooth the decision boundary, leading to high bias. The right 'k' is usually found through experimentation.
  • Feature Scaling: Since kNN relies on distance, features with larger scales can dominate the calculation. For example, a salary feature (e.g., 50,000) will have a much bigger influence than an age feature (e.g., 30). It is crucial to scale your data (e.g., using Standardization or Normalization) before applying kNN.

Python


# Python code with scikit-learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Sample Data: Classifying fruit based on width and height
# Features: [width, height], Label: 0=Apple, 1=Orange
X = np.array([[6, 6], [7, 5], [5.5, 7], [8, 8], [9, 7.5], [8.5, 9]])
y = np.array([0, 0, 0, 1, 1, 1])

# Create a pipeline to scale data and then apply kNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=3))
])

# Train the model (kNN just stores the data)
pipeline.fit(X, y)

# Predict a new fruit
new_fruit = [[7, 7]]
prediction = pipeline.predict(new_fruit)
print(f"The new fruit at {new_fruit[0]} is predicted as: {'Apple' if prediction[0] == 0 else 'Orange'}")

Note: kNN can also be used for regression