Following post shows an overview of k-Nearest Neighbor using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. k-Nearest Neighbor is a non-parametric, instance-based learning algorithm that "memorizes" the training space and uses training data for k most similar instances to classify a new instance.
Wisconsin Breast Cancer Data, collected from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg, is composed of 11 cytological attributes computed from digitized images of a fine needle aspirate (FNA) of a breast mass. Collected cell attributes include clump thickness, uniformity of cell size and cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses. Features are used to differentiate between benign and malignant samples, as defined by the Class predictor. For further details about feature definition and data collection, see publication.
Import packages
library(class)
library(caret)
Reading and Cleaning Wisconsin Breast Cancer Dataset from UCI Machine Learning Repository
breast_cancer <- read.csv("https://raw.githubusercontent.com/azkajavaid/BreastCancerWisconsinData-UCI/master/BreastCancerData.txt?token=ANkFlyu6ncAg-xoOxqhgB6wSw0GRnuBOks5atcg2wA%3D%3D")
colnames(breast_cancer) <- c("Code", "ClumpThickness", "UniformCellSize", "UniformCellShape", "MarginalAdhesion", "EpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "Class")
breast_cancer$Class <- as.character(breast_cancer$Class)
breast_cancer$Class <- revalue(breast_cancer$Class, c("2" = "Benign", "4" = "Malignant"))
breast_cancer$Class <- as.factor(breast_cancer$Class)
breast_cancer <- breast_cancer[!(breast_cancer$BareNuclei == "?"),] # dropping all observations with BareNuclei value of "?"
breast_cancer$BareNuclei <- as.integer(breast_cancer$BareNuclei)
Split data in training and test sets
set.seed(90)
index <- createDataPartition(breast_cancer$Class, p = 0.75, list = FALSE, times = 1)
train = breast_cancer[index, ]
test = breast_cancer[-index, ]
paste("Observations in training data: ", nrow(train), sep = "")
## [1] "Observations in training data: 513"
paste("Observations in testing data: ", nrow(test), sep = "")
## [1] "Observations in testing data: 169"
Applying k-Nearest Neighbor (knn) Classification to predict Class
trainData <- train[, c(2:10)] # selecting all predictors, except the outcome (Class)
testData <- test[, c(2:10)]
knn_pred <- knn(train = trainData, test = testData, cl = train$Class, k = 3)
acc_tab <- base::table(knn_pred, test$Class)
round(sum(diag(acc_tab))/sum(acc_tab), 3)
Alternatively, applying knn using caret
knn_mod <- train(trainData, train$Class, method = "knn")
knn_pred <- predict(knn_mod, testData)
acc_tab <- base::table(knn_pred, test$Class)
round(sum(diag(acc_tab))/sum(acc_tab), 3)