Following post shows an overview of Naive Bayes using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. Naive Bayes algorithm is based on Bayes Theorem as it assumes independence between the effects of a variable on a given class in presence of other attributes.
Wisconsin Breast Cancer Data, collected from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg, is composed of 11 cytological attributes computed from digitized images of a fine needle aspirate (FNA) of a breast mass. Collected cell attributes include clump thickness, uniformity of cell size and cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses. Features are used to differentiate between benign and malignant samples, as defined by the Class predictor. For further details about feature definition and data collection, see publication.


Import packages

library(plyr) # loading plyr for mutate function
library(e1071) # loading naiveBayes function


Reading and Cleaning Wisconsin Breast Cancer Dataset from UCI Machine Learning Repository

breast_cancer <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
colnames(breast_cancer) <- c("Code", "ClumpThickness", "UniformCellSize", "UniformCellShape", "MarginalAdhesion", "EpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "Class")
breast_cancer$Class <- as.character(breast_cancer$Class)
breast_cancer$Class <- revalue(breast_cancer$Class, c("2" = "Benign", "4" = "Malignant"))
breast_cancer$Class <- as.factor(breast_cancer$Class)
breast_cancer <- breast_cancer[!(breast_cancer$BareNuclei == "?"),] # dropping all observations with BareNuclei value of "?"
breast_cancer$BareNuclei <- as.integer(breast_cancer$BareNuclei)


Split Data in Training (75%) and Test Sets (25%)

n <- nrow(breast_cancer)
index = sample(1:n, size = round(0.75*n), replace = FALSE)
train = breast_cancer[index, ]
test = breast_cancer[-index, ]
paste("Observations in training data: ", nrow(train), sep = "")
## [1] "Observations in training data: 512"
paste("Observations in testing data: ", nrow(test), sep = "")
## [1] "Observations in testing data: 170"


Model Naive Bayes to predict cancer Class

naive <- naiveBayes(Class ~ ., data = train[, -c(1)])
naive
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Benign Malignant 
## 0.6757812 0.3242188 
## 
## Conditional probabilities:
##            ClumpThickness
## Y               [,1]     [,2]
##   Benign    3.008671 1.667224
##   Malignant 7.295181 2.432776
## 
##            UniformCellSize
## Y               [,1]      [,2]
##   Benign    1.326590 0.8649292
##   Malignant 6.487952 2.7385861
## 
##            UniformCellShape
## Y               [,1]      [,2]
##   Benign    1.384393 0.8877193
##   Malignant 6.542169 2.5455901
## 
##            MarginalAdhesion
## Y               [,1]      [,2]
##   Benign    1.355491 0.9406667
##   Malignant 5.578313 3.2140618
## 
##            EpithelialCellSize
## Y               [,1]      [,2]
##   Benign    2.101156 0.8771818
##   Malignant 5.403614 2.5701761
## 
##            BareNuclei
## Y               [,1]     [,2]
##   Benign    2.419075 1.177364
##   Malignant 4.614458 2.673467
## 
##            BlandChromatin
## Y               [,1]     [,2]
##   Benign    2.095376 1.068458
##   Malignant 5.939759 2.316473
## 
##            NormalNucleoli
## Y               [,1]      [,2]
##   Benign    1.286127 0.9849202
##   Malignant 5.795181 3.4702512
## 
##            Mitoses
## Y               [,1]      [,2]
##   Benign    1.069364 0.5338818
##   Malignant 2.536145 2.5693805


Predict model on test data

prediction <- predict(naive, test, type = "class")
naive_tab <- base::table(prediction, test$Class)
naive_tab
##            
## prediction  Benign Malignant
##   Benign        92         2
##   Malignant      5        71
round(sum(diag(naive_tab))/sum(naive_tab), 3)
## [1] 0.959