Data Science with R

Flights Delay Shiny Application

Flights Delay Application studies United States flights delay over 90 minutes from 2008-2016 using data from The United States Department of Transportation Bureau of Transportation Statistics. In 2015 alone, severe weather and security concerns resulted in delays of about 17.5 million minutes and undetermined causes like a previously delayed …

more ...

Function Name Conflicts Shiny Application

Function Name Conflicts Application uses the R documentation API to identify function conflicts (functions with same name) between two packages. For example, since function mutate is found in both plyr and dplyr, it is identified as a conflicting function. An awareness of same name functions could alleviate conflicts resulting from …

more ...

Location Explorer Shiny Application

Location Explorer Application uses Google Places, Eventbrite, Meetup and Yelp APIs to provide a coherent and systemic map of events, transportation, food and business vendors for specified geographic location. Application is hosted via shinyapps.io, RStudio's hosting service for Shiny apps, accessible at Shiny Application. This application uses the httr …

more ...

Twitter Analytics Shiny Application

Twitter Analytics Application uses the twitteR package to perform a live profile analysis of a user (timeline and favorite tweets) and topic's tweets. This analysis is accomplished via sentiment, text, emoji and geographic analysis of a user's tweets, their tweeting habits across time metrics like hour, week, month and year …

more ...

About Me

Hi! I am Azka Javaid. I recently graduated from Amherst College and currently work as a Data Scientist at IBM within Watson Health Oncology. At Watson Health, I have developed an acute understanding of the health domain, business intuition and client needs. This understanding has allowed me to better communicate …

more ...

Bigram Analysis

For this bigram analysis, The Time Machine, a science fiction piece by H.G. Wells, was analyzed from the Project Gutenberg, which offers over 56,000 free e-books. Package gutenbergr downloads and processes public domain works in the Project Gutenberg. All other works from The Project Gutenberg can be retrieved …

more ...

Decision Tree (Classification)

Following post shows an overview of Decision Trees using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. Decision trees segment the predictor space into regions using splitting rules that can be visualized using a tree. In classification decision trees, each observation belongs to most commonly occurring class.

Wisconsin …

more ...

K-Nearest Neighbor (Classification)

Following post shows an overview of k-Nearest Neighbor using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. k-Nearest Neighbor is a non-parametric, instance-based learning algorithm that "memorizes" the training space and uses training data for k most similar instances to classify a new instance.

Wisconsin Breast Cancer Data …

more ...

Naive Bayes (Classification)

Following post shows an overview of Naive Bayes using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. Naive Bayes algorithm is based on Bayes Theorem as it assumes independence between the effects of a variable on a given class in presence of other attributes.

Wisconsin Breast Cancer Data …

more ...

Plot Continuous Predictors using Melt

Import packages

library(plyr)
library(reshape2) # for melt function
library(ggplot2)

Reading and Cleaning Wisconsin Breast Cancer Dataset from UCI Machine Learning Repository

breast_cancer <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
colnames(breast_cancer) <- c("Code", "ClumpThickness", "UniformCellSize", "UniformCellShape", "MarginalAdhesion", "EpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "Class …

more ...

Principle Component Analysis (PCA)

Following post shows an overview of Principle Components Analysis (PCA) using the Chronic Kidney Disease dataset, from UCI Machine Learning Repository. PCA is a dimensionality reduction technique that converts correlated predictors into a set of uncorrelated predictors called principle components using orthogonal transformations.

Chronic Kidney Disease Data, collected from a …

more ...

Random Forest (Classification)

Following post shows an overview of Random Forests using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. Random Forests are a type of ensemble learning methods that learn from training number of individual decision trees, thereby reducing model variability and improving performance.

Wisconsin Breast Cancer Data, collected from …

more ...

Support Vector Machines (SVM)

Following post shows an overview of Support Vector Machine (SVM) using the Wisconsin Breast Cancer Dataset, from UCI Machine Learning Repository. SVM generalize maximal margin classifier to non-linear class boundaries using hyperplanes to separate classes.

Wisconsin Breast Cancer Data, collected from the University of Wisconsin Hospitals, Madison from Dr. William …

more ...

Topic Modeling (Latent Dirichlet Modeling) using Project Gutenberg

For this topic modeling analysis, three works were analyzed: Pride and Prejudice by Jane Austen, The Fall of the House of Usher by Edgar Allan Poe and The War of the Worlds by H.G. Wells. Since each work has a distinct genre and explores specific themes, these three works …

more ...

Word Cloud

For this word cloud analysis, Great Expectations by Charles Dickens was studied. Great Expectations charts Pip, an orphan's, personal development, exploring universal themes like guilt, persistence and social advancement and historical constructs like wealth, poverty, morality, good versus evil, and Victorian social structures.

This post will provide a brief bigram …

more ...

Scatter plot (ggplot)

Import packages

library(ggplot2)

Scatter plot of miles per gallon, mpg, against weight, wt, by number of forward gears, gear, from mtcars dataset

mtcars$gear <- as.character(mtcars$gear) # convert gear to character 
ggplot(mtcars, aes(x = wt, y = mpg, color = gear)) + geom_point(size = 0.7) + geom_smooth(method = lm, se …

more ...

Bar Plot (ggplot)

Import packages

library(ggplot2)

Bar plot count of forward gears, gear, by transmission status, am from mtcars dataset

class(mtcars$am)

## [1] "numeric"

class(mtcars$gear)

## [1] "numeric"

mtcars$am <- as.character(mtcars$am) # convert from numeric to character 
mtcars$gear <- as.character(mtcars$gear) # convert from numeric to character …

more ...

Box plot (ggplot)

Import packages

library(ggplot2)

Boxplot of miles per gallon, mpg, against number of forward gears, gear from mtcars dataset

mtcars$gear <- as.character(mtcars$gear) # convert gear to character 
ggplot(mtcars, aes(x = gear, y = mpg, color = gear)) + geom_point(size = 0.7) + geom_boxplot() + ggtitle("Gear vs. miles per gallon") + theme_bw …

more ...

Density plot (ggplot)

Import packages

library(ggplot2)

Density plot of sepal length, Sepal.Length, by specie status, Species, from iris dataset

ggplot(iris, aes(x = Sepal.Length, fill = Species)) + 
  geom_density(color = "black", alpha = 0.4) + # specify alpha indicating density plot shading frequency
  ggtitle("Distribution of Sepal Length") + theme_bw() + theme(text = element_text(size = 20 …

more ...

Heat map (ggplot)

Import packages

library(datasets)
library(magrittr)
library(dplyr)
library(data.table)
library(reshape2)
library(tidyr)
library(ggplot2)

Reshaping data

# Convert row names to column
mtcars_sc <- mtcars
mtcars_sc[1:11] <- as.data.frame(sapply(mtcars_sc[1:11], as.numeric))
mtcars_scale <- as.data.frame(scale(mtcars_sc))
mtcars_data <- data.table::setDT(data.frame …

more ...

Histogram (ggplot)

Import packages

library(ggplot2)

Histogram of sepal length, Sepal.Length, by specie status, Species, from iris dataset

Default histogram with overlapping bars

ggplot(iris, aes(x = Sepal.Length, fill = Species)) + 
  geom_histogram(bins = 25, color = "black", alpha = 0.7) + # specify alpha indicating histogram bar shading frequency
  ggtitle("Distribution of Sepal Length …

more ...

Pie chart (ggplot)

Import packages

library(ggplot2)
library(plyr)

Pie chart plot of miles per gallon, mpg, against weight, wt, by number of forward gears, gear, from mtcars dataset

tab <- data.frame(table(mtcars$am))
colnames(tab) <- c("Transmission", "Frequency")
tab$Transmission <- revalue(tab$Transmission, c("1" = "manual", "0" = "automatic"))
ggplot(tab, aes …

more ...

Apply a function to all variables

Find mean of all numeric columns of a Data Frame

lapply(mtcars, class) # find class of all predictors

## $mpg
## [1] "numeric"
## 
## $cyl
## [1] "character"
## 
## $disp
## [1] "numeric"
## 
## $hp
## [1] "numeric"
## 
## $drat
## [1] "numeric"
## 
## $wt
## [1] "numeric"
## 
## $qsec
## [1] "numeric"
## 
## $vs
## [1] "numeric"
## 
## $am
## [1] "character"
## 
## $gear
## [1] "character"
## 
## $carb
## [1 …

more ...

Create and Parse Lists of Lists

Creating a lists of lists

list1 <- list(attr = "Fruits", value = c("mango", "apple", "strawberries"))
list2 <- list(attr = "Vegetables", value = c("tomato", "potato"))
list3 <- list(attr = "Tidyverse", value = c("dplyr", "plyr", "ggplot2"))
list4 <- list(attr = "Workflow", value = c("Data Cleaning", "Modeling", "Visualization", "Communication"))
list_val <- list(list1, list2, list3, list4)
list_val

## [[1 …

more ...

Plotting Geographic Data with Leaflet (Color by Categorical Predictor)

Import packages

library(leaflet) 
library(jsonlite) 
library(tibble) 
library(plyr)
library(dplyr)
library(data.table)
library(datasets) # loading datasets package for mtcars and Iris data
library(webshot)

Scrape latitudes and longitudes for US states (from Michelle Hertzfeld's GitHub)

url <- "http://gist.githubusercontent.com/ajav17/dee0dd44357862c75ee2872038119f17/raw/0109432d22f28fd1a669a3fd113e41c4193dbb5d/USstates_avg_latLong"
statesLocation <- fromJSON …

more ...

Appending Columns to a Data Frame

Appending columns using cbind

# Random day, month and year predictors, indicating time of CO2 measurements
day <- sample(c(1:30), nrow(CO2), TRUE)
month <- sample(c(1:12), nrow(CO2), TRUE)
year <- sample(c(2013, 2014, 2015), nrow(CO2), TRUE)
CO2_time <- cbind(CO2, day, month, year) # binding the time predictors …

more ...

Change Reference Level of a Categorical Predictor

Import packages

library(plyr) # loading plyr for mutate function

Check current reference level of Species

levels(iris$Species) # current reference level is setosa

## [1] "setosa"     "versicolor" "virginica"

Change reference level from setosa to versicolor

iris$Species <- relevel(iris$Species, ref = "versicolor")
levels(iris$Species) # reference level changed to versicolor

## [1 …

more ...

Compare Two Datasets (Find Common/Different Observations)

Import packages

library(dplyr)

Creating sample datasets

LatLong <- c("40.841885, -73.856621",
             "40.675026, -73.944855", 
             "40.726253, -73.806710",
             "40.725375, -73.789845", 
             "40.845456, -73.876555")
Location <- c("Bronx", "Brooklyn", 
              "Manhattan", "Queens", "Staten Island")
geoData <- data.frame(LatLong, Location)
geoData

##                 LatLong      Location
## 1 40.841885, -73.856621 …

more ...

Convert String to Upper and Lower Case

Convert String to Upper Case

string <- "lower case"
up <- toupper(string)
up

## [1] "LOWER CASE"

Convert String to Lower Case

string <- "UPPER CASE"
low <- tolower(string)
low

## [1] "upper case"

more ...

Converting Rownames to Column

Import packages

library(data.table)
library(jsonlite)

Generating sample data

url <- paste("https://rdocumentation.org/api/packages/", "dplyr", "/versions/", "0.7.3", sep = "")
dat <- fromJSON(txt = url)
metrics <- data.frame(dat$package_name, dat$version, dat$title, dat$description, 
                  dat$release_date, dat$license, dat$maintainer$name, dat$maintainer$email)
colnames(metrics …

more ...

Count Number of Elements in String, List and Data Frame

Count characters in string

nchar("pomegranate")

## [1] 11

Count elements in a list

fruits <- c("mango", "pomegranate", "berries", "orange")
length(fruits) # number of elements in list

## [1] 4

nchar(fruits) # count of characters in string

## [1]  5 11  7  6

Count observations in data frame, iris

nrow(iris)

## [1] 150 …

more ...

Cox Proportional Hazards Model

Import packages

library(survival)

Cox proportional hazards model (predicting survival from heart dataset)

cox_mod <- coxph(Surv(start, stop, event) ~ age + year + surgery + transplant, data =  heart)
summary(cox_mod)

## Call:
## coxph(formula = Surv(start, stop, event) ~ age + year + surgery + 
##     transplant, data = heart)
## 
##   n= 172, number of events= 75 
## 
##                 coef exp(coef) se …

more ...

Create New Predictors

Import packages

library(plyr) # loading plyr for mutate function

Creating new variables using mutate

Calculating Body Mass Index (BMI) from height and weight from women dataset

# converting weight from pounds to kilogram
women_BMI <- mutate(women, weight_kg = weight / 2.2) # weight_kg = weight in kilograms

# converting height from inches to meters 
women_BMI …

more ...

Create new variables using mutate and ifelse

Import packages

library(dplyr)

Use mutate and ifelse syntax to create a new variable

iris_mutate <- mutate(iris, SepalLengthCat = ifelse(Sepal.Length > mean(Sepal.Length), "High", 
                                    ifelse(Sepal.Length < mean(Sepal.Length), "Low", "Equal"))) 
head(iris_mutate)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SepalLengthCat
## 1          5.1         3.5 …

more ...

Create Sample Observations

Create sample boolean vector of length 10

sample(c("Yes", "No"), 10, TRUE) #sample with replacement

##  [1] "No"  "No"  "Yes" "Yes" "No"  "Yes" "No"  "No"  "Yes" "No"

Create sample numeric vector of length 15

sample(c(1:15), 15, FALSE) #sample without replacement

##  [1] 15  2  1  7 11  3 …

more ...

Drop Row and Column by Index and Value

Drop row by index from iris

iris_row_index <- iris[-2, ]
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa …

more ...

Dropping Levels of a Factor

Import packages

library(plyr) # loading plyr for mutate function

Subset only setosa species from Iris

iris_setosa <- subset(iris, Species == "setosa")
paste("Unique species in iris_setosa:", unique(iris_setosa$Species), sep = " ")

## [1] "Unique species in iris_setosa: setosa"

levels(iris_setosa$Species)

## [1] "versicolor" "setosa"     "virginica"

Levels shows all species even though filtered dataset …

more ...

Impute Categorical Missing Values using Mode

Import packages

library(dplyr)

Create data frame with missing categorical features

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0 …

more ...

Impute Numeric Missing Values using Mean

Check data frame for missing values

head(airquality) # airquality dataset contains missing values

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62 …

more ...

Iterate over elements using for-loop

Loop over list of integers

for (i in 1:5)
{
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Iterate over column names to display first observation from every column

for (i in names(mtcars))
{
  print(paste("Value for ", i, ": ", mtcars[1, i], sep = ""))
}

## [1] "Value for …

more ...

Linear Regression

Import packages

library(stats)

Split Data in Training (75%) and Test Sets (25%)

n <- nrow(mtcars)
index = sample(1:n, size = round(0.75*n), replace = FALSE)
train = mtcars[index, ]
test = mtcars[-index, ]
paste("Observations in training data: ", nrow(train), sep = "")

## [1] "Observations in training data: 24"

paste("Observations in …

more ...

Logistic Regression

Import packages

library(stats)

Split data in training and test sets

set.seed(90)
n <- nrow(CO2)
index = sample(1:n, size = round(0.75*n), replace = FALSE)
train = CO2[index, ]
test = CO2[-index, ]

Logistic model predicting Treatment from CO2 train dataset

class(CO2$Treatment)

## [1] "factor"

log_mod_train <- glm(Treatment …

more ...

Partition Data Frame in Training and Test Data

Import packages

library(caret)

Random Split Data in Training (75%) and Test Sets (25%)

n <- nrow(iris)
index = sample(1:n, size = round(0.75*n), replace = FALSE)
train = iris[index, ]
test = iris[-index, ]
paste("Observations in training data: ", nrow(train), sep = "")

## [1] "Observations in training data: 112"

paste("Observations …

more ...

Pearson's Chi-squared Test

Import packages

library(stats)

Prepare data

mtcars_data <- mtcars
mtcars_data$am <- as.factor(mtcars_data$am) # converting predictors to factor 
mtcars_data$cyl <- as.factor(mtcars_data$cyl)

Student's t-test, assessing difference in means for two groups

chisq.test(table(mtcars_data$am, mtcars_data$cyl)) # Assess Transmission differences by number of cylinders

## Warning in chisq …

more ...

Plotting Scatter Plot via Facet Wrap (ggplot)

Import packages

library(ggplot2)

Plotting quantitative predictors against Species

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa …

more ...

Split by Character/Separator

Creating a sample dataframe

LatLong <- c("40.841885, -73.856621",
             "40.675026, -73.944855", 
             "40.726253, -73.806710",
             "40.725375, -73.789845", 
             "40.845456, -73.876555")
Location <- c("Bronx", "Brooklyn", 
              "Manhattan", "Queens", "Staten Island")
geoData <- data.frame(LatLong, Location)
geoData

##                 LatLong      Location
## 1 40.841885, -73.856621         Bronx
## 2 40 …

more ...

Stepwise Regression (Forward, Backward, Both)

Import packages

library(olsrr)
library(MASS) # stepAIC function

Stepwise Regression using olsrr package

Forward Stepwise Regression

mod_forward <- lm(mpg ~ ., data = mtcars)
step_forward <- ols_step_forward(mod_forward)

## We are selecting variables based on p value...

## 1 variable(s) added....

## 1 variable(s) added...
## 1 variable(s) added...

## No more variables satisfy the condition …

more ...

Student's t-Test

Prepare data

mtcars_data <- mtcars
mtcars_data$am <- as.factor(mtcars_data$am)

Student's t-test, assessing difference in means for two groups

t.test(mpg ~ am, data = mtcars_data) # Assess mpg differences by transmission status (automatic vs. manual)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0 …

more ...

Substitute a Pattern in String

Substitute first instance of a pattern in a text

text = "Apples and oranges are fruits"
sub("p", "b", text) # replace first instance of letter p with b

## [1] "Abples and oranges are fruits"

Substitute all instances of a pattern in a text

gsub("p", "b", text) # replace all instances of …

more ...

Survival Analysis & Kaplan Meier

Import packages

library(survival)

Load sample dataset

head(heart)

##   start stop event        age      year surgery transplant id
## 1     0   50     1 -17.155373 0.1232033       0          0  1
## 2     0    6     1   3.835729 0.2546201       0          0  2
## 3     0    1     0   6.297057 0.2655715       0          0  3 …

more ...

Tokenize String

Import packages

library(tidytext)
library(dplyr)

Create data for analysis

text <- "Dplyr provides the ability to process and wrangle data, facilitating convenient data transformations through functions like arrange, select and mutate."
data <- data.frame(count = 5, text)
data$text <- as.character(data$text)

Tokenize text

tokenize <- data %>% unnest_tokens(word, text …

more ...

Wilcoxon Rank Sum Test

Import packages

library(datasets)
library(stats)

Prepare data

mtcars_data <- datasets::mtcars
mtcars_data$am <- as.factor(mtcars_data$am)

Wilcoxon Rank Sum Test, non-parametric equivalent to t-test (assess population mean difference in two groups)

wilcox.test(cyl ~ am, data = mtcars_data) # Assess mpg differences by transmission status (automatic vs. manual)

## 
##  Wilcoxon rank sum …

more ...