Practical Machine Learning Course Project

Synopsis

This report contains research that was conducted for the Practical Machine Learning course as part of the Data Science Specialization offered through Coursera from Johns Hopkins University. This writeup was built in RStudio using the knitr function to publish the final report in HTML format.

The source code for this project can be found on GitHub: Practical Machine Learning Course Project

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The goal for this assignment is to analyze the Weight Lifting Exercise dataset and develop a machine learning algorithm using biometric data to predict the manner in which 6 participants performed a particular dumbbell exercise. This is the “classe” variable in the training set.

The biometric data was collected from accelerometers on the belt, forearm, arm, and dumbbell of the 6 participants. The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

For more information about the dataset used for this analysis, see section Weight Lifting Exercises Dataset from the Human Activity Recognition (HAR) project website.

The resulting prediction model will also serve as the basis for the “Course Project Prediction Quiz Portion” of this assignment where the machine learning algorithm will be applied to 20 test cases available in the test data. The predictions will submitted in the appropriate format for automated grading.

Environment Setup

Prepare the session by loading all necessary packages and clearing the global workspace (including hidden objects).

library(knitr)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
library(e1071)
library(randomForest)
library(gbm)
library(parallel)
library(doParallel)
rm(list = ls(all.names = TRUE))
setwd("~/repos/coursera/github-assignments/practical-machine-learning-course-project")

Prepare the Data

Load the Data

trainURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
validationURL  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

trainDataFile <- "data/pml-training.csv"
validationDataFile <- "data/pml-testing.csv"

if (!file.exists('data')) {
    dir.create('data')
}
if (!file.exists(trainDataFile)) {
    download.file(url = trainURL, destfile = trainDataFile)
}
if (!file.exists(validationDataFile)) {
    download.file(url = validationURL, destfile = validationDataFile)
}

trainData <- read.csv(trainDataFile, sep = ",", header = TRUE)
validationData <- read.csv(validationDataFile, sep = ",", header = TRUE)

stopifnot(file.size(trainDataFile) == 12202745)
stopifnot(file.size(validationDataFile) == 15113)

dim(trainData)

## [1] 19622   160

dim(validationData)

## [1]  20 160

The training dataset contains 19622 observations and 160 variables while the testing dataset (used for the prediction quiz portion of the assignment) contains 20 observations and 160 variables.

Clean the Data

An initial review of both datasets revealed a larger number of missing values as well as columns not relevant for prediction. Reduce the number of features to only include data columns that make sense for prediction.

# remove variables with 'Nearly Zero Variance'
nzv <- nearZeroVar(trainData)
trainData  <- trainData[, -nzv]
validationData  <- validationData[, -nzv]

# remove variables that are mostly NA
mostlyNA <- sapply(trainData, function(x) mean(is.na(x))) > 0.95
trainData <- trainData[, mostlyNA == FALSE]
validationData <- validationData[, mostlyNA == FALSE]

# remove first 7 variables (time series or non-numeric data)
trainData <- trainData[, -c(1:7)]
validationData <- validationData[, -c(1:7)]

# determine data variables that will be used to fit models
dataVariables <- names(trainData[, 1:(length(trainData)-1)])

After cleaning, the models will be fit using the following 51 variables:

##  [1] "pitch_belt"           "yaw_belt"             "total_accel_belt"    
##  [4] "gyros_belt_x"         "gyros_belt_y"         "gyros_belt_z"        
##  [7] "accel_belt_x"         "accel_belt_y"         "accel_belt_z"        
## [10] "magnet_belt_x"        "magnet_belt_y"        "magnet_belt_z"       
## [13] "roll_arm"             "pitch_arm"            "yaw_arm"             
## [16] "total_accel_arm"      "gyros_arm_x"          "gyros_arm_y"         
## [19] "gyros_arm_z"          "accel_arm_x"          "accel_arm_y"         
## [22] "accel_arm_z"          "magnet_arm_x"         "magnet_arm_y"        
## [25] "magnet_arm_z"         "roll_dumbbell"        "pitch_dumbbell"      
## [28] "yaw_dumbbell"         "total_accel_dumbbell" "gyros_dumbbell_x"    
## [31] "gyros_dumbbell_y"     "gyros_dumbbell_z"     "accel_dumbbell_x"    
## [34] "accel_dumbbell_y"     "accel_dumbbell_z"     "magnet_dumbbell_x"   
## [37] "magnet_dumbbell_y"    "magnet_dumbbell_z"    "roll_forearm"        
## [40] "pitch_forearm"        "yaw_forearm"          "total_accel_forearm" 
## [43] "gyros_forearm_x"      "gyros_forearm_y"      "gyros_forearm_z"     
## [46] "accel_forearm_x"      "accel_forearm_y"      "accel_forearm_z"     
## [49] "magnet_forearm_x"     "magnet_forearm_y"     "magnet_forearm_z"

Partition the Data

Knowing that we will be predicting the “classe” variable as the indicator of the training outcome, partition the training dataset (trainData) into two sets: 60% of the training data for the modeling process and the remaining 40% for the test set. This will be performed using random subsampling without replacement. Cross-validation within the training set will be used to improve model fit followed by an out-of-sample test with the test set.

inTrain  <- createDataPartition(trainData$classe, p = 0.6, list = FALSE)
trainSet <- trainData[inTrain, ]
testSet  <- trainData[-inTrain, ]

The validationData test dataset will only be used for the “Course Project Prediction Quiz Portion” of this assignment and thus will remain unchanged.

Exploratory Data Analysis

The 6 young and healthy participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in the following 5 different fashions:

Class A: exactly according to the specification
Class B: throwing the elbows to the front
Class C: lifting the dumbbell only halfway
Class D: lowering the dumbbell only halfway
Class E: throwing the hips to the front

This is the “classe” variable in the training set and is used as an indicator of the training outcome. Class A corresponds to the specified (correct) execution of the exercise, while the other 4 classes correspond to common mistakes.

The following plot shows the distribution of the classe variable in the training set.

g <- ggplot(trainSet, aes(classe))
g <- g + geom_bar(fill = rgb(0.2, 0.4, 0.6, 0.8))
g <- g + xlab("Classe Level")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("Distribution of the Classe Variable in the Training Set")
print(g)

Prediction Model Building

The following three modeling algorithms will be used in this analysis to determine which one provides the best out-of-sample accuracy:

Decision Tree
Random Forest
Generalized Boosted Model

Parallel processing will be used to improve efficiency and reduce build time.

At the end of each analysis, a confusion matrix will be shown to better visualize the accuracy of the predictions.

After applying each model to the test set, the model with the highest accuracy will be used for the prediction quiz portion of the assignment.

Cross Validation

3-fold (k-fold) cross-validation will be used within the training set for each model to tune the amount of bias in the model and improve accuracy.

# decision tree
fitControlDT <- rpart.control(method = "cv", number = 3, verboseIter = FALSE)

# random forest
fitControlRF <- trainControl(method = "cv", number = 3, verboseIter = FALSE, 
                             allowParallel = TRUE)

# generalized boosted model
fitControlGBM <- trainControl(method = "repeatedcv", number = 3, repeats = 1,
                              verboseIter = FALSE, allowParallel = TRUE)

Decision Tree

Build the model and print the decision tree dentigram.

set.seed(660067)
modFitDT <- rpart(as.factor(classe) ~ .,
                  data = trainSet,
                  control = fitControlDT,
                  method = "class")

prp(modFitDT,
    faclen = 0,
    box.palette = "GnBu",
    cex = 0.50,
    legend.x = 0,
    legend.y = 1,
    legend.cex = 1)

Validate the Decision Tree model on the test set (testSet) to determine how well it performed and the accuracy of the results.

predictionDT <- predict(modFitDT, newdata = testSet, type = "class")
confMatrixDT <- confusionMatrix(predictionDT, testSet$classe)
print(confMatrixDT)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1845  191   22   75  127
##          B   87  905   71   95   90
##          C    8  133 1026  190   99
##          D  182   99   57  790   99
##          E  110  190  192  136 1027
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7128          
##                  95% CI : (0.7027, 0.7228)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6368          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8266   0.5962   0.7500   0.6143   0.7122
## Specificity            0.9261   0.9458   0.9336   0.9334   0.9019
## Pos Pred Value         0.8164   0.7252   0.7047   0.6438   0.6205
## Neg Pred Value         0.9307   0.9071   0.9465   0.9251   0.9330
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2352   0.1153   0.1308   0.1007   0.1309
## Detection Prevalence   0.2880   0.1591   0.1856   0.1564   0.2109
## Balanced Accuracy      0.8763   0.7710   0.8418   0.7738   0.8071

Plot the matrix results.

g <- ggplot(data = as.data.frame(confMatrixDT$table),
            aes(x = Prediction, y = Reference))
g <- g + geom_tile(aes(fill = Freq), colour = "white")
g <- g + scale_fill_gradient(low = "white", high = "steelblue", name = "Frequency")
g <- g + geom_text(aes(x = Prediction, y = Reference, label = Freq))
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle(paste0("Decision Tree Accuracy ",
                        round(confMatrixDT$overall['Accuracy'], 3) * 100,
                        "%"))
print(g)

The results from the confusion matrix for the Decision Tree model show a predicted accuracy of 0.713 giving an out-of-sample error rate of 0.287 which is considerably high.

Random Forest

Build the Random Forest model.

set.seed(660067)
# initiate the cluster, leave 1 core for OS
# NOTE: make certain "127.0.0.1 localhost" exists in /etc/hosts
numCores <- detectCores() - 1
cluster <- makeCluster(numCores)
registerDoParallel(cluster)
timerStart <- Sys.time()
modFitRF <- train(classe ~ .,
                  data = trainSet,
                  method = "rf",
                  trControl = fitControlRF,
                  verbose = FALSE)
timerEnd <- Sys.time()
stopCluster(cluster)
registerDoSEQ()
buildTime = paste(round(timerEnd - timerStart, 2), attr(timerEnd - timerStart, "units"))

Build time: 1.76 mins using 7 cores on macOS.

Review the final model.

print(modFitRF$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 26
## 
##         OOB estimate of  error rate: 0.87%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3343    2    1    1    1 0.001493429
## B   26 2247    5    1    0 0.014041246
## C    0   18 2029    6    1 0.012171373
## D    0    1   28 1900    1 0.015544041
## E    0    0    3    8 2154 0.005080831

Validate the Random Forest model on the test set (testSet) to determine how well it performed and the accuracy of the results.

predictionRF <- predict(modFitRF, newdata = testSet)
confMatrixRF <- confusionMatrix(predictionRF, testSet$classe)
print(confMatrixRF)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231   14    0    0    0
##          B    1 1501    6    0    1
##          C    0    2 1358   13    3
##          D    0    0    4 1270   10
##          E    0    1    0    3 1428
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9926          
##                  95% CI : (0.9905, 0.9944)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9906          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9888   0.9927   0.9876   0.9903
## Specificity            0.9975   0.9987   0.9972   0.9979   0.9994
## Pos Pred Value         0.9938   0.9947   0.9869   0.9891   0.9972
## Neg Pred Value         0.9998   0.9973   0.9985   0.9976   0.9978
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1913   0.1731   0.1619   0.1820
## Detection Prevalence   0.2861   0.1923   0.1754   0.1637   0.1825
## Balanced Accuracy      0.9985   0.9938   0.9950   0.9927   0.9948

Plot the matrix results.

g <- ggplot(data = as.data.frame(confMatrixRF$table),
            aes(x = Prediction, y = Reference))
g <- g + geom_tile(aes(fill = Freq), colour = "white")
g <- g + scale_fill_gradient(low = "white", high = "steelblue", name = "Frequency")
g <- g + geom_text(aes(x = Prediction, y = Reference, label = Freq))
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle(paste0("Random Forest Accuracy ",
                        round(confMatrixRF$overall['Accuracy'], 3) * 100,
                        "%"))
print(g)

The results from the confusion matrix for the Random Forest model show a predicted accuracy of 0.993 giving an out-of-sample error rate of 0.007 which shows that the model produces very accurate predictions.

Generalized Boosted Model

Build the Generalized Boosted Model, an implementation of extensions to Freund and Schapire’s AdaBoost algorithm and Friedman’s gradient boosting machine.

set.seed(660067)
# initiate the cluster, leave 1 core for OS
# NOTE: make certain "127.0.0.1 localhost" exists in /etc/hosts
numCores <- detectCores() - 1
cluster <- makeCluster(numCores)
registerDoParallel(cluster)
timerStart <- Sys.time()
modFitGBM <- train(classe ~ .,
                   data = trainSet,
                   method = "gbm",
                   trControl = fitControlGBM,
                   verbose = FALSE)
timerEnd <- Sys.time()
stopCluster(cluster)
registerDoSEQ()
buildTime = paste(round(timerEnd - timerStart, 2), attr(timerEnd - timerStart, "units"))

Build time: 44.97 secs using 7 cores on macOS.

Review the final model.

print(modFitGBM$finalModel)

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 51 predictors of which 51 had non-zero influence.

Validate the Generalized Boosted Model on the test set (testSet) to determine how well it performed and the accuracy of the results.

predictionGBM <- predict(modFitGBM, newdata = testSet)
confMatrixGBM <- confusionMatrix(predictionGBM, testSet$classe)
print(confMatrixGBM)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2187   45    1    2    2
##          B   23 1425   50    5   19
##          C   12   43 1303   53   19
##          D   10    1   12 1220   22
##          E    0    4    2    6 1380
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9578          
##                  95% CI : (0.9531, 0.9622)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9466          
##                                           
##  Mcnemar's Test P-Value : 4.157e-14       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9798   0.9387   0.9525   0.9487   0.9570
## Specificity            0.9911   0.9847   0.9804   0.9931   0.9981
## Pos Pred Value         0.9776   0.9363   0.9112   0.9644   0.9914
## Neg Pred Value         0.9920   0.9853   0.9899   0.9900   0.9904
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2787   0.1816   0.1661   0.1555   0.1759
## Detection Prevalence   0.2851   0.1940   0.1823   0.1612   0.1774
## Balanced Accuracy      0.9855   0.9617   0.9664   0.9709   0.9776

Plot the matrix results.

g <- ggplot(data = as.data.frame(confMatrixGBM$table),
            aes(x = Prediction, y = Reference))
g <- g + geom_tile(aes(fill = Freq), colour = "white")
g <- g + scale_fill_gradient(low = "white", high = "steelblue", name = "Frequency")
g <- g + geom_text(aes(x = Prediction, y = Reference, label = Freq))
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle(paste0("Generalized Boosted Model Accuracy ",
                        round(confMatrixGBM$overall['Accuracy'], 3) * 100,
                        "%"))
print(g)

The results from the confusion matrix for the Generalized Boosted Model show a predicted accuracy of 0.958 giving an out-of-sample error rate of 0.042 which shows that the model produces very accurate predictions.

Conclusion

The goal for this project was to predict the manner in which a participant performed a particular dumbbell exercise. Using biometric data from the Weight Lifting Exercise dataset, we were able to fit three models that yielded the following accuracy rates:

Decision Tree : 0.713
Random Forest : 0.993
Generalized Boosted Model: 0.958

The Random Forest model proved to be the most accurate model for predicting the class of dumbbell exercise. As such, I will use the previously created Random Forest model to predict the 20 cases for the quiz portion of the assignment.

predictionQuiz <- predict(modFitRF, newdata = validationData)
predictionQuiz

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Course Project

Jeffrey M. Hunter

10 June, 2019

Synopsis

Introduction

Environment Setup

Prepare the Data

Load the Data

Clean the Data

Partition the Data

Exploratory Data Analysis

Prediction Model Building

Cross Validation

Decision Tree

Random Forest

Generalized Boosted Model

Conclusion