This text explains how all of the scripts work and how they are connected.
The data for this course project are available from Dataset.zip
You can read each of the text files using the read.table() function in R. For example, reading in 'features.txt' (from the UCI HAR Dataset folder) can be done with the following code:
features <- read.table("./UCI HAR Dataset/features.txt")
The packages required to run the script are as follows:
- plyr
- reshape2
- tidyr
Set the working directory to the folder in which the UCI HAR Dataset folder is saved.
-
Read the list of all features
features <- read.table("./UCI HAR Dataset/features.txt", col.names = c("Feature_Num", "Feature_Name"), colClasses = "character") features[,2] <- gsub("\\()", "", features[,2]) features[,2] <- gsub("-|,|\\(|\\)", "_", features[,2]) -
Read the training data set
X_train <- read.table("UCI HAR Dataset/train/X_train.txt", colClasses = "numeric", col.names = features[,2]) -
Read the training labels
y_train <- read.table("UCI HAR Dataset/train/y_train.txt", col.names = "Activity_Label") -
Read the training subjects' identifiers
subject_train <- read.table("UCI HAR Dataset/train/subject_train.txt", col.names = "Subject") -
Create a data frame for training data
training <- cbind(subject_train, y_train, X_train) -
Read the test set
X_test <- read.table("UCI HAR Dataset/test/X_test.txt", colClasses = "numeric", col.names = features[,2]) -
Read the test labels
y_test <- read.table("UCI HAR Dataset/test/y_test.txt", col.names = "Activity_Label") -
Read the test subjects' identifiers
subject_test <- read.table("UCI HAR Dataset/test/subject_test.txt", col.names = "Subject") -
Create a data frame for test data
test <- cbind(subject_test, y_test, X_test) -
Combine the two data frames:
trainingandtestcomb_data <- rbind(training, test)
-
Get the index of the measurement on the mean and standard deviation
col_index <- c(1, 2, grep("*[Mm]ean|*[Ss]td*", names(comb_data))) col_index <- col_index[! col_index %in% c(557:563)] -
Extract the data
extr_data <- comb_data[, col_index]
-
Read the activity labels with their activity name
activity_labels <- read.table("./UCI HAR Dataset/activity_labels.txt", col.names = c("Activity_Label", "Activity")) -
Update
extr_datawith descriptive activity namesextr_data_upd <- join(activity_labels, extr_data, by = "Activity_Label") extr_data_upd$Activity_Label <- NULL -
Move
Subjectcolumn to the first of the data frameextr_data_updextr_data_upd <- cbind(extr_data_upd$Subject, extr_data_upd) extr_data_upd$Subject <- NULL names(extr_data_upd)[1] <- "Subject"
This has been done in Step I. The descriptive values listed in features.txt would be appropriate. They are slightly modified and used as the variable names for the data set.
Step V - From the data set in Step IV, create a second, independent tidy data set with the average of each variable for each activity and each subject
-
Split the data set by Subject and by Activity
sp_list <- split(extr_data_upd, list(extr_data_upd$Subject,extr_data_upd$Activity)) -
Calculate the average of each variable for each subject and each activity
avg_mat <- t(sapply(sp_list, function(df) {colMeans(df[, 3:81])})) -
Transform the matrix
avg_matinto a data frameavg_dfavg_df <- as.data.frame(avg_mat) -
Update the column names
colnames(avg_df) <- paste("mean(", colnames(avg_df), ")", sep = "") -
Extract the row names of
avg_dfsubject_activity <- rownames(avg_df) -
Update the data frame
avg_dfby adding the row names as a columnavg_df_upd <- cbind(subject_activity, avg_df) -
Get the tidy data set
tidy_data <- separate(avg_df_upd, subject_activity, into = c("Subject", "Activity"), sep = "\\.") -
Get the final tidy data set
final_tidy_data <- tidy_data[order(as.numeric(tidy_data$Subject)), ] rownames(final_tidy_data) <- NULL
write.table(final_tidy_data,
file = "final_tidy_data.txt",
sep = "\t",
row.names = FALSE)
