This reposistory contains files for the implentation of the Coursera Class Getting and Cleaning Data.
The repository contains the following files:-
- README.md - This readme file.
- CodeBook.md - The code book explaining the data contained in the data set.
- run_analysis.R - The R script to generate a tidy data set based on the raw data.
- tidy.data.txt - The output of the script. This is a tidy data set with the average of each variable for each activity and each subject.
This Coursera project requires one R script called run_analysis.R that does the following:-
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive variable names.
- From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
The raw data for this project can be downloaded from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip.
For information on the contents of the raw data files, review the included Code Book.
The data should be extracted in its original structure into the project folder.
└───UCI HAR Dataset
├───test
│ └───Inertial Signals
└───train
└───Inertial Signals
The data in the 'UCI HAR Dataset\test\Inertial Signals'and 'UCI HAR Dataset\train\Inertial Signals' directories are not used in this implementation.
The following libraries are required:-
- data.table
- reshape2
These libraries can be installed by running the following commands in the R console.
install.packages('data.table')
install.packages('reshape2')
-
Download the raw data files from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
-
Unzip the files into the project folder keeping the folder structure.
-
Download the file
run_analysis.Rfrom this repository in the project folder. -
Run the script
run_analysis.Rsource('run_analysis.R')
The script generates a text file called tidy.data.txt that will contains a tidy data set with the average of each variable for each activity and each subject.
The code book for this project is located at this URL.
This script was written using R version 3.1.2 (2014-10-31) ("Pumpkin Helmet")
The run_analysis.R script performs the following actions:-
- Read in the raw data sets.
- Combine the training and test sets into a single dataset using the
cbindandrbindfunctions. [Assignment step 1] - Keep only the subject.id, activity and any column that contains a mean or standard deviation. Note: For mean, any varaible name that is a mean is included. This means that
meanFreqdata is also kept. [Assignment step 2] - The activty values are replaced with descriptive names. [Assignment step 3]
- The variable names are changed to descriptive labels. [Assignment step 4]
- The data is melted and recast into a data set showing the average of each variable for each activity and each subject. The data is then outputted to a text file. [Assignement step 5]
The outputted file modifies the variable names to make them more descriptive as follows:-
- Names starting with 't' are preceeded by 'TimeDomain'.
- Names starting with 'f' are preceeded by 'FrequencyDomain'.
- 'std' is replaced with 'StandardDeviation'.
- 'Acc' is replaced with 'Acceleration'.
- 'Gyro' is replaced with 'Gyroscope'.
- 'Mag' is replaced with 'Magnitude'.