Teaching Data Science: Basic data analysis: Part I

[Editor's note: Under construction - 03/06/2013.]
There are number of functions that will be helpful for this example. Please examine them through use of the help system (e.g. 'help(command)'):

read.csv()
head()
names()
as.numeric()
c()
data.frame() or as.data.frame()
sapply()
class()
print()
levels()
droplevels()
subset()
order()
plot()
lines()

Understanding bracket (e.g. '[]) notation [1] and the for command is also important for this exercise.[2] This exercise uses nested interior functions whose results provide arguments for exterior functions. The general form is:

function1(funtion2(function3)))

Here, function3 provides arguments for function2 which provides arguments for function1. Sometimes this form appears as:

function1(funtion2(function3(argsFUN3),other_argsFUN2), other_argsFUN1)

where additional function argument are passed, still inside the parentheses specific to the function. The result is usually a datatype as determined by last exterior function.[3]

The premise of this exercise is quite straightforward. You are 16 years old. Your parents have offered to purchase you a vehicle. They will pay for the purchase price,taxes, and insurance. However, you must pay for your fuel. You want to examine the EPA's fuel economy database to better understand your options for high mileage vehicles.

The dataframe class[4] is a row/column data structure that accepts mixed or heterogeneous columnar datatypes or data classes.

# Download http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip
# unzip vehicles.csv to your working directory

AllVehicles <- read.csv("vehicles.csv")
AllVehicles$comb08 <- as.numeric(AllVehicles$comb08)
# You can show the names and the depth and datatypes of the 71 columns and 33,184 rows in the dataframe 'AllVehicles' with:
nrow(AllVehicles) and ncol(AllVehicles)
length(names(AllVehicles))
AllVehicles[0,]
AllVehicles[,0]
names(AllVehicles)
# You can show the class (datatype) of each column with:
sapply(AllVehicles,class)
class(AllVehicles$comb08)

Extract separate dataframes for those cars whose combined mileage is greater than forty mpg and greater than forty-five mpg.

GTR40 <- data.frame(subset(AllVehicles, comb08 > 40, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))
GTR45 <- data.frame(subset(AllVehicles, comb08 > 45, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))

For examination purposes, output to the screen separate lists for the last five years of vehicles with fuel economy estimates of over forty and forty-five mpg. Note the use of the 'row.names=NULL' argument to data.frame function and the nested droplevels function. These commands reformat the dataframe as separate object from the parent dataframe, stripping meta data specific for the parent and rebuilding it for the child object. The 'for' control structure does not require braces ({}) if the command is printed on one line. Otherwise typical form is:

for (i in (some numerical range,list or function)) do this

for (i in (some numerical range,list or function)) {

do this on the next line inside braces

}

'2008:2013' specifies a numerical range in the source below:

for (i in (2008:2013)) print(droplevels(data.frame((subset(GTR40,year == i)),row.names=NULL)))
for (i in (2008:2013)) print(droplevels(data.frame((subset(GTR45,year == i)),row.names=NULL)))

Now we will sort, extract and plot our data. R is case sensitive. Pay special attention to your typing. [Editor's note: Discussion of the order function and the use of brackets should go here.]

# Plot 2013 GTR 40 MPG

for (i in (2013)) GTR40_2013 <-(droplevels(data.frame((subset(GTR40,year == i)),row.names=NULL)))
# Sort or order by 'comb08' or Combined MPG and reorder index:
GTR40_2013 <- droplevels(data.frame(GTR40_2013[order(GTR40_2013$comb08),],row.names=NULL))
#You can show the 'comb08' sorted dataframe with 'GTR40_2013':

The 'plot' and 'lines' functions are part of the default graphics package in R.

plot(GTR40_2013$comb08,xlab="GTR40 2013 Car Index",ylab="2013 Combined MPG from EPA 'comb08'",type="h")
lines(GTR40_2013$comb08)

# Plot 2013 GTR 45 MPG

for (i in (2013)) GTR45_2013 <-(droplevels(data.frame((subset(GTR45,year == i)),row.names=NULL)))
# Sort or order by comb08 or Combined MPG reorder index:
GTR45_2013 <- droplevels(data.frame(GTR45_2013[order(GTR45_2013$comb08),],row.names=NULL))
#You can show the 'comb08' sorted dataframe with 'GTR45_2013' :

plot(GTR45_2013$comb08,xlab="GTR45 2013 Car Index",ylab="2013 Combined MPG from EPA 'comb08'",type="h")
lines(GTR45_2013$comb08)

We can look ahead to some advanced graphic library functions. The 'lattice' graphics library gives another way of looking at this data:

# load lattice library

library(lattice)
histogram(model ~ comb08, data = GTR40_2013)
barchart(model ~ comb08 ,data = GTR40_2013)

End Notes for Basic Data Analysis: Part I
[1] On subsetting and the use of brackets in R : http://www.ats.ucla.edu/stat/r/modules/subsetting.htm
[2] See Roger D. Peng on Control Structures in R
[3] See Roger D. Peng on Functions
[4] For more on dataframes and R data structures, please see Lam, Longhow: An Introduction to R

Teaching Data Science

Wednesday, March 6, 2013

Basic data analysis: Part I

No comments: