Teaching Data Science

Data Science Club Description

2014-01-28T09:42:00.000-08:00

Title: “The Data Knights”
Subject: “How computer science looks at 'Big Data' and why understanding 'Big Data' is important.”
Instructor: Ryan Ferris (Isabel's Dad!)
Blog: http://www.teachingdatascience.blogspot.com
Requirements: 5th - 8th grade; high schoolers at SPA; A willingness to work with data, statistics, programming languages, symbolic math. There is no official workload, however suggested assignments will provide an opportunity for introductions to programming, statistics mathematics, database theory. At home PC, MAC, Unix important to get the most out of this club. A tablet reader/browser may also be helpful.

Dates and Time: 3:30 - 5:00 PM on the following Tuesdays in the Computer Lab:
2/5
2/26
3/5
3/19
3/26
4/9
4/23

“The Data Knights” Club will take place once a week for 1.5 hours in the Upper School computer lab. A possible format will be 45 minutes of lecture; 45 minutes of lab work. A reading list, blog or website will provide updated links, information, and some suggested assignments. 'Data Science' is rapidly becoming one of the most important fields of computer science. The field of 'data science' is seen as critical to help manage, marketize, analyze, and understand large volumes of data in an increasingly interconnected world. An introduction to 'data science' gives a parent or instructor an important opportunity to talk about how 'real world' computing uses 'big data' and 'data science' in diverse fields. 'Data Science' also gives us an opening to inject algorithms and approaches to understanding data with mathematics and statistics.

'Data Science' revolves around skillsets in a number of fields including statistics, mathematics, network analysis, database theory, software engineering and modeling. 'Data Science' is being used to understand and explore fields as diverse as energy supplies, the human genome, habitable planets, climate change, financial markets, the propagation of disease, population demographics and many others. It is almost a surety that the abilities to think broadly and flexibly about 'big data' will be an important trait of the next generation of engineers, scientists, technologists and political leaders. It may also be important for all young students to understand the breadth and complexity of a world that may contain 10 billion of us by the end of this century.

Software (possible list):

R Statistics
PostGreSQL
Python
Scilab
Spreadsheets(Excel, Scalc)
Octave
AWK
Graphic Presentation Software

Notes: I can extend the Tuesday sessions in February and March to help with data analysis on your science project if your sponsor or parent finds that useful. There are 14 desks in the SPA Computer lab only.

Elections Data and Accumulators

2013-05-08T21:58:00.003-07:00

I've load some code and charts I produced for a piece on my political blog. R Programming wizardy will take some time and practice to accumulate. In fact it is a more modern programming structure that I refer to as the 'accumulator' that I had to hack up to make this cruft work. I often use an "accumulator" variable while doing data analysis with Powershell. It allows me to continuously stuff similar typed data into an array as so:

rv -ea 0 a
# create one database stored as a variable (e.g. '$out') by merging all candidate donations.
# Add a 'member' or field (e.g. Candidate Name) to each record
$out+=$Varlist.name | % {
$a=(ls variable:/$PSItem).value;
$Name=(ls variable:/$PSItem).name;
$a | add-member -force -passthru -NotePropertyName Candidate -NotePropertyValue $PSItem;
}

# An $out record now is

$out[0]

Contributor : ATU
Date : 06/22/12
Amount : 900
P/G : P
City : WASHINGTON
State : DC
Zip : 20016
Employer :
Occupation :
Candidate : NM

I couldn't find something analogous in R Programming, so I resorted to code like this that created a zero based numeric data.frame ('ddf'), stuffing it with data (via rbind()) and then removing the original zeroed data row, re-leveling (via droplevel()), before finally returning the function value.

accyear <- function() {
#ddf <- data.frame(cbind(Year=NULL,Freq=NULL)))
ddf <- data.frame(cbind(Year=0,Freq=0))
for(i in list) {
dd <- (data.frame(cbind(Year=i,(subset(fvl2, Year == i, select= Freq)))))
dd <- (data.frame(cbind(Year=(unique(dd$Year)),Freq=(cumsum(dd$Freq)[nrow(dd)]))))
ddf <- rbind(ddf,dd)
}
ddf <- droplevels(data.frame(ddf[-1,],row.names=NULL))
return(ddf)
}

Full code for the charts is below. Data is from a Whatcom County Voter Database and a Census Sex and Age database for Washington Counties. The original CSV files were quite large: 5 to 7 million 'observations' or separate data fields. ( I am describing here nrow() * ncol().) . So I pared them down sequentially which is a process that is handled differently to the same effect in SQL. Then I use table() and stack() functions for important purposes. In my political blog, I overlaid the barplot()s with GIMP's transparent layer functionality, but in reality the visualization doesn't quite line up with the data. Close enough though to suggest a more accurate and interesting approach to correlating the separate information in one graph could be powerful.

fvl <- read.delim("ferrisvoterlist_20121204.txt")
fvl1 <- subset(fvl,select= c(1,3,5,8,15,16,17,18,19,21))
as.matrix(sapply(fvl1,class))

fvl2 <- as.data.frame(table(fvl1$BirthDate))

fvl2$Year <- substr((fvl2$Var1),7,10)
fvl2$Year <- as.numeric(fvl2$Year)
list <- sort(unique(fvl2$Year))

accyear <- function() {
#ddf <- data.frame(cbind(Year=NULL,Freq=NULL)))
ddf <- data.frame(cbind(Year=0,Freq=0))
for(i in list) {
dd <- (data.frame(cbind(Year=i,(subset(fvl2, Year == i, select= Freq)))))
dd <- (data.frame(cbind(Year=(unique(dd$Year)),Freq=(cumsum(dd$Freq)[nrow(dd)]))))
ddf <- rbind(ddf,dd)
}
ddf <- droplevels(data.frame(ddf[-1,],row.names=NULL))
return(ddf)
}

ddf <- accyear()
barplot(ddf$Freq,names.arg=ddf$Year,xlab="Voter Birth Year",ylab="Registration Count")

WA_AGESEX <- read.csv("CC-EST2011-AGESEX-53.csv")
as.matrix(grep(pattern="TOT",(as.list(names(WA_AGESEX))),value=TRUE))
Whatcom <- subset(WA_AGESEX, CTYNAME == "Whatcom County")
Whatcom4 <- subset(Whatcom, YEAR == 4)
WhatcomAGE <- (subset(Whatcom4,select=c(AGE1824_TOT,AGE2544_TOT,AGE4564_TOT,AGE65PLUS_TOT,AGE85PLUS_TOT)))
WhatcomAGE <- (droplevels(data.frame(WhatcomAGE,row.names=NULL)))
WhatcomAGE <- stack(WhatcomAGE)
# barplot(WhatcomAGE$values,names.arg=WhatcomAGE$ind)
barplot(WhatcomAGE$values[5:1],names.arg=WhatcomAGE$ind[5:1])

Click to enlarge the graphs:

Basic Data Analysis and Lattice Graphics Framework

2013-03-29T18:04:00.000-07:00

(post under construction)
This post is very generic cruft on basic data analysis designed around combining data with :

cbind()
rbind()
matrix()
as.matrix()
data.frame()
as.data.frame()

and visualizing data with the lattice graphics package.

Some quick and dirty notes on learning R. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.

This post follows from my last post on combining data for analysis. I am using BEA, BLS, and Census data to understand 20 year macro-economic flows. Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are here. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have concatenated data from mulitple sets into one dataframe. I have used prefixes (BLS,PI,NS) designate "Bureau of Labor Statistics", (BEA) Personal Income, (BEA) Net (residential) Stock.

getdf <- function(x) {as.data.frame(cbind(Class=sapply(x,class)),optional=TRUE)}

dfclass <- function(x,y) {as.matrix(grep(pattern=x,(as.character(names(y))),value=TRUE))}

print_matrix <- function(x) {
matrix1 <- matrix(sapply(x,class))
matrix1 <- cbind(matrix1,matrix(names(x)))

print(matrix1)

}

> getdf(dd)
Class
Year integer
BLS.Employment integer
BLS.CivLabForce integer
BLS.Census.Population integer
BLS.EmployPopRatio numeric
BLS.Census.EMP_CLF numeric
BLS.Census.EMP_POP numeric
BLS.Census.CLF_POP numeric
PI.PersonalInc numeric
PI.EmployeeComp numeric
PI.ProprietorInc numeric
PI.RentalInc numeric
PI.AssetReceipts numeric
PI.InterestInc numeric
PI.DividendInc numeric
PI.TransferReceipts numeric
PI.GovBenefits numeric
PI.OtherTransfer numeric
NS.ResidentialFixedAssets numeric
NS.Private numeric
NS.Corporate numeric
NS.Noncorp numeric
NS.Sole_prop_partner numeric
NS.Nonprofit numeric
NS.Households numeric
NS.Government numeric
NS.Federal numeric
NS.StateLocal numeric
NS.OwnerOccupied numeric
NS.TenantOccupied numeric

> dfclass("PI",dd)
[,1]
[1,] "PI.PersonalInc"
[2,] "PI.EmployeeComp"
[3,] "PI.ProprietorInc"
[4,] "PI.RentalInc"
[5,] "PI.AssetReceipts"
[6,] "PI.InterestInc"
[7,] "PI.DividendInc"
[8,] "PI.TransferReceipts"
[9,] "PI.GovBenefits"
[10,] "PI.OtherTransfer"

> print_matrix(dd)
[,1] [,2]
[1,] "integer" "Year"
[2,] "integer" "BLS.Employment"
[3,] "integer" "BLS.CivLabForce"
[4,] "integer" "BLS.Census.Population"
[5,] "numeric" "BLS.EmployPopRatio"
[6,] "numeric" "BLS.Census.EMP_CLF"
[7,] "numeric" "BLS.Census.EMP_POP"
[8,] "numeric" "BLS.Census.CLF_POP"
[9,] "numeric" "PI.PersonalInc"
[10,] "numeric" "PI.EmployeeComp"
[11,] "numeric" "PI.ProprietorInc"
[12,] "numeric" "PI.RentalInc"
...

Str() also works well to enumerate a dataframe:

> str(dd)

'data.frame':   20 obs. of  30 variables:

 $ Year                     : int  1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 ...

 $ BLS.Employment           : int  118492 120259 123060 124900 126708 129558 131463 133488 136891 136933 ...

 $ BLS.CivLabForce          : int  128105 129200 131056 132304 133943 136297 137673 139368 142583 143734 ...

 $ BLS.Census.Population    : int  256894 260255 263436 266557 269667 272912 276115 279295 282385 285309 ...

 $ BLS.EmployPopRatio       : num  0.615 0.617 0.625 0.629 0.632 0.638 0.641 0.643 0.644 0.637 ...

 $ BLS.Census.EMP_CLF       : num  0.925 0.931 0.939 0.944 0.946 0.951 0.955 0.958 0.96 0.953 ...

 $ BLS.Census.EMP_POP       : num  0.925 0.931 0.939 0.944 0.946 0.951 0.955 0.958 0.96 0.953 ...

 $ BLS.Census.CLF_POP       : num  0.499 0.496 0.497 0.496 0.497 0.499 0.499 0.499 0.505 0.504 ...

 $ PI.PersonalInc           : num  5347 5568 5875 6201 6592 ...

 $ PI.EmployeeComp          : num  3647 3791 3981 4179 4388 ...

 $ PI.ProprietorInc         : num  415 450 485 516 584 ...

 $ PI.RentalInc             : num  84.6 114.1 142.9 154.6 170.4 ...

 $ PI.AssetReceipts         : num  910 900 948 1005 1081 ...

 $ PI.InterestInc           : num  722 698 713 752 784 ...

 $ PI.DividendInc           : num  188 202 235 253 296 ...

 $ PI.TransferReceipts      : num  746 791 826 879 924 ...

 $ PI.GovBenefits           : num  730 777 813 860 901 ...

 $ PI.OtherTransfer         : num  16.3 14.1 13.3 18.7 22.9 19.4 26 34 42.4 46.8 ...

 $ NS.ResidentialFixedAssets: num  6744 7160 7668 8009 8449 ...

 $ NS.Private               : num  6586 6990 7486 7821 8253 ...

 $ NS.Corporate             : num  69.6 71.5 74.6 77.4 81.1 ...

 $ NS.Noncorp               : num  6516 6918 7411 7743 8172 ...

 $ NS.Sole_prop_partner     : num  617 634 656 681 714 ...

 $ NS.Nonprofit             : num  110 112 116 117 120 ...

 $ NS.Households            : num  5788 6172 6639 6945 7337 ...

 $ NS.Government            : num  159 170 182 188 196 ...

 $ NS.Federal               : num  52.7 56.4 59.9 61.6 63.8 66 68.7 72.2 75.3 79.1 ...

 $ NS.StateLocal            : num  106 114 122 126 133 ...

 $ NS.OwnerOccupied         : num  4918 5267 5694 5975 6333 ...

 $ NS.TenantOccupied        : num  1801 1866 1945 2005 2087 ...

Once we have peeked at our data, we can start using graphics packages like lattice to visualize data. I find that visualizing is a critical step in understanding the relationships of data; the specifics of which are not always immediately clear. The lattice graphic package (which ships with the standard installation) lets us rather easily add multiple multiple data sets across the same Y axis (e.g. "Year") through concatenation. Also added are a specific Y label, key, line type (ylab,auto.key,type). I use the lattice package "xyplot" function. All measurements are in billions:

library(lattice)
xyplot(NS.Households + PI.PersonalInc + PI.GovBenefits ~ Year,ylab="Non Specific",auto.key=TRUE,type="b")

What we perceive at first glance wouldn't surprise many observers of the American economy for the last twenty years. Nominally, we have seen increasing valuation of housing stock and personal income, and government benefits. 2008 - 2009 brought a "crash" to the housing stock valuation and personal income and a simultaneous increase in government spending. However, even this non specific chart shows the "bubble" in housing stock valuation whose "bursting" resulted in precipitous decline in personal income. Now let us try something a bit more complex. Below, I scale and create data sets to help me visually understand and weight macro-economic change. Notice that inside parentheses the operands "+" or "-" perform math; otherwise they concatenate X axis data.

xyplot(BLS.Census.Population/100 + PI.GovBenefits + (BLS.Census.CLF_POP * 5000) + (NS.Households - PI.PersonalInc) ~ Year,ylab="Divergence",auto.key=TRUE,type="b")

The orange line represents the increasing difference between household value and personal income we saw in the first chart. Only this time, we can readily see that difference crest at about $5 Trillion in 2007 before taking a dive. Total population is originally measured in millions, so I divide it by 100 to smack it front and center in this graph of billions. BLS statistics give us the employed labor force divided by the potential working force as a percentage. (This is not the vaunted unemployment measures U-3 or U-6!) I multiply this percentage by 5000 so the relationship between total population and a shrinking work force percentage (over time) is made clear. Government benefits fit into this graph without scaling. Clearly, they have been rising nominally well before Barack Obama took office and government stimulus packages were deployed.

From this graph we could deduce more sharply the American economic dilemma: a disjointed housing bubble, a shrinking workforce percentage (e.g. an aging population), a steadily increasing population and an economy dependent upon increasing government benefits. Let us try another approach at visualizing similar data, but before we do some comments on employment and population percentages may be useful. The total amount of employed persons in the United States is always some percentage of the civil labor force which in turn is always some percentage of the total population. These numbers are always much different than U-3 or U-6 which are the usual measures of unemployment. Let us take a look at this data for the last year in my data set from the BLS and Census. To do this, I am going to bind three columns as matrices from my dataframe. This code formats indexed dataframe information nicely:

> cbind(matrix(names(ddBLS)),matrix(ddBLS[1,1:8]), matrix(ddBLS[20,1:8]))
[,1] [,2] [,3]
[1,] "Year" 1992 2011
[2,] "BLS.Employment" 118492 139869
[3,] "BLS.CivLabForce" 128105 153617
[4,] "BLS.Census.Population" 256894 312603
[5,] "BLS.EmployPopRatio" 0.615 0.584
[6,] "BLS.Census.EMP_CLF" 0.925 0.911
[7,] "BLS.Census.EMP_POP" 0.461 0.447
[8,] "BLS.Census.CLF_POP" 0.499 0.491

We can use the lattice package ('xyplot') to look at these percentages:

xyplot(BLS.Census.EMP_CLF + BLS.EmployPopRatio + BLS.Census.CLF_POP + BLS.Census.EMP_POP ~ Year,type="b",ylab="BLS Employment %",auto.key=TRUE)

But maybe we want to look at related data on four separate charts with separate Y axis. We can use commands from the base graphics package to do this:

par(mfrow=c(2,2), pch=16)
attach(USEmployPop.1992.2011)
plot(Year,BLS.Employment,type="b")
plot(Year,BLS.CivLabForce,type="b")
plot(Year,BLS.Census.CLF_POP, type="b")
plot(Year,BLS.Census.EMP_POP,type="b")
detach(USEmployPop.1992.2011)
par(mfrow=c(1,1), pch=1)

diff1 <- as.data.frame(cbind(Year,'BLS.Total.Employment(M)'=BLS.Employment/100,NS.ResidentialFixedAssets,PI.PersonalInc,'Diff(NS.RFA - PI.PInc)' = (NS.ResidentialFixedAssets - PI.PersonalInc)))
diff1

ggplot2

2013-03-21T22:03:00.005-07:00

Hadley Wickham on using R and ggplot2:

Combining DataFrames in R Programming

2013-03-18T17:06:00.000-07:00

The screencast below discusses combinging dataframes from disparate sources in R Programming. Full screen is probably best. The code for the screencast is below. Data files for this screencast can be found here.

This is the code that accompanies the screen cast above.

# data science exercise combing dataframes from different sources
# data science exercise using lattice graphics system

list.files()
list.files(pattern="csv")

USPerInc.1992.2011 <- data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))
USResidentialAsset.1992.2011 <- data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))
USEmployPop.1992.2011 <- data.frame(read.csv("BLS_Census.csv"))

USComb <- cbind(USPerInc.1992.2011[c(1,6)])
names(USComb)
USComb <- cbind(USResidentialAsset.1992.2011[c(2,3,4,8)])
names(USComb)

USComb <- cbind(USPerInc.1992.2011[c(1,6)])
USComb <- cbind(USComb,USResidentialAsset.1992.2011[c(2,3,4,8)])
matrix(names(USComb))

USComb1 <- data.frame(read.csv("BLS_Census.csv"))
USComb1 <- cbind(USComb1,(data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))))
USComb1 <- cbind(USComb1,(data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))))
names(USComb1)

grep(pattern="Year",(as.character(names(USComb1))),value=TRUE)
grepl(pattern="Year",(as.character(names(USComb))))
matrix(grepl(pattern="Year",(as.character(names(USComb1)))))

USComb1 <- cbind(USComb1[c(-9,-20)])

matrix1 <- matrix(sapply(USComb,class))
matrix1 <- cbind(matrix1,matrix(names(USComb)))
matrix2 <- matrix(sapply(USComb1,class))
matrix2 <- cbind(matrix2,matrix(names(USComb1)))
matrix1
matrix2

dd <- USComb1
str(dd)

2013-03-11T13:58:00.001-07:00

"Great coders are today's rock stars."

I just had to post this video on learning to code from an all star cast at code.org...

RGui, Rstudio, Notepad++; Creating and Using Functions

2013-03-08T19:20:00.001-08:00

The code below is from the youtube published screen cast above. This screen cast is available as a high resolution WMV file. For more data see the Week4 folder.

# * ** Basic R Programming Function and Plotting Demonstration ** * #

# create a function to find the square of any number
sqr <- function(a) {a^2}
# works on 'scalar' or iterates over 'vector'
sqr(100)
sqr(1:100)
# use 'sapply' to iterate range as a data.frame for your function 'sqr' and built-in function 'sqrt'
data.frame(sapply((1:100),sqr))
data.frame(sapply((1:100),sqr),(sapply((1:100),(sqrt))))
# plot the data.frame as a histogram with custom X,Y labels
plot(data.frame(sapply((1:100),sqr),(sapply((1:100),(sqrt)))),xlab="Square 1:100", ylab="Square root 1:100",type="h")

# use 'cbind' (column bind) to apply XY labels to the columns
data.frame(cbind(sqr=sapply((1:100),sqr)),sqrt=sapply((1:100),(sqrt)))
# pump the same dataframe to 'dd' and plot 'dd'
dd <- data.frame(cbind(sqr=sapply((1:100),sqr)),sqrt=sapply((1:100),(sqrt)))
plot(dd, type="h")
lines(dd)

# create separate numeric vectors and combine them into a dataframe
s1 <- sapply((1:100),sqr)
s2 <- sapply((1:100),sqrt)
dd <- data.frame(s1,s2)
plot(dd, type="h")
lines(dd)

# use terms more relevant to the X and Y labels
Square <- sapply((1:100),sqr)
SquareRoot <- sapply((1:100),sqrt)
dd <- data.frame(Square,SquareRoot)
plot(dd, type="h")
lines(dd)

# create functions and values for all XY graph quadrants
sqr <- function(a) {a^2}
sqr_neg <- function(a) {-(a^2)}

Square <- sapply((1:100),sqr)
Square_neg <- -(sapply((1:100),sqr))
SquareRoot <- sapply((1:100),sqrt)
SquareRoot_neg <- -(sapply((1:100),sqrt))

# plot a four by four series of charts as a dataframe
dd <- data.frame(Square,SquareRoot,Square_neg,SquareRoot_neg)
plot(dd)

#quadrant (both XY are positive numbers)
plot(dd$Square,dd$SquareRoot, type="h")
lines(dd$Square,dd$SquareRoot)

# quadrant (both XY are negative numbers)
plot(dd$Square_neg,dd$SquareRoot_neg, type="h")
lines(dd$Square_neg,dd$SquareRoot_neg)

# use points instead of lines
plot(dd$Square,dd$SquareRoot, type="h")
points(dd$Square,dd$SquareRoot)

plot(dd$Square_neg,dd$SquareRoot_neg, type="h")
points(dd$Square_neg,dd$SquareRoot_neg)

# More Plotting and plotting functions
plot(Square ~ SquareRoot)

plot(Square,SquareRoot)
plotfunc <- function(x) {plot((Square ~ SquareRoot),subset = x)}
plotfunc <- function(x) {plot((Square ~ SquareRoot),dd,subset = x)}
plotfuncSquaregtr <- function(x) {plot((Square ~ SquareRoot),subset=Square > x)}
plotfuncSquarelt <- function(x) {plot((Square ~ SquareRoot),subset=Square < x)}

Basic data analysis: Part I

2013-03-06T21:03:00.001-08:00

[Editor's note: Under construction - 03/06/2013.]
There are number of functions that will be helpful for this example. Please examine them through use of the help system (e.g. 'help(command)'):

read.csv()
head()
names()
as.numeric()
c()
data.frame() or as.data.frame()
sapply()
class()
print()
levels()
droplevels()
subset()
order()
plot()
lines()

Understanding bracket (e.g. '[]) notation [1] and the for command is also important for this exercise.[2] This exercise uses nested interior functions whose results provide arguments for exterior functions. The general form is:

function1(funtion2(function3)))

Here, function3 provides arguments for function2 which provides arguments for function1. Sometimes this form appears as:

function1(funtion2(function3(argsFUN3),other_argsFUN2), other_argsFUN1)

where additional function argument are passed, still inside the parentheses specific to the function. The result is usually a datatype as determined by last exterior function.[3]

The premise of this exercise is quite straightforward. You are 16 years old. Your parents have offered to purchase you a vehicle. They will pay for the purchase price,taxes, and insurance. However, you must pay for your fuel. You want to examine the EPA's fuel economy database to better understand your options for high mileage vehicles.

The dataframe class[4] is a row/column data structure that accepts mixed or heterogeneous columnar datatypes or data classes.

# Download http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip
# unzip vehicles.csv to your working directory

AllVehicles <- read.csv("vehicles.csv")
AllVehicles$comb08 <- as.numeric(AllVehicles$comb08)
# You can show the names and the depth and datatypes of the 71 columns and 33,184 rows in the dataframe 'AllVehicles' with:
nrow(AllVehicles) and ncol(AllVehicles)
length(names(AllVehicles))
AllVehicles[0,]
AllVehicles[,0]
names(AllVehicles)
# You can show the class (datatype) of each column with:
sapply(AllVehicles,class)
class(AllVehicles$comb08)

Extract separate dataframes for those cars whose combined mileage is greater than forty mpg and greater than forty-five mpg.

GTR40 <- data.frame(subset(AllVehicles, comb08 > 40, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))
GTR45 <- data.frame(subset(AllVehicles, comb08 > 45, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))

For examination purposes, output to the screen separate lists for the last five years of vehicles with fuel economy estimates of over forty and forty-five mpg. Note the use of the 'row.names=NULL' argument to data.frame function and the nested droplevels function. These commands reformat the dataframe as separate object from the parent dataframe, stripping meta data specific for the parent and rebuilding it for the child object. The 'for' control structure does not require braces ({}) if the command is printed on one line. Otherwise typical form is:

for (i in (some numerical range,list or function)) do this

for (i in (some numerical range,list or function)) {

do this on the next line inside braces

}

'2008:2013' specifies a numerical range in the source below:

for (i in (2008:2013)) print(droplevels(data.frame((subset(GTR40,year == i)),row.names=NULL)))
for (i in (2008:2013)) print(droplevels(data.frame((subset(GTR45,year == i)),row.names=NULL)))

Now we will sort, extract and plot our data. R is case sensitive. Pay special attention to your typing. [Editor's note: Discussion of the order function and the use of brackets should go here.]

# Plot 2013 GTR 40 MPG

for (i in (2013)) GTR40_2013 <-(droplevels(data.frame((subset(GTR40,year == i)),row.names=NULL)))
# Sort or order by 'comb08' or Combined MPG and reorder index:
GTR40_2013 <- droplevels(data.frame(GTR40_2013[order(GTR40_2013$comb08),],row.names=NULL))
#You can show the 'comb08' sorted dataframe with 'GTR40_2013':

The 'plot' and 'lines' functions are part of the default graphics package in R.

plot(GTR40_2013$comb08,xlab="GTR40 2013 Car Index",ylab="2013 Combined MPG from EPA 'comb08'",type="h")
lines(GTR40_2013$comb08)

# Plot 2013 GTR 45 MPG

for (i in (2013)) GTR45_2013 <-(droplevels(data.frame((subset(GTR45,year == i)),row.names=NULL)))
# Sort or order by comb08 or Combined MPG reorder index:
GTR45_2013 <- droplevels(data.frame(GTR45_2013[order(GTR45_2013$comb08),],row.names=NULL))
#You can show the 'comb08' sorted dataframe with 'GTR45_2013' :

plot(GTR45_2013$comb08,xlab="GTR45 2013 Car Index",ylab="2013 Combined MPG from EPA 'comb08'",type="h")
lines(GTR45_2013$comb08)

We can look ahead to some advanced graphic library functions. The 'lattice' graphics library gives another way of looking at this data:

# load lattice library

library(lattice)
histogram(model ~ comb08, data = GTR40_2013)
barchart(model ~ comb08 ,data = GTR40_2013)

End Notes for Basic Data Analysis: Part I
[1] On subsetting and the use of brackets in R : http://www.ats.ucla.edu/stat/r/modules/subsetting.htm
[2] See Roger D. Peng on Control Structures in R
[3] See Roger D. Peng on Functions
[4] For more on dataframes and R data structures, please see Lam, Longhow: An Introduction to R

Basic Graphing in R: Combining, Plotting and Smoothing

2013-03-01T16:31:00.002-08:00

R Graphs from left to right: Price of Imported Oil per Quarter 1976:2012; Price of Retail Gasoline per Quarter:1976:2012; Ratio of Retail Gas / Imported Oil per Quarter: 1976:2012 . Source: U.S. EIA : "Short-Term Energy Outlook Real and Nominal Prices, February 13, 2012

The data files, images and R Script for this blog are here. These example use R 2.14 64 bit for Windows. Because I am neither a statistician or energy professional, the results of the following analysis will have to be taken with a "grain of salt". The purpose of this post is to demonstrate basic use of exporting, reformatting, combining, plotting, smoothing data in R.

Finding and Importing Data

I found historical prices of United States energy consumption at the Energy Information Administration.[1] . I wanted to understand better the rise in the price of gasoline in the United States and how closely it relates to the historic rise in the market price of crude oil. I used the quarterly worksheets Crude Oil - Q and Gasoline - Q from the EIA spreadsheet "real_prices.xls" ; data current as of February, 2013. I found it simplest to reformat the date to numeric columns in CSV ('comma series value') format. I synchronized both worksheets to cover the same date range and created a simplified numeric date range using Open Office Scalc's left and right functions to reformat the quarter dates thus sidestepping the issue of date formatting (for now). After importing the data into R with these commands:

> QTR_Imp_Oil_Price <- read.csv("ImportedOilPrice_datereformat_simple.csv")
> QTR_Retail_Gas_Price <- read.csv("QuarterRetailGas_datereformat_simple.csv")

I then had two data frames as below. Since all the columns now contain numeric class data, 'read.csv' formats them as.numeric:

> head(QTR_Imp_Oil_Price)
Q Year Index84 Nominal Real
1 1 1976 0.5590 13.3500 55.3500
2 2 1976 0.5640 13.4296 55.1742
3 3 1976 0.5730 13.5194 54.6710
4 4 1976 0.5813 13.5948 54.1876
5 5 1977 0.5920 14.3847 56.3033
6 6 1977 0.6023 14.5384 55.9284
...

> head(QTR_Retail_Gas_Price)
Q Year Index84 Nominal Real
1 1 1976 0.5590 0.60 2.49
2 2 1976 0.5640 0.60 2.48
3 3 1976 0.5730 0.63 2.53
4 4 1976 0.5813 0.63 2.50
5 5 1977 0.5920 0.64 2.49
6 6 1977 0.6023 0.66 2.53
...

For this post, we can ignore "Index84" and "Real" data columns. However, we will create a new dataframe by combining two columns from separate dataframes:

Oil_Gas_Nominal <- data.frame(QTR_Imp_Oil_Price$Nominal,QTR_Retail_Gas_Price$Nominal)
# copy to a more readable name
Oil_Gas_Nominal_Price <- Oil_Gas_Nominal

Plotting in R

The commands

help(plot)
methods(plot)
help(lines)
library(help="stats")
help(lowess)

help us understand the versatility of plotting in R. In the examples below I am using the plot.ts (e.g. 'plot time series' command). Because the a dataframe has levels synchronous with time span both x and y arguments are not needed. Type Oil_Gas_Nominal_Price[1] at the R console to see why. The lines function allows me to apply scatterplot smoothing to the graph. The plot.ts function allows for x and y axis labels as well as chart type. Here type="h" specifies histogram. Here are some examples from the EIA derived dataframes:

require(stats)

plot.ts(Oil_Gas_Nominal_Price[1], xlab="By Quarter: 1976:2012",type="h")

lines(stats::lowess(Oil_Gas_Nominal_Price[1]))

require(stats)

plot.ts(Oil_Gas_Nominal_Price[2], xlab="By Quarter: 1976:2012", type="h")

lines(stats::lowess(Oil_Gas_Nominal_Price[2]))

This last plot shows how dividing by dataframe columns is a vector operation.

require(stats)
Ratio_Nominal <- data.frame(Oil_Gas_Nominal[2]/Oil_Gas_Nominal[1])
plot.ts(data.frame(Ratio_Nominal),main="Retail Gas/Imported Oil",xlab="By Quarter: 1976:2012",ylab="Retail Gas/Imported Oil",type="h")
lines(stats::lowess(Ratio_Nominal))

More information on DataFrames in R:

[1] http://timhesterberg.home.comcast.net/~timhesterberg/Rpackages/TwoPackages5.pdf
[2] http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames
[3] http://cran.r-project.org/web/packages/dataframe/dataframe.pdf
[4] http://www3.nd.edu/~steve/Rcourse/Lecture2v1.pdf
[5] http://www.dummies.com/how-to/content/how-to-create-a-data-frame-from-scratch-in-r.html
[6] http://www.rochester.edu/College/gradstudents/bkenkel//data/rcourse_chap03.pdf
[7] http://rwiki.sciviews.org/doku.php?id=tips:data-frames
[8] http://rwiki.sciviews.org/doku.php?id=tips:data-frames:sort

More information on Graphs in R:

[1] http://www.harding.edu/fmccown/r/
[2] http://www.statmethods.net/graphs/scatterplot.html
[3] http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html
[4] http://www.cyclismo.org/tutorial/R/plotting.html
[5] http://www.sr.bham.ac.uk/~ajrs/R/r-plot_data.html
[6] http://stackoverflow.com/questions/2564258/plot-2-graphs-in-same-plot-in-r
[7] http://flowingdata.com/2012/12/17/getting-started-with-charts-in-r/

Datatypes, consistent data, reshaping data, dirty data

2013-02-25T17:01:00.001-08:00

[Data for this post can be found here]

This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer:

So this data is part of the EPA 2013 fuel economy ratings. You can see the steps I took to get this data into the data editor here (click to enlarge):

All well so far. Now let us try some data analysis! Let us say, for example, that you want a list of all vehicles whose combined MPG is 45 or more. The following syntax should work just fine, if your data was "clean":

> subset(EPADelim, Cmb.MPG >= 45 , select = c(Model,Cmb.MPG))

[1] Model Cmb.MPG

<0 rows> (or 0-length row.names)

Warning message:

In Ops.factor(Cmb.MPG, 45) : >= not meaningful for factors

However, a warning message is returned. So let us take a look at 'Cmb.MPG'. Right away, we see some problems. Many of the data fields are marked "N/A". By default, R ignores these entries. More troublesome for those of us who would like to do numeric comparisons with the subset function is "factor" data fields such as "16/24". These fields can not be subject to numeric comparison by R.:

To understand data a little better, let us discuss (briefly) 'datatypes' in R. R has a number of important classes of data. You can use the 'class()' function on any object in R to uncover the class type. For example:

> class(1)
[1] "numeric"
> class("char")
[1] "character"
> class(1:10)
[1] "integer"
> class(2303456L)
[1] "integer"
> class(df)
[1] "data.frame"
> class(get.c)
[1] "function"

The class of the data can be changed with the 'as.[class]' function:

> class(EPADelim$City.MPG)
[1] "factor"

> class(as.numeric(EPADelim$City.MPG))

[1] "numeric"
> (as.numeric(EPADelim$City.MPG))
[1] 52 52 52 40 40 38 38 29 29 32....

> class(as.vector(EPADelim$City.MPG))
[1] "character"
> as.vector(EPADelim$City.MPG)[1:10]
[1] "39" "39" "39" "24" "24" "22" "22" "16" "16" "19"

So let us try our subset() function once again, converting 'Cmb.MPG' datatype on the fly:

> subset(EPADelim,as.numeric(Cmb.MPG) >= 45,select = c(Model,Cmb.MPG))
Model Cmb.MPG
1 ACURA ILX 38
2 ACURA ILX 38
3 ACURA ILX 38
4 ACURA ILX 28
5 ACURA ILX 28

...

hmmm.... that isn't quite right. This data set has to be cleaned. R has an entire series of functions designed to help you automate such data "reshaping" including (but not limited to):

strssplit()
sapply()
sub()
gsub()
cut()
cut2()
merge()
melt()
sort()
order()
head()
tail()

I won't discuss these functions in this post. (See note at bottom for some tutorial links.) However, If you have some knowledge of a database language utility like 'gawk 4.0', you can use the following syntax to understand just how much data needs to be 'cleaned' in column 15. For example , sorted count of all data that contains the "/" in column 15 ('Cmb.MPG') shows:

$ gawk -F"\t" '{print $15}' all_alpha_13.txt | sort -nr | uniq -c | sort -k1 -nr | grep "/"

165 N/A

22 10/14

17 13/17

11 17/23

11 16/22

9 14/19

9 11/15

8 12/16

7 11/14

6 16/24

5 14/20

4 9/12

4 43/100

4 16/21

4 13/18

4 10/13

....

However, a much easier, but time consumptive, way to do this is by editing the data in a spreadsheet or R's data editor. The changes you makes in R's data editor take place immediately and irrevocably. You can see the approach I take to cleaning up data in the spreadsheet screenshot below. I simply split up the 'Cmb.MPG' into two numeric columns: 'Cmb.hi.MPG' and 'Cmb.lo.MPG'. :

Now we try our subset() function again. However, after wading through a pile of 'N/A' we examine our results to see that triple digit 'Cmb.hi.MPG' have been left out:

> subset(EPASplit15,as.numeric(Cmb.hi.MPG) > 45 ,select = c(Model,Cmb.hi.MPG))

....

164 RAM 3500 N/A
165 RAM 3500 N/A
166 TESLA Model S 95
173 TESLA Model S 89
174 TOYOTA RAV4 EV 76
175 CODA Coda 73
176 CODA Coda 73
177 TOYOTA Prius Plug-in Hybrid 95
178 TOYOTA Prius Plug-in Hybrid 95
179 TOYOTA Prius 50
180 TOYOTA Prius 50
181 TOYOTA Prius c 50
182 TOYOTA Prius c 50
183 FORD C-MAX Hybrid 47
184 FORD Fusion Hybrid 47
213 CHEVROLET Volt 98
214 CHEVROLET Volt 98
215 CHEVROLET Volt 98

A better command line 'fix' for this would be:

EPASplit15$Cmb.hi.MPG <- as.numeric(EPASplit15$Cmb.hi.MPG) As a last resort we call up the data editor for EPASplit15 ('fix(EPASplit15)') and by clicking on the top column of the 'Cmb.hi.MPG' and 'Cmb.lo.MPG' convert them to numeric columns:

Now our subset() function works as desired:

> subset(EPASplit15,(Cmb.hi.MPG) > 45 ,select = c(Model,Cmb.hi.MPG))

Model Cmb.hi.MPG
166 TESLA Model S 95
173 TESLA Model S 89
174 TOYOTA RAV4 EV 76
175 CODA Coda 73
176 CODA Coda 73
177 TOYOTA Prius Plug-in Hybrid 95
178 TOYOTA Prius Plug-in Hybrid 95
179 TOYOTA Prius 50
180 TOYOTA Prius 50
181 TOYOTA Prius c 50
182 TOYOTA Prius c 50
183 FORD C-MAX Hybrid 47
184 FORD Fusion Hybrid 47
188 FORD C-MAX PHEV 100
189 FORD C-MAX PHEV 100
190 FORD Fusion PHEV 100
191 FORD Fusion PHEV 100
213 CHEVROLET Volt 98
214 CHEVROLET Volt 98
215 CHEVROLET Volt 98
2128 SCION iQ EV 121
2129 SCION iQ EV 121
2147 HONDA Fit 118
2148 HONDA Fit 118
2149 FIAT 500e 116
2150 FIAT 500e 116
2151 NISSAN Leaf 116
2152 NISSAN Leaf 116
2153 MITSUBISHI i-MiEV 112
2154 MITSUBISHI i-MiEV 112
2174 SMART ForTwo Cabriolet 107
2175 SMART ForTwo Cabriolet 107
2176 SMART ForTwo Coupe 107
2177 SMART ForTwo Coupe 107
2178 FORD Focus BEV 105
2179 FORD Focus BEV 105

Update 03/02/2013:

Originally, I missed a developer download page that had cleaner (and more detailed) fuel economy data. The chart below shows us how many high mileage vehicles are now appearing on the market.

# Download:
# http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip
# Unzip vehicles.csv to your working directory

AllVehicles <- read.csv("vehicles.csv")
AllVehicles$comb08 <- as.numeric(AllVehicles$comb08)

GTR45 <- data.frame(subset(AllVehicles, comb08 > 45, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))

# If necessary install (plyr) package.
# See dataframe sorting discussion at
# http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns-in-r/6871968#6871968
# Sort by year then combined fuel mileage
# for other alternatives see[1,2]

library(plyr)
arrange(GTR45,(year),comb08)
SortYearGTR45 <- arrange(GTR45,(year),comb08)
plot(SortYearGTR45$year,SortYearGTR45$comb08,type="p")

Another method of sorting by dataframe with statistical smoothing:

GTR45.MPG.Year.Model.Cmb.MPG <- data.frame(GTR45[order(GTR45$year),c(3,4)])
plot(GTR45.MPG.Year.Model.Cmb.MPG,ylab="Combined MPG")
lines(stats::lowess(GTR45.MPG.Year.Model.Cmb.MPG))

End Notes:

For more information on Data Cleaning:

Lists and Data Cleaning Jaffe and Muschelli
Tidy Data Hadley Wickham
Data Munging Basics Jeffrey Leek

Roger D. Peng: "Computing for Data Analysis"

2013-02-20T06:17:00.000-08:00

Roger D. Peng (http://www.biostat.jhsph.edu/~rpeng/) is a "Rock 'n' Roll Statistician" (see http://twitter.com/rdpeng) specializing in Biostatistics at the John Hopkins Bloomberg School of Public Health. Dr. Peng teaches also teaches R Programming courses at Coursera.org. He has posted the lectures for his last course "Computing for Data Analysis" on Youtube. His lectures are an excellent resource and will be understandable in large part for most of the 5th - 10th graders at Saint Paul's Academy. I recommend you watch at least the following individual videos below. Consider watching all of the videos in the play lists "BackGround on R" and "Computing for Data Analysis: Week 1" in preparation for next week's class.

Dr. Peng is an experienced statistical programmer in R. Watching these videos will clear up many questions we had in our second class and help you extend your data analysis skills with R. Also, anything I described poorly or inaccurately will be accurately described in Dr. Peng's lectures.

Individual Lectures

Setting Your Working Directory and Editing R Code:
http://www.youtube.com/watch?v=8xT3hmJQskU
How to Get Help:
http://www.youtube.com/watch?v=ZFaWxxzouCY
Reading/Writing Data I:
http://www.youtube.com/watch?v=aBzAels6jPk&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=8
Reading/Writing Data II:
http://www.youtube.com/watch?v=cUUqDWttMws&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=9

Playlists

BackGround on R
http://www.youtube.com/watch?v=V2V3T9GkKBY&list=PLjTlxb-wKvXMUop9m0C8G5xLBzhsIDBC7
Computing for Data Analysis: Week 1
http://www.youtube.com/watch?v=8xT3hmJQskU&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ
Computing for Data Analysis: Week 2
http://www.youtube.com/watch?v=s_h9ruNwI_0&list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B
Computing for Data Analysis Week 3
http://www.youtube.com/watch?v=R2Zh_kPxrmg&list=PLjTlxb-wKvXOzI2h0F2_rYZHIXz8GWBop
Computing for Data Analysis Week 4
http://www.youtube.com/watch?v=HPSrjKt-e8c&list=PLjTlxb-wKvXOdzysAE6qrEBN_aSBC0LZS

A Brief Overview of Programming Languages and Symbolic Logic (Part I)

2013-02-11T11:03:00.001-08:00

What kind of spirit is it, that can support you to keep your interest on C++ Standardization for 20+ years?

Nothing else I could spend my time on could do so much good to so many people.

from a Interview with Bjarne Stroustrup (2012); Author of the C++ Programming Language

Some would say that the best way to learn a programming language is to simply install your language of choice and start writing code based on the manual, tutorials or help system specific to that language. However, having studied more than few programming languages in my lifetime, I am going to emphasize another deeper background approach.

The ability to use digital logic circuits in computation has been with us a very short historical period of time. The rise in their use has paralleled the rapid spread of urbanization, growth in population, universities, international trade and market economies. Today, more than sixty years after the United States Army Ballistic Research Laboratory developed ENIAC, computational logic is now the substructure of all economic, engineering, defense, science, and mathematical efforts. Each year our lives are more organized around and organized with improvements in computer hardware, computational theory, information theory, networking and user interface design. During these scant last sixty years, at least 2500 computer languages have been developed. A "History of Computer Languages" constitutes a reading list of some of the brightest minds to live in the post World War II era. A brief, detailed but important list of popular programming languages can be found here.

For some time now, the disciplines of digital intelligence have been engaged in segregation and specialization. An electrical engineering degree is now something quite different from a computer science degree. Most of us who write and learn computer software today concern ourselves little with how computer hardware processes logic. But there was time when the two disciplines were indivisible. Some scholars still believe that learning machine architecture (sometimes called 'machine architecture and organization' or MOA) is essential to understanding the binary (or Boolean) logic (1,2) that is the foundation for all digital computation and programming languages. Despite all such discussion, there is still no definitive path for the creation of a competent software engineer. The conventional wisdom generally involves obtaining a computer science degree.

That being said, the Wikipedia list of college drop billionaires includes Bill Gates (Microsoft founder), Steve Jobs (Apple founder), Mark Zuckerberg (Facebook founder) and Larry Ellison (Oracle founder). Many of us who studied other disciplines in college simply fell into computer administration and software engineering because we found we enjoyed or had a knack for it. All this being said, pursuing an electrical engineering or computer science degree would still be the recommended best first step on the way being hired by one of the companies organized by the men described above.

And perhaps the best step in understanding software engineering is to gain an understanding of math, logic, and more specifically mathematical (or symbolic) logic. Many of you in this class (5th - 10th graders at SPA) learn a math curriculum that prepares you to understand many of the principles from which computational logic and computer science are derived. However, chief in importance among all these principles is the simple yet all encompassing conception that the principles of quantitative logic can be represented by symbolic logic or natural language. To this concept, our survival on this Earth as a species for the last few thousand years owes much. Let us examine briefly the history of mathematical (or symbolic) logic.

(To be continued)

Week0: Using spreadsheets

2013-02-04T09:56:00.001-08:00

Spreadsheets are the ubiquitous tool of business, finance, science, academia, and corporate america. The most widespread of all data analysis tools, spreadsheets have no peers as such. Spreadsheets, despite the continual enhancement of their functionality, have some drawbacks. For example:

Both functions and data in cells are subject to undetected statistical errors sometimes due to the slip of a cursor.
Mixing data and code in the same worksheet can and have created data disasters.
Large datasets or data results are impractical when stored in cells.

Despite these drawbacks, it is common to find thousands of spreadsheets in active use at any investment bank or accounting firm. No doubt SPA students will begin using spreadsheets well before they are in high school and continuing using them when they enter college and the working world.

As with any data analysis software, you can approach spreadsheet work from a number of different perspectives. Sometimes all you want to do is examine your data to look for relationships. Other times, you will have specific templates you want to apply repeatedly to a continuous flow of data. Often times, spreadsheet graphing capacity is the end result for some presentation or series of slides. Online training for spreadsheets is also ubiquitous. Here are some tutorial links for Open Office Calc:

MSU Tutorial (PDF)

University of Regina Tutorial (PDF)

"Video Tutorials for Open Office"

"Tutorials for Open Office"

Open Office Forum List of Tutorials

Open Office Wiki

The help files for most spreadsheets are extensive. Spreadsheets are a generalized tool, designed for generic data analysis. Your ability to use a spreadsheet well to model your problem set will depend significantly on a specific domain knowledge. That being said, advanced spreadsheet design and development involves high-end statistical, programming and database skillsets. I've uploaded a series of 'screencast' tutorials about using basic functionality in Open Office Calc here:

DataScience:Week0

The spreadsheet for the tutorials can be found here or here. For your science projects, I will try to make myself available for help with your spreadsheets the best that I can.

Week0: CK-12 Resources

2013-01-30T19:52:00.002-08:00

CK-12 is an incredible STEM resource. Log on to it and peruse the subjects you want to learn. The site is designed to offer self-paced learning, achievement tracking, grade based content. Perhaps most significant is the ability of the student and educator to create eFlexBooks in .mobi, .epub, and PDF formats. The mathematical and statistical lessons are of very high quality. And this resource is free. Check out these two flexbooks:

This type of resource allows an opportunity for myself as an instructor to discuss how technology increases learning. I highly recommend the use of a portable reader such as a Nook or Kindle or the use of reader software on your smart phone or laptop. Simply put, mathematics is sometimes best learned in small doses at places convenient for the student. To prevent visual fatigue from interfering with comprehension, I prefer a reader that handles the display of Greek Letters with optimum clarity in all types of light.

For these articles, the student may find a resource on the use of Greek Letters in Mathematics helpful. Please see my (upcoming) post on the use of greek letters in mathematics and their representation in statistical software.

Questions for the Students
For this course on data science we are very interested in

statistical concepts
mathematical concepts that are critical to statistical programming

Mathematical structures important to statistical programming include:

the concept of a function
the properties of arrays, vectors, matrices and other data structures

Read pages 5 - 62 in CK-12-MathAnalysis . Pay particular attention to the generic discussion of functions as a mathematical concept. Read pages 42 - 73 in Basic-Probability-and-Statistics---A-Short-Course.

How do we understand visual data through graphs?
How is our understanding of visual data linked to mathematical concepts that represent data?

Both these works will prepare a student for better understanding statistical software packages like R, Octave, Python, Scilab.

For the ambitious
Read Probability and Statistics - Advanced(Second Edition) . If you find your eyes glazing over when reviewing formulas, try this strategy:

Complete a first read of this work that skims over each chapter.
Try to identify the general purpose of the discussion on data analysis in statistics.
Take a break and think some about what you have read.
Read the work carefully from one end to the other over a period of sittings. Each time you read something new, take some time to think about possible uses of the mathematical concept.

Scilab

2013-01-24T11:40:00.000-08:00

Scilab is multi-platform, open source software that offers functionality similar to MATLAB, including user contributed add-ins for statistics and data analysis. The software is well used in secondary and university locations for teaching and practicing engineering and mathematics An extensive help menu is available with the software. There are many third party tutorials available from science and engineering departments across the world. For example:

I have found the product surprisingly well-featured and mature and easy to learn. There appears to be a substantial technical community committed to Scilab. In addition, I find the semantics and syntax of Scilab form a gentle and well-featured introduction to command line computing for upper and high school students.

Questions and Activities for Students

Install Scilab on your home computer.
Walk through at least one of the tutorials listed above.
Do you find Scilab easier or more difficult to use than a spreadsheet for mathematical computation?
What advantages do you think command line scripts have over spreadsheet cell based programming?
Why would do you think scientists and engineers would use Scilab or MATLAB instead of a spreadsheet?

Climate Data from BEST

2013-01-20T13:46:00.000-08:00

Berkeley Earth Surface Temperature (BEST) has released a significant study documenting the rise in the Earth's surface temperature from 1753 to 2011. The study includes a detailed discussion of statistical methodology and locations for the MATLAB code and data. MATLAB is a commercial mathematical, statistical, and presentation software used by many academics and scientists. A movie of land temperature anomalies can be found here. The conclusions from this study represent the collation of exhaustive amounts of temperature data to create conclusions about a land surface temperature anomalies over time Some presentations on the complexity of their statistical methodology can be found here and here. Correlating historical climate data to understand evidence of global climate change has proven to be a complex and controversial challenge. The BEST sponsor NOVIM hopes to promote scientific studies without political bias concerning significant world resource issues.

Questions for Students:

(1) What does the BEST statistical methodology suggest about the complexity and nature of the analysis of data from historical records? Do you think the study helps us understand how complete our understanding of statistical reasoning needs to be for very large data sets?

(2) Read the NOVIM website highlighted articles:

How do the sciences of mathematics and statistics help us come to unbiased conclusions?

(3) Read the Wikipedia articles on Scientific Method and List of Biases in Judgement and Decision Making . How important do you think the study of data science is to the future of humankind? How does improving the scientific method by learning how to remove bias from our scientific methodology improve our chance to prosper and survive on this Earth?