Teaching Data Science

Tuesday, January 28, 2014

Data Science Club Description

Title: “The Data Knights”
Subject: “How computer science looks at 'Big Data' and why understanding 'Big Data' is important.”
Instructor: Ryan Ferris (Isabel's Dad!)
Blog: http://www.teachingdatascience.blogspot.com
Requirements: 5th - 8th grade; high schoolers at SPA; A willingness to work with data, statistics, programming languages, symbolic math. There is no official workload, however suggested assignments will provide an opportunity for introductions to programming, statistics mathematics, database theory. At home PC, MAC, Unix important to get the most out of this club. A tablet reader/browser may also be helpful.

Elections Data and Accumulators

I've load some code and charts I produced for a piece on my political blog. R Programming wizardy will take some time and practice to accumulate. In fact it is a more modern programming structure that I refer to as the 'accumulator' that I had to hack up to make this cruft work. I often use an "accumulator" variable while doing data analysis with Powershell. It allows me to continuously stuff similar typed data into an array as so:

Basic Data Analysis and Lattice Graphics Framework

(post under construction)
This post is very generic cruft on basic data analysis designed around combining data with :

cbind()
rbind()
matrix()
as.matrix()
data.frame()
as.data.frame()

and visualizing data with the lattice graphics package.

Some quick and dirty notes on learning R. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.

This post follows from my last post on combining data for analysis. I am using BEA, BLS, and Census data to understand 20 year macro-economic flows. Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are here. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have concatenated data from mulitple sets into one dataframe. I have used prefixes (BLS,PI,NS) designate "Bureau of Labor Statistics", (BEA) Personal Income, (BEA) Net (residential) Stock.

ggplot2

Hadley Wickham on using R and ggplot2:

Monday, March 18, 2013

Combining DataFrames in R Programming

The screencast below discusses combinging dataframes from disparate sources in R Programming. Full screen is probably best. The code for the screencast is below. Data files for this screencast can be found here.

This is the code that accompanies the screen cast above.

# data science exercise combing dataframes from different sources
# data science exercise using lattice graphics system

list.files()
list.files(pattern="csv")

USPerInc.1992.2011 <- data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))
USResidentialAsset.1992.2011 <- data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))
USEmployPop.1992.2011 <- data.frame(read.csv("BLS_Census.csv"))

USComb <- cbind(USPerInc.1992.2011[c(1,6)])
names(USComb)
USComb <- cbind(USResidentialAsset.1992.2011[c(2,3,4,8)])
names(USComb)

USComb <- cbind(USPerInc.1992.2011[c(1,6)])
USComb <- cbind(USComb,USResidentialAsset.1992.2011[c(2,3,4,8)])
matrix(names(USComb))

USComb1 <- data.frame(read.csv("BLS_Census.csv"))
USComb1 <- cbind(USComb1,(data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))))
USComb1 <- cbind(USComb1,(data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))))
names(USComb1)

grep(pattern="Year",(as.character(names(USComb1))),value=TRUE)
grepl(pattern="Year",(as.character(names(USComb))))
matrix(grepl(pattern="Year",(as.character(names(USComb1)))))

USComb1 <- cbind(USComb1[c(-9,-20)])

matrix1 <- matrix(sapply(USComb,class))
matrix1 <- cbind(matrix1,matrix(names(USComb)))
matrix2 <- matrix(sapply(USComb1,class))
matrix2 <- cbind(matrix2,matrix(names(USComb1)))
matrix1
matrix2

dd <- USComb1
str(dd)

Monday, March 11, 2013

"Great coders are today's rock stars."

I just had to post this video on learning to code from an all star cast at code.org...

Friday, March 8, 2013

RGui, Rstudio, Notepad++; Creating and Using Functions

The code below is from the youtube published screen cast above. This screen cast is available as a high resolution WMV file. For more data see the Week4 folder.

Basic data analysis: Part I

[Editor's note: Under construction - 03/06/2013.]
There are number of functions that will be helpful for this example. Please examine them through use of the help system (e.g. 'help(command)'):

read.csv()
head()
names()
as.numeric()
c()
data.frame() or as.data.frame()
sapply()
class()
print()
levels()
droplevels()
subset()
order()
plot()
lines()

Basic Graphing in R: Combining, Plotting and Smoothing

R Graphs from left to right: Price of Imported Oil per Quarter 1976:2012; Price of Retail Gasoline per Quarter:1976:2012; Ratio of Retail Gas / Imported Oil per Quarter: 1976:2012 . Source: U.S. EIA : "Short-Term Energy Outlook Real and Nominal Prices, February 13, 2012

The data files, images and R Script for this blog are here. These example use R 2.14 64 bit for Windows. Because I am neither a statistician or energy professional, the results of the following analysis will have to be taken with a "grain of salt". The purpose of this post is to demonstrate basic use of exporting, reformatting, combining, plotting, smoothing data in R.

Datatypes, consistent data, reshaping data, dirty data

[Data for this post can be found here]

This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer: