Title: “The Data Knights”
Subject: “How computer science looks at 'Big Data' and why understanding 'Big Data' is important.”
Instructor: Ryan Ferris (Isabel's Dad!)
Blog: http://www.teachingdatascience.blogspot.com
Requirements: 5th - 8th grade; high schoolers at SPA; A willingness to work with data, statistics, programming languages, symbolic math. There is no official workload, however suggested assignments will provide an opportunity for introductions to programming, statistics mathematics, database theory. At home PC, MAC, Unix important to get the most out of this club. A tablet reader/browser may also be helpful.
Teaching Data Science
Notes on Teaching Children Data Science quite possibly from a 'STEM' perspective. This is a course designed for upper school or high school students at Saint Paul's Academy in Bellingham, WA. An associated web page is http://teachingdatascience.com/ .
Tuesday, January 28, 2014
Wednesday, May 8, 2013
Elections Data and Accumulators
I've load some code and charts I produced for a piece on my political blog. R Programming wizardy will take some time and practice to accumulate. In fact it is a more modern programming structure that I refer to as the 'accumulator' that I had to hack up to make this cruft work. I often use an "accumulator" variable while doing data analysis with Powershell. It allows me to continuously stuff similar typed data into an array as so:
Friday, March 29, 2013
Basic Data Analysis and Lattice Graphics Framework
(post under construction)
This post is very generic cruft on basic data analysis designed around combining data with :
and visualizing data with the lattice graphics package.
Some quick and dirty notes on learning R. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.
This post follows from my last post on combining data for analysis. I am using BEA, BLS, and Census data to understand 20 year macro-economic flows. Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are here. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have concatenated data from mulitple sets into one dataframe. I have used prefixes (BLS,PI,NS) designate "Bureau of Labor Statistics", (BEA) Personal Income, (BEA) Net (residential) Stock.
This post is very generic cruft on basic data analysis designed around combining data with :
- cbind()
- rbind()
- matrix()
- as.matrix()
- data.frame()
- as.data.frame()
and visualizing data with the lattice graphics package.
Some quick and dirty notes on learning R. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.
This post follows from my last post on combining data for analysis. I am using BEA, BLS, and Census data to understand 20 year macro-economic flows. Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are here. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have concatenated data from mulitple sets into one dataframe. I have used prefixes (BLS,PI,NS) designate "Bureau of Labor Statistics", (BEA) Personal Income, (BEA) Net (residential) Stock.
Thursday, March 21, 2013
Monday, March 18, 2013
Combining DataFrames in R Programming
The screencast below discusses combinging dataframes from disparate sources in R Programming. Full screen is probably best. The code for the screencast is below. Data files for this screencast can be found here.
This is the code that accompanies the screen cast above.
# data science exercise combing dataframes from different sources
# data science exercise using lattice graphics system
list.files()
list.files(pattern="csv")
USPerInc.1992.2011 <- data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))
USResidentialAsset.1992.2011 <- data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))
USEmployPop.1992.2011 <- data.frame(read.csv("BLS_Census.csv"))
USComb <- cbind(USPerInc.1992.2011[c(1,6)])
names(USComb)
USComb <- cbind(USResidentialAsset.1992.2011[c(2,3,4,8)])
names(USComb)
USComb <- cbind(USPerInc.1992.2011[c(1,6)])
USComb <- cbind(USComb,USResidentialAsset.1992.2011[c(2,3,4,8)])
matrix(names(USComb))
USComb1 <- data.frame(read.csv("BLS_Census.csv"))
USComb1 <- cbind(USComb1,(data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))))
USComb1 <- cbind(USComb1,(data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))))
names(USComb1)
grep(pattern="Year",(as.character(names(USComb1))),value=TRUE)
grepl(pattern="Year",(as.character(names(USComb))))
matrix(grepl(pattern="Year",(as.character(names(USComb1)))))
USComb1 <- cbind(USComb1[c(-9,-20)])
matrix1 <- matrix(sapply(USComb,class))
matrix1 <- cbind(matrix1,matrix(names(USComb)))
matrix2 <- matrix(sapply(USComb1,class))
matrix2 <- cbind(matrix2,matrix(names(USComb1)))
matrix1
matrix2
dd <- USComb1
str(dd)
This is the code that accompanies the screen cast above.
# data science exercise combing dataframes from different sources
# data science exercise using lattice graphics system
list.files()
list.files(pattern="csv")
USPerInc.1992.2011 <- data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))
USResidentialAsset.1992.2011 <- data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))
USEmployPop.1992.2011 <- data.frame(read.csv("BLS_Census.csv"))
USComb <- cbind(USPerInc.1992.2011[c(1,6)])
names(USComb)
USComb <- cbind(USResidentialAsset.1992.2011[c(2,3,4,8)])
names(USComb)
USComb <- cbind(USPerInc.1992.2011[c(1,6)])
USComb <- cbind(USComb,USResidentialAsset.1992.2011[c(2,3,4,8)])
matrix(names(USComb))
USComb1 <- data.frame(read.csv("BLS_Census.csv"))
USComb1 <- cbind(USComb1,(data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))))
USComb1 <- cbind(USComb1,(data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))))
names(USComb1)
grep(pattern="Year",(as.character(names(USComb1))),value=TRUE)
grepl(pattern="Year",(as.character(names(USComb))))
matrix(grepl(pattern="Year",(as.character(names(USComb1)))))
USComb1 <- cbind(USComb1[c(-9,-20)])
matrix1 <- matrix(sapply(USComb,class))
matrix1 <- cbind(matrix1,matrix(names(USComb)))
matrix2 <- matrix(sapply(USComb1,class))
matrix2 <- cbind(matrix2,matrix(names(USComb1)))
matrix1
matrix2
dd <- USComb1
str(dd)
Monday, March 11, 2013
"Great coders are today's rock stars."
I just had to post this video on learning to code from an all star cast at code.org...
Friday, March 8, 2013
RGui, Rstudio, Notepad++; Creating and Using Functions
The code below is from the youtube published screen cast above. This screen cast is available as a high resolution WMV file. For more data see the Week4 folder.
Wednesday, March 6, 2013
Basic data analysis: Part I
[Editor's note: Under construction - 03/06/2013.]
There are number of functions that will be helpful for this example. Please examine them through use of the help system (e.g. 'help(command)'):
- read.csv()
- head()
- names()
- as.numeric()
- c()
- data.frame() or as.data.frame()
- sapply()
- class()
- print()
- levels()
- droplevels()
- subset()
- order()
- plot()
- lines()
Friday, March 1, 2013
Basic Graphing in R: Combining, Plotting and Smoothing
here. These example use R 2.14 64 bit for Windows. Because I am neither a statistician or energy professional, the results of the following analysis will have to be taken with a "grain of salt". The purpose of this post is to demonstrate basic use of exporting, reformatting, combining, plotting, smoothing data in R.
Monday, February 25, 2013
Datatypes, consistent data, reshaping data, dirty data
[Data for this post can be found here]
This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer:
This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer:
Subscribe to:
Posts (Atom)