Teaching Data Science: 2013

Wednesday, May 8, 2013

Elections Data and Accumulators

I've load some code and charts I produced for a piece on my political blog. R Programming wizardy will take some time and practice to accumulate. In fact it is a more modern programming structure that I refer to as the 'accumulator' that I had to hack up to make this cruft work. I often use an "accumulator" variable while doing data analysis with Powershell. It allows me to continuously stuff similar typed data into an array as so:

Basic Data Analysis and Lattice Graphics Framework

(post under construction)
This post is very generic cruft on basic data analysis designed around combining data with :

cbind()
rbind()
matrix()
as.matrix()
data.frame()
as.data.frame()

and visualizing data with the lattice graphics package.

Some quick and dirty notes on learning R. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.

This post follows from my last post on combining data for analysis. I am using BEA, BLS, and Census data to understand 20 year macro-economic flows. Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are here. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have concatenated data from mulitple sets into one dataframe. I have used prefixes (BLS,PI,NS) designate "Bureau of Labor Statistics", (BEA) Personal Income, (BEA) Net (residential) Stock.

ggplot2

Hadley Wickham on using R and ggplot2:

Monday, March 18, 2013

Combining DataFrames in R Programming

The screencast below discusses combinging dataframes from disparate sources in R Programming. Full screen is probably best. The code for the screencast is below. Data files for this screencast can be found here.

This is the code that accompanies the screen cast above.

# data science exercise combing dataframes from different sources
# data science exercise using lattice graphics system

list.files()
list.files(pattern="csv")

USPerInc.1992.2011 <- data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))
USResidentialAsset.1992.2011 <- data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))
USEmployPop.1992.2011 <- data.frame(read.csv("BLS_Census.csv"))

USComb <- cbind(USPerInc.1992.2011[c(1,6)])
names(USComb)
USComb <- cbind(USResidentialAsset.1992.2011[c(2,3,4,8)])
names(USComb)

USComb <- cbind(USPerInc.1992.2011[c(1,6)])
USComb <- cbind(USComb,USResidentialAsset.1992.2011[c(2,3,4,8)])
matrix(names(USComb))

USComb1 <- data.frame(read.csv("BLS_Census.csv"))
USComb1 <- cbind(USComb1,(data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))))
USComb1 <- cbind(USComb1,(data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))))
names(USComb1)

grep(pattern="Year",(as.character(names(USComb1))),value=TRUE)
grepl(pattern="Year",(as.character(names(USComb))))
matrix(grepl(pattern="Year",(as.character(names(USComb1)))))

USComb1 <- cbind(USComb1[c(-9,-20)])

matrix1 <- matrix(sapply(USComb,class))
matrix1 <- cbind(matrix1,matrix(names(USComb)))
matrix2 <- matrix(sapply(USComb1,class))
matrix2 <- cbind(matrix2,matrix(names(USComb1)))
matrix1
matrix2

dd <- USComb1
str(dd)

Monday, March 11, 2013

"Great coders are today's rock stars."

I just had to post this video on learning to code from an all star cast at code.org...

Friday, March 8, 2013

RGui, Rstudio, Notepad++; Creating and Using Functions

The code below is from the youtube published screen cast above. This screen cast is available as a high resolution WMV file. For more data see the Week4 folder.

Basic data analysis: Part I

[Editor's note: Under construction - 03/06/2013.]
There are number of functions that will be helpful for this example. Please examine them through use of the help system (e.g. 'help(command)'):

read.csv()
head()
names()
as.numeric()
c()
data.frame() or as.data.frame()
sapply()
class()
print()
levels()
droplevels()
subset()
order()
plot()
lines()

Basic Graphing in R: Combining, Plotting and Smoothing

R Graphs from left to right: Price of Imported Oil per Quarter 1976:2012; Price of Retail Gasoline per Quarter:1976:2012; Ratio of Retail Gas / Imported Oil per Quarter: 1976:2012 . Source: U.S. EIA : "Short-Term Energy Outlook Real and Nominal Prices, February 13, 2012

The data files, images and R Script for this blog are here. These example use R 2.14 64 bit for Windows. Because I am neither a statistician or energy professional, the results of the following analysis will have to be taken with a "grain of salt". The purpose of this post is to demonstrate basic use of exporting, reformatting, combining, plotting, smoothing data in R.

Datatypes, consistent data, reshaping data, dirty data

[Data for this post can be found here]

This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer:

Roger D. Peng: "Computing for Data Analysis"

Roger D. Peng (http://www.biostat.jhsph.edu/~rpeng/) is a "Rock 'n' Roll Statistician" (see http://twitter.com/rdpeng) specializing in Biostatistics at the John Hopkins Bloomberg School of Public Health. Dr. Peng teaches also teaches R Programming courses at Coursera.org. He has posted the lectures for his last course "Computing for Data Analysis" on Youtube. His lectures are an excellent resource and will be understandable in large part for most of the 5th - 10th graders at Saint Paul's Academy. I recommend you watch at least the following individual videos below. Consider watching all of the videos in the play lists "BackGround on R" and "Computing for Data Analysis: Week 1" in preparation for next week's class.

Dr. Peng is an experienced statistical programmer in R. Watching these videos will clear up many questions we had in our second class and help you extend your data analysis skills with R. Also, anything I described poorly or inaccurately will be accurately described in Dr. Peng's lectures.

A Brief Overview of Programming Languages and Symbolic Logic (Part I)

What kind of spirit is it, that can support you to keep your interest on C++ Standardization for 20+ years?

Nothing else I could spend my time on could do so much good to so many people.

from a Interview with Bjarne Stroustrup (2012); Author of the C++ Programming Language

Some would say that the best way to learn a programming language is to simply install your language of choice and start writing code based on the manual, tutorials or help system specific to that language. However, having studied more than few programming languages in my lifetime, I am going to emphasize another deeper background approach.

The ability to use digital logic circuits in computation has been with us a very short historical period of time. The rise in their use has paralleled the rapid spread of urbanization, growth in population, universities, international trade and market economies. Today, more than sixty years after the United States Army Ballistic Research Laboratory developed ENIAC, computational logic is now the substructure of all economic, engineering, defense, science, and mathematical efforts. Each year our lives are more organized around and organized with improvements in computer hardware, computational theory, information theory, networking and user interface design. During these scant last sixty years, at least 2500 computer languages have been developed. A "History of Computer Languages" constitutes a reading list of some of the brightest minds to live in the post World War II era. A brief, detailed but important list of popular programming languages can be found here.

Week0: Using spreadsheets

Spreadsheets are the ubiquitous tool of business, finance, science, academia, and corporate america. The most widespread of all data analysis tools, spreadsheets have no peers as such. Spreadsheets, despite the continual enhancement of their functionality, have some drawbacks. For example:

Both functions and data in cells are subject to undetected statistical errors sometimes due to the slip of a cursor.
Mixing data and code in the same worksheet can and have created data disasters.
Large datasets or data results are impractical when stored in cells.

Despite these drawbacks, it is common to find thousands of spreadsheets in active use at any investment bank or accounting firm. No doubt SPA students will begin using spreadsheets well before they are in high school and continuing using them when they enter college and the working world.

Week0: CK-12 Resources

CK-12 is an incredible STEM resource. Log on to it and peruse the subjects you want to learn. The site is designed to offer self-paced learning, achievement tracking, grade based content. Perhaps most significant is the ability of the student and educator to create eFlexBooks in .mobi, .epub, and PDF formats. The mathematical and statistical lessons are of very high quality. And this resource is free. Check out these two flexbooks:

This type of resource allows an opportunity for myself as an instructor to discuss how technology increases learning. I highly recommend the use of a portable reader such as a Nook or Kindle or the use of reader software on your smart phone or laptop. Simply put, mathematics is sometimes best learned in small doses at places convenient for the student. To prevent visual fatigue from interfering with comprehension, I prefer a reader that handles the display of Greek Letters with optimum clarity in all types of light.

For these articles, the student may find a resource on the use of Greek Letters in Mathematics helpful. Please see my (upcoming) post on the use of greek letters in mathematics and their representation in statistical software.

Scilab

Scilab is multi-platform, open source software that offers functionality similar to MATLAB, including user contributed add-ins for statistics and data analysis. The software is well used in secondary and university locations for teaching and practicing engineering and mathematics An extensive help menu is available with the software. There are many third party tutorials available from science and engineering departments across the world. For example:

I have found the product surprisingly well-featured and mature and easy to learn. There appears to be a substantial technical community committed to Scilab. In addition, I find the semantics and syntax of Scilab form a gentle and well-featured introduction to command line computing for upper and high school students.

Climate Data from BEST

Berkeley Earth Surface Temperature (BEST) has released a significant study documenting the rise in the Earth's surface temperature from 1753 to 2011. The study includes a detailed discussion of statistical methodology and locations for the MATLAB code and data. MATLAB is a commercial mathematical, statistical, and presentation software used by many academics and scientists. A movie of land temperature anomalies can be found here. The conclusions from this study represent the collation of exhaustive amounts of temperature data to create conclusions about a land surface temperature anomalies over time Some presentations on the complexity of their statistical methodology can be found here and here. Correlating historical climate data to understand evidence of global climate change has proven to be a complex and controversial challenge. The BEST sponsor NOVIM hopes to promote scientific studies without political bias concerning significant world resource issues.

Questions for Students:

(1) What does the BEST statistical methodology suggest about the complexity and nature of the analysis of data from historical records? Do you think the study helps us understand how complete our understanding of statistical reasoning needs to be for very large data sets?

(2) Read the NOVIM website highlighted articles:

How do the sciences of mathematics and statistics help us come to unbiased conclusions?

(3) Read the Wikipedia articles on Scientific Method and List of Biases in Judgement and Decision Making . How important do you think the study of data science is to the future of humankind? How does improving the scientific method by learning how to remove bias from our scientific methodology improve our chance to prosper and survive on this Earth?

Teaching Data Science

Wednesday, May 8, 2013

Elections Data and Accumulators

Friday, March 29, 2013

Basic Data Analysis and Lattice Graphics Framework

Thursday, March 21, 2013

ggplot2

Monday, March 18, 2013

Combining DataFrames in R Programming

Monday, March 11, 2013

"Great coders are today's rock stars."

I just had to post this video on learning to code from an all star cast at code.org...

Friday, March 8, 2013

RGui, Rstudio, Notepad++; Creating and Using Functions

Wednesday, March 6, 2013

Basic data analysis: Part I

Friday, March 1, 2013

Basic Graphing in R: Combining, Plotting and Smoothing

Monday, February 25, 2013

Datatypes, consistent data, reshaping data, dirty data

Wednesday, February 20, 2013

Roger D. Peng: "Computing for Data Analysis"

Monday, February 11, 2013

A Brief Overview of Programming Languages and Symbolic Logic (Part I)

Monday, February 4, 2013

Week0: Using spreadsheets

Wednesday, January 30, 2013

Week0: CK-12 Resources

Thursday, January 24, 2013

Scilab

Sunday, January 20, 2013

Climate Data from BEST