tag:blogger.com,1999:blog-28737742235817774242024-03-12T21:53:45.757-07:00Teaching Data ScienceNotes on Teaching Children Data Science quite possibly from a 'STEM' perspective. This is a course designed for upper school or high school students at Saint Paul's Academy in Bellingham, WA. An associated web page is <a href="http://teachingdatascience.com/">http://teachingdatascience.com/</a> . Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.comBlogger16125tag:blogger.com,1999:blog-2873774223581777424.post-63105697460024034482014-01-28T09:42:00.000-08:002013-02-28T07:49:40.654-08:00Data Science Club Description<b>Title:</b> “The Data Knights”<br />
<b>Subject:</b> “How computer science looks at 'Big Data' and why understanding 'Big Data' is important.”<br />
<b>Instructor:</b> Ryan Ferris (Isabel's Dad!)<br />
<b>Blog:</b> <a href="http://www.teachingdatascience.blogspot.com/">http://www.teachingdatascience.blogspot.com</a><br />
<b>Requirements: </b>5th - 8th grade; high schoolers at SPA; A willingness to work with data, statistics, programming languages, symbolic math. There is no official workload, however suggested assignments will provide an opportunity for introductions to programming, statistics mathematics, database theory. At home PC, MAC, Unix important to get the most out of this club. A tablet reader/browser may also be helpful.<br />
<br />
<b></b><br />
<a name='more'></a><b><br /></b><br />
<b>Dates and Time:</b> 3:30 - 5:00 PM on the following Tuesdays in the Computer Lab:<br />
2/5<br />
2/26<br />
3/5<br />
3/19<br />
3/26<br />
4/9<br />
4/23<br />
<br />
<br />
“The Data Knights” Club will take place once a week for 1.5 hours in the Upper School computer lab. A possible format will be 45 minutes of lecture; 45 minutes of lab work. A reading list, blog or website will provide updated links, information, and some suggested assignments. 'Data Science' is rapidly becoming one of the most important fields of computer science. The field of 'data science' is seen as critical to help manage, marketize, analyze, and understand large volumes of data in an increasingly interconnected world. An introduction to 'data science' gives a parent or instructor an important opportunity to talk about how 'real world' computing uses 'big data' and 'data science' in diverse fields. 'Data Science' also gives us an opening to inject algorithms and approaches to understanding data with mathematics and statistics.<br />
<br />
'Data Science' revolves around skillsets in a number of fields including statistics, mathematics, network analysis, database theory, software engineering and modeling. 'Data Science' is being used to understand and explore fields as diverse as energy supplies, the human genome, habitable planets, climate change, financial markets, the propagation of disease, population demographics and many others. It is almost a surety that the abilities to think broadly and flexibly about 'big data' will be an important trait of the next generation of engineers, scientists, technologists and political leaders. It may also be important for all young students to understand the breadth and complexity of a world that may contain 10 billion of us by the end of this century.<br />
<br />
Software (possible list):<br />
<br />
<ul>
<li>R Statistics</li>
<li>PostGreSQL</li>
<li>Python</li>
<li>Scilab</li>
<li>Spreadsheets(Excel, Scalc)</li>
<li>Octave</li>
<li>AWK</li>
<li>Graphic Presentation Software</li>
</ul>
<br />
Notes: I can extend the Tuesday sessions in February and March to help with data analysis on your science project if your sponsor or parent finds that useful. There are 14 desks in the SPA Computer lab only.Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.comtag:blogger.com,1999:blog-2873774223581777424.post-78066389650163443802013-05-08T21:58:00.003-07:002013-05-08T22:00:59.876-07:00Elections Data and AccumulatorsI've load some code and charts I produced for a piece on my <a href="http://bellingham-wa-politics-economics.blogspot.com/2013/05/big-data-and-local-elections-part-i.html">political blog</a>. R Programming wizardy will take some time and practice to accumulate. In fact it is a more modern programming structure that I refer to as the 'accumulator' that I had to hack up to make this cruft work. I often use an <span style="background-color: yellow;">"accumulator" variable </span>while doing <a href="http://horizontal-logic.blogspot.com/2012/10/data-analytics-with-powershell-part-i.html#more">data analysis with Powershell</a>. It allows me to continuously stuff similar typed data into an array as so:<br />
<br />
<a name='more'></a><br />
<br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">rv -ea 0 a</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"># create one database stored as a variable (e.g. '$out') by merging all candidate donations. </span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"># Add a 'member' or field (e.g. Candidate Name) to each record</span><br />
<span style="background-color: yellow;"><span style="font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">$out+=$Varlist.name</span><span style="font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"> </span></span><span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">| % {</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">$a=(ls variable:/$PSItem).value;</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">$Name=(ls variable:/$PSItem).name;</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">$a | add-member -force -passthru -NotePropertyName Candidate -NotePropertyValue $PSItem; </span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">}</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"><br /></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13px; line-height: 18px;"></span><span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"># An $out record now is </span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"></span><br style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13px; line-height: 18px;" />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">$out[0]</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"><br /></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13px; line-height: 18px;"></span><span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Contributor : ATU</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Date : 06/22/12</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Amount : 900</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">P/G : P</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">City : WASHINGTON</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">State : DC</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Zip : 20016</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Employer :</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Occupation :</span><br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">Candidate : NM</span><br />
<br />
I couldn't find something analogous in R Programming, so I resorted to code like this that created a zero based numeric data.frame ('ddf'), stuffing it with data (via <i>rbind()</i>) and then removing the original zeroed data row, re-leveling (via <i>droplevel()</i>), before finally returning the function value.<br />
<br />
<br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>accyear <- function() {</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>#ddf <- data.frame(cbind(Year=NULL,Freq=NULL)))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span><span style="background-color: yellow;">ddf <- data.frame(cbind(Year=0,Freq=0))</span></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>for(i in list) {</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>dd <- (data.frame(cbind(Year=i,(subset(fvl2, Year == i, select= Freq)))))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>dd <- (data.frame(cbind(Year=(unique(dd$Year)),Freq=(cumsum(dd$Freq)[nrow(dd)]))))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>ddf <- rbind(ddf,dd)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>}</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span><span style="background-color: yellow;">ddf <- droplevels(data.frame(ddf[-1,],row.names=NULL))</span><span class="Apple-tab-span" style="white-space: pre;"> </span></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>return(ddf)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>}</span><br />
<br />
<span style="background-color: white; color: #666666; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"><br /></span>
Full code for the charts is below. Data is from a Whatcom County Voter Database and a Census Sex and Age database for Washington Counties. The original CSV files were quite large: 5 to 7 million 'observations' or separate data fields. ( I am describing here <b><i>nrow()</i> * <i>ncol()</i>.</b>) . So I pared them down sequentially which is a process that is handled differently to the same effect in SQL. Then I use <i style="background-color: #f6b26b;">table()</i> and <span style="background-color: #ffd966;"><i>stack()</i> </span><span style="background-color: white;">functions </span>for important purposes. In my <a href="http://bellingham-wa-politics-economics.blogspot.com/2013/05/big-data-and-local-elections-part-i.html">political blog</a>, I overlaid the <span style="background-color: magenta;"><i>barplot()s</i> </span>with GIMP's transparent layer functionality, but in reality the visualization doesn't quite line up with the data. Close enough though to suggest a more accurate and interesting approach to correlating the separate information in one graph could be powerful.<br />
<br />
<span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: x-small;">fvl <- read.delim("ferrisvoterlist_20121204.txt")</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">fvl1 <- subset(fvl,select= c(1,3,5,8,15,16,17,18,19,21))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">as.matrix(sapply(fvl1,class))</span><br />
<div>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span style="background-color: #f6b26b;">fvl2 <- as.data.frame(table(fvl1$BirthDate))</span></span></div>
<br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">fvl2$Year <- substr((fvl2$Var1),7,10)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">fvl2$Year <- as.numeric(fvl2$Year)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">list <- sort(unique(fvl2$Year))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>accyear <- function() {</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>#ddf <- data.frame(cbind(Year=NULL,Freq=NULL)))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>ddf <- data.frame(cbind(Year=0,Freq=0))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>for(i in list) {</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>dd <- (data.frame(cbind(Year=i,(subset(fvl2, Year == i, select= Freq)))))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>dd <- (data.frame(cbind(Year=(unique(dd$Year)),Freq=(cumsum(dd$Freq)[nrow(dd)]))))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>ddf <- rbind(ddf,dd)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>}</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>ddf <- droplevels(data.frame(ddf[-1,],row.names=NULL))<span class="Apple-tab-span" style="white-space: pre;"> </span></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>return(ddf)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span class="Apple-tab-span" style="white-space: pre;"> </span>}</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">ddf <- accyear()</span><br />
<span style="background-color: magenta; font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">barplot(ddf$Freq,names.arg=ddf$Year,xlab="Voter Birth Year",ylab="Registration Count")</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">WA_AGESEX <- read.csv("CC-EST2011-AGESEX-53.csv")</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">as.matrix(grep(pattern="TOT",(as.list(names(WA_AGESEX))),value=TRUE))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">Whatcom <- subset(WA_AGESEX, CTYNAME == "Whatcom County")</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">Whatcom4 <- subset(Whatcom, YEAR == 4)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">WhatcomAGE <- (subset(Whatcom4,select=c(AGE1824_TOT,AGE2544_TOT,AGE4564_TOT,AGE65PLUS_TOT,AGE85PLUS_TOT)))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;">WhatcomAGE <- (droplevels(data.frame(WhatcomAGE,row.names=NULL)))</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span style="background-color: #ffd966;">WhatcomAGE <- stack(WhatcomAGE)</span></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"># barplot(WhatcomAGE$values,names.arg=WhatcomAGE$ind)</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span style="background-color: magenta;">barplot(WhatcomAGE$values[5:1],names.arg=WhatcomAGE$ind[5:1])</span></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: left;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><span style="font-family: 'Times New Roman'; font-size: x-small; text-align: -webkit-auto;">Click to enlarge the graphs:</span></span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiY7qm5mwVlxHXryNEFBl71R6Vh-kQK0dwGE8-eF5JjRlYDH2klyFo9jfo7b5wVBoVH0kdpVPU7_qps6-12XErTzqbnYuWuxIMtNqvgl-2bm2SnadJDJnS59XIYqkY0b8GO7kdFGOE1o7I/s1600/2013-05-08_1147x858.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiY7qm5mwVlxHXryNEFBl71R6Vh-kQK0dwGE8-eF5JjRlYDH2klyFo9jfo7b5wVBoVH0kdpVPU7_qps6-12XErTzqbnYuWuxIMtNqvgl-2bm2SnadJDJnS59XIYqkY0b8GO7kdFGOE1o7I/s320/2013-05-08_1147x858.png" width="320" /></a></span></div>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8KA2SO46je7IAUFN0UsM7-gSliNZGCjAfPoCB_3NvyKSLJJd3wJLSK_4qjK6doJ96T5ZZqjwH0GP6_gxReO1oEk4cli2RE74kpD4Qpw1gGsUj9myX3WZy2V8vLL9sBthEA22PXeaxF1U/s1600/2013-05-08_1147x858.layer.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8KA2SO46je7IAUFN0UsM7-gSliNZGCjAfPoCB_3NvyKSLJJd3wJLSK_4qjK6doJ96T5ZZqjwH0GP6_gxReO1oEk4cli2RE74kpD4Qpw1gGsUj9myX3WZy2V8vLL9sBthEA22PXeaxF1U/s320/2013-05-08_1147x858.layer.png" width="320" /></a></span></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7wQDILZUzvbDnxpRj9kN7imYhL4m9010wN8g3xSINSnhiUPLMxFmdpZxANGBGgLJ9mYqivo7LVSZpfu_glCEY6asuN_Mvn8nLNbrjyW_Gc3cdkdzvMdMnZWavjfVN4i71XsNMygJbtb8/s1600/2013-05-08_1147x858.layered.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7wQDILZUzvbDnxpRj9kN7imYhL4m9010wN8g3xSINSnhiUPLMxFmdpZxANGBGgLJ9mYqivo7LVSZpfu_glCEY6asuN_Mvn8nLNbrjyW_Gc3cdkdzvMdMnZWavjfVN4i71XsNMygJbtb8/s320/2013-05-08_1147x858.layered.png" width="320" /></a></span></div>
<div>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: x-small;"><br /></span></div>
<br />
<br />
<br />
<br />Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-81350225575514873912013-03-29T18:04:00.000-07:002013-03-29T20:52:11.334-07:00Basic Data Analysis and Lattice Graphics Framework(<span style="color: red;">post under construction)</span><br />
This post is very generic cruft on basic data analysis designed around combining data with :<br />
<br />
<ul>
<li>cbind()</li>
<li>rbind()</li>
<li>matrix()</li>
<li>as.matrix()</li>
<li>data.frame()</li>
<li>as.data.frame() </li>
</ul>
<br />
and visualizing data with the<i> lattice </i>graphics package.<br />
<br />
<b>Some quick and dirty notes on learning R</b>. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. <b>Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. </b>Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.<br />
<br />
<span style="font-family: inherit;">This post follows from my <a href="http://teachingdatascience.blogspot.com/2013/03/combining-dataframes-in-r-programming.html">last post on combining data for analysis</a>. I am using <a href="http://bea.gov/">BEA</a>, <a href="http://bls.gov/">BLS</a>, and <a href="http://census.gov/">Census</a> data to understand 20 year macro-economic flows. </span>Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are <a href="http://teachingdatascience.rmfmedia.com/Week5/MoreDataManipulationinR.txt">here</a>. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have <a href="http://teachingdatascience.rmfmedia.com/Week5/">concatenated data from mulitple sets</a> into one<i> dataframe</i>. I have used prefixes (BLS,PI,NS) designate "<a href="http://bls.gov/">Bureau of Labor Statistics</a>", (<a href="http://bea.gov/">BEA</a>) Personal Income, (<a href="http://bea.gov/">BEA</a>) Net (residential) Stock.<br />
<br />
<br />
<a name='more'></a><br />
<b><span style="font-family: 'Courier New', Courier, monospace;">getdf <- </span><span style="font-family: 'Courier New', Courier, monospace;">function(x) {as.data.frame(cbind(Class=sapply(x,class)),optional=TRUE)}</span></b><br />
<b><br /></b>
<span style="font-family: 'Courier New', Courier, monospace;"><b>dfclass <- function(x,y) {as.matrix(grep(pattern=x,(as.character(names(y))),value=TRUE))}</b></span><br />
<b><span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
</b><br />
<span style="font-family: Courier New, Courier, monospace;"><b>print_matrix <- function(x) {</b></span><br />
<span style="font-family: Courier New, Courier, monospace;"><b>matrix1 <- matrix(sapply(x,class))</b></span><br />
<span style="font-family: Courier New, Courier, monospace;"><b>matrix1 <- cbind(matrix1,matrix(names(x)))</b></span><br />
<div style="display: inline !important;">
<span style="font-family: Courier New, Courier, monospace;"><b> print(matrix1)</b></span></div>
<b style="font-family: 'Courier New', Courier, monospace;">}</b><br />
<div style="font-family: 'Courier New', Courier, monospace; font-size: small; font-weight: bold;">
<br /></div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">> </span><b style="font-family: 'Courier New', Courier, monospace; font-size: small;">getdf(dd)</b><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> Class</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">Year integer</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.Employment integer</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.CivLabForce integer</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.Census.Population integer</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.EmployPopRatio numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.Census.EMP_CLF numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.Census.EMP_POP numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">BLS.Census.CLF_POP numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.PersonalInc numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.EmployeeComp numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.ProprietorInc numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.RentalInc numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.AssetReceipts numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.InterestInc numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.DividendInc numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.TransferReceipts numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.GovBenefits numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">PI.OtherTransfer numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.ResidentialFixedAssets numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Private numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Corporate numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Noncorp numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Sole_prop_partner numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Nonprofit numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Households numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Government numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.Federal numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.StateLocal numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.OwnerOccupied numeric</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">NS.TenantOccupied numeric</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">><b> dfclass("PI",dd)</b></span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [,1] </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [1,] "PI.PersonalInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [2,] "PI.EmployeeComp" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [3,] "PI.ProprietorInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [4,] "PI.RentalInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [5,] "PI.AssetReceipts" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [6,] "PI.InterestInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [7,] "PI.DividendInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [8,] "PI.TransferReceipts"</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [9,] "PI.GovBenefits" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[10,] "PI.OtherTransfer"</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"></span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">> <b>print_matrix(dd)</b></span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [,1] [,2] </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [1,] "integer" "Year" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [2,] "integer" "BLS.Employment" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [3,] "integer" "BLS.CivLabForce" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [4,] "integer" "BLS.Census.Population" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [5,] "numeric" "BLS.EmployPopRatio" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [6,] "numeric" "BLS.Census.EMP_CLF" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [7,] "numeric" "BLS.Census.EMP_POP" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [8,] "numeric" "BLS.Census.CLF_POP" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> [9,] "numeric" "PI.PersonalInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[10,] "numeric" "PI.EmployeeComp" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[11,] "numeric" "PI.ProprietorInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[12,] "numeric" "PI.RentalInc" </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">...</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
Str() also works well to enumerate a <i>dataframe:</i><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<br />
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
> str(dd)</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
'data.frame': 20 obs. of 30 variables:</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ Year : int 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.Employment : int 118492 120259 123060 124900 126708 129558 131463 133488 136891 136933 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.CivLabForce : int 128105 129200 131056 132304 133943 136297 137673 139368 142583 143734 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.Census.Population : int 256894 260255 263436 266557 269667 272912 276115 279295 282385 285309 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.EmployPopRatio : num 0.615 0.617 0.625 0.629 0.632 0.638 0.641 0.643 0.644 0.637 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.Census.EMP_CLF : num 0.925 0.931 0.939 0.944 0.946 0.951 0.955 0.958 0.96 0.953 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.Census.EMP_POP : num 0.925 0.931 0.939 0.944 0.946 0.951 0.955 0.958 0.96 0.953 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ BLS.Census.CLF_POP : num 0.499 0.496 0.497 0.496 0.497 0.499 0.499 0.499 0.505 0.504 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.PersonalInc : num 5347 5568 5875 6201 6592 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.EmployeeComp : num 3647 3791 3981 4179 4388 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.ProprietorInc : num 415 450 485 516 584 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.RentalInc : num 84.6 114.1 142.9 154.6 170.4 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.AssetReceipts : num 910 900 948 1005 1081 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.InterestInc : num 722 698 713 752 784 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.DividendInc : num 188 202 235 253 296 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.TransferReceipts : num 746 791 826 879 924 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.GovBenefits : num 730 777 813 860 901 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ PI.OtherTransfer : num 16.3 14.1 13.3 18.7 22.9 19.4 26 34 42.4 46.8 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.ResidentialFixedAssets: num 6744 7160 7668 8009 8449 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Private : num 6586 6990 7486 7821 8253 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Corporate : num 69.6 71.5 74.6 77.4 81.1 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Noncorp : num 6516 6918 7411 7743 8172 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Sole_prop_partner : num 617 634 656 681 714 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Nonprofit : num 110 112 116 117 120 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Households : num 5788 6172 6639 6945 7337 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Government : num 159 170 182 188 196 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.Federal : num 52.7 56.4 59.9 61.6 63.8 66 68.7 72.2 75.3 79.1 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.StateLocal : num 106 114 122 126 133 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.OwnerOccupied : num 4918 5267 5694 5975 6333 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
$ NS.TenantOccupied : num 1801 1866 1945 2005 2087 ...</div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
<br /></div>
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
<br /></div>
<div>
<span style="font-family: inherit;"><span style="font-size: small;"> Once we have peeked at our data, we can start using graphics packages like <i>lattice</i> to visualize data. I find that visualizing is a critical step in understanding the relationships of data; the specifics of which are not always immediately clear.</span><span style="font-size: small;"> </span>The <i>lattice</i> graphic package (which ships with the standard installation) lets us rather easily add multiple multiple data sets across the same Y axis (e.g. "Year") through concatenation. Also added are a specific Y label, key, line type (<i>ylab,auto.key,type</i>). I use the <i>lattice</i> package "xyplot" function. All measurements are in <i>billions:</i></span></div>
<div>
<span style="font-family: 'Times New Roman'; font-size: small;"><br /></span></div>
<span style="font-family: Courier New, Courier, monospace;">library(lattice)</span><br />
<span style="font-family: Courier New, Courier, monospace;">xyplot(NS.Households + PI.PersonalInc + PI.GovBenefits ~ Year,ylab="Non Specific",auto.key=TRUE,type="b")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicCNl-kml71Ik-Y1HStg7AX4Jx4zGajrx-ONjFhvoH4P3SSX-Eqr7ZnEgwhwt5sXvhNUxAqK28L8PY5FwD2C9WmM5yeG7biZFOCJfwbyVSHpGGi0X00H7kn51ZKWbXrmfRSaPazjjxbIo/s1600/NonSpecific.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="404" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicCNl-kml71Ik-Y1HStg7AX4Jx4zGajrx-ONjFhvoH4P3SSX-Eqr7ZnEgwhwt5sXvhNUxAqK28L8PY5FwD2C9WmM5yeG7biZFOCJfwbyVSHpGGi0X00H7kn51ZKWbXrmfRSaPazjjxbIo/s640/NonSpecific.jpeg" width="640" /></a></div>
<span style="font-family: inherit;">What we perceive at first glance wouldn't surprise many observers of the American economy for the last twenty years. Nominally, we have seen increasing valuation of housing stock and personal income, and government benefits. 2008 - 2009 brought a "crash" to the housing stock valuation and personal income and a simultaneous increase in government spending. However, even this non specific chart shows the "bubble" in housing stock valuation whose "bursting" resulted in precipitous decline in personal income. Now let us try something a bit more complex. Below, I scale and create data sets to help me visually understand and weight macro-economic change. Notice that inside parentheses the operands "+" or "-" perform math; otherwise they concatenate X axis data.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">xyplot(BLS.Census.Population/100 + PI.GovBenefits + (BLS.Census.CLF_POP * 5000) + (NS.Households - PI.PersonalInc) ~ Year,ylab="Divergence",auto.key=TRUE,type="b")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSRTrUtkVq0wlzb3qr9g2WrrS8yNRsgseUBvgVvd6He5ZUCfZWHME8fOOQl8v6b3e3F09F6rIWZPzi3PyIMAMY3VZPaFGh0p0CnLEAAMJCFBSlEH8BZ1g5BvBnMpWD-orJ2VkBYfni6JE/s1600/Divergence.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="403" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSRTrUtkVq0wlzb3qr9g2WrrS8yNRsgseUBvgVvd6He5ZUCfZWHME8fOOQl8v6b3e3F09F6rIWZPzi3PyIMAMY3VZPaFGh0p0CnLEAAMJCFBSlEH8BZ1g5BvBnMpWD-orJ2VkBYfni6JE/s640/Divergence.jpeg" width="640" /></a></div>
<span style="font-family: inherit;">The orange line represents the increasing difference between household value and personal income we saw in the first chart. Only this time, we can readily see that difference crest at about $5 Trillion in 2007 before taking a dive. Total population is originally measured in millions, so I divide it by 100 to smack it front and center in this graph of billions. <a href="http://bls.gov/">BLS </a>statistics give us the employed labor force divided by the potential working force as a percentage. (This is not the vaunted unemployment measures U-3 or U-6!) I multiply this percentage by 5000 so the relationship between total population and a shrinking work force percentage (over time) is made clear. Government benefits fit into this graph without scaling. Clearly, they have been rising nominally well before Barack Obama took office and government stimulus packages were deployed.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">From this graph we could deduce more sharply the American economic dilemma: a disjointed housing bubble, a shrinking workforce percentage (e.g. an aging population), a steadily increasing population and an economy dependent upon increasing government benefits. Let us try another approach at visualizing similar data, but before we do some comments on employment and population percentages may be useful. </span><span style="font-family: inherit;">The total amount of employed persons in the United States is always some percentage of the civil labor force which in turn is always some percentage of the total population. These numbers are always much different than U-3 or U-6 which are the usual measures of unemployment. Let us take a look at this data for the last year in my data set from the <a href="http://bls.gov/">BLS</a> and <a href="http://census.gov/">Census</a>. To do this, I am going to bind three columns as matrices from my <i>dataframe. </i>This code formats indexed <i>dataframe </i>information nicely:</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;">> cbind(matrix(names(ddBLS)),matrix(ddBLS[1,1:8]), matrix(ddBLS[20,1:8]))</span><br />
<span style="font-family: Courier New, Courier, monospace;"> [,1] [,2] [,3] </span><br />
<span style="font-family: Courier New, Courier, monospace;">[1,] "Year" 1992 2011 </span><br />
<span style="font-family: Courier New, Courier, monospace;">[2,] "BLS.Employment" 118492 139869</span><br />
<span style="font-family: Courier New, Courier, monospace;">[3,] "BLS.CivLabForce" 128105 153617</span><br />
<span style="font-family: Courier New, Courier, monospace;">[4,] "BLS.Census.Population" 256894 312603</span><br />
<span style="font-family: Courier New, Courier, monospace;">[5,] "BLS.EmployPopRatio" 0.615 0.584 </span><br />
<span style="font-family: Courier New, Courier, monospace;">[6,] "BLS.Census.EMP_CLF" 0.925 0.911 </span><br />
<span style="font-family: Courier New, Courier, monospace;">[7,] "BLS.Census.EMP_POP" 0.461 0.447 </span><br />
<span style="font-family: Courier New, Courier, monospace;">[8,] "BLS.Census.CLF_POP" 0.499 0.491 </span><br />
<div>
<br /></div>
<span style="font-family: inherit;">We can use the lattice package ('xyplot') to look at these percentages:</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">xyplot(BLS.Census.EMP_CLF + BLS.EmployPopRatio + BLS.Census.CLF_POP + </span><span style="font-family: 'Courier New', Courier, monospace;">BLS.Census.EMP_POP</span><span style="font-family: 'Courier New', Courier, monospace;"> ~ Year,type="b",ylab="BLS Employment %",auto.key=TRUE)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqxP26duL5IBUPXoUsIK-srVsjmwoBrvHr_IrnP7cECJsTBGRWsKdjNhuV2wPh0ILQlRS5Ab3dU7mG9ArfKYH0CuddFP5gdBYlBixqFbdFkBzI_3SRrb3Jt6DoMZMX9h5BlB4U1K0Q4LQ/s1600/BLS_Employment.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqxP26duL5IBUPXoUsIK-srVsjmwoBrvHr_IrnP7cECJsTBGRWsKdjNhuV2wPh0ILQlRS5Ab3dU7mG9ArfKYH0CuddFP5gdBYlBixqFbdFkBzI_3SRrb3Jt6DoMZMX9h5BlB4U1K0Q4LQ/s640/BLS_Employment.jpeg" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-family: inherit;">But maybe we want to look at related data on four separate charts with separate Y axis. We can use commands from the base graphics package to do this:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">par(mfrow=c(2,2), pch=16)</span><br />
<span style="font-family: Courier New, Courier, monospace;">attach(USEmployPop.1992.2011)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">plot(Year,BLS.Employment,type="b")</span><br />
<span style="font-family: Courier New, Courier, monospace;">plot(Year,BLS.CivLabForce,type="b")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">plot(Year,BLS.Census.CLF_POP, type="b")</span><br />
<span style="font-family: Courier New, Courier, monospace;">plot(Year,BLS.Census.EMP_POP,type="b")</span><br />
<span style="font-family: Courier New, Courier, monospace;">detach(USEmployPop.1992.2011)</span><br />
<span style="font-family: Courier New, Courier, monospace;">par(mfrow=c(1,1), pch=1)</span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0LHN2ut68u3o2uKdPywsvo-j7C60Raw5ckTY9JIkdC7if725u5DAd8D4W-Vj-cHeqwzBi8cy9RcvH208juRbcf2Z8dSTXuPLJW6TOrqQxIwNLdUp4r2MfmPCo0O9WlSUupzr_K86ADbg/s1600/BLS_Employment_001.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0LHN2ut68u3o2uKdPywsvo-j7C60Raw5ckTY9JIkdC7if725u5DAd8D4W-Vj-cHeqwzBi8cy9RcvH208juRbcf2Z8dSTXuPLJW6TOrqQxIwNLdUp4r2MfmPCo0O9WlSUupzr_K86ADbg/s640/BLS_Employment_001.jpeg" width="640" /></a></div>
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">diff1 <- as.data.frame(cbind(Year,'BLS.Total.Employment(M)'=BLS.Employment/100,NS.ResidentialFixedAssets,PI.PersonalInc,'Diff(NS.RFA - PI.PInc)' = (NS.ResidentialFixedAssets - PI.PersonalInc)))</span><br />
<span style="font-family: Courier New, Courier, monospace;">diff1</span><br />
<br />
<div style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">
<br /></div>
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-22283509797984402472013-03-21T22:03:00.005-07:002013-03-29T20:43:49.663-07:00ggplot2Hadley Wickham on using R and ggplot2:<br />
<br />
<iframe allowfullscreen="" frameborder="0" height="360" src="http://www.youtube.com/embed/TaxJwC_MP9Q?feature=player_detailpage" width="640"></iframe>Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-35312793511388082542013-03-18T17:06:00.000-07:002013-03-24T17:44:59.689-07:00Combining DataFrames in R ProgrammingThe screencast below discusses combinging dataframes from disparate sources in R Programming. Full screen is probably best. The code for the screencast is below. Data files for this screencast can be found <a href="http://teachingdatascience.rmfmedia.com/Week5/">here</a>.<br />
<br />
<iframe allowfullscreen="" frameborder="0" height="360" src="http://www.youtube.com/embed/B1hNKeVmx_w?feature=player_embedded" width="640"></iframe>
<br />
<br />
This is the code that accompanies <a href="http://youtu.be/B1hNKeVmx_w">the screen cast above</a>.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"># data science exercise combing dataframes from different sources</span><br />
<span style="font-family: Courier New, Courier, monospace;"># data science exercise using lattice graphics system</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">list.files()</span><br />
<span style="font-family: Courier New, Courier, monospace;">list.files(pattern="csv")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">USPerInc.1992.2011 <- data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))</span><br />
<span style="font-family: Courier New, Courier, monospace;">USResidentialAsset.1992.2011 <- data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))</span><br />
<span style="font-family: Courier New, Courier, monospace;">USEmployPop.1992.2011 <- data.frame(read.csv("BLS_Census.csv"))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">USComb <- cbind(USPerInc.1992.2011[c(1,6)])</span><br />
<span style="font-family: Courier New, Courier, monospace;">names(USComb)</span><br />
<span style="font-family: Courier New, Courier, monospace;">USComb <- cbind(USResidentialAsset.1992.2011[c(2,3,4,8)])</span><br />
<span style="font-family: Courier New, Courier, monospace;">names(USComb)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">USComb <- cbind(USPerInc.1992.2011[c(1,6)])</span><br />
<span style="font-family: Courier New, Courier, monospace;">USComb <- cbind(USComb,USResidentialAsset.1992.2011[c(2,3,4,8)])</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix(names(USComb))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">USComb1 <- data.frame(read.csv("BLS_Census.csv"))</span><br />
<span style="font-family: Courier New, Courier, monospace;">USComb1 <- cbind(USComb1,(data.frame(read.csv("PersonalIncomeDisposition1992-2011.csv"))))</span><br />
<span style="font-family: Courier New, Courier, monospace;">USComb1 <- cbind(USComb1,(data.frame(read.csv("Current-Cost_Net_Stock_Residential_Fixed_Assets.csv"))))</span><br />
<span style="font-family: Courier New, Courier, monospace;">names(USComb1)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">grep(pattern="Year",(as.character(names(USComb1))),value=TRUE)</span><br />
<span style="font-family: Courier New, Courier, monospace;">grepl(pattern="Year",(as.character(names(USComb))))</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix(grepl(pattern="Year",(as.character(names(USComb1)))))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">USComb1 <- cbind(USComb1[c(-9,-20)])</span><br />
<span style="font-family: Courier New, Courier, monospace;"> </span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix1 <- matrix(sapply(USComb,class))</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix1 <- cbind(matrix1,matrix(names(USComb)))</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix2 <- matrix(sapply(USComb1,class))</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix2 <- cbind(matrix2,matrix(names(USComb1)))</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix1</span><br />
<span style="font-family: Courier New, Courier, monospace;">matrix2</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">dd <- USComb1</span><br />
<span style="font-family: Courier New, Courier, monospace;">str(dd)</span><br />
<br />Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-76467324159266846732013-03-11T13:58:00.001-07:002013-03-11T14:13:11.145-07:00<h3 style="text-align: center;">
"Great coders are today's rock stars."</h3>
<h3>
I just had to post this video on learning to code from an all star cast at <a href="http://code.org/">code.org</a>...</h3>
<iframe allowfullscreen="" frameborder="0" height="360" src="https://www.youtube.com/embed/nKIu9yen5nc?feature=player_embedded" width="640"></iframe>Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-85793714334918436662013-03-08T19:20:00.001-08:002013-03-11T13:23:48.739-07:00RGui, Rstudio, Notepad++; Creating and Using Functions<iframe allowfullscreen="" frameborder="0" height="360" src="http://www.youtube.com/embed/gY6ElG2Tc6k?feature=player_detailpage" width="640"></iframe>
The code below is from the <a href="http://www.youtube.com/watch?v=gY6ElG2Tc6k&feature=player_detailpage&list=UUGgeZekA-AHAIbnAKVNBP0w">youtube published screen cast</a> above. This screen cast is available as a high resolution <a href="http://teachingdatascience.rmfmedia.com/Week4/EditingDataR_HiDef.wmv">WMV file</a>. For more data see the <a href="http://teachingdatascience.rmfmedia.com/Week4/">Week4 folder</a>.<br />
<a name='more'></a><br />
<strong># * ** Basic R Programming Function and Plotting Demonstration ** * #</strong><br />
<br />
<strong># create a function to find the square of any number</strong><br />
sqr <- function(a) {a^2}<br />
<strong># works on 'scalar' or iterates over 'vector'</strong><br />
sqr(100)<br />
sqr(1:100)<br />
<strong># use 'sapply' to iterate range as a data.frame for your function 'sqr' and built-in function 'sqrt'</strong><br />
data.frame(sapply((1:100),sqr))<br />
data.frame(sapply((1:100),sqr),(sapply((1:100),(sqrt))))<br />
# plot the data.frame as a histogram with custom X,Y labels<br />
plot(data.frame(sapply((1:100),sqr),(sapply((1:100),(sqrt)))),xlab="Square 1:100", ylab="Square root 1:100",type="h")<br />
<br />
<strong># use 'cbind' (column bind) to apply XY labels to the columns</strong><br />
data.frame(cbind(sqr=sapply((1:100),sqr)),sqrt=sapply((1:100),(sqrt)))<br />
<strong># pump the same dataframe to 'dd' and plot 'dd'</strong><br />
dd <- data.frame(cbind(sqr=sapply((1:100),sqr)),sqrt=sapply((1:100),(sqrt)))<br />
plot(dd, type="h")<br />
lines(dd)<br />
<br />
<strong># create separate numeric vectors and combine them into a dataframe</strong><br />
s1 <- sapply((1:100),sqr)<br />
s2 <- sapply((1:100),sqrt)<br />
dd <- data.frame(s1,s2)<br />
plot(dd, type="h")<br />
lines(dd)<br />
<br />
<strong># use terms more relevant to the X and Y labels</strong><br />
Square <- sapply((1:100),sqr)<br />
SquareRoot <- sapply((1:100),sqrt)<br />
dd <- data.frame(Square,SquareRoot)<br />
plot(dd, type="h")<br />
lines(dd)<br />
<br />
<strong># create functions and values for all XY graph quadrants</strong><br />
sqr <- function(a) {a^2}<br />
sqr_neg <- function(a) {-(a^2)}<br />
<br />
Square <- sapply((1:100),sqr)<br />
Square_neg <- -(sapply((1:100),sqr))<br />
SquareRoot <- sapply((1:100),sqrt)<br />
SquareRoot_neg <- -(sapply((1:100),sqrt))<br />
<br />
<strong> # plot a four by four series of charts as a dataframe</strong><br />
dd <- data.frame(Square,SquareRoot,Square_neg,SquareRoot_neg)<br />
plot(dd)<br />
<br />
<strong>#quadrant (both XY are positive numbers)</strong><br />
plot(dd$Square,dd$SquareRoot, type="h")<br />
lines(dd$Square,dd$SquareRoot)<br />
<br />
<strong># quadrant </strong><b>(both XY are negative numbers)</b><br />
plot(dd$Square_neg,dd$SquareRoot_neg, type="h")<br />
lines(dd$Square_neg,dd$SquareRoot_neg)<br />
<br />
<strong> # use points instead of lines</strong><br />
plot(dd$Square,dd$SquareRoot, type="h")<br />
points(dd$Square,dd$SquareRoot)<br />
<br />
plot(dd$Square_neg,dd$SquareRoot_neg, type="h")<br />
points(dd$Square_neg,dd$SquareRoot_neg)<br />
<br />
<b># More Plotting and plotting functions</b><br />
plot(Square ~ SquareRoot)<br />
<br />
plot(Square,SquareRoot)<br />
plotfunc <- function(x) {plot((Square ~ SquareRoot),subset = x)}<br />
plotfunc <- function(x) {plot((Square ~ SquareRoot),dd,subset = x)}<br />
plotfuncSquaregtr <- function(x) {plot((Square ~ SquareRoot),subset=Square > x)}<br />
plotfuncSquarelt <- function(x) {plot((Square ~ SquareRoot),subset=Square < x)}<br />
<br />
<br />Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-21427343828241440122013-03-06T21:03:00.001-08:002013-03-13T10:39:28.804-07:00Basic data analysis: Part I<br />
<span style="color: red; font-family: 'Courier New', Courier, monospace;">[Editor's note: Under construction - 03/06/2013</span><span style="color: red; font-family: 'Courier New', Courier, monospace;">.]</span><br />
<span style="font-family: inherit;">There are number of functions that will be helpful for this example. Please examine them through use of the help system (e.g. 'help(<i>command)')</i>:</span><br />
<ul>
<li><span style="font-family: inherit;">read.csv()</span></li>
<li><span style="font-family: inherit;">head()</span></li>
<li><span style="font-family: inherit;">names()</span></li>
<li><span style="font-family: inherit;">as.numeric()</span></li>
<li><span style="font-family: inherit;">c()</span></li>
<li><span style="font-family: inherit;">data.frame() or as.data.frame()</span></li>
<li><span style="font-family: inherit;">sapply()</span></li>
<li><span style="font-family: inherit;">class()</span></li>
<li><span style="font-family: inherit;">print()</span></li>
<li><span style="font-family: inherit;">levels()</span></li>
<li><span style="font-family: inherit;">droplevels()</span></li>
<li><span style="font-family: inherit;">subset()</span></li>
<li><span style="font-family: inherit;">order()</span></li>
<li><span style="font-family: inherit;">plot()</span></li>
<li><span style="font-family: inherit;">lines()</span></li>
</ul>
<div>
<a name='more'></a><span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">Understanding bracket (e.g. '[]) notation [<a href="http://www.ats.ucla.edu/stat/r/modules/subsetting.htm">1</a>] and the <i>for</i> command is also important for this exercise.[<a href="http://www.youtube.com/watch?list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B&v=s_h9ruNwI_0&feature=player_detailpage">2</a>] This exercise uses <i>nested interior</i> functions whose results provide arguments for <i>exterior</i> functions. The general form is:</span></div>
<div>
<ul>
<li><i style="font-family: 'Courier New', Courier, monospace;"><b>function1(funtion2(function3)))</b></i></li>
</ul>
</div>
<div>
<span style="font-family: inherit;">Here, function3 provides arguments for function2 which provides arguments for function1. Sometimes this form appears as:</span></div>
<div>
<ul>
<li><i style="font-family: 'Courier New', Courier, monospace;"><b>function1(funtion2(function3(argsFUN3),other_argsFUN2), other_argsFUN1)</b></i></li>
</ul>
</div>
<div>
<span style="font-family: inherit;">where additional function argument are passed, still inside the parentheses specific to the function. The result is usually a datatype as determined by last exterior function.[<a href="http://www.youtube.com/watch?list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B&feature=player_detailpage&v=KIqlKw2zqEQ">3</a>] </span><br />
<span style="font-family: inherit;"><br />
The premise of this exercise is quite straightforward. You are 16 years old. Your parents have offered to purchase you a vehicle. They will pay for the purchase price,taxes, and insurance. However, you must pay for your fuel. You want to examine the <a href="http://www.fueleconomy.gov/feg/ws/index.shtml">EPA's fuel economy database</a> to better understand your options for high mileage vehicles.</span></div>
<span style="font-family: inherit;"><br />
The <i>dataframe </i>class[<a href="http://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf">4</a>] is a row/column data structure that accepts mixed or heterogeneous columnar datatypes or data classes. </span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;"># Download http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;"># unzip vehicles.csv to your working directory</span><br />
<ul>
<li><span style="font-family: 'Courier New', Courier, monospace;">AllVehicles <- read.csv("vehicles.csv")</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">AllVehicles$comb08 <- as.numeric(AllVehicles$comb08)</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;"># You can show the names and the depth and datatypes of the 71 columns and 33,184 rows in the <i>dataframe '</i>AllVehicles' with:</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">nrow(AllVehicles) and ncol(AllVehicles)</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">length(names(AllVehicles))</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">AllVehicles[0,]</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">AllVehicles[,0]</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">names(AllVehicles)</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;"># You can show the <i>class </i>(datatype)<i> </i>of each column with:</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">sapply(AllVehicles,class)</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">class(AllVehicles$comb08)</span></li>
</ul>
<span style="font-family: inherit;">Extract separate <i>dataframes</i> for those cars whose combined mileage is greater than forty mpg and greater than forty-five mpg.</span><br />
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">GTR40 <- data.frame(subset(AllVehicles, comb08 > 40, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">GTR45 <- data.frame(subset(AllVehicles, comb08 > 45, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))</span></li>
</ul>
<span style="font-family: inherit;">For examination purposes, output to the screen separate lists for the last five years of vehicles with fuel economy estimates of over forty and forty-five mpg. Note the use of the 'row.names=NULL' argument to <i>data.frame </i>function and the nested <i>droplevels</i> function. These commands reformat the <i>dataframe</i> as separate object from the parent <i>dataframe, </i>stripping meta data specific for the parent and rebuilding it for the child object.<i> </i></span><span style="font-family: inherit;">The '</span><i style="font-family: inherit;">for' </i><span style="font-family: inherit;">control structure does not require braces ({}) if the command is printed on one line. Otherwise typical form is:</span><br />
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;"><b><i>for (i in (some numerical range,list or function)) do this</i></b></span></blockquote>
<blockquote class="tr_bq">
<b style="font-family: 'Courier New', Courier, monospace;"><i>for (i in (some numerical range,list or function)) {</i></b></blockquote>
<blockquote class="tr_bq">
<b style="font-family: 'Courier New', Courier, monospace;"><i>do this on the next line inside braces</i></b></blockquote>
<blockquote class="tr_bq">
<b>} </b></blockquote>
<span style="font-family: inherit;">'2008:2013' specifies a numerical range in the source below:</span><br />
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">for (i in (2008:2013)) print(droplevels(data.frame((subset(GTR40,year == i)),row.names=NULL)))</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">for (i in (2008:2013)) print(droplevels(data.frame((subset(GTR45,year == i)),row.names=NULL)))</span></li>
</ul>
<div>
<span style="font-family: inherit;">Now we will sort, extract and plot our data. R is case sensitive. Pay special attention to your typing. <span style="color: red;">[Editor's note: Discussion of the <i>order</i> function and the use of brackets should go here.]</span></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<span style="font-family: 'Courier New', Courier, monospace;"># Plot 2013 GTR 40 MPG</span><br />
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">for (i in (2013)) GTR40_2013 <-(droplevels(data.frame((subset(GTR40,year == i)),row.names=NULL)))</span></li>
<li><span style="font-family: Courier New, Courier, monospace;"># Sort or order by 'comb08' or Combined MPG and reorder index:</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">GTR40_2013 <- droplevels(data.frame(GTR40_2013[order(GTR40_2013$comb08),],row.names=NULL))</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">#You can show the 'comb08' sorted <i>dataframe </i>with<i> '</i>GTR40_2013':</span></li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh4U2vKLqrVU_ZgS6yjTK-tNd9sHLuHOYtdnuA62BsngNnWE-G1KTCiAHk7lKQNZuoy9onV5DqGq5G__Z_qdEYazlfGC6AgoP-_srEYzJfhAU3kYwLFIsMz9PVpRuFGDww75OcjiTCzLo/s1600/GTR40_2013.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="352" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh4U2vKLqrVU_ZgS6yjTK-tNd9sHLuHOYtdnuA62BsngNnWE-G1KTCiAHk7lKQNZuoy9onV5DqGq5G__Z_qdEYazlfGC6AgoP-_srEYzJfhAU3kYwLFIsMz9PVpRuFGDww75OcjiTCzLo/s640/GTR40_2013.JPG" width="640" /></a></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: inherit;">The 'plot' and 'lines' functions are part of the default graphics package in R.</span></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">plot(GTR40_2013$comb08,xlab="GTR40 2013 Car Index",ylab="2013 Combined MPG from EPA 'comb08'",type="h")</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">lines(GTR40_2013$comb08)</span></li>
</ul>
<span style="font-family: 'Courier New', Courier, monospace;"># Plot 2013 GTR 45 MPG</span><br />
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">for (i in (2013)) GTR45_2013 <-(droplevels(data.frame((subset(GTR45,year == i)),row.names=NULL)))</span></li>
<li><span style="font-family: Courier New, Courier, monospace;"># Sort or order by comb08 or Combined MPG </span><span style="font-family: 'Courier New', Courier, monospace;">reorder index:</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">GTR45_2013 <- droplevels(data.frame(GTR45_2013[order(GTR45_2013$comb08),],row.names=NULL))</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">#You can show the 'comb08' sorted <i>dataframe </i>with<i> '</i>GTR45_2013' :</span></li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjycRZIEBtAi6kMx4fvHZiIf5R91kGgUXVFHjkq0BwrpnvtzxftZaoE0z825fTv3EBNy-uBtrVZPI1-ya_kvt-eSieCn0W7Wj3Q1A7HVjTIPUMDcGqXbiX3qaoEiZFXXzaMzE8QkJURVOo/s1600/GTR45_2013.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjycRZIEBtAi6kMx4fvHZiIf5R91kGgUXVFHjkq0BwrpnvtzxftZaoE0z825fTv3EBNy-uBtrVZPI1-ya_kvt-eSieCn0W7Wj3Q1A7HVjTIPUMDcGqXbiX3qaoEiZFXXzaMzE8QkJURVOo/s640/GTR45_2013.JPG" width="640" /></a></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">plot(GTR45_2013$comb08,xlab="GTR45 2013 Car Index",ylab="2013 Combined MPG from EPA 'comb08'",type="h")</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">lines(GTR45_2013$comb08)</span> </li>
</ul>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijUzbPFm6qxSEwfqS6glzqUmJYdsRP1I7z6ylWk4HB5NEx-rzGSh7RK2K41gP_RI7VV1CmeWoPlFvfbOQ0SKrrAGr5mlFXPfXuRtFHuSzcVerTHvTNE91-bgNo-WIChwgoKhXFhCnzjwo/s1600/MPG_Panorama.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijUzbPFm6qxSEwfqS6glzqUmJYdsRP1I7z6ylWk4HB5NEx-rzGSh7RK2K41gP_RI7VV1CmeWoPlFvfbOQ0SKrrAGr5mlFXPfXuRtFHuSzcVerTHvTNE91-bgNo-WIChwgoKhXFhCnzjwo/s640/MPG_Panorama.jpg" width="640" /></a><br />
<b><br /></b>
<b><br /></b>
We can look ahead to some advanced graphic library functions. The 'lattice' graphics library gives another way of looking at this data:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"># load lattice library</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;">library(lattice)</span><br />
<span style="font-family: Courier New, Courier, monospace;">histogram(model ~ comb08, data = GTR40_2013)</span><br />
<span style="font-family: Courier New, Courier, monospace;">barchart(model ~ comb08 ,data = GTR40_2013)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGJyG7Y4K-PYJJcdDDmJXx0mtmWIRtA2o0R5yCtHSwgpkFKS2oYWseJaaupEF9y04hABXDvMuEwnyjUvPYFFEvoOxqAdzKeA2OsD5dqlYEreCayfQEqn-0r_o62fcZ8dGCwZOyupj_xjc/s1600/CombinedLattice.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGJyG7Y4K-PYJJcdDDmJXx0mtmWIRtA2o0R5yCtHSwgpkFKS2oYWseJaaupEF9y04hABXDvMuEwnyjUvPYFFEvoOxqAdzKeA2OsD5dqlYEreCayfQEqn-0r_o62fcZ8dGCwZOyupj_xjc/s640/CombinedLattice.jpg" width="640" /></a></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<br />
<b>End Notes for Basic Data Analysis: Part I</b><br />
[1] On subsetting and the use of brackets in R : <a href="http://www.ats.ucla.edu/stat/r/modules/subsetting.htm">http://www.ats.ucla.edu/stat/r/modules/subsetting.htm</a><br />
[2] See Roger D. Peng on <a href="http://www.youtube.com/watch?list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B&v=s_h9ruNwI_0&feature=player_detailpage">Control Structures in R </a><br />
[3] See Roger D. Peng on <a href="http://www.youtube.com/watch?list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B&feature=player_detailpage&v=KIqlKw2zqEQ">Functions</a><br />
[4] For more on dataframes and R data structures, please see Lam, Longhow: <a href="http://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf">An Introduction to R</a>Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-19576070763899645812013-03-01T16:31:00.002-08:002013-03-11T13:23:58.330-07:00Basic Graphing in R: Combining, Plotting and Smoothing <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo1wzmd2QBM2ZpeN8zOr0E7Cu9OTEjC3NBveA7bxhoRy43EN8EINm-9leNpKu5DhdH5vATZfI1ms1Ak8z_i394WPoJbTPlrFX8wsB1wp9h1YuekYo_pkkZNmVHNBta4bZLOJTzgmPbmmM/s1600/Panorama.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo1wzmd2QBM2ZpeN8zOr0E7Cu9OTEjC3NBveA7bxhoRy43EN8EINm-9leNpKu5DhdH5vATZfI1ms1Ak8z_i394WPoJbTPlrFX8wsB1wp9h1YuekYo_pkkZNmVHNBta4bZLOJTzgmPbmmM/s640/Panorama.jpg" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: small;">R Graphs from left to right: Price of Imported Oil per Quarter 1976:2012; Price of Retail Gasoline per Quarter:1976:2012; Ratio of Retail Gas / Imported Oil per Quarter: 1976:2012 . Source: <b>U.S. EIA</b> : "Short-Term Energy Outlook Real and Nominal Prices, February 13, 2012</span></td></tr>
</tbody></table>
The data files, images and R Script for this blog are <a href="http://teachingdatascience.rmfmedia.com/Week3/">here</a>. These example use R 2.14 64 bit for Windows. Because I am neither a statistician or energy professional, the results of the following analysis will have to be taken with a "grain of salt". The purpose of this post is to demonstrate <i>basic</i> use of exporting, reformatting, combining, plotting, smoothing data in R.<br />
<br />
<a name='more'></a><br />
<h4>
Finding and Importing Data</h4>
I found <a href="http://www.eia.gov/forecasts/steo/realprices/">historical prices</a> of United States energy consumption at the <a href="http://www.eia.gov/">Energy Information Administration</a>.[<a href="http://www.eia.gov/forecasts/steo/realprices/real_prices.xls">1</a>] . I wanted to understand better the rise in the price of gasoline in the United States and how closely it relates to the historic rise in the market price of crude oil. I used the quarterly worksheets<b> Crude Oil - Q</b> and <b>Gasoline - Q </b>from the EIA spreadsheet "<a href="http://www.eia.gov/forecasts/steo/realprices/real_prices.xls">real_prices.xls</a>" ; data current as of February, 2013. I found it simplest to reformat the date to numeric columns in <a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> ('comma series value') format. I synchronized both worksheets to cover the same date range and created a simplified numeric date range using Open Office Scalc's <i>left </i>and <i>right</i> functions to reformat the quarter dates thus sidestepping the issue of date formatting (for now). After importing the data into R with these commands:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">> QTR_Imp_Oil_Price <- read.csv("ImportedOilPrice_datereformat_simple.csv") </span><br />
<span style="font-family: Courier New, Courier, monospace;">> QTR_Retail_Gas_Price <- read.csv("QuarterRetailGas_datereformat_simple.csv")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: inherit;">I then had two <i>data frames</i> as below. Since all the columns now contain numeric class data, 'read.csv' formats them <i>as.numeric</i>:</span><br />
<span style="font-family: inherit;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">> head(QTR_Imp_Oil_Price)</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Q Year Index84 Nominal Real</span><br />
<span style="font-family: Courier New, Courier, monospace;">1 1 1976 0.5590 13.3500 55.3500</span><br />
<span style="font-family: Courier New, Courier, monospace;">2 2 1976 0.5640 13.4296 55.1742</span><br />
<span style="font-family: Courier New, Courier, monospace;">3 3 1976 0.5730 13.5194 54.6710</span><br />
<span style="font-family: Courier New, Courier, monospace;">4 4 1976 0.5813 13.5948 54.1876</span><br />
<span style="font-family: Courier New, Courier, monospace;">5 5 1977 0.5920 14.3847 56.3033</span><br />
<span style="font-family: Courier New, Courier, monospace;">6 6 1977 0.6023 14.5384 55.9284</span><br />
<span style="font-family: Courier New, Courier, monospace;">...</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">> head(QTR_Retail_Gas_Price)</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Q Year Index84 Nominal Real</span><br />
<span style="font-family: Courier New, Courier, monospace;">1 1 1976 0.5590 0.60 2.49</span><br />
<span style="font-family: Courier New, Courier, monospace;">2 2 1976 0.5640 0.60 2.48</span><br />
<span style="font-family: Courier New, Courier, monospace;">3 3 1976 0.5730 0.63 2.53</span><br />
<span style="font-family: Courier New, Courier, monospace;">4 4 1976 0.5813 0.63 2.50</span><br />
<span style="font-family: Courier New, Courier, monospace;">5 5 1977 0.5920 0.64 2.49</span><br />
<span style="font-family: Courier New, Courier, monospace;">6 6 1977 0.6023 0.66 2.53</span><br />
<span style="font-family: Courier New, Courier, monospace;">...</span><br />
<br />
For this post, we can ignore "Index84" and "Real" data columns. However, we will create a new <i>dataframe </i>by combining two columns from separate <i>dataframes:</i><br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">Oil_Gas_Nominal <- data.frame(QTR_Imp_Oil_Price$Nominal,QTR_Retail_Gas_Price$Nominal)</span><br />
<span style="font-family: Courier New, Courier, monospace;"># copy to a more readable name</span><br />
<span style="font-family: Courier New, Courier, monospace;">Oil_Gas_Nominal_Price <- Oil_Gas_Nominal</span><br />
<br />
<h4>
Plotting in R</h4>
<div>
The commands</div>
<div>
<ul>
<li><span style="font-family: 'Courier New', Courier, monospace;">help(plot)</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">methods(plot)</span></li>
<li><span style="font-family: 'Courier New', Courier, monospace;">help(lines)</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">library(help="stats")</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">help(lowess)</span></li>
</ul>
</div>
<div>
help us understand the versatility of <i>plotting in R</i>. In the examples below I am using the <i>plot.ts (e.g. </i>'plot time series' command). Because the a <i>dataframe</i> has levels synchronous with time span both x and y arguments are not needed. Type <b>Oil_Gas_Nominal_Price[1]</b> at the R console to see why. The <i>lines </i>function allows me to apply <i>scatterplot smoothing </i>to the graph. The <i>plot.ts </i>function allows for x and y axis labels as well as chart <i>type</i>. Here <i>type="h"</i> specifies histogram. Here are some examples from the EIA derived <i>dataframes</i>:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">require(stats)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">plot.ts(<b>Oil_Gas_Nominal_Price[1]</b>, xlab="By Quarter: 1976:2012",type="h")</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">lines(stats::lowess(Oil_Gas_Nominal_Price[1]))</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCMT4Yn6DQQgSZVRDD1jzQ2w1kQNx9qwG2KlzeqP2h79Je12s3Go_XQ-BQLbtKU2oy8SX7ADW6kY9VayCgYlRJkngjyqOH3V3Wbpnl8vIOpV_Idr21FihQ6h4ylZHirvzX2gmmMRBb6dg/s1600/Imported_Oil_By_Quarter.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCMT4Yn6DQQgSZVRDD1jzQ2w1kQNx9qwG2KlzeqP2h79Je12s3Go_XQ-BQLbtKU2oy8SX7ADW6kY9VayCgYlRJkngjyqOH3V3Wbpnl8vIOpV_Idr21FihQ6h4ylZHirvzX2gmmMRBb6dg/s400/Imported_Oil_By_Quarter.jpeg" width="400" /></a></div>
<div>
</div>
</div>
<div style="font-family: inherit;">
<br /></div>
<br />
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">require(stats)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">plot.ts(<b>Oil_Gas_Nominal_Price[2]</b>, xlab="By Quarter: 1976:2012", type="h")</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">lines(stats::lowess(Oil_Gas_Nominal_Price[2]))</span></div>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIgFPVrXADgVDqG-EsVyL1PZyQnqWEnEK1tIwTrSmw4lv8dqmKYK3jdF1TA-xpXcXuquCyw7ZmvPHBwyNPVcDcY5L97qMA46kUZ-s-XXHA8DK-4neX-soel5rgRPRNeAEvbcNemdV6dXg/s1600/Retail_Gas_By_Quarter.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIgFPVrXADgVDqG-EsVyL1PZyQnqWEnEK1tIwTrSmw4lv8dqmKYK3jdF1TA-xpXcXuquCyw7ZmvPHBwyNPVcDcY5L97qMA46kUZ-s-XXHA8DK-4neX-soel5rgRPRNeAEvbcNemdV6dXg/s400/Retail_Gas_By_Quarter.jpeg" width="400" /></a></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
This last plot shows how dividing by <i>dataframe</i> columns is a <i>vector </i>operation.</div>
<br />
<span style="font-family: Courier New, Courier, monospace;">require(stats)</span><br />
<span style="font-family: Courier New, Courier, monospace;">Ratio_Nominal <- data.frame(Oil_Gas_Nominal[2]/Oil_Gas_Nominal[1])</span><br />
<span style="font-family: Courier New, Courier, monospace;">plot.ts(data.frame(Ratio_Nominal),main="Retail Gas/Imported Oil",xlab="By Quarter: 1976:2012",ylab="Retail Gas/Imported Oil",type="h")</span><br />
<span style="font-family: Courier New, Courier, monospace;">lines(stats::lowess(Ratio_Nominal))</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBFZdhYZeiW6MnN3NTa-lIomVxJWJJ34zxX0FnzpyFu5pSUCOKQXPM6wKhAPBc4BuwFSF3K-MHTL4btDMdNZmCDW52XKt0aq1WpEbApGbicvMU3g2jxICg22ewDU8yjKmgCiGBcJ3gmIQ/s1600/RetailGas_ImportedOil.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="387" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBFZdhYZeiW6MnN3NTa-lIomVxJWJJ34zxX0FnzpyFu5pSUCOKQXPM6wKhAPBc4BuwFSF3K-MHTL4btDMdNZmCDW52XKt0aq1WpEbApGbicvMU3g2jxICg22ewDU8yjKmgCiGBcJ3gmIQ/s400/RetailGas_ImportedOil.jpeg" width="400" /></a></div>
<br />
<h4>
<span style="font-family: inherit;">More information on DataFrames in R:</span></h4>
<span style="font-family: inherit;">[1] <a href="http://timhesterberg.home.comcast.net/~timhesterberg/Rpackages/TwoPackages5.pdf">http://timhesterberg.home.comcast.net/~timhesterberg/Rpackages/TwoPackages5.pdf</a></span><br />
<span style="font-family: inherit;">[2] <a href="http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames">http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames</a></span><br />
<span style="font-family: inherit;">[3] <a href="http://cran.r-project.org/web/packages/dataframe/dataframe.pdf">http://cran.r-project.org/web/packages/dataframe/dataframe.pdf</a></span><br />
<span style="font-family: inherit;">[4] <a href="http://www3.nd.edu/~steve/Rcourse/Lecture2v1.pdf">http://www3.nd.edu/~steve/Rcourse/Lecture2v1.pdf</a></span><br />
<span style="font-family: inherit;">[5] <a href="http://www.dummies.com/how-to/content/how-to-create-a-data-frame-from-scratch-in-r.html">http://www.dummies.com/how-to/content/how-to-create-a-data-frame-from-scratch-in-r.html</a></span><br />
<span style="font-family: inherit;">[6] <a href="http://www.rochester.edu/College/gradstudents/bkenkel//data/rcourse_chap03.pd">http://www.rochester.edu/College/gradstudents/bkenkel//data/rcourse_chap03.pd</a>f</span><br />
<span style="font-family: inherit;">[7] <a href="http://rwiki.sciviews.org/doku.php?id=tips:data-frames">http://rwiki.sciviews.org/doku.php?id=tips:data-frames</a></span><br />
<span style="font-family: inherit;">[8] <a href="http://rwiki.sciviews.org/doku.php?id=tips:data-frames:sort">http://rwiki.sciviews.org/doku.php?id=tips:data-frames:sort</a></span><br />
<span style="font-family: inherit;"><br /></span>
<br />
<h4>
<span style="font-family: inherit;">More information on Graphs in R:</span></h4>
<span style="font-family: inherit;">[1] <a href="http://www.harding.edu/fmccown/r/">http://www.harding.edu/fmccown/r/</a></span><br />
<span style="font-family: inherit;">[2] <a href="http://www.statmethods.net/graphs/scatterplot.html">http://www.statmethods.net/graphs/scatterplot.html</a></span><br />
<span style="font-family: inherit;">[3] <a href="http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html">http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html</a></span><br />
<span style="font-family: inherit;">[4] <a href="http://www.cyclismo.org/tutorial/R/plotting.html">http://www.cyclismo.org/tutorial/R/plotting.html</a></span><br />
<span style="font-family: inherit;">[5] <a href="http://www.sr.bham.ac.uk/~ajrs/R/r-plot_data.html">http://www.sr.bham.ac.uk/~ajrs/R/r-plot_data.html</a></span><br />
<span style="font-family: inherit;">[6] <a href="http://stackoverflow.com/questions/2564258/plot-2-graphs-in-same-plot-in-r">http://stackoverflow.com/questions/2564258/plot-2-graphs-in-same-plot-in-r</a></span><br />
[7] <a href="http://flowingdata.com/2012/12/17/getting-started-with-charts-in-r/">http://flowingdata.com/2012/12/17/getting-started-with-charts-in-r/</a>Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-31665437791678622912013-02-25T17:01:00.001-08:002013-03-03T09:36:38.632-08:00 Datatypes, consistent data, reshaping data, dirty data[Data for this post can be found <a href="http://teachingdatascience.rmfmedia.com/Week2/">here</a>]<br />
<br />
This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer:<br />
<br />
<div>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDioLNaRBt2VQ5TxDfJbqgQ0NWBmV-uCOlCwY0KK2-YIg4CuzIxquj6-sNvHphPApXBzpl9QrLtQ7kaz0Q5GtCu8tD3LAn8hmlo-RSIHutzV1bjfDYJ4WYQRUHm1uhGlQrlasrEi4DMV8/s1600/EPA2013_002.JPG" imageanchor="1" style="clear: left; display: inline !important; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDioLNaRBt2VQ5TxDfJbqgQ0NWBmV-uCOlCwY0KK2-YIg4CuzIxquj6-sNvHphPApXBzpl9QrLtQ7kaz0Q5GtCu8tD3LAn8hmlo-RSIHutzV1bjfDYJ4WYQRUHm1uhGlQrlasrEi4DMV8/s640/EPA2013_002.JPG" width="640" /></a></div>
<div>
<br />
<br />
<a name='more'></a><br /></div>
<div>
So this data is part of the EPA 2013 fuel economy ratings. You can see the steps I took to get this data into the data editor here (click to enlarge):</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEix2ml2AGjOqxtnVVuZXtGpWvQZTgPEFcncvHNfrgu_gX-JCIECpT1L2OHuz_hwW4QFdBVV73NZor3xh8vii_KKjBQd3QP01ezNxE-8k2dLXVR4vK_TK16r496QGMN7MhG-u5LUILfbz5c/s1600/EPA2013_001.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="380" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEix2ml2AGjOqxtnVVuZXtGpWvQZTgPEFcncvHNfrgu_gX-JCIECpT1L2OHuz_hwW4QFdBVV73NZor3xh8vii_KKjBQd3QP01ezNxE-8k2dLXVR4vK_TK16r496QGMN7MhG-u5LUILfbz5c/s640/EPA2013_001.JPG" width="640" /></a></div>
<div>
All well so far. Now let us try some data analysis! Let us say, for example, that you want a list of all vehicles whose combined MPG is 45 or more. The following syntax should work just fine, if your data was "clean":</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">> subset(EPADelim, Cmb.MPG >= 45 , select = c(Model,Cmb.MPG))</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">[1] Model Cmb.MPG</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><0 rows> (or 0-length row.names)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">Warning message:</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">In Ops.factor(Cmb.MPG, 45) : >= not meaningful for factors</span></div>
<div>
<br /></div>
</div>
<div>
<span style="font-family: inherit;">However, a warning message is returned. So let us take a look at 'Cmb.MPG'. Right away, we see some problems. Many of the data fields are marked "N/A". By default, R ignores these entries. More troublesome for those of us who would like to do numeric </span>comparisons<span style="font-family: inherit;"> with the subset function is "factor" data fields such as "16/24". These fields can not be subject to numeric </span>comparison<span style="font-family: inherit;"> by R.:</span><br />
<span style="font-family: inherit;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoa6_zz6IABg9jWkgPfpUObkuJRonJkMwHsZMHG9bwzzGXcTADf2aTWQ9aWePen03rOUApSrw4LrWlDZC72QtAaxUUBH82vmlaDh6rlqPn8zn1cAMhEt6b8gBPELExNme8aC7KX19aN38/s1600/EPA2013_003.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoa6_zz6IABg9jWkgPfpUObkuJRonJkMwHsZMHG9bwzzGXcTADf2aTWQ9aWePen03rOUApSrw4LrWlDZC72QtAaxUUBH82vmlaDh6rlqPn8zn1cAMhEt6b8gBPELExNme8aC7KX19aN38/s640/EPA2013_003.JPG" width="640" /></a></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<br />
<br />
To understand data a little better, let us discuss (briefly) 'datatypes' in R. R has a number of important classes of data. You can use the 'class()' function on any object in R to uncover the class type. For example:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">> class(1)</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "numeric"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> class("char")</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "character"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> class(1:10)</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "integer"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> class(2303456L)</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "integer"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> class(df)</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "data.frame"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> class(get.c)</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "function"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">The class of the data can be changed with the 'as.[class]' function:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">> class(EPADelim$City.MPG)</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "factor"</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;">> class(as.numeric(EPADelim$City.MPG))</span></div>
<span style="font-family: 'Courier New', Courier, monospace;">[1] "numeric"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> (as.numeric(EPADelim$City.MPG))</span><br />
<span style="font-family: Courier New, Courier, monospace;"> [1] 52 52 52 40 40 38 38 29 29 32....</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">> class(as.vector(EPADelim$City.MPG))</span><br />
<span style="font-family: Courier New, Courier, monospace;">[1] "character"</span><br />
<span style="font-family: Courier New, Courier, monospace;">> as.vector(EPADelim$City.MPG)[1:10]</span><br />
<span style="font-family: Courier New, Courier, monospace;"> [1] "39" "39" "39" "24" "24" "22" "22" "16" "16" "19"</span><br />
<br />
<br />
So let us try our subset() function once again, converting 'Cmb.MPG' datatype on the fly:<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">> subset(EPADelim,as.numeric(Cmb.MPG) >= 45,select = c(Model,Cmb.MPG))</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Model Cmb.MPG</span><br />
<span style="font-family: Courier New, Courier, monospace;">1 ACURA ILX 38</span><br />
<span style="font-family: Courier New, Courier, monospace;">2 ACURA ILX 38</span><br />
<span style="font-family: Courier New, Courier, monospace;">3 ACURA ILX 38</span><br />
<span style="font-family: Courier New, Courier, monospace;">4 ACURA ILX 28</span><br />
<span style="font-family: Courier New, Courier, monospace;">5 ACURA ILX 28</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;">...</span></div>
<div>
<br />
hmmm.... that isn't quite right. This data set has to be cleaned. R has an entire series of functions designed to help you automate such data "reshaping" including (but not limited to):</div>
<ul>
<li>strssplit()</li>
<li>sapply()</li>
<li>sub()</li>
<li>gsub()</li>
<li>cut()</li>
<li>cut2()</li>
<li>merge()</li>
<li>melt()</li>
<li>sort()</li>
<li>order()</li>
<li>head()</li>
<li>tail()</li>
</ul>
<div>
I won't discuss these functions in this post. (See note at bottom for some tutorial links.) However, If you have some knowledge of a database language utility like 'gawk 4.0', you can use the following syntax to understand just how much data needs to be 'cleaned' in column 15. For example , sorted count of all data that contains the "/" in column 15 ('Cmb.MPG') shows:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">$ gawk -F"\t" '{print $15}' all_alpha_13.txt | sort -nr | uniq -c | sort -k1 -nr | grep "/" </span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 165 N/A</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 22 10/14</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 17 13/17</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 11 17/23</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 11 16/22</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 9 14/19</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 9 11/15</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 8 12/16</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 7 11/14</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 6 16/24</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 5 14/20</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 4 9/12</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 4 43/100</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 4 16/21</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 4 13/18</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> 4 10/13</span></div>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">....</span></div>
<div>
<br /></div>
<div>
However, a much easier, but time consumptive, way to do this is by editing the data in a spreadsheet or R's data editor. The changes you makes in R's data editor take place immediately and irrevocably. You can see the approach I take to cleaning up data in the spreadsheet screenshot below. I simply split up the 'Cmb.MPG' into two <i>numeric</i> columns: 'Cmb.hi.MPG' and 'Cmb.lo.MPG'. :</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFyGU-n5qCjxcQnAiBPJ0bSfnP06ZLLnPZfuBII41-6-b5GkXM30_WE2W3GnRIL9xswHwdC4QU_fjXOn-zx00K79VoaC3V7kQXXVw2lRJTzKjeBvvC-99xp-kl5HY1ApT3X_8dZiTPSl8/s1600/EPA2013_004.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFyGU-n5qCjxcQnAiBPJ0bSfnP06ZLLnPZfuBII41-6-b5GkXM30_WE2W3GnRIL9xswHwdC4QU_fjXOn-zx00K79VoaC3V7kQXXVw2lRJTzKjeBvvC-99xp-kl5HY1ApT3X_8dZiTPSl8/s640/EPA2013_004.JPG" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
Now we try our subset() function again. However, after wading through a pile of 'N/A' we examine our results to see that triple digit 'Cmb.hi.MPG' have been left out:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">> subset(EPASplit15,as.numeric(Cmb.hi.MPG) > 45 ,select = c(Model,Cmb.hi.MPG))</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">....</span></div>
<br />
<span style="font-family: Courier New, Courier, monospace;">164 RAM 3500 N/A</span><br />
<span style="font-family: Courier New, Courier, monospace;">165 RAM 3500 N/A</span><br />
<span style="font-family: Courier New, Courier, monospace;">166 TESLA Model S 95</span><br />
<span style="font-family: Courier New, Courier, monospace;">173 TESLA Model S 89</span><br />
<span style="font-family: Courier New, Courier, monospace;">174 TOYOTA RAV4 EV 76</span><br />
<span style="font-family: Courier New, Courier, monospace;">175 CODA Coda 73</span><br />
<span style="font-family: Courier New, Courier, monospace;">176 CODA Coda 73</span><br />
<span style="font-family: Courier New, Courier, monospace;">177 TOYOTA Prius Plug-in Hybrid 95</span><br />
<span style="font-family: Courier New, Courier, monospace;">178 TOYOTA Prius Plug-in Hybrid 95</span><br />
<span style="font-family: Courier New, Courier, monospace;">179 TOYOTA Prius 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">180 TOYOTA Prius 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">181 TOYOTA Prius c 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">182 TOYOTA Prius c 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">183 FORD C-MAX Hybrid 47</span><br />
<span style="font-family: Courier New, Courier, monospace;">184 FORD Fusion Hybrid 47</span><br />
<span style="font-family: Courier New, Courier, monospace;">213 CHEVROLET Volt 98</span><br />
<span style="font-family: Courier New, Courier, monospace;">214 CHEVROLET Volt 98</span><br />
<span style="font-family: Courier New, Courier, monospace;">215 CHEVROLET Volt 98</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: inherit;">A better command line 'fix' for this would be:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">EPASplit15$Cmb.hi.MPG <- as.numeric(EPASplit15$Cmb.hi.MPG)</span>
<span style="font-family: inherit;">As a last resort we call up the data editor for EPASplit15 ('fix(EPASplit15)') and by clicking on the top column of the 'Cmb.hi.MPG' and 'Cmb.lo.MPG' convert them to numeric columns:</span><br />
<span style="font-family: inherit;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhf2HbMtldPgnKVgNO1COBYg364w9rlYJIDirmIlHLGE2-ToIXlP4805KJMMDdJSBC3KgqegXXC1xkegoNTAgGR8NzT8G_MNV0yXmcP0bouR7zkHW6dgVf4WyzPhHV92zoLBh_Xv54sblg/s1600/EPA2013_005.JPG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="268" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhf2HbMtldPgnKVgNO1COBYg364w9rlYJIDirmIlHLGE2-ToIXlP4805KJMMDdJSBC3KgqegXXC1xkegoNTAgGR8NzT8G_MNV0yXmcP0bouR7zkHW6dgVf4WyzPhHV92zoLBh_Xv54sblg/s640/EPA2013_005.JPG" width="640" /></a></div>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Now our subset() function works as desired:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">> subset(EPASplit15,(Cmb.hi.MPG) > 45 ,select = c(Model,Cmb.hi.MPG))</span><br />
<span style="font-family: Courier New, Courier, monospace;"></span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;"> Model Cmb.hi.MPG</span><br />
<span style="font-family: Courier New, Courier, monospace;">166 TESLA Model S 95</span><br />
<span style="font-family: Courier New, Courier, monospace;">173 TESLA Model S 89</span><br />
<span style="font-family: Courier New, Courier, monospace;">174 TOYOTA RAV4 EV 76</span><br />
<span style="font-family: Courier New, Courier, monospace;">175 CODA Coda 73</span><br />
<span style="font-family: Courier New, Courier, monospace;">176 CODA Coda 73</span><br />
<span style="font-family: Courier New, Courier, monospace;">177 TOYOTA Prius Plug-in Hybrid 95</span><br />
<span style="font-family: Courier New, Courier, monospace;">178 TOYOTA Prius Plug-in Hybrid 95</span><br />
<span style="font-family: Courier New, Courier, monospace;">179 TOYOTA Prius 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">180 TOYOTA Prius 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">181 TOYOTA Prius c 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">182 TOYOTA Prius c 50</span><br />
<span style="font-family: Courier New, Courier, monospace;">183 FORD C-MAX Hybrid 47</span><br />
<span style="font-family: Courier New, Courier, monospace;">184 FORD Fusion Hybrid 47</span><br />
<span style="font-family: Courier New, Courier, monospace;">188 FORD C-MAX PHEV 100</span><br />
<span style="font-family: Courier New, Courier, monospace;">189 FORD C-MAX PHEV 100</span><br />
<span style="font-family: Courier New, Courier, monospace;">190 FORD Fusion PHEV 100</span><br />
<span style="font-family: Courier New, Courier, monospace;">191 FORD Fusion PHEV 100</span><br />
<span style="font-family: Courier New, Courier, monospace;">213 CHEVROLET Volt 98</span><br />
<span style="font-family: Courier New, Courier, monospace;">214 CHEVROLET Volt 98</span><br />
<span style="font-family: Courier New, Courier, monospace;">215 CHEVROLET Volt 98</span><br />
<span style="font-family: Courier New, Courier, monospace;">2128 SCION iQ EV 121</span><br />
<span style="font-family: Courier New, Courier, monospace;">2129 SCION iQ EV 121</span><br />
<span style="font-family: Courier New, Courier, monospace;">2147 HONDA Fit 118</span><br />
<span style="font-family: Courier New, Courier, monospace;">2148 HONDA Fit 118</span><br />
<span style="font-family: Courier New, Courier, monospace;">2149 FIAT 500e 116</span><br />
<span style="font-family: Courier New, Courier, monospace;">2150 FIAT 500e 116</span><br />
<span style="font-family: Courier New, Courier, monospace;">2151 NISSAN Leaf 116</span><br />
<span style="font-family: Courier New, Courier, monospace;">2152 NISSAN Leaf 116</span><br />
<span style="font-family: Courier New, Courier, monospace;">2153 MITSUBISHI i-MiEV 112</span><br />
<span style="font-family: Courier New, Courier, monospace;">2154 MITSUBISHI i-MiEV 112</span><br />
<span style="font-family: Courier New, Courier, monospace;">2174 SMART ForTwo Cabriolet 107</span><br />
<span style="font-family: Courier New, Courier, monospace;">2175 SMART ForTwo Cabriolet 107</span><br />
<span style="font-family: Courier New, Courier, monospace;">2176 SMART ForTwo Coupe 107</span><br />
<span style="font-family: Courier New, Courier, monospace;">2177 SMART ForTwo Coupe 107</span><br />
<span style="font-family: Courier New, Courier, monospace;">2178 FORD Focus BEV 105</span><br />
<span style="font-family: Courier New, Courier, monospace;">2179 FORD Focus BEV 105</span><br />
<div>
<br />
Update 03/02/2013:<br />
<br />
Originally, I missed a <a href="http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle">developer download page</a> that had cleaner (and more detailed) fuel economy data. The chart below shows us how many high mileage vehicles are now appearing on the market.<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"># Download:</span><br />
<span style="font-family: Courier New, Courier, monospace;"># http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;"># Unzip vehicles.csv to your working directory</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">AllVehicles <- read.csv("vehicles.csv")</span><br />
<span style="font-family: Courier New, Courier, monospace;">AllVehicles$comb08 <- as.numeric(AllVehicles$comb08)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">GTR45 <- data.frame(subset(AllVehicles, comb08 > 45, select = c(model,make,year,comb08,comb08U,combA08,combA08U,combE)))</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;"># If necessary install (plyr) package.</span><br />
<span style="font-family: Courier New, Courier, monospace;"># See dataframe sorting discussion at</span><br />
<span style="font-family: Courier New, Courier, monospace;"># http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns-in-r/6871968#6871968</span><br />
<span style="font-family: Courier New, Courier, monospace;"># Sort by year then combined fuel mileage</span><br />
<span style="font-family: Courier New, Courier, monospace;"># for other alternatives see[<a href="http://rwiki.sciviews.org/doku.php?id=tips%3adata-frames%3asort">1</a>,<a href="http://rwiki.sciviews.org/doku.php?id=tips%3adata-frames%3asort">2</a>]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">library(plyr)</span><br />
<span style="font-family: Courier New, Courier, monospace;">arrange(GTR45,(year),comb08)</span><br />
<span style="font-family: Courier New, Courier, monospace;">SortYearGTR45 <- arrange(GTR45,(year),comb08)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">plot(SortYearGTR45$year,SortYearGTR45$comb08,type="p")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1qpIg8GIOjty0GmJv0wCwHx2QtG_ADbAJ6RiC4RQ6g8hb-YxXMYaeqqu_zJJOEOdw0GX1RPHgf4HHlumU4NH-S1a2T_Mj_bEhm1IRs420FpKRA79Xnt20V6j0jyEipg8K6dk27Yc7q6U/s1600/SortYearGTR45.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1qpIg8GIOjty0GmJv0wCwHx2QtG_ADbAJ6RiC4RQ6g8hb-YxXMYaeqqu_zJJOEOdw0GX1RPHgf4HHlumU4NH-S1a2T_Mj_bEhm1IRs420FpKRA79Xnt20V6j0jyEipg8K6dk27Yc7q6U/s640/SortYearGTR45.jpeg" width="640" /></a></div>
<br />
Another method of sorting by dataframe with statistical smoothing:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">GTR45.MPG.Year.Model.Cmb.MPG <- <b>data.frame(GTR45[order(GTR45$year),c(3,4)])</b></span><br />
<span style="font-family: Courier New, Courier, monospace;">plot(GTR45.MPG.Year.Model.Cmb.MPG,ylab="Combined MPG")</span><br />
<span style="font-family: Courier New, Courier, monospace;">lines(stats::lowess(GTR45.MPG.Year.Model.Cmb.MPG))</span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnWRUVsVsj8L1NSJmj9uQ4N3HVT_u5GvvTbDoZCLoE5Uf_pVBZz7Lufu375ViOIE-gMxr6IpEW4k1h1mY1qOuMn0rG5CffG2ZkSSRDXdb-_ephVBbKOQgVLejJrKprVZoiIln5nkicC9o/s1600/SortYearGTR45_lowess.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnWRUVsVsj8L1NSJmj9uQ4N3HVT_u5GvvTbDoZCLoE5Uf_pVBZz7Lufu375ViOIE-gMxr6IpEW4k1h1mY1qOuMn0rG5CffG2ZkSSRDXdb-_ephVBbKOQgVLejJrKprVZoiIln5nkicC9o/s640/SortYearGTR45_lowess.jpeg" width="640" /></a></div>
<br />
<h4>
End Notes:</h4>
For more information on Data Cleaning:<br />
<br />
<a href="https://dl.dropbox.com/u/7710864/courseraPublic/otherResources/lecture3/index.html#1">Lists and Data Cleaning</a> Jaffe and Muschelli<br />
<a href="http://vita.had.co.nz/papers/tidy-data.pdf">Tidy Data</a> Hadley Wickham<br />
<a href="https://d19vezwu8eufl6.cloudfront.net/dataanalysis/dataMungingBasics.pdf">Data Munging Basics</a> Jeffrey Leek<br />
<br /></div>
</div>
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-6240352763321066222013-02-20T06:17:00.000-08:002013-02-28T07:50:24.960-08:00Roger D. Peng: "Computing for Data Analysis"<br />
Roger D. Peng (<a href="http://www.biostat.jhsph.edu/~rpeng/">http://www.biostat.jhsph.edu/~rpeng/</a>) is a "Rock 'n' Roll Statistician" (see <a href="http://twitter.com/rdpeng">http://twitter.com/rdpeng</a>) specializing in Biostatistics at the<a href="http://www.jhsph.edu/"> John Hopkins Bloomberg School of Public Health</a>. Dr. Peng teaches also teaches R Programming courses at<a href="http://coursera.org/"> Coursera.org.</a> He has posted the lectures for his last course <a href="http://www.youtube.com/user/rdpeng">"Computing for Data Analysis" on Youtube</a>. His lectures are an excellent resource and will be understandable in large part for most of the 5th - 10th graders at Saint Paul's Academy. I recommend you watch at least the following individual videos below. Consider watching all of the videos in the play lists "<a href="http://www.youtube.com/watch?v=V2V3T9GkKBY&list=PLjTlxb-wKvXMUop9m0C8G5xLBzhsIDBC7">BackGround on R</a>" and "<a href="http://www.youtube.com/watch?v=8xT3hmJQskU&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ">Computing for Data Analysis: Week 1</a>" in preparation for next week's class.<br />
<br />
<a href="http://www.biostat.jhsph.edu/~rpeng/">Dr. Peng</a> is an experienced statistical programmer in R. Watching these videos will clear up many questions we had in our second class and help you extend your data analysis skills with R. Also, anything I described poorly or inaccurately will be accurately described in <a href="http://www.biostat.jhsph.edu/~rpeng/">Dr. Peng's</a> lectures.<br />
<br />
<a name='more'></a><br /><br />
<h3>
Individual Lectures</h3>
<b>Setting Your Working Directory and Editing R Code:</b><br />
<a href="http://www.youtube.com/watch?v=8xT3hmJQskU">http://www.youtube.com/watch?v=8xT3hmJQskU</a><br />
<b>How to Get Help:</b><br />
<a href="http://www.youtube.com/watch?v=ZFaWxxzouCY">http://www.youtube.com/watch?v=ZFaWxxzouCY</a><br />
<b>Reading/Writing Data I:</b><br />
<a href="http://www.youtube.com/watch?v=aBzAels6jPk&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=8">http://www.youtube.com/watch?v=aBzAels6jPk&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=8</a><br />
<b>Reading/Writing Data II:</b><br />
<a href="http://www.youtube.com/watch?v=cUUqDWttMws&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=9">http://www.youtube.com/watch?v=cUUqDWttMws&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=9</a><br />
<br />
<h3>
Playlists</h3>
<b>BackGround on R</b><br />
<a href="http://www.youtube.com/watch?v=V2V3T9GkKBY&list=PLjTlxb-wKvXMUop9m0C8G5xLBzhsIDBC7">http://www.youtube.com/watch?v=V2V3T9GkKBY&list=PLjTlxb-wKvXMUop9m0C8G5xLBzhsIDBC7</a><br />
<b>Computing for Data Analysis: Week 1</b><br />
<a href="http://www.youtube.com/watch?v=8xT3hmJQskU&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ">http://www.youtube.com/watch?v=8xT3hmJQskU&list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ</a><br />
<b>Computing for Data Analysis: Week 2</b><br />
<a href="http://www.youtube.com/watch?v=s_h9ruNwI_0&list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B">http://www.youtube.com/watch?v=s_h9ruNwI_0&list=PLjTlxb-wKvXNnjUTX4C8IeIhPBjPkng6B</a><br />
<b>Computing for Data Analysis Week 3</b><br />
<a href="http://www.youtube.com/watch?v=R2Zh_kPxrmg&list=PLjTlxb-wKvXOzI2h0F2_rYZHIXz8GWBop">http://www.youtube.com/watch?v=R2Zh_kPxrmg&list=PLjTlxb-wKvXOzI2h0F2_rYZHIXz8GWBop</a><br />
<b>Computing for Data Analysis Week 4</b><br />
<a href="http://www.youtube.com/watch?v=HPSrjKt-e8c&list=PLjTlxb-wKvXOdzysAE6qrEBN_aSBC0LZS">http://www.youtube.com/watch?v=HPSrjKt-e8c&list=PLjTlxb-wKvXOdzysAE6qrEBN_aSBC0LZS</a><br />
<div>
<br /></div>
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-39103134850142107232013-02-11T11:03:00.001-08:002013-03-09T05:44:39.714-08:00A Brief Overview of Programming Languages and Symbolic Logic (Part I)<br />
<blockquote class="tr_bq">
<span style="background-color: white;">What kind of spirit is it, that can support you to keep your interest on C++ Standardization for 20+ years?</span><br />
<blockquote class="tr_bq">
<span style="background-color: white;"><b>Nothing else I could spend my time on could do so much good to so many people.</b></span><br />
<blockquote>
from a <a href="http://www.aced.dk/1/?p=1438">Interview with Bjarne Stroustrup</a> (2012); Author of the C++ Programming Language </blockquote>
</blockquote>
</blockquote>
<br />
Some would say that the best way to learn a programming language is to simply install your language of choice and start writing code based on the manual, tutorials or help system specific to that language. However, having studied more than few programming languages in my lifetime, I am going to emphasize another deeper background approach.<br />
<br />
The ability to use<a href="http://en.wikipedia.org/wiki/Logic_circuits"> digital logic circuits</a> in computation has been with us a very short historical period of time. The rise in their use has paralleled the rapid spread of urbanization, growth in population, universities, international trade and market economies. Today, more than sixty years after the United States Army Ballistic Research Laboratory developed <a href="http://en.wikipedia.org/wiki/ENIAC">ENIAC</a>, computational logic is now the substructure of all economic, engineering, defense, science, and mathematical efforts. Each year our lives are more organized around and organized with improvements in computer hardware, computational theory, information theory, networking and user interface design. During these scant last sixty years, at least<a href="http://people.ku.edu/~nkinners/LangList/Extras/langlist.htm"> 2500 computer languages</a> have been developed. A "<a href="http://en.wikipedia.org/wiki/History_of_programming_languages">History of Computer Languages</a>" constitutes a reading list of some of the brightest minds to live in the post World War II era. A brief, detailed but important list of popular programming languages can be found <a href="http://www.levenez.com/lang/">here</a>.<br />
<br />
<a name='more'></a><br />
For some time now, the disciplines of digital intelligence have been engaged in segregation and specialization. An <a href="http://en.wikipedia.org/wiki/Electrical_engineering">electrical engineering</a> degree is now something quite different from a<a href="http://en.wikipedia.org/wiki/Computer_science"> computer science</a> degree. Most of us who write and learn computer software today concern ourselves little with how <a href="http://en.wikipedia.org/wiki/Computer_hardware">computer hardware</a> processes logic. But there was time when the two disciplines were indivisible. Some scholars still believe that learning <i>machine architecture</i> (sometimes called '<i>machine architecture and organization</i>' or<b> MOA</b>) is essential to understanding the <i>binary (or Boolean) logic (<a href="http://en.wikipedia.org/wiki/Logic_gate">1</a>,<a href="http://en.wikipedia.org/wiki/Boolean_logic">2</a>) </i> that is the foundation for all digital computation and <i><a href="http://en.wikipedia.org/wiki/Programming_languages">programming languages</a></i>. Despite all such discussion, there is still no definitive path for the creation of a competent software engineer. The conventional wisdom generally involves obtaining a <a href="http://en.wikipedia.org/wiki/Computer_hardware">computer science</a> degree.<br />
<br />
That being said, the <a href="http://en.wikipedia.org/wiki/List_of_college_dropout_billionaires">Wikipedia list of college drop billionaires</a> includes <a href="http://en.wikipedia.org/wiki/Bill_Gates">Bill Gates</a> (Microsoft founder), <a href="http://en.wikipedia.org/wiki/Steve_Jobs">Steve Jobs</a> (Apple founder), <a href="http://en.wikipedia.org/wiki/Mark_Zuckerberg">Mark Zuckerberg</a> (Facebook founder) and <a href="http://en.wikipedia.org/wiki/Lawrence_Ellison">Larry Ellison</a> (Oracle founder). Many of us who studied other disciplines in college simply fell into computer administration and software engineering because we found we enjoyed or had a knack for it. All this being said, pursuing an <a href="http://en.wikipedia.org/wiki/electrical_engineering">electrical engineering</a> or <a href="http://en.wikipedia.org/wiki/Computer_science">computer science</a> degree would still be the recommended best first step on the way being hired by one of the companies organized by the men described above.<br />
<br />
And perhaps the best step in understanding software engineering is to gain an understanding of math, logic, and more specifically <i><a href="http://en.wikipedia.org/wiki/Mathematical_logic">mathematical (or symbolic) logic</a></i>. Many of you in this class (5th - 10th graders at SPA) learn a <a href="http://en.wikipedia.org/wiki/Portal:Mathematics">math curriculum</a> that prepares you to understand many of the principles from which computational logic and computer science are derived. However, chief in importance among all these principles is the simple yet all encompassing conception that the principles of quantitative logic can be represented by symbolic logic or natural language. To this concept, our survival on this Earth as a species for the last few thousand years owes much. Let us examine briefly the history of <i><a href="http://en.wikipedia.org/wiki/Mathematical_logic">mathematical (or symbolic) logic</a></i>.<br />
<br />
(To be continued)Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-1210697152787991622013-02-04T09:56:00.001-08:002013-02-28T07:50:50.758-08:00Week0: Using spreadsheetsSpreadsheets are the ubiquitous tool of business, finance, science, academia, and corporate america. The most widespread of all data analysis tools, spreadsheets have no peers as such. Spreadsheets, despite the continual enhancement of their functionality, have some drawbacks. For example:<br />
<ul>
<li>Both functions and data in cells are subject to undetected statistical errors sometimes due to the slip of a cursor.</li>
<li>Mixing data and code in the same worksheet can and have created data disasters.</li>
<li>Large datasets or data results are impractical when stored in cells.</li>
</ul>
Despite these drawbacks, it is common to find thousands of spreadsheets in active use at any investment bank or accounting firm. No doubt SPA students will begin using spreadsheets well before they are in high school and continuing using them when they enter college and the working world.<br />
<br />
<a name='more'></a><br /><br />
As with any data analysis software, you can approach spreadsheet work from a number of different perspectives. Sometimes all you want to do is examine your data to look for relationships. Other times, you will have specific templates you want to apply repeatedly to a continuous flow of data. Often times, spreadsheet graphing capacity is the end result for some presentation or series of slides. Online training for spreadsheets is also ubiquitous. Here are some tutorial links for <a href="http://www.openoffice.org/index1-passthru.html?utm_expid=57643286-7&utm_referrer=http%3A%2F%2Fwww.openoffice.org%2Findex1-passthru.html">Open Office</a> Calc:<br />
<br />
<br />
<li><a href="http://edutech.msu.edu/online/OpenOffice%20Tutorials/CalcComplete.pdf"><span style="color: #2288bb;">MSU
Tutorial (PDF)</span></a>
</li>
<li><a href="http://education.uregina.ca/technology/ecmp355/Tutorials/OOCalcTutorial.pdf"><span style="color: #2288bb;">University
of Regina Tutorial (PDF)</span></a>
</li>
<li><a href="http://showmedo.com/videotutorials/openoffice"><span style="color: #2288bb;">"Video Tutorials for
Open Office"</span></a>
</li>
<li><a href="http://www.tutorialsforopenoffice.org/category_index/spreadsheet.html"><span style="color: #2288bb;">"Tutorials
for Open Office"</span></a>
</li>
<li><a href="http://forum.openoffice.org/en/forum/viewforum.php?f=75"><span style="color: #2288bb;">Open
Office Forum List of Tutorials</span></a>
</li>
<li><a href="http://wiki.openoffice.org/wiki/Documentation/Tutorials"><span style="color: #2288bb;">Open
Office Wiki</span></a> </li>
<br />
<br />
The help files for most spreadsheets are extensive. Spreadsheets are a <em>generalized</em> tool, designed for generic data analysis. Your ability to use a spreadsheet well to model your problem set will depend significantly on a specific<em> domain </em>knowledge. That being said, advanced spreadsheet design and development involves high-end statistical, programming and database skillsets. I've uploaded a series of 'screencast' tutorials about using basic functionality in <a href="http://www.openoffice.org/index1-passthru.html?utm_expid=57643286-7&utm_referrer=http%3A%2F%2Fwww.openoffice.org%2Findex1-passthru.html">Open Office</a> Calc here:<br />
<br />
<a href="http://www.youtube.com/playlist?list=PL0xj6yuE69m2mtVDaI3mPyd-HkkVM-Osy&feature=view_all">DataScience:Week0</a><br />
<br />
The spreadsheet for the tutorials can be found <a href="http://teachingdatascience.rmfmedia.com/Week0/BasicSpreadsheet.ods">here</a> or <a href="http://dl.dropbox.com/u/70404843/Week0/BasicSpreadsheet.ods">here</a>. For your science projects, I will try to make myself available for help with your spreadsheets the best that I can.<br />
<br />
<br />
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-1837433857975629482013-01-30T19:52:00.002-08:002013-02-28T07:52:15.738-08:00Week0: CK-12 Resources <a href="http://www.ck12.org/">CK-12</a> is an incredible STEM resource.<a href="http://www.ck12.org/dashboard/"> Log on</a> to it and peruse the subjects you want to learn. The site is designed to offer self-paced learning, achievement tracking, grade based content. Perhaps most significant is the ability of the student and educator to create <a href="http://www.ck12.org/eflex/about">eFlexBooks </a> in .mobi, .epub, and PDF formats. The mathematical and statistical lessons are of very high quality. And this resource is free. Check out these two flexbooks:<br />
<ul>
<li><a href="http://www.ck12.org/book/Basic-Probability-and-Statistics---A-Short-Course/">http://www.ck12.org/book/Basic-Probability-and-Statistics---A-Short-Course/</a></li>
<li><a href="http://www.ck12.org/book/CK-12-Math-Analysis/">http://www.ck12.org/book/CK-12-Math-Analysis/</a></li>
</ul>
This type of resource allows an opportunity for myself as an instructor to discuss how technology increases learning. I highly recommend the use of a portable reader such as a Nook or Kindle or the use of reader software on your smart phone or laptop. Simply put, mathematics is sometimes best learned in small doses at places convenient for the student. To prevent visual fatigue from interfering with comprehension, I prefer a reader that handles the display of <a href="http://en.wikipedia.org/wiki/Greek_letters_used_in_mathematics,_science,_and_engineering">Greek Letters</a> with optimum clarity in all types of light.<br />
<br />
For these articles, the student may find a resource on the use of <a href="http://en.wikipedia.org/wiki/Greek_letters_used_in_mathematics,_science,_and_engineering">Greek Letters in Mathematics</a> helpful. Please see my (upcoming) post on the use of greek letters in mathematics and their representation in statistical software.<br />
<br />
<a name='more'></a><br /><br />
<b>Questions for the Students</b><br />
For this course on data science we are very interested in<br />
<ol>
<li>statistical concepts</li>
<li>mathematical concepts that are critical to statistical programming</li>
</ol>
Mathematical structures important to statistical programming include:<br />
<ol>
<li>the concept of a function</li>
<li>the properties of arrays, vectors, matrices and other data structures</li>
</ol>
Read pages 5 - 62 in <a href="http://www.ck12.org/book/CK-12-Math-Analysis/">CK-12-MathAnalysis</a> . Pay particular attention to the generic discussion of functions as a mathematical concept. Read pages 42 - 73 in<a href="http://www.ck12.org/book/CK-12-Math-Analysis/"> Basic-Probability-and-Statistics---A-Short-Course</a>.<br />
<ul>
<li>How do we understand visual data through graphs?</li>
<li>How is our understanding of visual data linked to mathematical concepts that represent data?</li>
</ul>
Both these works will prepare a student for better understanding statistical software packages like <a href="http://www.r-project.org/">R</a>, <a href="http://www.gnu.org/software/octave/">Octave</a>, <a href="http://www.python.org/">Python</a>, <a href="http://www.scilab.org/">Scilab</a>.<br />
<br />
<b>For the ambitious</b><br />
Read <a href="http://www.ck12.org/book/Probability-and-Statistics---Advanced-%2528Second-Edition%2529/">Probability and Statistics - Advanced(Second Edition)</a> . If you find your eyes glazing over when reviewing formulas, try this strategy:<br />
<ol>
<li>Complete a first read of this work that skims over each chapter.</li>
<li>Try to identify the general purpose of the discussion on data analysis in statistics.</li>
<li>Take a break and think some about what you have read.</li>
<li>Read the work carefully from one end to the other over a period of sittings. Each time you read something new, take some time to think about possible uses of the mathematical concept.</li>
</ol>
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-54835811807080186832013-01-24T11:40:00.000-08:002013-02-28T07:52:52.311-08:00Scilab<a href="https://www.scilab.org/">Scilab</a> is multi-platform, open source software that offers functionality similar to <a href="http://www.mathworks.com/products/matlab/">MATLAB</a>, including user contributed add-ins for statistics and data analysis. The software is well used in secondary and university locations for teaching and practicing engineering and mathematics An extensive help menu is available with the software. There are many third party tutorials available from science and engineering departments across the world. For example:<br />
<ul>
<li><a href="http://user.physics.unc.edu/~sheila/scilabtutorial.html">http://user.physics.unc.edu/~sheila/scilabtutorial.html</a></li>
<li><a href="http://hkumath.hku.hk/~nkt/Scilab/IntroToScilab.html">http://hkumath.hku.hk/~nkt/Scilab/IntroToScilab.html</a></li>
<li><a href="http://www.cse.iitb.ac.in/~cs626-449/scilab.pdf">http://www.cse.iitb.ac.in/~cs626-449/scilab.pdf</a> </li>
<li><a href="http://www.comp.dit.ie/bmacnamee/materials/dip/labs/introtoscilab.pdf">http://www.comp.dit.ie/bmacnamee/materials/dip/labs/introtoscilab.pdf</a></li>
<li><a href="http://usingscilab.blogspot.com/">http://usingscilab.blogspot.com/</a></li>
</ul>
<br />
I have found the product surprisingly well-featured and mature and easy to learn. There appears to be a substantial technical community committed to <a href="https://www.scilab.org/">Scilab</a>. In addition, I find the semantics and syntax of <a href="http://scilab.org/">Scilab</a> form a gentle and well-featured introduction to command line computing for upper and high school students.<br />
<br />
<a name='more'></a><br /><br />
<b>Questions and Activities for Students</b><br />
<ul>
<li>Install <a href="http://scilab.org/">Scilab</a> on your home computer.</li>
<li>Walk through at least one of the tutorials listed above. </li>
<li>Do you find <a href="http://scilab.org/">Scilab</a> easier or more difficult to use than a spreadsheet for mathematical computation?</li>
<li>What advantages do you think command line scripts have over spreadsheet cell based programming?</li>
<li>Why would do you think scientists and engineers would use <a href="http://scilab.org/">Scilab</a> or <a href="http://www.mathworks.com/products/matlab/">MATLAB </a>instead of a spreadsheet?</li>
</ul>
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com0tag:blogger.com,1999:blog-2873774223581777424.post-83761972205222552722013-01-20T13:46:00.000-08:002013-02-28T07:51:33.169-08:00Climate Data from BEST<a href="http://berkeleyearth.org/">Berkeley Earth Surface Temperature</a> (<b><a href="http://berkeleyearth.org/">BEST</a></b>) has released a <a href="http://www.scitechnol.com/GIGS/GIGS-1-101.pdf">significant study</a> documenting the rise in the Earth's surface temperature from 1753 to 2011. The study includes a detailed discussion of statistical methodology and locations for the <a href="http://www.mathworks.com/index.html">MATLAB</a> <a href="http://berkeleyearth.org/our-code/">code and data</a>. <a href="http://www.mathworks.com/products/matlab/index.html">MATLAB</a> is a commercial mathematical, statistical, and presentation software used by many academics and scientists. A movie of land temperature anomalies can be found <a href="http://berkeleyearth.org/movies/">here</a>. The conclusions from this study represent the collation of exhaustive amounts of temperature data to create conclusions about a land surface temperature anomalies over time Some presentations on the complexity of their statistical methodology can be found <a href="http://berkeleyearth.org/pdf/berkeley-earth-santa-fe-robert-rohde.pdf">here</a> and <a href="http://berkeleyearth.org/pdf/berkeley-earth-santa-fe.pdf">here</a>. Correlating historical climate data to understand evidence of global climate change has proven to be a complex and controversial challenge. The <a href="http://berkeleyearth.org/">BEST</a> sponsor <a href="http://www.novim.org/">NOVIM</a> hopes to promote scientific studies <i>without political bias</i> concerning significant world resource issues. <br />
<br />
<b>Questions for Students:</b><br />
<br />
(1) What does the <a href="http://www.scitechnol.com/GIGS/GIGS-1-101.pdf">BEST statistical methodology</a> suggest about the complexity and nature of the analysis of data from historical records? Do you think the study helps us understand how complete our understanding of statistical reasoning needs to be for very large data sets?<br />
<br />
(2) Read the <a href="http://www.novim.org/">NOVIM</a> website highlighted articles:<br />
<ul>
<li><a href="http://www.novim.org/images/pdf/schneideradvocacy050112">Advocacy</a></li>
<li><a href="http://www.novim.org/images/pdf/wsjridley.pdf">Confirmation Bias</a></li>
<li><a href="http://www.novim.org/images/pdf/crichtoncaltechlecture.pdf">Science versus Policy</a></li>
</ul>
<div>
How do the sciences of mathematics and statistics help us come to <i>unbiased</i> conclusions? </div>
<div>
<br /></div>
<div>
(3) Read the <a href="http://en.wikipedia.org/">Wikipedia articles</a> on<a href="http://en.wikipedia.org/wiki/Scientific_method"> Scientific Method</a> and <a href="http://en.wikipedia.org/wiki/List_of_cognitive_biases">List of Biases in Judgement and Decision Making</a> . How important do you think the study of data science is to the future of humankind? How does improving the scientific method by learning how to remove <i>bias </i>from our scientific methodology improve our chance to prosper and survive on this Earth?</div>
Ryan M. Ferrishttp://www.blogger.com/profile/03122603266808854365noreply@blogger.com