Monday, February 25, 2013

Datatypes, consistent data, reshaping data, dirty data

[Data for this post can be found here]

This post is on one of the nasty prerequisites for all data professionals: "Understanding Dirty Data and How to Clean it Up". Sometimes called "bad data" or alternatively the quest for "tidy data". Most 'relational data' uses the row and column format as you are familiar with in a spreadsheet. Ideally, all data would be arranged neatly in such a format. Let us take a look at such data from R's data editor window. You can click on these pictures to see them in the blogger slide viewer:

Wednesday, February 20, 2013

Roger D. Peng: "Computing for Data Analysis"

Roger D. Peng ( is a "Rock 'n' Roll Statistician" (see specializing in Biostatistics at the John Hopkins Bloomberg School of Public Health.  Dr. Peng teaches also teaches R Programming courses at  He has posted the lectures for his last course "Computing for Data Analysis" on Youtube.   His lectures are an excellent resource and will be understandable in large part for most of the 5th - 10th graders at Saint Paul's Academy. I recommend you watch at least the following individual videos below. Consider watching all of the videos in the play lists "BackGround on R" and "Computing for Data Analysis: Week 1" in preparation for next week's class.

Dr. Peng is an experienced statistical programmer in R.  Watching these videos will clear up many questions we had in our second class and help you extend your data analysis skills with R. Also, anything I described poorly or inaccurately will be accurately described in Dr. Peng's lectures.

Monday, February 11, 2013

A Brief Overview of Programming Languages and Symbolic Logic (Part I)

What kind of spirit is it, that can support you to keep your interest on C++ Standardization for 20+ years?
Nothing else I could spend my time on could do so much good to so many people.
from a  Interview with Bjarne Stroustrup (2012); Author of the C++ Programming Language  

Some would say that the best way to learn a programming language is to simply install your language of choice and start writing code based on the manual, tutorials or help system specific to that language.  However, having studied more than few programming languages in my lifetime,  I am going to emphasize another deeper background approach.

The ability to use digital logic circuits in computation has been with us a very short historical period of time.  The rise in their use has paralleled the rapid spread of urbanization, growth in population, universities, international trade and market economies.  Today, more than sixty years after the United States Army Ballistic Research Laboratory developed ENIAC, computational logic is now the substructure of all economic, engineering, defense, science, and mathematical  efforts.  Each year our lives are more organized around and organized with improvements in computer hardware, computational theory, information theory, networking and user interface design. During these scant last sixty years, at least 2500 computer languages have been developed. A "History of Computer Languages" constitutes a reading list of some of the brightest minds to live in the post World War II era. A brief, detailed but important list of popular programming languages can be found here.

Monday, February 4, 2013

Week0: Using spreadsheets

Spreadsheets are the ubiquitous tool of business, finance, science, academia, and corporate america. The most widespread of all data analysis tools, spreadsheets have no peers as such.  Spreadsheets, despite the continual enhancement of their functionality, have some drawbacks. For example:
  • Both functions and data in cells are subject to undetected statistical errors sometimes due to the slip of a cursor.
  • Mixing data and code in the same worksheet can and have created data disasters.
  • Large datasets or data results are impractical when stored in cells.
Despite these drawbacks, it is common to find thousands of spreadsheets in active use at any investment bank or accounting firm. No doubt SPA students will begin using spreadsheets well before they are in high school and continuing using them when they enter college and the working world.