Teaching Data Science: Basic Data Analysis and Lattice Graphics Framework

(post under construction)
This post is very generic cruft on basic data analysis designed around combining data with :

cbind()
rbind()
matrix()
as.matrix()
data.frame()
as.data.frame()

and visualizing data with the lattice graphics package.

Some quick and dirty notes on learning R. I have found virtual libraries exist both on and offline for learning R. However, I have also found that R is a peculiar and specific language. I would compare the semantics most to SQL, but somehow that comparison stops being useful quickly. Ironically, given the power of R language quantitative analysis, I have found the user really wants to get the "feel of R" inside his forearms to become useful and self-confident. Spending time manipulating and re-organizing data is essential at each step of your curriculum in learning R. Functionally, R is a mathematical platform and benefits from domain specific packages and knowledge. But the R language is also a unique engine type with programmable limits. R does certain functionality very well. Other functionality perhaps more typical of many programming languages is simply outside the subset of R. There is art to successful use of R. There is an important 'R' mentality that only serious practice will enjoin.

This post follows from my last post on combining data for analysis. I am using BEA, BLS, and Census data to understand 20 year macro-economic flows. Some examples on how to use cbind, rbind, matrix, as.data.frame commands to re-organize this data are here. Below are some functions I have created to help explicate the data set ('dd'). They are slightly more concise/useful than the function 'str(dd)'. The user will recall that I have concatenated data from mulitple sets into one dataframe. I have used prefixes (BLS,PI,NS) designate "Bureau of Labor Statistics", (BEA) Personal Income, (BEA) Net (residential) Stock.

getdf <- function(x) {as.data.frame(cbind(Class=sapply(x,class)),optional=TRUE)}

dfclass <- function(x,y) {as.matrix(grep(pattern=x,(as.character(names(y))),value=TRUE))}

print_matrix <- function(x) {
matrix1 <- matrix(sapply(x,class))
matrix1 <- cbind(matrix1,matrix(names(x)))

print(matrix1)

}

> getdf(dd)
Class
Year integer
BLS.Employment integer
BLS.CivLabForce integer
BLS.Census.Population integer
BLS.EmployPopRatio numeric
BLS.Census.EMP_CLF numeric
BLS.Census.EMP_POP numeric
BLS.Census.CLF_POP numeric
PI.PersonalInc numeric
PI.EmployeeComp numeric
PI.ProprietorInc numeric
PI.RentalInc numeric
PI.AssetReceipts numeric
PI.InterestInc numeric
PI.DividendInc numeric
PI.TransferReceipts numeric
PI.GovBenefits numeric
PI.OtherTransfer numeric
NS.ResidentialFixedAssets numeric
NS.Private numeric
NS.Corporate numeric
NS.Noncorp numeric
NS.Sole_prop_partner numeric
NS.Nonprofit numeric
NS.Households numeric
NS.Government numeric
NS.Federal numeric
NS.StateLocal numeric
NS.OwnerOccupied numeric
NS.TenantOccupied numeric

> dfclass("PI",dd)
[,1]
[1,] "PI.PersonalInc"
[2,] "PI.EmployeeComp"
[3,] "PI.ProprietorInc"
[4,] "PI.RentalInc"
[5,] "PI.AssetReceipts"
[6,] "PI.InterestInc"
[7,] "PI.DividendInc"
[8,] "PI.TransferReceipts"
[9,] "PI.GovBenefits"
[10,] "PI.OtherTransfer"

> print_matrix(dd)
[,1] [,2]
[1,] "integer" "Year"
[2,] "integer" "BLS.Employment"
[3,] "integer" "BLS.CivLabForce"
[4,] "integer" "BLS.Census.Population"
[5,] "numeric" "BLS.EmployPopRatio"
[6,] "numeric" "BLS.Census.EMP_CLF"
[7,] "numeric" "BLS.Census.EMP_POP"
[8,] "numeric" "BLS.Census.CLF_POP"
[9,] "numeric" "PI.PersonalInc"
[10,] "numeric" "PI.EmployeeComp"
[11,] "numeric" "PI.ProprietorInc"
[12,] "numeric" "PI.RentalInc"
...

Str() also works well to enumerate a dataframe:

> str(dd)

'data.frame':   20 obs. of  30 variables:

 $ Year                     : int  1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 ...

 $ BLS.Employment           : int  118492 120259 123060 124900 126708 129558 131463 133488 136891 136933 ...

 $ BLS.CivLabForce          : int  128105 129200 131056 132304 133943 136297 137673 139368 142583 143734 ...

 $ BLS.Census.Population    : int  256894 260255 263436 266557 269667 272912 276115 279295 282385 285309 ...

 $ BLS.EmployPopRatio       : num  0.615 0.617 0.625 0.629 0.632 0.638 0.641 0.643 0.644 0.637 ...

 $ BLS.Census.EMP_CLF       : num  0.925 0.931 0.939 0.944 0.946 0.951 0.955 0.958 0.96 0.953 ...

 $ BLS.Census.EMP_POP       : num  0.925 0.931 0.939 0.944 0.946 0.951 0.955 0.958 0.96 0.953 ...

 $ BLS.Census.CLF_POP       : num  0.499 0.496 0.497 0.496 0.497 0.499 0.499 0.499 0.505 0.504 ...

 $ PI.PersonalInc           : num  5347 5568 5875 6201 6592 ...

 $ PI.EmployeeComp          : num  3647 3791 3981 4179 4388 ...

 $ PI.ProprietorInc         : num  415 450 485 516 584 ...

 $ PI.RentalInc             : num  84.6 114.1 142.9 154.6 170.4 ...

 $ PI.AssetReceipts         : num  910 900 948 1005 1081 ...

 $ PI.InterestInc           : num  722 698 713 752 784 ...

 $ PI.DividendInc           : num  188 202 235 253 296 ...

 $ PI.TransferReceipts      : num  746 791 826 879 924 ...

 $ PI.GovBenefits           : num  730 777 813 860 901 ...

 $ PI.OtherTransfer         : num  16.3 14.1 13.3 18.7 22.9 19.4 26 34 42.4 46.8 ...

 $ NS.ResidentialFixedAssets: num  6744 7160 7668 8009 8449 ...

 $ NS.Private               : num  6586 6990 7486 7821 8253 ...

 $ NS.Corporate             : num  69.6 71.5 74.6 77.4 81.1 ...

 $ NS.Noncorp               : num  6516 6918 7411 7743 8172 ...

 $ NS.Sole_prop_partner     : num  617 634 656 681 714 ...

 $ NS.Nonprofit             : num  110 112 116 117 120 ...

 $ NS.Households            : num  5788 6172 6639 6945 7337 ...

 $ NS.Government            : num  159 170 182 188 196 ...

 $ NS.Federal               : num  52.7 56.4 59.9 61.6 63.8 66 68.7 72.2 75.3 79.1 ...

 $ NS.StateLocal            : num  106 114 122 126 133 ...

 $ NS.OwnerOccupied         : num  4918 5267 5694 5975 6333 ...

 $ NS.TenantOccupied        : num  1801 1866 1945 2005 2087 ...

Once we have peeked at our data, we can start using graphics packages like lattice to visualize data. I find that visualizing is a critical step in understanding the relationships of data; the specifics of which are not always immediately clear. The lattice graphic package (which ships with the standard installation) lets us rather easily add multiple multiple data sets across the same Y axis (e.g. "Year") through concatenation. Also added are a specific Y label, key, line type (ylab,auto.key,type). I use the lattice package "xyplot" function. All measurements are in billions:

library(lattice)
xyplot(NS.Households + PI.PersonalInc + PI.GovBenefits ~ Year,ylab="Non Specific",auto.key=TRUE,type="b")

What we perceive at first glance wouldn't surprise many observers of the American economy for the last twenty years. Nominally, we have seen increasing valuation of housing stock and personal income, and government benefits. 2008 - 2009 brought a "crash" to the housing stock valuation and personal income and a simultaneous increase in government spending. However, even this non specific chart shows the "bubble" in housing stock valuation whose "bursting" resulted in precipitous decline in personal income. Now let us try something a bit more complex. Below, I scale and create data sets to help me visually understand and weight macro-economic change. Notice that inside parentheses the operands "+" or "-" perform math; otherwise they concatenate X axis data.

xyplot(BLS.Census.Population/100 + PI.GovBenefits + (BLS.Census.CLF_POP * 5000) + (NS.Households - PI.PersonalInc) ~ Year,ylab="Divergence",auto.key=TRUE,type="b")

The orange line represents the increasing difference between household value and personal income we saw in the first chart. Only this time, we can readily see that difference crest at about $5 Trillion in 2007 before taking a dive. Total population is originally measured in millions, so I divide it by 100 to smack it front and center in this graph of billions. BLS statistics give us the employed labor force divided by the potential working force as a percentage. (This is not the vaunted unemployment measures U-3 or U-6!) I multiply this percentage by 5000 so the relationship between total population and a shrinking work force percentage (over time) is made clear. Government benefits fit into this graph without scaling. Clearly, they have been rising nominally well before Barack Obama took office and government stimulus packages were deployed.

From this graph we could deduce more sharply the American economic dilemma: a disjointed housing bubble, a shrinking workforce percentage (e.g. an aging population), a steadily increasing population and an economy dependent upon increasing government benefits. Let us try another approach at visualizing similar data, but before we do some comments on employment and population percentages may be useful. The total amount of employed persons in the United States is always some percentage of the civil labor force which in turn is always some percentage of the total population. These numbers are always much different than U-3 or U-6 which are the usual measures of unemployment. Let us take a look at this data for the last year in my data set from the BLS and Census. To do this, I am going to bind three columns as matrices from my dataframe. This code formats indexed dataframe information nicely:

> cbind(matrix(names(ddBLS)),matrix(ddBLS[1,1:8]), matrix(ddBLS[20,1:8]))
[,1] [,2] [,3]
[1,] "Year" 1992 2011
[2,] "BLS.Employment" 118492 139869
[3,] "BLS.CivLabForce" 128105 153617
[4,] "BLS.Census.Population" 256894 312603
[5,] "BLS.EmployPopRatio" 0.615 0.584
[6,] "BLS.Census.EMP_CLF" 0.925 0.911
[7,] "BLS.Census.EMP_POP" 0.461 0.447
[8,] "BLS.Census.CLF_POP" 0.499 0.491

We can use the lattice package ('xyplot') to look at these percentages:

xyplot(BLS.Census.EMP_CLF + BLS.EmployPopRatio + BLS.Census.CLF_POP + BLS.Census.EMP_POP ~ Year,type="b",ylab="BLS Employment %",auto.key=TRUE)

But maybe we want to look at related data on four separate charts with separate Y axis. We can use commands from the base graphics package to do this:

par(mfrow=c(2,2), pch=16)
attach(USEmployPop.1992.2011)
plot(Year,BLS.Employment,type="b")
plot(Year,BLS.CivLabForce,type="b")
plot(Year,BLS.Census.CLF_POP, type="b")
plot(Year,BLS.Census.EMP_POP,type="b")
detach(USEmployPop.1992.2011)
par(mfrow=c(1,1), pch=1)

diff1 <- as.data.frame(cbind(Year,'BLS.Total.Employment(M)'=BLS.Employment/100,NS.ResidentialFixedAssets,PI.PersonalInc,'Diff(NS.RFA - PI.PInc)' = (NS.ResidentialFixedAssets - PI.PersonalInc)))
diff1

Teaching Data Science

Friday, March 29, 2013

Basic Data Analysis and Lattice Graphics Framework

No comments: