Teaching Data Science: Elections Data and Accumulators

I've load some code and charts I produced for a piece on my political blog. R Programming wizardy will take some time and practice to accumulate. In fact it is a more modern programming structure that I refer to as the 'accumulator' that I had to hack up to make this cruft work. I often use an "accumulator" variable while doing data analysis with Powershell. It allows me to continuously stuff similar typed data into an array as so:

rv -ea 0 a
# create one database stored as a variable (e.g. '$out') by merging all candidate donations.
# Add a 'member' or field (e.g. Candidate Name) to each record
$out+=$Varlist.name | % {
$a=(ls variable:/$PSItem).value;
$Name=(ls variable:/$PSItem).name;
$a | add-member -force -passthru -NotePropertyName Candidate -NotePropertyValue $PSItem;
}

# An $out record now is

$out[0]

Contributor : ATU
Date : 06/22/12
Amount : 900
P/G : P
City : WASHINGTON
State : DC
Zip : 20016
Employer :
Occupation :
Candidate : NM

I couldn't find something analogous in R Programming, so I resorted to code like this that created a zero based numeric data.frame ('ddf'), stuffing it with data (via rbind()) and then removing the original zeroed data row, re-leveling (via droplevel()), before finally returning the function value.

accyear <- function() {
#ddf <- data.frame(cbind(Year=NULL,Freq=NULL)))
ddf <- data.frame(cbind(Year=0,Freq=0))
for(i in list) {
dd <- (data.frame(cbind(Year=i,(subset(fvl2, Year == i, select= Freq)))))
dd <- (data.frame(cbind(Year=(unique(dd$Year)),Freq=(cumsum(dd$Freq)[nrow(dd)]))))
ddf <- rbind(ddf,dd)
}
ddf <- droplevels(data.frame(ddf[-1,],row.names=NULL))
return(ddf)
}

Full code for the charts is below. Data is from a Whatcom County Voter Database and a Census Sex and Age database for Washington Counties. The original CSV files were quite large: 5 to 7 million 'observations' or separate data fields. ( I am describing here nrow() * ncol().) . So I pared them down sequentially which is a process that is handled differently to the same effect in SQL. Then I use table() and stack() functions for important purposes. In my political blog, I overlaid the barplot()s with GIMP's transparent layer functionality, but in reality the visualization doesn't quite line up with the data. Close enough though to suggest a more accurate and interesting approach to correlating the separate information in one graph could be powerful.

fvl <- read.delim("ferrisvoterlist_20121204.txt")
fvl1 <- subset(fvl,select= c(1,3,5,8,15,16,17,18,19,21))
as.matrix(sapply(fvl1,class))

fvl2 <- as.data.frame(table(fvl1$BirthDate))

fvl2$Year <- substr((fvl2$Var1),7,10)
fvl2$Year <- as.numeric(fvl2$Year)
list <- sort(unique(fvl2$Year))

accyear <- function() {
#ddf <- data.frame(cbind(Year=NULL,Freq=NULL)))
ddf <- data.frame(cbind(Year=0,Freq=0))
for(i in list) {
dd <- (data.frame(cbind(Year=i,(subset(fvl2, Year == i, select= Freq)))))
dd <- (data.frame(cbind(Year=(unique(dd$Year)),Freq=(cumsum(dd$Freq)[nrow(dd)]))))
ddf <- rbind(ddf,dd)
}
ddf <- droplevels(data.frame(ddf[-1,],row.names=NULL))
return(ddf)
}

ddf <- accyear()
barplot(ddf$Freq,names.arg=ddf$Year,xlab="Voter Birth Year",ylab="Registration Count")

WA_AGESEX <- read.csv("CC-EST2011-AGESEX-53.csv")
as.matrix(grep(pattern="TOT",(as.list(names(WA_AGESEX))),value=TRUE))
Whatcom <- subset(WA_AGESEX, CTYNAME == "Whatcom County")
Whatcom4 <- subset(Whatcom, YEAR == 4)
WhatcomAGE <- (subset(Whatcom4,select=c(AGE1824_TOT,AGE2544_TOT,AGE4564_TOT,AGE65PLUS_TOT,AGE85PLUS_TOT)))
WhatcomAGE <- (droplevels(data.frame(WhatcomAGE,row.names=NULL)))
WhatcomAGE <- stack(WhatcomAGE)
# barplot(WhatcomAGE$values,names.arg=WhatcomAGE$ind)
barplot(WhatcomAGE$values[5:1],names.arg=WhatcomAGE$ind[5:1])

Click to enlarge the graphs:

Teaching Data Science

Wednesday, May 8, 2013

Elections Data and Accumulators

No comments: