dplyr 0.2

Nice

RStudio Blog

I’m very excited to announce dplyr 0.2. It has three big features:

  • improved piping courtesy of the magrittr package
  • a vastly more useful implementation of do()

  • five new verbs: sample_n(), sample_frac(), summarise_each(), mutate_each and glimpse().

These features are described in more detail below. To learn more about the 35 new minor improvements and bug fixes, please read the full release notes.

Improved piping

dplyr now imports %>% from the magrittr package by Stefan Milton Bache. I recommend that you use this instead of %.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you %>%, you can control which argument on the RHS receives the LHS with the pronoun .. This makes %>% more useful with base R functions because they don’t always take the data frame as the first argument. For…

View original post 323 more words

Advertisements
Posted in Uncategorized | Leave a comment

R Markdown

I was toying with the idea of moving to LaTex for word processing needs.  This post Markdown or LaTeX from Yihui Xie provides enlightenment on the subject.

My challenge is that customers still demand documents to be in Microsoft Word.

A great workaround is using R Markdown. The development version of R Studio is awesome for the  R markdown features.  You can pick your output in either HTML, PDF, or Word and it can change.

Here is a screenshot of your options when using R markdown.

Markdown2

Posted in Markdown, R, Uncategorized | Leave a comment

Histograms and Regression Lines as a matter of philosophy

Make sense of life events through histograms. I rather enjoy a histogram and often use histograms to give anecdotes context.

My favorite is the classic example of the panhandler who makes $50,000 a year. Is this case an outlier or is this case the average? I think that this is a good framework to critically think about information people tell us.

Thanks to the book Thinking Fast and Slow, the concept of regression to the mean is becoming real. Regression to the mean refers to the observation that if we were to observe an extreme value on the first measurement, the observation in second measurement would be closer to the average; related if the first measurement is closer to average then the second measurement will most likely be extreme. See the wikipedia page for more details. 

Most of the time my new born son does not sleep well through the over night hours. On the weekends he tends to sleep better through the night, and we start to hope that this trend continues through the weekday.  Sunday night comes and he regresses to his mean sleeping pattern.  My only hope is the small sample size as small size produces extreme values.

Histograms and regression to mean are tools we can use to help us think critically about the world.

Posted in Uncategorized | Tagged , | Leave a comment

Program Data Management

I imagine most established nonprofit organizations have systems and dedicated staff to monitor the organization’s fiances.  I imagine there are reviews to determine if the organization is over or under budget as it pertains to spending and revenue.

Is there the same amount of attention given to program data.  Here are 5 thoughts about program data management.

1. Is there a review to determine if the organization is collecting the right data?

2. Is there  a data agenda?  (What would we like to collect and how. )

3. Is there a more streamlined way to collect program data?

4. Is there an audit schedule to fight against garbage in garbage out.

5. Is there a mechanism to process the data to clean up the garbage.

Posted in Uncategorized | Leave a comment

R Studio Presentation Feature

I am enjoying the Presentation Feature in R studio. It is easy to use and great way to present R code and output. I also appreciate that one can present from R studio without needing connection to the internet and could present straight from R studio.  To give the slides flavor dust off your CSS skills.

Posted in Uncategorized | Leave a comment

Late to the party: Str and Summary function in R

I admit I am very late to the party with these functions but after dealing with enough messy data sets they are now required before any analysis. These functions will help you understand the underlying structure of your data sets. Below is a data set of NFL Stats and you can see how easy it to utilize and what the output looks like.

STR and Summary Function

# Read the data into R
Nfl <- read.csv("PtsLM.csv")
# Exploring your data. The str function will you tell you the type of data
# you have and that is handy when you can't figure out why a variable is
# givving you a hard time.
str(Nfl)
## 'data.frame':    64 obs. of  13 variables:
##  $ Year        : num  2012 2011 2012 2011 2012 ...
##  $ Number      : num  1 1 2 2 3 3 4 4 5 5 ...
##  $ Team        : Factor w/ 32 levels "Arizona Cardinals",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ G           : num  16 16 16 16 16 16 16 16 16 16 ...
##  $ Pts.G       : num  15.6 19.5 26.2 25.1 24.9 23.6 21.5 23.2 22.3 25.4 ...
##  $ TotPts      : num  250 312 419 402 398 378 344 372 357 406 ...
##  $ Yds.G       : num  263 324 369 377 352 ...
##  $ Yds.P       : num  4.1 5.2 5.8 5.6 5.4 5.2 5.6 5.7 5.8 6.2 ...
##  $ X1st.G      : num  15.4 17.9 21.4 21.8 19.6 19.5 18.8 19.6 20.5 21.6 ...
##  $ PassingAvg  : num  5.6 7.2 7.7 7.3 7.1 6.7 6.7 6.7 8 7.9 ...
##  $ PassYds.G   : num  188 223 282 262 234 ...
##  $ RusingAvg   : num  3.4 4.2 3.7 4 4.3 4.3 5 4.9 4.5 5.4 ...
##  $ RushingYds.G: num  75.2 101.6 87.3 114.6 118.8 ...

names(Nfl)
##  [1] "Year"         "Number"       "Team"         "G"           
##  [5] "Pts.G"        "TotPts"       "Yds.G"        "Yds.P"       
##  [9] "X1st.G"       "PassingAvg"   "PassYds.G"    "RusingAvg"   
## [13] "RushingYds.G"
# The names function provides you with column/variable names.
names(Nfl)
##  [1] "Year"         "Number"       "Team"         "G"           
##  [5] "Pts.G"        "TotPts"       "Yds.G"        "Yds.P"       
##  [9] "X1st.G"       "PassingAvg"   "PassYds.G"    "RusingAvg"   
## [13] "RushingYds.G"
# The summary function will give a nice summary breakdown on all the
# variables in the data frame whether the variable is a numeric variable or
# not.
summary(Nfl)
##       Year          Number                     Team          G     
##  Min.   :2011   Min.   : 1.00   Arizona Cardinals: 2   Min.   :16  
##  1st Qu.:2011   1st Qu.: 8.75   Atlanta Falcons  : 2   1st Qu.:16  
##  Median :2012   Median :16.50   Baltimore Ravens : 2   Median :16  
##  Mean   :2012   Mean   :16.50   Buffalo Bills    : 2   Mean   :16  
##  3rd Qu.:2012   3rd Qu.:24.25   Carolina Panthers: 2   3rd Qu.:16  
##  Max.   :2012   Max.   :32.00   Chicago Bears    : 2   Max.   :16  
##                                 (Other)          :52               
##      Pts.G          TotPts        Yds.G         Yds.P          X1st.G    
##  Min.   :12.1   Min.   :193   Min.   :259   Min.   :4.10   Min.   :15.4  
##  1st Qu.:19.2   1st Qu.:307   1st Qu.:314   1st Qu.:5.00   1st Qu.:17.9  
##  Median :22.8   Median :364   Median :343   Median :5.35   Median :19.3  
##  Mean   :22.5   Mean   :360   Mean   :347   Mean   :5.42   Mean   :19.6  
##  3rd Qu.:24.9   3rd Qu.:399   3rd Qu.:375   3rd Qu.:5.80   3rd Qu.:21.3  
##  Max.   :35.0   Max.   :560   Max.   :467   Max.   :6.70   Max.   :27.8  
##                                                                          
##    PassingAvg     PassYds.G     RusingAvg     RushingYds.G  
##  Min.   :5.40   Min.   :136   Min.   :3.40   Min.   : 75.2  
##  1st Qu.:6.60   1st Qu.:194   1st Qu.:3.90   1st Qu.:100.5  
##  Median :7.05   Median :226   Median :4.20   Median :114.5  
##  Mean   :7.13   Mean   :230   Mean   :4.25   Mean   :116.5  
##  3rd Qu.:7.72   3rd Qu.:256   3rd Qu.:4.50   3rd Qu.:128.5  
##  Max.   :9.30   Max.   :334   Max.   :5.40   Max.   :169.3  
## 
Posted in Data Cleaning, data quality., R, Uncategorized | Tagged , | Leave a comment

Subsetting Data In R


Subsetting Data In R

Data management in R can be somewhat challenging. I have always been able to subset observations without a problem but had struggled with subsetting variables or columns into a new object. Thanks to UCLA’s Institute for Digital Research and Education I was able to grasp the concept and saved a lot of time. Another great resource for learning about R is Quick R

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

bb <- mlb11
names(bb)  #Tells us the names of the variables/columns and it will have a number assigned to the variable.

##  [1] "team"         "runs"         "at_bats"      "hits"        
##  [5] "homeruns"     "bat_avg"      "strikeouts"   "stolen_bases"
##  [9] "wins"         "new_onbase"   "new_slug"     "new_obs"

bb1 <- bb[, c(2, 6, 10, 12)]  #This line of code tells R to subset variabless #2Runs, #6 bat_avg, #10 new_onbase, #12 new_obs
names(bb1)  # Check to make sure we did it correctly.

## [1] "runs"       "bat_avg"    "new_onbase" "new_obs"

Now you are able complete a data analysis on these variables. It will be easy to view the correlation between these selected variables.

cor(bb1)

##              runs bat_avg new_onbase new_obs
## runs       1.0000  0.8100     0.9215  0.9669
## bat_avg    0.8100  1.0000     0.8823  0.8671
## new_onbase 0.9215  0.8823     1.0000  0.9373
## new_obs    0.9669  0.8671     0.9373  1.0000

There you go!

Posted in Uncategorized | Leave a comment