Session5

Mandy

Recap

Recap

You should know now:

  • the command t.test() and the different kinds of usage
  • that y ~ x in R mean y dependent on x (formula syntax)
  • that R can roll dice for you (as one example for random number generator)

Exercises

For the babies data set, the variable age contains the recorded mom's age and dage contains the dad's age for several different cases in the sample. Do a significance test of the null hypothesis of equal ages against a one-sided alternative that the dads are older in the sampled population.

  • install the package UsingR (you may check if the package is already installed)
  • do the respective t-test
  • redo the test assuming equal variances! How do the results differ? Would this assumption be appropriate?

Anova

Anova

  • a technique we use when all explanatory variables are categorical (factor)
  • if there is one factor with three or more levels we use one-way ANOVA (only two levels: t-test should be preferred, would give exactly the same answer since with 2 levels F = t2)
  • for more factors there there is two-way, three-way anova
  • central idea is to compare two or more means by comparing variances

Example Data

A data frame with 14 observations on 2 variables.

Variable content
ozone: athmospheric ozone concentration
garden: garden id

Example Data

Var 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ozone 9 7 6 8 5 11 9 11 9 6 10 8 8 12
garden a a a b a b b b b a b a a b

Source: M. Crawley, The R-Book

Exercise

  • read in the respective data file anova.txt using the read.table() command

Solution

gardens <- read.table("session5dat/gardens2.txt", header = T)
head(gardens)
##   ozone garden index
## 1     6      a    10
## 2     8      a     7
## 3     5      a     1
## 4     9      a     4
## 5     7      a     5
## 6     8      a     6

Total Sum of Squares

  • first we plot the values

plot of chunk unnamed-chunk-2

Total Sum of Squares

  • there is a lot of scatter, indicating that the variance in ozone is large
  • to get a feel for the overall variance we plot the overall mean (8.5) and indicate each of the residuals by a vertical line

Total Sum of Squares

plot of chunk unnamed-chunk-3

Total Sum of Squares

  • we refer to this overall variation as the total sum of squares, SSY or TSS

\[SSY = \sum(y-\bar{y})^2\]

Total Sum of Squares

  • here we have \[ SSY = 55.5 \]

plot of chunk unnamed-chunk-4

Group means

  • now instead of fitting the overall mean, let us fit the individual garden means

Group means

  • now instead of fitting the overall mean, let us fit the individual garden means
garden mean
a 7
b 10

Group means

  • Garden A: black
  • Garden B: red

plot of chunk unnamed-chunk-5

Group means

  • now we see that the mean ozone concentration is substantially higher in garden B
  • the aim of ANOVA is to determine
    • whether it is significantly higher or
    • whether this kind of difference could come by chance alone

Error Sum of Squares

When the means are significantly different then the sum of squares computed from the individual garden means will be smaller than the sum of squares computed from the overall mean.

  • we define the new sum of squares as the error sum of squares (error in the sense of redidual) \[ SSE = \sum(y_{garden A}-\bar{y}_{garden A})^2+\sum(y_{garden B}-\bar{y}_{garden B})^2 \]

Error Sum of Squares

  • here we have \[ SSE = 24.0 \]

plot of chunk unnamed-chunk-6

Treatment Sum of Squares

  • then the component of the variation that is explained by the difference (i.e. explained by the predictor) of the means is called the treatment sum of sqares SSA
  • analysis of variance is based on the notion that we break down the total sum of squares into useful and informative components

\[ SSY = SSE + SSA \]

where

  • SSA = explained variation
  • SSE = unexplained variation

Anova Table

Source Sum of squares Degrees of freedom Mean square F ratio
Garden \(31.5\) \(1\) \(31.5\) \(15.75\)
Error \(24.0\) \(12\) \(s^2=2.0\)
Total \(55.5\) \(13\)

F-Test

  • now we need to test whether an F ratio of 15.75 is large or small
    • we can use a table or software package
    • I use here R to calculate the cumulative probability
1 - pf(15.75,1,12)
## [1] 0.001864103

F-Test

plot of chunk unnamed-chunk-8

Anova in R

  • in R we use the lm() command and
  • the formula syntax a ~ b
  • we assign the result to a variable
mm <- lm(ozone ~ garden, data = gardens)
mm
## 
## Call:
## lm(formula = ozone ~ garden, data = gardens)
## 
## Coefficients:
## (Intercept)      gardenb  
##           7            3

Anova in R

summary(mm)
## 
## Call:
## lm(formula = ozone ~ garden, data = gardens)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##     -2     -1      0      1      2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0000     0.5345  13.096 1.82e-08 ***
## gardenb       3.0000     0.7559   3.969  0.00186 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.414 on 12 degrees of freedom
## Multiple R-squared:  0.5676, Adjusted R-squared:  0.5315 
## F-statistic: 15.75 on 1 and 12 DF,  p-value: 0.001864

Anova in R

anova(mm)
## Analysis of Variance Table
## 
## Response: ozone
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## garden     1   31.5    31.5   15.75 0.001864 **
## Residuals 12   24.0     2.0                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova - Visualization

  • there are two traditional ways of plotting the results of ANOVA
  • box-and-whisker plots
  • barplots with error bars

Anova - Visualization

Boxplot1

Anova - Visualization

bp2

Anova - Visualization

BarplotCI

Anova - Visualization

BarplotSE