Estimating a proportion from a sample

Here, we show how to get a confidence interval for the proportion of the mean obtained from a sample.

Let us generate 100 random flips of a biased coin (p=1/3)

n <- 100
a <- rbinom(n, size=1, prob=1/3)

To obtain descriptive statistics, one can use the functions table or prop.table:

table(a)

## a
##  0  1 
## 61 39

prop.table(table(a))

## a
##    0    1 
## 0.61 0.39

mean(a)

## [1] 0.39

Note the existence of two other functions that also tabulate the content of discrete variable(s) — ftable and xtabs — and offer more flexibility to display the output. The latter, xtab is particularily powerful for crosstabulating several variables.

You can plot the distributions of values using stick plots or barplots.

plot(table(a))

barplot(table(a))

The relevant function to test against the null hypothesis of p=0.5 and to obtain a confidence interval is prop.test:

prop.test(table(a))

## 
##  1-sample proportions test with continuity correction
## 
## data:  table(a), null probability 0.5
## X-squared = 4.41, df = 1, p-value = 0.03573
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5070114 0.7044326
## sample estimates:
##    p 
## 0.61

prop.test(table(a))$conf.int

## [1] 0.5070114 0.7044326
## attr(,"conf.level")
## [1] 0.95

Comparing two empirical proportions

Suppose we have two binomial samples. To test wehterh the underlyinh probability is the same, we enter the counts in a 2x2 matrix where each column gives the counts of successes and failures and use prop.test again:

twoprop <- matrix(c(105, 366, 45, 67), nrow=2)
prop.test(twoprop)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  twoprop
## X-squared = 14.226, df = 1, p-value = 0.0001621
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.23061397 -0.05991721
## sample estimates:
##    prop 1    prop 2 
## 0.7000000 0.8452656

Estimating the mean of a population from a sample

Let us now consider the case of a continuous variable.

n <- 100
a <- rnorm(n, mean=100, sd=15)

Exploratory graphics

There are several ways to graphically inspect a continous variable. Which one is approriate depends on the number of data points and their distribution.

par(las=1)
stripchart(a, method='jitter', vertical=TRUE)
abline(h=mean(a), lty=2)

If the sample is small, I recommend using a dotchart:

dotchart(a[1:20])

If the sample is large enough, histograms or density plots are a good idea:

hist(a)
rug(a)

plot(density(a))
abline(v=mean(a), lty=2)
rug(a)

boxplot(a)

Descriptive statistics

summary(a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   64.01   88.79   99.43   99.15  108.80  148.20

mean(a)

## [1] 99.15297

mean(a[abs(a-mean(a)) < 2*sd(a)]) # after deleting data points beyond 2 standard deviations

## [1] 98.5568

Confidence interval of a mean

Assuming normally distributed values:

t.test(a)

## 
##  One Sample t-test
## 
## data:  a
## t = 63.245, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   96.04219 102.26375
## sample estimates:
## mean of x 
##  99.15297

t.test(a)$conf.int

## [1]  96.04219 102.26375
## attr(,"conf.level")
## [1] 0.95

A confidence interval based on bootstrap can also be obtained:

require(boot)

## Loading required package: boot

sampmean <- function(x, d) { mean(x[d]) }
boota <- boot(a, sampmean, 1000)
boot.ci(boota)

## Warning in boot.ci(boota): bootstrap variances needed for studentized
## intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boota)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 96.14, 102.25 )   ( 96.17, 102.20 )  
## 
## Level     Percentile            BCa          
## 95%   ( 96.10, 102.14 )   ( 96.13, 102.15 )  
## Calculations and Intervals on Original Scale

Comparing two groups (continous dependent variable)

g1 <- 500 + rnorm(30, sd=40)
g2 <- 520 + rnorm(20, sd=30)
write(g1, 'group1.dat')
write(g2, 'group2.dat')
rm(g1, g2)

Data for this example are in two text files group1.dat and group2.dat.

g1 <- scan('group1.dat')
g2 <- scan('group2.dat')

We arrange them into a data frame with two columns: group (a factor with two modalities: Gr1 and Gr2), and y which contains the values themselves.

tg <- data.frame(group=factor(rep(c('Gr1', 'Gr2'), 
                                  c(length(g1), length(g2)))),
                 y=c(g1, g2))

head(tg)

##   group        y
## 1   Gr1 545.8721
## 2   Gr1 543.7691
## 3   Gr1 557.8433
## 4   Gr1 483.4303
## 5   Gr1 408.2971
## 6   Gr1 537.6479

str(tg)

## 'data.frame':    50 obs. of  2 variables:
##  $ group: Factor w/ 2 levels "Gr1","Gr2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ y    : num  546 544 558 483 408 ...

table(tg$group)

## 
## Gr1 Gr2 
##  30  20

Graphical explorations

hist(tg$y)

boxplot(tg$y ~ tg$group)

require(lattice)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

dotchart(tg$y, groups = tg$group)

When the samples are small, stripchart may be the best:

stripchart(tg$y ~ tg$group, 
           vertical=TRUE,
           pch=1)

If the samples are large enough, you can create density plots:

par(mfrow=(c(1, 2)))
xsca <- range(tg$y)
for (gr in levels(tg$group))
{
  with(subset(tg, group==gr),
       {
         plot(density(y), xlim=xsca, main=gr, bty='l')
         rug(y, ticksize=0.1)
         })
}

Or violin plots:

require(ggplot2)

## Loading required package: ggplot2

p <-  ggplot(tg, aes(group, y))
p + geom_violin() + geom_jitter(height = 0, width = 0.1)

To obtain the basic descriptive stats

attach(tg)
signif(tapply(y, group, mean),3)

## Gr1 Gr2 
## 503 519

signif(tapply(y, group, median), 3)

## Gr1 Gr2 
## 507 515

signif(tapply(y, group, sd), 3)

##  Gr1  Gr2 
## 45.2 23.6

signif(tapply(y, group, se), 3)

##  Gr1  Gr2 
## 8.25 5.28

detach(tg)

Inferential statistics

Student T-tests. First assuming equal variance, then relaxing this assumption

t.test(y ~ group, data=tg, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  y by group
## t = -1.3998, df = 48, p-value = 0.168
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -37.543155   6.724562
## sample estimates:
## mean in group Gr1 mean in group Gr2 
##          503.4598          518.8691

t.test(y ~ group, data=tg)

## 
##  Welch Two Sample t-test
## 
## data:  y by group
## t = -1.5731, df = 45.889, p-value = 0.1226
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -35.128392   4.309799
## sample estimates:
## mean in group Gr1 mean in group Gr2 
##          503.4598          518.8691

Somewhat more information can be obtained by fitting linear models.

First with a parametrisation (contr.treatment) of group where the intercept will correspond to the mean of group 1 and the effect will estimate the difference between the two groups.

contrasts(tg$group) <- contr.treatment
contrasts(tg$group)

##     2
## Gr1 0
## Gr2 1

summary(lm(y ~ group, data=tg))

## 
## Call:
## lm(formula = y ~ group, data = tg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.163 -19.398  -0.215  25.384  84.606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  503.460      6.962   72.31   <2e-16 ***
## group2        15.409     11.008    1.40    0.168    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.13 on 48 degrees of freedom
## Multiple R-squared:  0.03922,    Adjusted R-squared:  0.0192 
## F-statistic: 1.959 on 1 and 48 DF,  p-value: 0.168

Alternatively, one can prefer a parametrisation where the intercept estimates the global mean and the first parameter is the deviation from the global mean.

contrasts(tg$group) <- contr.sum
contrasts(tg$group)

##     [,1]
## Gr1    1
## Gr2   -1

summary(lm(y ~ group, data=tg))

## 
## Call:
## lm(formula = y ~ group, data = tg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.163 -19.398  -0.215  25.384  84.606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  511.164      5.504   92.87   <2e-16 ***
## group1        -7.705      5.504   -1.40    0.168    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.13 on 48 degrees of freedom
## Multiple R-squared:  0.03922,    Adjusted R-squared:  0.0192 
## F-statistic: 1.959 on 1 and 48 DF,  p-value: 0.168

Barplot with standard errors

Barplot with the means and their associated standard errors (note this is not the standard error for the difference between the groups’ means, which is roughly \(\sqrt{2}\) larger and, maybe for this reason, rarely used in psychology papers (like they rarely report confidence intervals))

attach(tg)
par(mfrow=c(1,1))
means <- tapply(y, group, mean)
ses <- tapply(y, group, se)

ysca = c(min(means - 3 * ses), max(means + 3 * ses))

mp <- barplot(means, ylim=ysca, xpd=F)
arrows(mp, means-ses, 
       mp, means+ses, 
       code=3, angle=90)

detach(tg)

A much nicer plot can be constructed, with confidence intervals for the means and for their difference (Cumming, Geoff, and Sue Finch. 2005. “Inference by Eye: Confidence Intervals and How to Read Pictures of Data.” American Psychologist 60 (2): 170–180.)

attach(tg)
m1 <- t.test(y[group=='Gr1'])$conf.int
m2 <- t.test(y[group=='Gr2'])$conf.int
di <- diff(t.test(y~group)$conf.int)
ysca <- c(min(c(m1,m2)-0.3*diff(range(c(m1,m2)))),
          max(c(m1,m2)+0.3*diff(range(c(m1,m2)))))
          
plot(c(Gr1=1, Gr2=2, difference=3),
     c(mean(m1), mean(m2), mean(m2)),
     pch=c(16,16,17), ylim=ysca, xlim=c(.5,3.5), axes=F, xlab='', ylab='')
axis(2, las=1)
axis(1,at=1:3,labels=c('Gr1','Gr2','difference'))
arrows(1:3, c(m1[1], m2[1], mean(m2)-di/2),
       1:3, c(m1[2], m2[2], mean(m2)+di/2),
       code=3, angle=90)
abline(h=mean(m1), lty=2)
abline(h=mean(m2), lty=2)

detach(tg)

Comparing the means of several independent groups

Data are in a spreasheet format, in oneway.csv

ow <- read.csv('oneway.csv')
head(ow)

##   X group        y
## 1 1   Gr1 6.697519
## 2 2   Gr1 5.707240
## 3 3   Gr1 7.752619
## 4 4   Gr1 6.451629
## 5 5   Gr1 5.423243
## 6 6   Gr1 7.183462

str(ow)

## 'data.frame':    288 obs. of  3 variables:
##  $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ group: Factor w/ 5 levels "Gr1","Gr2","Gr3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ y    : num  6.7 5.71 7.75 6.45 5.42 ...

We can perform the same operations as we did for the two samples case.

Graphical explorations

attach(ow)
hist(y)

plot(y ~ group)

stripchart(y ~ group, vertical=TRUE)

for (g in group) { plot(density(y[group==g]), main=g); rug(y[group==g])}

detach(ow)

Descriptive stats

attach(ow)
signif(tapply(y, group, mean),3)

##  Gr1  Gr2  Gr3  Gr4  Gr5 
## 6.65 5.53 6.02 4.59 4.60

signif(tapply(y, group, median), 3)

##  Gr1  Gr2  Gr3  Gr4  Gr5 
## 6.59 5.44 6.02 4.48 4.59

signif(tapply(y, group, sd), 3)

##   Gr1   Gr2   Gr3   Gr4   Gr5 
## 1.160 1.020 0.971 1.070 0.993

signif(tapply(y, group, se), 3)

##   Gr1   Gr2   Gr3   Gr4   Gr5 
## 0.150 0.136 0.129 0.139 0.134

detach(ow)

Inferential statistics

require(ez)

## Loading required package: ez

ow$sub <- factor(1:nrow(ow))
ez_model <- ezANOVA(data=ow,
                    wid=sub,
                    dv=y,
                    between = group)

## Warning: Data is unbalanced (unequal N per group). Make sure you specified
## a well-considered value for the type argument to ezANOVA().

print(ez_model)

## $ANOVA
##   Effect DFn DFd       F            p p<.05       ges
## 1  group   4 283 43.0577 3.347148e-28     * 0.3783373
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd      SSn      SSd         F         p p<.05
## 1   4 283 1.101014 117.1381 0.6649988 0.6167767

ow$sub <- factor(1:nrow(ow))
ezPlot(data = ow,
       dv = y,
       wid=sub,
       between = group,
       x = group)

## Warning: Data is unbalanced (unequal N per group). Make sure you specified
## a well-considered value for the type argument to ezANOVA().

## Warning in ezStats(data = data, dv = dv, wid = wid, within = within,
## within_full = within_full, : Unbalanced groups. Mean N will be used in
## computation of FLSD

summary(av <- aov(y ~ group, data=ow))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## group         4  189.1   47.28   43.06 <2e-16 ***
## Residuals   283  310.8    1.10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(av)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = y ~ group, data = ow)
## 
## $group
##                 diff         lwr         upr     p adj
## Gr2-Gr1 -1.123148314 -1.65770790 -0.58858873 0.0000002
## Gr3-Gr1 -0.631864532 -1.16399318 -0.09973588 0.0108814
## Gr4-Gr1 -2.060103427 -2.58536561 -1.53484124 0.0000000
## Gr5-Gr1 -2.052525914 -2.58959322 -1.51545861 0.0000000
## Gr3-Gr2  0.491283782 -0.05002434  1.03259190 0.0953760
## Gr4-Gr2 -0.936955114 -1.47151470 -0.40239553 0.0000239
## Gr5-Gr2 -0.929377600 -1.47554138 -0.38321382 0.0000452
## Gr4-Gr3 -1.428238896 -1.96036754 -0.89611025 0.0000000
## Gr5-Gr3 -1.420661382 -1.96444610 -0.87687666 0.0000000
## Gr5-Gr4  0.007577513 -0.52948979  0.54464482 0.9999995

plot(TukeyHSD(av))

The output of lm provides additonal information

contrasts(ow$group) <- contr.treatment
summary(lmtr <- lm(y ~ group, data=ow))

## 
## Call:
## lm(formula = y ~ group, data = ow)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.70612 -0.62439 -0.07799  0.65597  2.82940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6528     0.1353  49.177  < 2e-16 ***
## group2       -1.1231     0.1947  -5.768  2.1e-08 ***
## group3       -0.6319     0.1938  -3.260  0.00125 ** 
## group4       -2.0601     0.1913 -10.768  < 2e-16 ***
## group5       -2.0525     0.1956 -10.492  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.048 on 283 degrees of freedom
## Multiple R-squared:  0.3783, Adjusted R-squared:  0.3696 
## F-statistic: 43.06 on 4 and 283 DF,  p-value: < 2.2e-16

contrasts(ow$group) <- contr.sum
summary(lmsum <- lm(y ~ group, data=ow))

## 
## Call:
## lm(formula = y ~ group, data = ow)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.70612 -0.62439 -0.07799  0.65597  2.82940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.47931    0.06179  88.679  < 2e-16 ***
## group1       1.17353    0.12165   9.647  < 2e-16 ***
## group2       0.05038    0.12483   0.404    0.687    
## group3       0.54166    0.12400   4.368 1.76e-05 ***
## group4      -0.88657    0.12165  -7.288 3.15e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.048 on 283 degrees of freedom
## Multiple R-squared:  0.3783, Adjusted R-squared:  0.3696 
## F-statistic: 43.06 on 4 and 283 DF,  p-value: < 2.2e-16

Comparing two treatments

In the previous section, the data from the two groups were assumed to be independent. If there is some pairing, for example if data were acquired in the same unit under two conditions, then the data are not independent. The simplest way to perform the data analysis is to examine the differences between the two conditions computed over each unit.

Here data come organized a long table format with one measure per row, and condition and subject as variables. This less convenient to compute the differences within subjects than a short format with one subject per row, and one column per condition, but better to run linear model. To convert from one representation to the other, see stack, reshape2, plyr…

tc <- read.csv("twotreat.csv")
head(tc)

##   X sub cond         y
## 1 1  s1    1  9.391306
## 2 2  s1    2 10.928504
## 3 3  s2    1  8.133601
## 4 4  s2    2 10.383381
## 5 5  s3    1 10.356377
## 6 6  s3    2 11.435402

str(tc)

## 'data.frame':    40 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sub : Factor w/ 20 levels "s1","s10","s11",..: 1 1 12 12 14 14 15 15 16 16 ...
##  $ cond: int  1 2 1 2 1 2 1 2 1 2 ...
##  $ y   : num  9.39 10.93 8.13 10.38 10.36 ...

tc$sub <- factor(tc$sub) # make sure these vars are factors
tc$cond <- factor(tc$cond)
table(tc$sub)

## 
##  s1 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19  s2 s20  s3  s4  s5  s6  s7 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  s8  s9 
##   2   2

(I assume that there are no repeated measures within subject and treatment. If this is the case with your dataset, use aggregate or melt)

Graphical explorations

with(tc, interaction.plot(cond, sub, y))

Fancier graphs can be obtained with lattice:

require(lattice)
xyplot(y ~ cond, group=sub, data=tc, type='l')

xyplot(y ~ cond | sub, data=tc, type='l')

We can also remove to main effects of subjects, as we are interested in the difference between condition within subjects:

attach(tc)
tc$ycorr <- y + mean(y) - tapply(y, sub, mean)[sub]
detach(tc)
attach(tc)
par(mfcol=c(1,2))
interaction.plot(cond, sub, y, main='original data')
interaction.plot(cond, sub, ycorr, main='after removing intersub var')

par(mfcol=c(1,1))
detach(tc)

Descriptive stats

with(tc, signif(tapply(y, cond, mean)))

##        1        2 
##  9.92405 11.07760

# compute differences
c1 <- levels(tc$cond)[1]
c2 <- levels(tc$cond)[2] 

s1 <- tc$sub[tc$cond==c1]
y1 <- tc$y[tc$cond==c1][order(s1)]

s2 <- tc$sub[tc$cond==c2]
y2 <- tc$y[tc$cond==c2][order(s2)]

summary(y1-y2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.0330 -2.1910 -1.2780 -1.1540 -0.5116  1.6450

se(y1-y2) # standard error of the effect

## [1] 0.2922856

# Check if the pairing was useful?
cor.test(y1, y2)

## 
##  Pearson's product-moment correlation
## 
## data:  y1 and y2
## t = 3.1518, df = 18, p-value = 0.005517
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2089560 0.8219508
## sample estimates:
##       cor 
## 0.5963352

Inferential stats

t.test(y1, y2, paired=T)

## 
##  Paired t-test
## 
## data:  y1 and y2
## t = -3.9467, df = 19, p-value = 0.0008653
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7653345 -0.5418128
## sample estimates:
## mean of the differences 
##               -1.153574

Linear model approach

(sm <- summary(model_lm <- lm(y ~ cond + sub, data=tc)))

## 
## Call:
## lm(formula = y ~ cond + sub, data = tc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3995 -0.4974  0.0000  0.4974  1.3995 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.58312    0.66971  14.309 1.26e-11 ***
## cond2        1.15357    0.29229   3.947 0.000865 ***
## subs10       1.51722    0.92429   1.641 0.117144    
## subs11      -1.34969    0.92429  -1.460 0.160561    
## subs12       0.97900    0.92429   1.059 0.302790    
## subs13      -0.82581    0.92429  -0.893 0.382791    
## subs14       1.23746    0.92429   1.339 0.196428    
## subs15      -0.20772    0.92429  -0.225 0.824583    
## subs16       1.17449    0.92429   1.271 0.219177    
## subs17      -0.52878    0.92429  -0.572 0.573968    
## subs18       2.88544    0.92429   3.122 0.005615 ** 
## subs19       1.27412    0.92429   1.378 0.184068    
## subs2       -0.90141    0.92429  -0.975 0.341689    
## subs20      -0.09180    0.92429  -0.099 0.921921    
## subs3        0.73598    0.92429   0.796 0.435711    
## subs4       -0.70989    0.92429  -0.768 0.451905    
## subs5       -1.83862    0.92429  -1.989 0.061269 .  
## subs6        3.08695    0.92429   3.340 0.003442 ** 
## subs7        0.23268    0.92429   0.252 0.803946    
## subs8       -0.02338    0.92429  -0.025 0.980080    
## subs9        0.17251    0.92429   0.187 0.853918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9243 on 19 degrees of freedom
## Multiple R-squared:  0.8245, Adjusted R-squared:  0.6398 
## F-statistic: 4.464 on 20 and 19 DF,  p-value: 0.0009515

(diff <-sm$coefficients[2,'Estimate'])

## [1] 1.153574

(diffse <- sm$coefficients[2,'Std. Error'])

## [1] 0.2922856

In this simple situation, mixed effect models will yield the same p-values:

require(nlme)

## Loading required package: nlme

(model_lme <- lme(y ~ cond, random=~1|sub, data= tc))

## Linear mixed-effects model fit by REML
##   Data: tc 
##   Log-restricted-likelihood: -66.80189
##   Fixed: y ~ cond 
## (Intercept)       cond2 
##    9.924055    1.153574 
## 
## Random effects:
##  Formula: ~1 | sub
##         (Intercept)  Residual
## StdDev:    1.108976 0.9242883
## 
## Number of Observations: 40
## Number of Groups: 20

summary(model_lme)

## Linear mixed-effects model fit by REML
##  Data: tc 
##        AIC      BIC    logLik
##   141.6038 148.1541 -66.80189
## 
## Random effects:
##  Formula: ~1 | sub
##         (Intercept)  Residual
## StdDev:    1.108976 0.9242883
## 
## Fixed effects: y ~ cond 
##                Value Std.Error DF   t-value p-value
## (Intercept) 9.924055 0.3228109 19 30.742626   0e+00
## cond2       1.153574 0.2922856 19  3.946734   9e-04
##  Correlation: 
##       (Intr)
## cond2 -0.453
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.80719196 -0.44122630 -0.04697454  0.44392016  1.78263759 
## 
## Number of Observations: 40
## Number of Groups: 20

# plot(ranef(model_lme))
# plot(res_lme <- residuals(model_lme))
# qqnorm(res_lme)
# qqline(res_lme)
# plot(model_lme)

require(lme4)

## Loading required package: lme4

## Loading required package: Matrix

## 
## Attaching package: 'lme4'

## The following object is masked from 'package:nlme':
## 
##     lmList

(model_lmer <- lmer(y ~ cond + (1|sub), data= tc))

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ cond + (1 | sub)
##    Data: tc
## REML criterion at convergence: 133.6038
## Random effects:
##  Groups   Name        Std.Dev.
##  sub      (Intercept) 1.1090  
##  Residual             0.9243  
## Number of obs: 40, groups:  sub, 20
## Fixed Effects:
## (Intercept)        cond2  
##       9.924        1.154

summary(model_lmer)

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ cond + (1 | sub)
##    Data: tc
## 
## REML criterion at convergence: 133.6
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.80719 -0.44123 -0.04697  0.44392  1.78264 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  sub      (Intercept) 1.2298   1.1090  
##  Residual             0.8543   0.9243  
## Number of obs: 40, groups:  sub, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   9.9241     0.3228  30.743
## cond2         1.1536     0.2923   3.947
## 
## Correlation of Fixed Effects:
##       (Intr)
## cond2 -0.453

# qqmath(ranef(model_lmer))

See http://freshbiostats.wordpress.com/2013/07/28/mixed-models-in-r-lme4-nlme-both/

Bootstrap confidence interval for the difference

require(boot)
samplemean <- function(x, d) { mean(x[d]) }
b <- boot(y1-y2, samplemean, 1000)
boot.ci(b)

## Warning in boot.ci(b): bootstrap variances needed for studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-1.705, -0.603 )   (-1.734, -0.641 )  
## 
## Level     Percentile            BCa          
## 95%   (-1.666, -0.574 )   (-1.630, -0.498 )  
## Calculations and Intervals on Original Scale

Plots

The errors bars can either represent the standard errors (or confidence intervals) of the means of each treatment, or the standard error bar for the difference between the two treatments when intersubject variability is taken out.

First graphics: with the std.err. of the means:

attach(tc)
par(mfrow=c(1,1))

means <- tapply(y, cond, mean)
(ses <- tapply(y, cond, se))

##         1         2 
## 0.2986051 0.3453241

ysca = c(min(means-3*ses), max(means+3*ses))

mp <- barplot(means, ylim=ysca, xpd=F)
arrows(mp, means-ses, 
       mp, means+ses, 
       code=3, angle=90)

detach(tc)

If we remove the between Ss variability

attach(tc)
par(mfrow=c(1,1))
    
means <- tapply(y, cond, mean)
(ses <- tapply(ycorr, cond, se))

##         1         2 
## 0.1461428 0.1461428

ysca = c(min(means-3*ses), max(means+3*ses))

mp <- barplot(means, ylim=ysca, xpd=F)
arrows(mp, means-ses, 
       mp, means+ses, 
       code=3, angle=90)

detach(tc)

If we take the standard error from the regression:

(sm <- summary(model_lm <- lm(y ~ cond + sub, data=tc)))

## 
## Call:
## lm(formula = y ~ cond + sub, data = tc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3995 -0.4974  0.0000  0.4974  1.3995 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.58312    0.66971  14.309 1.26e-11 ***
## cond2        1.15357    0.29229   3.947 0.000865 ***
## subs10       1.51722    0.92429   1.641 0.117144    
## subs11      -1.34969    0.92429  -1.460 0.160561    
## subs12       0.97900    0.92429   1.059 0.302790    
## subs13      -0.82581    0.92429  -0.893 0.382791    
## subs14       1.23746    0.92429   1.339 0.196428    
## subs15      -0.20772    0.92429  -0.225 0.824583    
## subs16       1.17449    0.92429   1.271 0.219177    
## subs17      -0.52878    0.92429  -0.572 0.573968    
## subs18       2.88544    0.92429   3.122 0.005615 ** 
## subs19       1.27412    0.92429   1.378 0.184068    
## subs2       -0.90141    0.92429  -0.975 0.341689    
## subs20      -0.09180    0.92429  -0.099 0.921921    
## subs3        0.73598    0.92429   0.796 0.435711    
## subs4       -0.70989    0.92429  -0.768 0.451905    
## subs5       -1.83862    0.92429  -1.989 0.061269 .  
## subs6        3.08695    0.92429   3.340 0.003442 ** 
## subs7        0.23268    0.92429   0.252 0.803946    
## subs8       -0.02338    0.92429  -0.025 0.980080    
## subs9        0.17251    0.92429   0.187 0.853918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9243 on 19 degrees of freedom
## Multiple R-squared:  0.8245, Adjusted R-squared:  0.6398 
## F-statistic: 4.464 on 20 and 19 DF,  p-value: 0.0009515

diff <-sm$coefficients[2,'Estimate']
diffse <- sm$coefficients[2,'Std. Error']

attach(tc)
par(mfrow=c(1,1))
    
means <- tapply(y, cond, mean)
(ses <- rep(diffse, length(means)))

## [1] 0.2922856 0.2922856

ysca = c(min(means-3*ses), max(means+3*ses))

mp <- barplot(means, ylim=ysca, xpd=F)
arrows(mp, means-ses, 
       mp, means+ses, 
       code=3, angle=90)

detach(tc)

attach(tc)
m1 <- t.test(y[cond==1])$conf.int
m2 <- t.test(y[cond==2])$conf.int
di <- diff(t.test(y1-y2)$conf.int)
ysca <- c(min(c(m1,m2)-0.1*diff(range(c(m1,m2)))),
          max(c(m1,m2)+0.1*diff(range(c(m1,m2)))))
          
plot(c(Gr1=1, Gr2=2, difference=3),
     c(mean(m1), mean(m2), mean(m2)),
     pch=c(16,16,17), xlim=c(0.5, 3.5), ylim=ysca, axes=F, xlab='', ylab='')
axis(2, las=1)
axis(1,at=1:3,labels=c('cond1','cond2','difference'))
arrows(1:3, c(m1[1], m2[1], mean(m2)-di/2),
       1:3, c(m1[2], m2[2], mean(m2)+di/2),
       code=3, angle=90)
abline(h=mean(m1), lty=2)
abline(h=mean(m2), lty=2)

detach(tc)

require(gplots)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

par(mfcol=(c(1,2)))
plotmeans(y ~ cond, data=tc)
plotmeans(ycorr ~ cond, data=tc)

Comparing several means, within subjects

rm(list=ls())
require(ez)
require(gplots)
require(lme4)

Creation of simulated data

nsub <- 20 # number of subjects (statistical units)
nconds <- 5 # number of conditions 
effects <- c(110, 110, 120, 140, 100)
sd_between_sub <- 10
sd_within_sub <- 4

ot <- data.frame(sub = factor(rep(paste('s',1:nsub,sep=''), each=nconds)),
                 cond = factor(rep(paste('cond',1:nconds,sep=''), nsub)),
                 y = effects + rep(rnorm(nsub, sd=sd_between_sub), each=nconds) + rnorm(nsub * nconds, sd=sd_within_sub))

Exploration plots

with(ot, interaction.plot(cond, sub, y, main='Cond * Subject plot', legend=FALSE))

ot$ycorr <- ot$y + mean(ot$y) - tapply(ot$y, ot$sub, mean)[ot$sub]
with(ot, interaction.plot(cond, sub, ycorr, main='Cond * Sub after removing Sub main effect', legend=FALSE))

Classical analysis of variance model:

require(ez)
#summary(aov_model <- aov(y ~ cond + Error(sub/cond), data=ot))

ez_model <- ezANOVA(data=ot,
                    dv=y,
                    wid=sub,
                    within = cond)
print(ez_model)

## $ANOVA
##   Effect DFn DFd        F            p p<.05       ges
## 2   cond   4  76 254.3557 3.610562e-43     * 0.6299456
## 
## $`Mauchly's Test for Sphericity`
##   Effect         W         p p<.05
## 2   cond 0.7055564 0.7337603      
## 
## $`Sphericity Corrections`
##   Effect       GGe        p[GG] p[GG]<.05      HFe        p[HF] p[HF]<.05
## 2   cond 0.8547449 3.078513e-37         * 1.065071 3.610562e-43         *

ezPlot(data=ot,
       dv=y,
       wid=sub,
       within = cond,
       x = cond)

require(lme4)
summary(lmer_model <- lmer(y ~ cond + (1 | sub), data=ot))

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ cond + (1 | sub)
##    Data: ot
## 
## REML criterion at convergence: 619.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.74360 -0.48449  0.04857  0.48853  2.21409 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  sub      (Intercept) 92.51    9.618   
##  Residual             17.48    4.181   
## Number of obs: 100, groups:  sub, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  114.581      2.345   48.86
## condcond2     -1.152      1.322   -0.87
## condcond3      8.540      1.322    6.46
## condcond4     29.498      1.322   22.31
## condcond5     -9.658      1.322   -7.30
## 
## Correlation of Fixed Effects:
##           (Intr) cndcn2 cndcn3 cndcn4
## condcond2 -0.282                     
## condcond3 -0.282  0.500              
## condcond4 -0.282  0.500  0.500       
## condcond5 -0.282  0.500  0.500  0.500

anova(lmer_model)

## Analysis of Variance Table
##      Df Sum Sq Mean Sq F value
## cond  4  17788  4446.9  254.36

require(car)

## Loading required package: car

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

Anova(lmer_model)

## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: y
##       Chisq Df Pr(>Chisq)    
## cond 1017.4  4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# plotmeans(y ~ cond, data=ot, gap=0.1)
plotmeans(ycorr ~ cond, data=ot, gap=0.1)

## Warning in axis(1, at = 1:length(means), labels = legends, ...): "gap" is
## not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "gap" is not a
## graphical parameter

Examples of basic data analyses with R

Christophe Pallier

May 19, 2017

Estimating a proportion from a sample

Comparing two empirical proportions

Estimating the mean of a population from a sample

Exploratory graphics

Descriptive statistics

Confidence interval of a mean

Comparing two groups (continous dependent variable)

Graphical explorations

Inferential statistics

Barplot with standard errors

Comparing the means of several independent groups

Graphical explorations

Descriptive stats

Inferential statistics

Comparing two treatments

Graphical explorations

Descriptive stats

Inferential stats

Plots

Comparing several means, within subjects

Creation of simulated data

Exploration plots

Classical analysis of variance model: