Hi, this is a tutorial designed to accompany an introductory level hands-on workshop. The tutorial (along with the workshop) aims at helping phonetic researchers run some useful statistical analyses in R. We will discuss R code for the following tasks while assuming no prior knowledge of programming.
There are many ways to do the same thing, but here we will discuss only a few of them.
We will be using the following packages: gss, ggplot2, lmerTest, plyr, and some artificially generated data.
(optional but highly recommended) download and install RStudio Desktop: https://www.rstudio.com/products/rstudio/download/#download
working directory: this is the default directory for accessing files. The code below allows you to choose your working directory. Choose the folder where you saved the data files.
setwd(choose.dir())
If you encounter difficulty using the above code, you may alternatively set the working directory in the following format. The content enclosed in the quotation marks should be the directory/path to the folder where you saved the data files:
setwd("C:/Users/v/Desktop/Rworkshop")
Below we use a dataset of 3 columns where
dttest <- read.csv("F1_ttest.csv",fileEncoding="UTF-8-BOM")
dttest
We can run a paired t-test to see if mean F1 differs significantly in different language contexts.
t.test(dttest$Y.ei_C_F1,dttest$Y.ei_E_F1,paired=TRUE)
##
## Paired t-test
##
## data: dttest$Y.ei_C_F1 and dttest$Y.ei_E_F1
## t = 9.5462, df = 5, p-value = 0.0002135
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 46.10818 80.09094
## sample estimates:
## mean of the differences
## 63.09956
t.test() returns useful statistics such as t value, df, and p value.
If you skip the argument ‘paired=TRUE’ in t.test(), R will run independent 2-group t-test for you.
t.test(dttest$Y.ei_C_F1,dttest$Y.ei_E_F1)
##
## Welch Two Sample t-test
##
## data: dttest$Y.ei_C_F1 and dttest$Y.ei_E_F1
## t = 7.521, df = 9.9494, p-value = 2.072e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 44.39313 81.80599
## sample estimates:
## mean of x mean of y
## 449.0327 385.9332
This is often used to check if something is significantly different from chance level, or a given value. The given value to be tested against is specified in the argument ‘mu=’.
Here for demonstration we test whether the mean F1 for /ei/ in Cantonese in our dataset is significantly different from 450Hz.
t.test(dttest$Y.ei_C_F1,mu=450)
##
## One Sample t-test
##
## data: dttest$Y.ei_C_F1
## t = -0.16919, df = 5, p-value = 0.8723
## alternative hypothesis: true mean is not equal to 450
## 95 percent confidence interval:
## 434.3366 463.7289
## sample estimates:
## mean of x
## 449.0327
Below we will work on a different data file, and use SSANOVA to test whether the formant values of vowel ei are significantly different in bilinguals’ Cantonese and English.
dssanova <- read.csv("F1_F2.csv",fileEncoding="UTF-8-BOM")
tail(dssanova)
Our data contains F1 and F2 values of ei produced in Cantonese and English.
The vowel label is listed under the column ‘word’, e.g. ‘ei_C_F2’ means F2 of ei in Cantonese.
Column ‘X’ records the normalized measurement points.
Column ‘Y’ denotes the actual formant values at each corresponding measurement point.
#we use code in this section to generate SSANOVA values, which will later be used for plotting.
library(gss)
# actual comparisons
engF1 <-"ei_E_F1"
canF1 <-"ei_C_F1"
F1 <- subset(dssanova,word %in% c(engF1, canF1))
F1$word <- factor(F1$word, levels = unique(F1$word))
# fit model
ssa.fit2 <-ssanova(Y~word+X+word:X,data=F1)
# get predictions for constructing confidence intervals
X=seq(min(F1$X), max(F1$X), by=0.01)
grid <- expand.grid(X=X, word = c(engF1, canF1))
grid$ssa.fit <- predict(ssa.fit2,newdata = grid,se = T)$fit
grid$ssa.SE <- predict(ssa.fit2,newdata = grid,se = T)$se.fit
We can use the package ‘ggplot2’ to make graphs. We will plot the F1 values (along with the confidence interval, generated by SSANOVA) on the y axis, measurement points (time) on the x axis.
# plotting comparison; setting up plot object:
library(ggplot2)
p01 <- ggplot(grid,aes(x = X, colour = word, group = word)) +
theme_bw() +
geom_line(aes(y = ssa.fit),alpha = 1,colour = "grey20", lty=3) +
geom_ribbon(aes(ymin = ssa.fit-(1.96*ssa.SE), ymax = ssa.fit+(1.96*ssa.SE),fill = word),alpha = 0.5,color="NA") +
xlab("normalized time") +
ylab("Formant (Hz)")+
theme(axis.text = element_text(colour="black", size=15),
text = element_text(size=15),
title = element_text(size=15),
axis.title.x= element_text(vjust=-0.45),
axis.title.y = element_text(vjust=.2),
axis.ticks = element_line(colour="black"),
axis.line = element_line())
p01
~challenge task
Sections above have compared the F1 values in English and Cantonese. Try working on F2 values yourself.
Hint: change the specification for comparisons.
When you’re done, you may check the last section for a reference answer.
In this section we build models to see how the various information we have about the speaker or the linguistic context of the word influence the duration of the vowel.
Suppose we have a corpus of child’s speech and we are interested primarily in the relationship between word frequency and the duration of the vowel. So we decided to look at two specific vowels, schwa and ae. We sampled vowel duration from the speech of 40 children. These children form 4 groups of 10, so that each group of children would have the same first or dominant language, i.e. English, French, mandarin or Japanese. In each children’s speech, we sampled 10 tokens of schwa, and 10 tokens of ae. Because it’s a corpus, we could not control for the age of the children. Overall, they age between 5-10. Also, the words in which we measure the vowels may vary across speakers. For instance, maybe one of the tokens from a speaker comes from “bad”, but none of the tokens from another speaker comes from “bad”. All together there are 48 unique words.
Let’s load the data and call it dlmer.
dlmer<-read.csv("dlmer.csv",header=T)
Take a closer look at the data
head(dlmer)
The dataset consists of 800 rows (10 tokens x 2 vowels x 10 speakers x 4 groups) and 8 columns with the following column names:
Since we are interested in the effect of word frequency on vowel duration, we may plot all 800 data points, and have a linear model of the relation between frequency and duration. There seems to be a positive correlation between frequency and duration: vowels in higher frequency words appear, quite counterintuitively, longer than those in lower frequency words.
We get the same pattern when we separate the data for each vowel…
or when we further seperate the data based on speakers’ L1…
But when we take into consideration the identity of individual speakers, things begin to look different. The correlation is clearly negative.
Although we could build individual models for each speaker, we would also like to know how strong the correlation is in general, i.e. across speakers (imagine a line that represents the grand average of all the lines). This is why we need mixed models.
In other words…
It’s important to take into consideration the individuals/idiosyncracies.
A lme model (linear mixed effects model) allows us to assess the relationship between the variables we are interested in (e.g. frequency and duration) in aggregated data while taking into consideration the idiosyncracies (e.g. speakers).
m0<-lm(duration~frequency,data=dlmer)
library(lmerTest)
m1<-lmer(duration~frequency+(1|speakerID),data=dlmer)
m2<-lmer(duration~frequency+(1+frequency|speakerID),data=dlmer)
Compare the above figures with the one based on raw data:
Clearly, among the three models, m2 best predicts the real data.
The magical element is(1 + frequency | speakerID)
(1|speakerID)
says
speakers can have variable baselines in terms of the duration (i.e. random intercept);
(1+frequency|speakerID)
says, in addition to random intercept,
the effects of frequency on duration could also vary for different speakers (i.e. random slope).
Let’s straighten out some terms.
lmer(duration ~ frequency+ ( 1 + frequency | speakerID ),data=dlmer)
Elements in red are the building blocks of a mixed effects model. From left to right, they are
duration
: dependent variable. left to the tilde.
frequency
: fixed effect(s) (also independent variable). right to the tilde, not in brackets.
(1+frequency|speakerID)
: random effects. right to the tilde, in brackets.
In our toy models above, we included only frequency
, but real data oftentimes contain more than one variable and we would like to know whether some specific variables are significant. In our case here, we would like to know whether variables such as lang
, age
, vowel
would impact the duration of the vowel.
One way to do that is through model comparison.
m2<-lmer(duration~frequency+(1+frequency|speakerID),data=dlmer)
m3<-lmer(duration~frequency+lang+(1+frequency|speakerID),data=dlmer)
anova(m2,m3)
## refitting model(s) with ML (instead of REML)
Notice that in comparison with m2, m3 has one more fixed effect lang
.
Since the p value (0.19) is higher than 0.05 (a conventional threshold for hypothesis rejection), meaning that the addition of lang
as a fixed factor did not improve the model, we conclude that lang
is not a significant predictor of duration.
m3inter<-lmer(duration~frequency*lang+(1+frequency|speakerID),data=dlmer)
anova(m3,m3inter)
## refitting model(s) with ML (instead of REML)
frequency * lang
= frequency
as a fixed factor +lang
as a fixed factor + an interaction between frequency
and lang
.
Another way to do significance testing is to read the results of the models.
m4<-lmer(duration~frequency+vowel+(1+frequency|speakerID),data=dlmer)
summary(m4)
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
## to degrees of freedom [lmerMod]
## Formula: duration ~ frequency + vowel + (1 + frequency | speakerID)
## Data: dlmer
##
## REML criterion at convergence: -8.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8906 -0.5885 0.0376 0.6601 2.5157
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## speakerID (Intercept) 0.29721 0.5452
## frequency 0.03013 0.1736 0.01
## Residual 0.04129 0.2032
## Number of obs: 800, groups: speakerID, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.12736 0.08761 38.39802 1.454 0.154
## frequency -0.19510 0.03152 39.13341 -6.190 2.78e-07 ***
## vowelschwa -0.24958 0.01521 736.15015 -16.408 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) frqncy
## frequency 0.018
## vowelschwa -0.092 -0.124
summary() offers a p value based on t distribution.
Here, for instance, the p value for vowelschwa
is <.001, indicating that the duration for vowel schwa is significantly different from that for vowel ae.
~challenge task
Build your own model and test the significance of your fixed effects. Experiment with different random intercepts, random slopes, and interactions.
Baayen (2008 : Ch. 7): Analyzing Linguistic Data
Cunnings & Finlayson (2015): Mixed effects modelling and longitudinal data analysis
Field et al. (2012: Ch. 19): Discovering Statistics Using R
In the SSANOVA section, we have used ggplot2 to generate a graph of formants. In this section we will build simpler graphs like bar charts and line charts, but with a focus on the code itself, in order to understand how ggplot2 works.
Let’s make a bar plot with dssanova.
#take a look at what our data set is like
head(dssanova)
Plots with error bars are usually more informative than those without. To have error bars, we need to prepare the data beforehand. Here we will pot bar charts with +-1 standard deviation as error bars. The code below is a function that calculates the mean and sd.
library(plyr)
#+++++++++++++++++++++++++
# Function to calculate the mean and the standard deviation
# for each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of a column containing the variable
#to be summariezed
# groupnames : vector of column names to be used as
# grouping variables
data_summary <- function(data, varname, groupnames){
require(plyr)
summary_func <- function(x, col){
c(mean = mean(x[[col]], na.rm=TRUE),
sd = sd(x[[col]], na.rm=TRUE))
}
data_sum<-ddply(data, groupnames, .fun=summary_func,
varname)
data_sum <- rename(data_sum, c("mean" = varname))
return(data_sum)
}
Run the function with our dataset.
dssanova.sum<-data_summary(dssanova, varname="Y", groupnames=c("word"))
dssanova.sum
Now we can build a basic bar chart.
#plot
library(ggplot2)
p<-ggplot((aes(x = word, y=Y)), data=dssanova.sum) +
geom_bar(stat="identity")+
geom_errorbar(aes(ymin=Y-sd, ymax=Y+sd))
p
make error bars leaner
p<-ggplot((aes(x = word, y=Y)), data=dssanova.sum) +
geom_bar(stat="identity")+
geom_errorbar(aes(ymin=Y-sd, ymax=Y+sd), width=.4,position=position_dodge(.9))
p
make the background white
p<-p+theme_bw()
p
make more informative axes titles
p<-p+ylab("formant (Hz)")+xlab("language and formant condition")
p
~challenge task
Make a bar plot where bars are clustered by the formant number, i.e. the bars should appear in the order of ei_C_F1, ei_E_F1, ei_C_F2, ei_E_F2; The space between ei_C_F1 and ei_E_F1 (and the space between ei_C_F2 and ei_E_F2) should be smaller than the space between ei_E_F1 and ei_C_F2.
Answer appears at the end of the tutorial.
Bar charts offer some idea of the overall mean, but a line chart showing the trajectory could be more informative.
Get mean and sd at each measurement point for each word condition.
dssanova.sumline<-data_summary(dssanova, varname="Y", groupnames=c("word","X"))
dssanova.sumline
pp<-ggplot((aes(x = X, y=Y,color=word)), data=dssanova.sumline) +
geom_errorbar(aes(ymin=Y-sd, ymax=Y+sd), width=.4,linetype="solid")+
geom_line()+
geom_point()+
ylab("formant (Hz)")+xlab("position")+
theme_bw()
pp
We can also make a plot with 2 panels, one comparing F1 in Cantonese and English, the other one comparing F2.
dssanova.sumline2<-data_summary(dssanova, varname="Y", groupnames=c("X","language","formant"))
pp<-ggplot((aes(x = X, y=Y,color=language)), data=dssanova.sumline2) +
geom_errorbar(aes(ymin=Y-sd, ymax=Y+sd), width=.4,linetype="solid")+
geom_line()+
geom_point()+
ylab("formant (Hz)")+xlab("position")+
theme_bw()+
facet_wrap(~formant,scales="free")
pp
There are many ways to do the same thing. Your code doesn’t have to be the same as the ones below. So long as they work, things are fine.
library(gss)
# this is where we make a change
engF1 <-"ei_E_F2"
canF1 <-"ei_C_F2"
#reuse all the rest of the code, and we'll get a plot for F2
F1 <- subset(dssanova,word %in% c(engF1, canF1))
F1$word <- factor(F1$word, levels = unique(F1$word))
# fit model
ssa.fit2 <-ssanova(Y~word+X+word:X,data=F1)
# get predictions for constructing confidence intervals
X=seq(min(F1$X), max(F1$X), by=0.01)
grid <- expand.grid(X=X, word = c(engF1, canF1))
grid$ssa.fit <- predict(ssa.fit2,newdata = grid,se = T)$fit
grid$ssa.SE <- predict(ssa.fit2,newdata = grid,se = T)$se.fit
ggplot(grid,aes(x = X, colour = word, group = word)) +
theme_bw() +
geom_line(aes(y = ssa.fit),alpha = 1,colour = "grey20", lty=3) +
geom_ribbon(aes(ymin = ssa.fit-(1.96*ssa.SE), ymax = ssa.fit+(1.96*ssa.SE),fill = word),alpha = 0.5,color="NA") +
xlab("normalized time") +
ylab("Formant (Hz)")+
theme(axis.text = element_text(colour="black", size=15),
text = element_text(size=15),
title = element_text(size=15),
axis.title.x= element_text(vjust=-0.45),
axis.title.y = element_text(vjust=.2),
axis.ticks = element_line(colour="black"),
axis.line = element_line())
mchallenge<-lmer(duration~frequency+sex+(1+frequency|speakerID),data=dlmer)
anova(m2,mchallenge)
## refitting model(s) with ML (instead of REML)
mchallenge2<-lmer(duration~frequency*sex+(1+frequency|speakerID),data=dlmer)
anova(mchallenge2,mchallenge)
## refitting model(s) with ML (instead of REML)
frequency * sex
= frequency
as a fixed factor +sex
as a fixed factor + an interaction between frequency
and sex
There is a different between boys and girls’ vowel durations, for the inclusion of sex
significantly improved the model (m2), but the effect of frequency did not vary for different sexes.
dssanova.sum2<-data_summary(dssanova, varname="Y", groupnames=c("formant","language"))
p<-ggplot((aes(x = formant, y=Y,fill=language)), data=dssanova.sum2) +
geom_bar(stat="identity",position=position_dodge())+
geom_errorbar(aes(ymin=Y-sd, ymax=Y+sd), width=.4,position=position_dodge(.9))+
ylab("formant (Hz)")+xlab("formant")+
theme_bw()
p
This tutorial has benefited from inputs from varied sources, in particular, Sang-im Lee-Kim (on SSANOVA) and Albert Lee (on LME) and Josef Fruehwald (style sheet).
All errors are mine.
Congratulations on making it to the end of this tutorial!
You’re now better equipped as a researcher than you used to be.
Off into the wild world of research you go!