We will be working on a dataset called tempo which contains speaking rate and rhythm data for 20 speakers producing different types of texts: a) a story, b) isolated sentences that differ in terms of the phonotactic structure (simple -> complex) and c) excerpts of poems representing various poetic metres (e.g., iambs, trochees). The data was collected specifically for research on speaking rate and speech rhythm.
Before you move on to study PART 2, make sure that you understand the basic terms:
a) data types
- quantitative - numerical (numeric); can be continuous (e.g., duration, pitch, reaction time) or discrete (word frequency, utterance length in syllables or words, word count in a phrase)
- qualitative - categorical, e.g. pitch accent type, stress level, lexical tone category, syntactic category
- ordinal - position of the word in a phrase ( -> initial, medial, final), strength of a foreign accent (e.g., on a 5-level scale), speaking rate (e.g., very slow … very fast)
b) variable: dependent (response), independent (explanatory, predictor)
c) population & sample
d) randomization & bias
e) statistic and parameter
library(readxl)
tempo.df <- read_excel("tempo.xlsx")
From now on we will use the tempo.df data frame.
You can use the following commands to have a closer look at the structure of your data:
class(), dim(), names(), str(), glimpse(), summary(), slice()
(some of them may require an upload of additional libraries).
str(tempo.df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5117 obs. of 6 variables:
## $ speaker: chr "ADPO" "ADPO" "ADPO" "ADPO" ...
## $ ISR : chr "norm" "norm" "norm" "norm" ...
## $ text : chr "story" "story" "story" "story" ...
## $ LSR : num 5.8 5.1 6.3 5.5 4.9 4.7 5.5 4.5 4.6 5.7 ...
## $ nPVI.V : num 40.3 57.8 44.1 44.7 50.1 ...
## $ rPVI.C : num 52.5 40.8 45.5 85.5 53.2 ...
You can use: head(), tail() or from the dplyr package: slice(), sample_n() and sample_fact().
head(tempo.df)
library(dplyr)
sample_n(tempo.df, 10)
any(is.na(tempo.df))
## [1] TRUE
In order to easily identify which variables (columns) contain missing values (labeled as NA) display the summary of selected columns:
summary(tempo.df[,4:6])
## LSR nPVI.V rPVI.C
## Min. : 1.100 Min. : 0.11 Min. : 1.50
## 1st Qu.: 3.200 1st Qu.: 30.74 1st Qu.: 43.40
## Median : 4.500 Median : 38.58 Median : 62.70
## Mean : 4.654 Mean : 40.82 Mean : 73.83
## 3rd Qu.: 6.000 3rd Qu.: 48.47 3rd Qu.: 93.25
## Max. :10.100 Max. :119.66 Max. :469.91
## NA's :4 NA's :6
Alternatively, you can use is.na() or complete.cases() (NOT recommended for large datasets - returns a vector of logical values).
There is no need to remove these values right now, but they will have to be handled somehow at the later stages of the analysis.
First, convert character columns to factors:
tempo.df$text <- as.factor(tempo.df$text)
tempo.df$ISR <- as.factor(tempo.df$ISR)
tempo.df$speaker <- as.factor(tempo.df$speaker)
Second, check here to find information about sampling distribution.
Next, use the code below to transform the original data: each value of the dependent variable (i.e., LSR, nPVI.V and rPVI.C) should represent the average calculated for each level determined by the combination of the predictor/independent variables:
short.df <- tempo.df %>% group_by(speaker, ISR, text) %>% summarise(LSR=mean(LSR), nPVI.V=mean(nPVI.V, na.rm = TRUE), rPVI.C=mean(rPVI.C, na.rm = TRUE))
head(short.df)
In tempo.df there are three numerical continuous variables: LSR (laboratory measured speaking rate) and so called rhythm metrics that measure durational variability in vocalic and consonantal intervals - nPVI.V and nPVI.C.
mean(short.df$LSR)
mean(short.df$nPVI.V)
mean(short.df$rPVI.C)
median(short.df$LSR)
median(short.df$nPVI.V)
median(short.df$rPVI.C)
The steps for determining the median of a data set can be found here.
var(short.df$rPVI.C)
sd(short.df$rPVI.C)
quantile(short.df$LSR, c(.20, .40, .60, .80))
## 20% 40% 60% 80%
## 3.371718 4.689768 5.802857 6.668333
The steps for determining the kthpercentile (where k is any number between one and one hundred) can be found here.
range(short.df$nPVI.V)
## [1] 24.07062 80.14200
IQR(short.df$nPVI.V)
## [1] 6.341466
summary(short.df)
## speaker ISR text LSR nPVI.V
## ADPO : 15 fast :57 poe :95 Min. :1.847 Min. :24.07
## AGŚW : 15 norm :57 sent :95 1st Qu.:3.608 1st Qu.:37.03
## AGWA : 15 slow :57 story:95 Median :5.284 Median :40.01
## BŁKI : 15 vfast:57 Mean :5.121 Mean :40.51
## DAKO : 15 vslow:57 3rd Qu.:6.444 3rd Qu.:43.38
## JAWY : 15 Max. :9.113 Max. :80.14
## (Other):195
## rPVI.C
## Min. : 34.92
## 1st Qu.: 48.88
## Median : 61.16
## Mean : 67.87
## 3rd Qu.: 84.13
## Max. :155.42
##
In the first step you should tidy your dataset, so that it can be used as an argument in the normality test:
normal.tempo <- short.df %>% tbl_df() %>% filter(ISR=='norm') %>% select(LSR) %>% unlist() %>% as.vector()
hist(normal.tempo)
library(ggplot2)
short.df %>% filter(ISR=='norm') %>% ggplot(aes(x=LSR)) + geom_histogram(breaks=c(4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5), aes(y=..density.., fill=..count..), color='black') + geom_density(color='magenta') + xlab('LSR (only normal tempo)') + ylab('count') + theme_bw()
It seems that LSR data at normal intended speaking rate is symmetric. Moreover, the shape of the data fits into a bell-shaped curve which indicates a normal distribution (or Gaussian distribution). In order to confirm this ibservation we perform the Shapiro-Wilk test:
shapiro.test(normal.tempo)
##
## Shapiro-Wilk normality test
##
## data: normal.tempo
## W = 0.99024, p-value = 0.9265
Normal distribution has important implications: if this is the case, then we can apply the EMPIRICAL RULE (or 68-95-99 Rule) not only to describe our data, but also to describe the population.
This is not the case of the LSR data at the slow intended speaking rate - this histogram is asymmetric and skewed right (i.e, it has a tail going off to the right). It can also be noticed that speech delivered at a slow rate has lower mean (3.85 vs. 5.42) and standard deviation (0.84 vs. 0.97) as indicated by less variable bins of the data realised at the normal speaking rate, as well as by greater concentration of the bins around the mean (-> a flatter histogram has generally more variability than a bell-shaped histogram of a similar range).
slow.tempo <- tempo.df %>% tbl_df() %>% filter(ISR=='slow') %>% select(LSR) %>% unlist() %>% as.vector()
shapiro.test(slow.tempo)
##
## Shapiro-Wilk normality test
##
## data: slow.tempo
## W = 0.9818, p-value = 1.254e-11
hist(slow.tempo)
tempo.df %>% filter(ISR=='slow') %>% ggplot(aes(x=LSR)) + geom_histogram(breaks=c(1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7), aes(y=..density.., fill=..count..), color='black') + geom_density(color='magenta') + xlab('LSR (only slow tempo)') + ylab('count') + theme_bw()
short.df %>% ggplot(aes(factor(ISR), LSR)) + geom_boxplot(aes(fill = factor(ISR)))
tempo.cat <- tempo.df %>% select(ISR:LSR)
tempo.cat.2 <- tempo.cat %>% group_by(ISR) %>% summarise(total=n())
summarise(tempo.cat.2, sum(total))
#use the result to calculate proportions
tempo.cat.2 <- tempo.cat.2 %>% mutate(proportion=(total*100)/5117) %>% arrange(desc(proportion))
tempo.cat.2
PIE CHART
A pie chart takes categorical data and breaks them down by class/category showing the proportion of observations that fall into each category.
BAR GRAPH
A bar graph takes categorical data and breaks them down by class/category showing the number or percentage of observations in each category (-> absolute or relative frequency). The amount of data in each category is represented by using bars of different lenghts.
tempo.cat %>% arrange(LSR) %>% ggplot(aes(x=ISR)) + geom_bar(alpha=0.6) + theme_bw()
tempo.cat %>% arrange(LSR) %>% ggplot(aes(x=ISR)) + geom_bar(aes(fill=text), position="dodge") + theme_bw()