We will be working on a dataset called tempo which contains speaking rate and rhythm data for 20 speakers producing different types of texts: a) a story, b) isolated sentences that differ in terms of the phonotactic structure (simple -> complex) and c) excerpts of poems representing various poetic metres (e.g., iambs, trochees). The data was collected specifically for research on speaking rate and speech rhythm.
Before you move on to study PART 2, make sure that you understand the basic terms:
a) data types
- quantitative - numerical (numeric); can be continuous (e.g., duration, pitch, reaction time) or discrete (word frequency, utterance length in syllables or words, word count in a phrase)
- qualitative - categorical, e.g. pitch accent type, stress level, lexical tone category, syntactic category
- ordinal - position of the word in a phrase ( -> initial, medial, final), strength of a foreign accent (e.g., on a 5-level scale), speaking rate (e.g., very slow … very fast)
b) variable: dependent (response), independent (explanatory, predictor)
c) population & sample
d) randomization & bias
e) statistic and parameter

PART 1: Preparation of the data.

Reading excel files in R.

library(readxl)
tempo.df <- read_excel("tempo.xlsx")

From now on we will use the tempo.df data frame.

Understand data structure

You can use the following commands to have a closer look at the structure of your data:
class(), dim(), names(), str(), glimpse(), summary(), slice()
(some of them may require an upload of additional libraries).

str(tempo.df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5117 obs. of  6 variables:
##  $ speaker: chr  "ADPO" "ADPO" "ADPO" "ADPO" ...
##  $ ISR    : chr  "norm" "norm" "norm" "norm" ...
##  $ text   : chr  "story" "story" "story" "story" ...
##  $ LSR    : num  5.8 5.1 6.3 5.5 4.9 4.7 5.5 4.5 4.6 5.7 ...
##  $ nPVI.V : num  40.3 57.8 44.1 44.7 50.1 ...
##  $ rPVI.C : num  52.5 40.8 45.5 85.5 53.2 ...

Take a look at the data

You can use: head(), tail() or from the dplyr package: slice(), sample_n() and sample_fact().

head(tempo.df)
library(dplyr)
sample_n(tempo.df, 10)

Identify missing values

any(is.na(tempo.df))
## [1] TRUE

In order to easily identify which variables (columns) contain missing values (labeled as NA) display the summary of selected columns:

summary(tempo.df[,4:6])
##       LSR             nPVI.V           rPVI.C      
##  Min.   : 1.100   Min.   :  0.11   Min.   :  1.50  
##  1st Qu.: 3.200   1st Qu.: 30.74   1st Qu.: 43.40  
##  Median : 4.500   Median : 38.58   Median : 62.70  
##  Mean   : 4.654   Mean   : 40.82   Mean   : 73.83  
##  3rd Qu.: 6.000   3rd Qu.: 48.47   3rd Qu.: 93.25  
##  Max.   :10.100   Max.   :119.66   Max.   :469.91  
##                   NA's   :4        NA's   :6

Alternatively, you can use is.na() or complete.cases() (NOT recommended for large datasets - returns a vector of logical values).

There is no need to remove these values right now, but they will have to be handled somehow at the later stages of the analysis.

Organize the data

First, convert character columns to factors:

tempo.df$text <- as.factor(tempo.df$text)
tempo.df$ISR <- as.factor(tempo.df$ISR)
tempo.df$speaker <- as.factor(tempo.df$speaker)

Second, check here to find information about sampling distribution.
Next, use the code below to transform the original data: each value of the dependent variable (i.e., LSR, nPVI.V and rPVI.C) should represent the average calculated for each level determined by the combination of the predictor/independent variables:

short.df <- tempo.df %>% group_by(speaker, ISR, text) %>% summarise(LSR=mean(LSR), nPVI.V=mean(nPVI.V, na.rm = TRUE), rPVI.C=mean(rPVI.C, na.rm = TRUE))
head(short.df)

PART 2: Descriptive statistics for quantitative data

In tempo.df there are three numerical continuous variables: LSR (laboratory measured speaking rate) and so called rhythm metrics that measure durational variability in vocalic and consonantal intervals - nPVI.V and nPVI.C.

  1. MEAN (AVERAGE)
  1. Of a sample: sample mean
  2. Of a population: population mean
mean(short.df$LSR)
mean(short.df$nPVI.V)
mean(short.df$rPVI.C)
  1. MEDIAN
median(short.df$LSR)
median(short.df$nPVI.V)
median(short.df$rPVI.C)

The steps for determining the median of a data set can be found here.

  1. VARIANCE
  1. Of a sample: sample variance
  2. Of a population: population variance
var(short.df$rPVI.C)
  1. STANDARD DEVIATION
    It is calculated using the same formula as variance - you have to take one more step that consists in taking the square root of the result.
sd(short.df$rPVI.C)
  1. PERCENTILES and QUARTILES (Q1, Q2, Q3)
quantile(short.df$LSR, c(.20, .40, .60, .80))
##      20%      40%      60%      80% 
## 3.371718 4.689768 5.802857 6.668333

The steps for determining the kthpercentile (where k is any number between one and one hundred) can be found here.

  1. RANGE and INTERQUARTILE RANGE (IQR)
range(short.df$nPVI.V)
## [1] 24.07062 80.14200
IQR(short.df$nPVI.V)
## [1] 6.341466
  1. A SIX-NUMBER SUMMARY
summary(short.df)
##     speaker       ISR        text         LSR            nPVI.V     
##  ADPO   : 15   fast :57   poe  :95   Min.   :1.847   Min.   :24.07  
##  AGŚW   : 15   norm :57   sent :95   1st Qu.:3.608   1st Qu.:37.03  
##  AGWA   : 15   slow :57   story:95   Median :5.284   Median :40.01  
##  BŁKI   : 15   vfast:57              Mean   :5.121   Mean   :40.51  
##  DAKO   : 15   vslow:57              3rd Qu.:6.444   3rd Qu.:43.38  
##  JAWY   : 15                         Max.   :9.113   Max.   :80.14  
##  (Other):195                                                        
##      rPVI.C      
##  Min.   : 34.92  
##  1st Qu.: 48.88  
##  Median : 61.16  
##  Mean   : 67.87  
##  3rd Qu.: 84.13  
##  Max.   :155.42  
## 
  1. HISTOGRAMS - analyzing the shape of data

In the first step you should tidy your dataset, so that it can be used as an argument in the normality test:

normal.tempo <- short.df %>% tbl_df() %>% filter(ISR=='norm') %>% select(LSR) %>% unlist() %>% as.vector()
hist(normal.tempo)

library(ggplot2)
short.df %>% filter(ISR=='norm') %>% ggplot(aes(x=LSR)) + geom_histogram(breaks=c(4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5), aes(y=..density.., fill=..count..), color='black') + geom_density(color='magenta') + xlab('LSR (only normal tempo)') + ylab('count') + theme_bw()

It seems that LSR data at normal intended speaking rate is symmetric. Moreover, the shape of the data fits into a bell-shaped curve which indicates a normal distribution (or Gaussian distribution). In order to confirm this ibservation we perform the Shapiro-Wilk test:

shapiro.test(normal.tempo)
## 
##  Shapiro-Wilk normality test
## 
## data:  normal.tempo
## W = 0.99024, p-value = 0.9265

Normal distribution has important implications: if this is the case, then we can apply the EMPIRICAL RULE (or 68-95-99 Rule) not only to describe our data, but also to describe the population.

This is not the case of the LSR data at the slow intended speaking rate - this histogram is asymmetric and skewed right (i.e, it has a tail going off to the right). It can also be noticed that speech delivered at a slow rate has lower mean (3.85 vs. 5.42) and standard deviation (0.84 vs. 0.97) as indicated by less variable bins of the data realised at the normal speaking rate, as well as by greater concentration of the bins around the mean (-> a flatter histogram has generally more variability than a bell-shaped histogram of a similar range).

slow.tempo <- tempo.df %>% tbl_df() %>% filter(ISR=='slow') %>% select(LSR) %>% unlist() %>% as.vector()
shapiro.test(slow.tempo)
## 
##  Shapiro-Wilk normality test
## 
## data:  slow.tempo
## W = 0.9818, p-value = 1.254e-11
hist(slow.tempo)

tempo.df %>% filter(ISR=='slow') %>% ggplot(aes(x=LSR)) + geom_histogram(breaks=c(1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7), aes(y=..density.., fill=..count..), color='black') + geom_density(color='magenta') + xlab('LSR (only slow tempo)') + ylab('count') + theme_bw()

  1. BOXPLOTS
short.df %>% ggplot(aes(factor(ISR), LSR)) + geom_boxplot(aes(fill = factor(ISR)))

PART 3: Descriptive statistics for qualitative data

  1. PERCENTAGES & PROPORTIONS
    Categorical (qualitative) data are often summarized by reporting the percentage or proportion of observations that falls into each category.
tempo.cat <- tempo.df %>% select(ISR:LSR)
tempo.cat.2 <- tempo.cat %>% group_by(ISR) %>% summarise(total=n())
summarise(tempo.cat.2, sum(total))
#use the result to calculate proportions
tempo.cat.2 <- tempo.cat.2 %>% mutate(proportion=(total*100)/5117) %>% arrange(desc(proportion))
tempo.cat.2
  1. PIE CHART
    A pie chart takes categorical data and breaks them down by class/category showing the proportion of observations that fall into each category.

  2. BAR GRAPH
    A bar graph takes categorical data and breaks them down by class/category showing the number or percentage of observations in each category (-> absolute or relative frequency). The amount of data in each category is represented by using bars of different lenghts.

tempo.cat %>% arrange(LSR) %>% ggplot(aes(x=ISR)) + geom_bar(alpha=0.6) + theme_bw()

tempo.cat %>% arrange(LSR) %>% ggplot(aes(x=ISR)) + geom_bar(aes(fill=text), position="dodge") + theme_bw()