We will be working on a dataset called *tempo* which contains speaking rate and rhythm data for 20 speakers producing different types of texts: a) a story, b) isolated sentences that differ in terms of the phonotactic structure (simple -> complex) and c) excerpts of poems representing various poetic metres (e.g., iambs, trochees). The data was collected specifically for research on speaking rate and speech rhythm.

Before you move on to study PART 2, make sure that you understand the basic terms:

a) data types

- *quantitative* - numerical (numeric); can be continuous (e.g., duration, pitch, reaction time) or discrete (word frequency, utterance length in syllables or words, word count in a phrase)

- *qualitative* - categorical, e.g. pitch accent type, stress level, lexical tone category, syntactic category

- *ordinal* - position of the word in a phrase ( -> initial, medial, final), strength of a foreign accent (e.g., on a 5-level scale), speaking rate (e.g., very slow … very fast)

b) variable: *dependent* (response), *independent* (explanatory, predictor)

c) population & sample

d) randomization & bias

e) statistic and parameter

```
library(readxl)
tempo.df <- read_excel("tempo.xlsx")
```

From now on we will use the *tempo.df* data frame.

You can use the following commands to have a closer look at the structure of your data:

**class(), dim(), names(), str(), glimpse(), summary(), slice()**

(some of them may require an upload of additional libraries).

`str(tempo.df)`

```
## Classes 'tbl_df', 'tbl' and 'data.frame': 5117 obs. of 6 variables:
## $ speaker: chr "ADPO" "ADPO" "ADPO" "ADPO" ...
## $ ISR : chr "norm" "norm" "norm" "norm" ...
## $ text : chr "story" "story" "story" "story" ...
## $ LSR : num 5.8 5.1 6.3 5.5 4.9 4.7 5.5 4.5 4.6 5.7 ...
## $ nPVI.V : num 40.3 57.8 44.1 44.7 50.1 ...
## $ rPVI.C : num 52.5 40.8 45.5 85.5 53.2 ...
```

You can use: **head(), tail()** or from the *dplyr* package: **slice(), sample_n() and sample_fact().**

`head(tempo.df)`

```
library(dplyr)
sample_n(tempo.df, 10)
```

`any(is.na(tempo.df))`

`## [1] TRUE`

In order to easily identify which variables (columns) contain missing values (labeled as **NA**) display the *summary* of selected columns:

`summary(tempo.df[,4:6])`

```
## LSR nPVI.V rPVI.C
## Min. : 1.100 Min. : 0.11 Min. : 1.50
## 1st Qu.: 3.200 1st Qu.: 30.74 1st Qu.: 43.40
## Median : 4.500 Median : 38.58 Median : 62.70
## Mean : 4.654 Mean : 40.82 Mean : 73.83
## 3rd Qu.: 6.000 3rd Qu.: 48.47 3rd Qu.: 93.25
## Max. :10.100 Max. :119.66 Max. :469.91
## NA's :4 NA's :6
```

Alternatively, you can use **is.na()** or **complete.cases()** (NOT recommended for large datasets - returns a vector of logical values).

There is no need to remove these values right now, but they will have to be handled somehow at the later stages of the analysis.

First, convert character columns to factors:

```
tempo.df$text <- as.factor(tempo.df$text)
tempo.df$ISR <- as.factor(tempo.df$ISR)
tempo.df$speaker <- as.factor(tempo.df$speaker)
```

Second, check here to find information about *sampling distribution*.

Next, use the code below to transform the original data: each value of the dependent variable (i.e., LSR, nPVI.V and rPVI.C) should represent the *average* calculated for each level determined by the combination of the predictor/independent variables:

`short.df <- tempo.df %>% group_by(speaker, ISR, text) %>% summarise(LSR=mean(LSR), nPVI.V=mean(nPVI.V, na.rm = TRUE), rPVI.C=mean(rPVI.C, na.rm = TRUE))`

`head(short.df)`

In tempo.df there are three numerical continuous variables: **LSR** (laboratory measured speaking rate) and so called *rhythm metrics* that measure durational variability in vocalic and consonantal intervals - **nPVI.V** and **nPVI.C**.

- MEAN (AVERAGE)

- Of a sample:

- Of a population:

```
mean(short.df$LSR)
mean(short.df$nPVI.V)
mean(short.df$rPVI.C)
```

- MEDIAN

```
median(short.df$LSR)
median(short.df$nPVI.V)
median(short.df$rPVI.C)
```

The steps for determining the median of a data set can be found here.

- VARIANCE

- Of a sample:

- Of a population:

`var(short.df$rPVI.C)`

- STANDARD DEVIATION

It is calculated using the same formula as*variance*- you have to take one more step that consists in taking the square root of the result.

`sd(short.df$rPVI.C)`

- PERCENTILES and QUARTILES (Q1, Q2, Q3)

`quantile(short.df$LSR, c(.20, .40, .60, .80))`

```
## 20% 40% 60% 80%
## 3.371718 4.689768 5.802857 6.668333
```

The steps for determining the k^{th}percentile (where *k* is any number between one and one hundred) can be found here.

- RANGE and INTERQUARTILE RANGE (IQR)

`range(short.df$nPVI.V)`

`## [1] 24.07062 80.14200`

`IQR(short.df$nPVI.V)`

`## [1] 6.341466`

- A SIX-NUMBER SUMMARY

`summary(short.df)`

```
## speaker ISR text LSR nPVI.V
## ADPO : 15 fast :57 poe :95 Min. :1.847 Min. :24.07
## AGŚW : 15 norm :57 sent :95 1st Qu.:3.608 1st Qu.:37.03
## AGWA : 15 slow :57 story:95 Median :5.284 Median :40.01
## BŁKI : 15 vfast:57 Mean :5.121 Mean :40.51
## DAKO : 15 vslow:57 3rd Qu.:6.444 3rd Qu.:43.38
## JAWY : 15 Max. :9.113 Max. :80.14
## (Other):195
## rPVI.C
## Min. : 34.92
## 1st Qu.: 48.88
## Median : 61.16
## Mean : 67.87
## 3rd Qu.: 84.13
## Max. :155.42
##
```

- HISTOGRAMS - analyzing the
*shape*of data

In the first step you should tidy your dataset, so that it can be used as an argument in the normality test:

`normal.tempo <- short.df %>% tbl_df() %>% filter(ISR=='norm') %>% select(LSR) %>% unlist() %>% as.vector()`

`hist(normal.tempo)`

```
library(ggplot2)
short.df %>% filter(ISR=='norm') %>% ggplot(aes(x=LSR)) + geom_histogram(breaks=c(4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5), aes(y=..density.., fill=..count..), color='black') + geom_density(color='magenta') + xlab('LSR (only normal tempo)') + ylab('count') + theme_bw()
```

It seems that LSR data at *normal intended* speaking rate is **symmetric**. Moreover, the shape of the data fits into a *bell-shaped curve* which indicates a **normal distribution** (or Gaussian distribution). In order to confirm this ibservation we perform the Shapiro-Wilk test:

`shapiro.test(normal.tempo)`

```
##
## Shapiro-Wilk normality test
##
## data: normal.tempo
## W = 0.99024, p-value = 0.9265
```

Normal distribution has important implications: if this is the case, then we can apply the EMPIRICAL RULE (or 68-95-99 Rule) not only to describe our data, but also to describe the population.

This is not the case of the LSR data at the *slow intended* speaking rate - this histogram is asymmetric and **skewed** right (i.e, it has a tail going off to the right). It can also be noticed that speech delivered at a slow rate has lower mean (3.85 vs. 5.42) and standard deviation (0.84 vs. 0.97) as indicated by less variable bins of the data realised at the normal speaking rate, as well as by greater concentration of the bins around the mean (-> a flatter histogram has generally more variability than a bell-shaped histogram of a similar range).

```
slow.tempo <- tempo.df %>% tbl_df() %>% filter(ISR=='slow') %>% select(LSR) %>% unlist() %>% as.vector()
shapiro.test(slow.tempo)
```

```
##
## Shapiro-Wilk normality test
##
## data: slow.tempo
## W = 0.9818, p-value = 1.254e-11
```

`hist(slow.tempo)`

`tempo.df %>% filter(ISR=='slow') %>% ggplot(aes(x=LSR)) + geom_histogram(breaks=c(1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7), aes(y=..density.., fill=..count..), color='black') + geom_density(color='magenta') + xlab('LSR (only slow tempo)') + ylab('count') + theme_bw()`

`short.df %>% ggplot(aes(factor(ISR), LSR)) + geom_boxplot(aes(fill = factor(ISR)))`

- PERCENTAGES & PROPORTIONS

Categorical (qualitative) data are often summarized by reporting the percentage or proportion of observations that falls into each category.

```
tempo.cat <- tempo.df %>% select(ISR:LSR)
tempo.cat.2 <- tempo.cat %>% group_by(ISR) %>% summarise(total=n())
```

```
summarise(tempo.cat.2, sum(total))
#use the result to calculate proportions
```

```
tempo.cat.2 <- tempo.cat.2 %>% mutate(proportion=(total*100)/5117) %>% arrange(desc(proportion))
tempo.cat.2
```

PIE CHART

A pie chart takes categorical data and breaks them down by class/category showing the proportion of observations that fall into each category.BAR GRAPH

A bar graph takes categorical data and breaks them down by class/category showing the number or percentage of observations in each category (-> absolute or relative frequency). The amount of data in each category is represented by using bars of different lenghts.

`tempo.cat %>% arrange(LSR) %>% ggplot(aes(x=ISR)) + geom_bar(alpha=0.6) + theme_bw()`

`tempo.cat %>% arrange(LSR) %>% ggplot(aes(x=ISR)) + geom_bar(aes(fill=text), position="dodge") + theme_bw()`