Distributions and probabilities

PART 1: Random variable and the binomial distribution

RANDOM VARIABLE
A random variable is a variable representing a specific data type (quantitative or qualitative) which changes in a random way, i.e. you can not tell exactly what will be the next outcome (value) of the variable, but it might be so, that the variable takes some values more often than others and you can use thise knowledge to predict its next outcome.

discrete random variable: its values/outcomes are whole numbers, e.g. the number of people (out of a random sample of 100 people) who voted ‘yes’ on some issue

can be finite or countably infinite (e.g. the number of car accidents occurring at a specific intersection within a 10-year period)

continuous random variable: e.g. response time in a word identification task

is always uncountably infinite as it can take too many possible values (you can not list them)

PROBABILITY DISTRIBUTION p(x)
Distribution of a random variable X is a listing, graph or function of all of its possible outcomes and how often the actual outcome (x) or set of outcomes occurs.

Probability distribution is defined as the distribution of the probabilities of all the possible outcomes (x) of a random variable (X). For example, if you roll a dice (X) the possible outcomes (x) have equal probabilities - that is P(X=1)=1/6, P(X=2)=1/6, etc.
The mean, standard deviation and variance of a random variable X have the same notation as the respective statistics of the whole population. If probabilities are equal for all possible x (values/outcomes) that a random variable X can take, then the mean is the average of all the possible outcomes (irrespective of the number of trials).

BINOMIAL VARIABLE
A binomial is a discrete random variable characterized by the following features:

the number of trials (n) is fixed
it takes on only two values, e.g. yes/no, success/failure or 1/0
the probability (p) of yes/success/1 is the same for each trial
the trials are independent

Probability distribution of a binomial variable is determined by the number of trials (n) and the probability of success (p).

calculate using formula:
calculate using binomial table; in order to estimate greater/less then probabilities just sum up all the probabilities from the tables that are greater/less then your pre-determined level (i.e. number of successes, denoted x)

Statistics

mean:
variance:
standard deviation:

EXERCISES

You flip a coin 25 times and record the number of heads. What is the binomial random variable (X) in this experiment?
You roll a six-faced die 10 times and record which face comes up each time (X). Why is X not a binomial random variable?
What is the mean of a binomial random variable with n = 18 and p = 0.4?

PART 2: The normal distribution

A continuous random variable has normal distribution if it is characterized by the following features:

the distribution of all possible values of X is symmetric and can be approximated using a bell-curve
median = mean
standard deviation measures the distance from the mean to the inflection point (i.e. the point where the bell-curve changes its shape)

Consequently, the normal distribution follows the Empirical Rule: 68/95/99.7.

STANDARD NORMAL (Z-) DISTRIBUTION

a memeber of the normal distribution family
the standard to which one can refer when describing and analysing any other normal distribution
used to find probabilities for values of a normally distributed variable X
mean = 0, standard deviation = 1
values: z-scores or standard scores
follows the 68/95/99.7 rule

Z-score standardization makes it possible to compare values that come from different distributions and/or samples, e.g. comparing speaking rate measures or syllable durations in speech samples obtained from different speakers.
To transform values of a variable X to z-scores use the formula: z-score

Finding probabilities for Z with the Z-table (check here and here).

The number that you identify in the Z-table represents p(Z < z), i.e. the probability that the random variable Z is less-than this value, which also correspons to the percentage of z-values that are smaller than your number.

The probability that X is equal to any single value is 0 for any continuous random variable, because unlike in case of discrete random variables, the probability of a continuous random variable is considered an area (and NOT a signle point) below the bell curve.

upload the dataset contaning syllable durations

library(readxl)
df <- read_xlsx('stats_class_1.xlsx', sheet = 2)
str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2242 obs. of  1 variable:
##  $ duration: num  122 224 118 179 196 148 98 116 240 171 ...

calculate mean and standard deviation

(meanDur <- mean(df$duration))

## [1] 170.4014

(sdDur <- sd(df$duration))

## [1] 73.29609

determine the probability that a syllable duration is less than 100 ms

(z = (100-meanDur)/sdDur) # convert to a z-score

## [1] -0.9605072

# p(X < 100) = p(Z < -0.96); 
p = 0.1685

determine the probability that a syllable duration is greater than 300 ms

(z = (300-meanDur)/sdDur)

## [1] 1.768151

# p(X > 300) = p(Z > 1.77) = 1 - p(Z < 1.77); 
(p = 1 - 0.9616)

## [1] 0.0384

determine the probability that a syllable duration is between 140 and 200 ms

(z = (140-meanDur)/sdDur)

## [1] -0.4147756

(z = (200-meanDur)/sdDur)

## [1] 0.403822

# p(140 < X < 200) = p(-0.41 < Z < 0.40) = p(Z < 0.40) - p(Z < -0.41) 
(p = 0.6554 - 0.3409)

## [1] 0.3145

Determining percentile values from probabilities (check here).

It is possible to estimate the value of any percentile given a probability, mean and standard deviation of X. For example, knowing that the mean speaking rate in a speech corpus was 5.9 syl./sec. with a standard deviation of 1.2 syl./sec. you can learn the cutoff of the 10% slowest rates: find a, where p(X < a) = p. If you want to find out about the cutoff of the 10% fastest rates: first, express the top 10% as the 90th percentile and then find a, where p(X > a) = 1 - p.

Please, determine speaking rate value that marks:

20% of the fastest rates in a data set
15% of the slowest rates in a data set

Normal approximation to the binomial (check here).

If the number of trials in a binomial is large, that is n * p >= 10 & n * (1 - p) >= 10, it is possible to find binomial probabilities by applying normal approximation. For example, you randomly select 100 syllables (n = 100) from a corpus and record how many of them had lexical stress (this is your X, the binomial).
Please find the probability that X is A. less than 22, B. less than 40, C. greater than 60, assuming that p = 0.33.

meanX = 100*0.33 # n*p 
sdX = sqrt(meanX*0.67) # sqrt(n*p(1-p))
(z = (22-meanX)/sdX) # convert to a z-score

## [1] -2.339367

# p(X < 22) = p(Z < -2.34)
p = 0.0096

PART 3: The t-distribution

Characteristic features (check here)

applies to a continuous random variable
can be approximated with a bell-shaped curve
like Z-distribution its mean = 0, but sd is larger, which is reflected in a comparably flatter shape of the bell curve
it is used to test claims concerning a population mean when the sample size is small and/or standard deviation of a population is unknown (assumming normal or close-to-normal distribution of the population)
is specified by degrees of freedom (df equals the number of observations minus 1); at df = 31 (n = 30) the shape of t-distribution approximates the shape of Z-distribution

Finding probabilities using t-table

The t-table provides greater-than (or right-tail) probabilities for values on different t-distributions (i.e. with df between 1 and 30).

For a t-distribution with df = 10 what is p(t >= 1.81)?
For a t-distribution with df = 15 what is p(t >= 1.34)?
For a t-distribution with df = 27 what is p(t <= -2.05)?
For a t-distribution with df = 9 what is p(t >= 3.25) or p(t <= -3.25)?

Finding percentiles

What is the 95th percentile for a t-distribution with df = 10?
What is the 10th percentile for a t-distribution with df = 20?

Building confidence intervals with t-values

Confidence intervals are constructed using a pre-defined confidence level and a critical value that is related to this level. If sample is large and/or population standard deviation is known the critical value is determined using Z-distribution, otherwise t-distribution is used for this purpose.
a) What is the t-value for a 95% confidence interval for a t-distribution with df = 23?
b) Using the first row (column headings) of the t-table what column would you use to find the t-value for a 99% confidence interval?

PART 4: The sampling distribution

It is based on sample averages rather than individual outcomes, i.e. a random variable is composed of values that represent sample means (denoted with smallXhat ), e.g. each person rolls a single dice 10 times and records the average that she/he obtained for every sample.
Comparing to a “standard” random variable (denoted with X) whose distribution is flat (every possible outcome, i.e. the numbers from 1 to 6, has the same probability), the shape of a sampling distribution (denoted with bigXhat ) is more variable: while it has the same spread (1-6) and center (3.5), its shape is more variable and can be approximated with a bell curve (see here).
The mean of the sampling distribution is the same as the mean of the population that the samples come from.

Standard error

Is the measure of variability in the sample mean and is used to express how much your statistic (here - the average/mean) will vary from one sample to another. As shown in the formula below, standard error is proportional to population standard deviation and decreases with sample size: stError .

Central Limit Theorem

When a random variable X has normal distribution, then sample means that come from the same population will also be normally distributed. Importantly, when X distribution is not normal (or is unknown), the sampling distribution will still approximate to the normal distribution IF the sample size is large enough (at least 30).

Probabilities for the sample mean (check here)

If bigXhat has normal or approximately distribution (due to the assumptions of the Central Limit Theorem) then finding probabilities for sample mean ( smallXhat ) follows the same steps as finding probabilities for X.
To convert sample mean to a z-score use the formula zscoreSampleMean .
Assuming that in the population of Germany the average number of languages spoken by a person is 2.1 with a standard deviation of 0.5 what is the probability that in a sample of 50 random persons the average of languages spoken will be greater than 1.8?

Distributions and probabilities

Agnieszka Wagner

7 kwietnia 2018

PART 1: Random variable and the binomial distribution

PART 2: The normal distribution

PART 3: The t-distribution

PART 4: The sampling distribution