Probability distribution is defined as the distribution of the probabilities of all the possible outcomes (x) of a random variable (X). For example, if you roll a dice (X) the possible outcomes (x) have equal probabilities - that is P(X=1)=1/6, P(X=2)=1/6, etc.
The mean, standard deviation and variance of a random variable X have the same notation as the respective statistics of the whole population. If probabilities are equal for all possible x (values/outcomes) that a random variable X can take, then the mean is the average of all the possible outcomes (irrespective of the number of trials).
Consequently, the normal distribution follows the Empirical Rule: 68/95/99.7.
Z-score standardization makes it possible to compare values that come from different distributions and/or samples, e.g. comparing speaking rate measures or syllable durations in speech samples obtained from different speakers.
To transform values of a variable X to z-scores use the formula:
The number that you identify in the Z-table represents p(Z < z), i.e. the probability that the random variable Z is less-than this value, which also correspons to the percentage of z-values that are smaller than your number.
The probability that X is equal to any single value is 0 for any continuous random variable, because unlike in case of discrete random variables, the probability of a continuous random variable is considered an area (and NOT a signle point) below the bell curve.
library(readxl)
df <- read_xlsx('stats_class_1.xlsx', sheet = 2)
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2242 obs. of 1 variable:
## $ duration: num 122 224 118 179 196 148 98 116 240 171 ...
(meanDur <- mean(df$duration))
## [1] 170.4014
(sdDur <- sd(df$duration))
## [1] 73.29609
(z = (100-meanDur)/sdDur) # convert to a z-score
## [1] -0.9605072
# p(X < 100) = p(Z < -0.96);
p = 0.1685
(z = (300-meanDur)/sdDur)
## [1] 1.768151
# p(X > 300) = p(Z > 1.77) = 1 - p(Z < 1.77);
(p = 1 - 0.9616)
## [1] 0.0384
(z = (140-meanDur)/sdDur)
## [1] -0.4147756
(z = (200-meanDur)/sdDur)
## [1] 0.403822
# p(140 < X < 200) = p(-0.41 < Z < 0.40) = p(Z < 0.40) - p(Z < -0.41)
(p = 0.6554 - 0.3409)
## [1] 0.3145
It is possible to estimate the value of any percentile given a probability, mean and standard deviation of X. For example, knowing that the mean speaking rate in a speech corpus was 5.9 syl./sec. with a standard deviation of 1.2 syl./sec. you can learn the cutoff of the 10% slowest rates: find a, where p(X < a) = p. If you want to find out about the cutoff of the 10% fastest rates: first, express the top 10% as the 90th percentile and then find a, where p(X > a) = 1 - p.
Please, determine speaking rate value that marks:
If the number of trials in a binomial is large, that is n * p >= 10 & n * (1 - p) >= 10, it is possible to find binomial probabilities by applying normal approximation. For example, you randomly select 100 syllables (n = 100) from a corpus and record how many of them had lexical stress (this is your X, the binomial).
Please find the probability that X is A. less than 22, B. less than 40, C. greater than 60, assuming that p = 0.33.
meanX = 100*0.33 # n*p
sdX = sqrt(meanX*0.67) # sqrt(n*p(1-p))
(z = (22-meanX)/sdX) # convert to a z-score
## [1] -2.339367
# p(X < 22) = p(Z < -2.34)
p = 0.0096
The t-table provides greater-than (or right-tail) probabilities for values on different t-distributions (i.e. with df between 1 and 30).
Confidence intervals are constructed using a pre-defined confidence level and a critical value that is related to this level. If sample is large and/or population standard deviation is known the critical value is determined using Z-distribution, otherwise t-distribution is used for this purpose.
a) What is the t-value for a 95% confidence interval for a t-distribution with df = 23?
b) Using the first row (column headings) of the t-table what column would you use to find the t-value for a 99% confidence interval?
It is based on sample averages rather than individual outcomes, i.e. a random variable is composed of values that represent sample means (denoted with ), e.g. each person rolls a single dice 10 times and records the average that she/he obtained for every sample.
Comparing to a “standard” random variable (denoted with X) whose distribution is flat (every possible outcome, i.e. the numbers from 1 to 6, has the same probability), the shape of a sampling distribution (denoted with ) is more variable: while it has the same spread (1-6) and center (3.5), its shape is more variable and can be approximated with a bell curve (see here).
The mean of the sampling distribution is the same as the mean of the population that the samples come from.
Is the measure of variability in the sample mean and is used to express how much your statistic (here - the average/mean) will vary from one sample to another. As shown in the formula below, standard error is proportional to population standard deviation and decreases with sample size: .
When a random variable X has normal distribution, then sample means that come from the same population will also be normally distributed. Importantly, when X distribution is not normal (or is unknown), the sampling distribution will still approximate to the normal distribution IF the sample size is large enough (at least 30).
If has normal or approximately distribution (due to the assumptions of the Central Limit Theorem) then finding probabilities for sample mean () follows the same steps as finding probabilities for X.
To convert sample mean to a z-score use the formula .
Assuming that in the population of Germany the average number of languages spoken by a person is 2.1 with a standard deviation of 0.5 what is the probability that in a sample of 50 random persons the average of languages spoken will be greater than 1.8?