# The Hitchikers Guide to Inferential Statistics

# Inferential vs Descriptive stats.

- Descriptive - Purely used to summarize data, like IQR, Mean. Importantly, they do not draw conclusions beyond the data we already have.
- Inferential - Does allow us to make conclusions beyond the data we have.

## Hypothesis Testing

Inferential statistics is based on the premise that you can’t prove something is true, but you can disprove something by finding an exception.

Decide what you are trying to find evidence for (alternative hypothesis), then set up the opposite as null hypothesis, and find evidence to disprove that.

Suppose you want to show a link between studying and test results.

- \(H_0\): There is no effect on test result from studying
- \(H_1\): There is an effect on test result from studying

Hypotheses are always about the population parameters (like the mean).

So the above should be stated mathematically:

- \(H_0\): \(\mu_0\) = \(\mu_1\)
- \(H_1\): \(\mu_0\) =/= \(\mu_1\)

If \(\mu_0\) and \(\mu_1\) are the means for the “no study” and “study” groups respectively.

General formula for test statistics: \(\frac{(Observed Data) - (What We Expect if the null is true)}{AverageVariation}\). This formula gets adapted for all the specific cases.

## Standardization

(Example from here)

How can we compare things that aren’t the same?

Suppose there are two tests: SAT and ACT.

Both try measure college readiness, but SAT is out of 1600 points and ACT is out of 36. How can you compare scores across these 2 tests?

- Centre both distributions around 0 (subtract the mean) Mean of SAT is 1000, and mean of ACT is 21, so subtract those.
- Scales are still wrong, so divide adjusted score by standard deviation

This adjusted score is called a Z-Score (see Definitions). \(Z = \frac{x - \mu}{\sigma}\).

Note the assumption that both the distribution of SAT and ACT are normally distributed!

## Normal Distribution

(Based on here)

Normal distribution is important not only because a lot of things are normally distributed, like, height, IQ etc… There are a lot of things like debt, blood pressure which are not normally distributed. But here’s the thing - distributions of means are normally distributed, even if populations aren’t. So, if you sample a lot from a population which is not distributed normally, the mean of those samples will be normally distributed!

In order to meaningfully compare whether two means are different, we need to know something about their distribution: the sampling distribution.

This is described formally in the **Central Limit Theorem** (see definitions).

Now, the mean of the distribution of sample means is the same as the population mean, its standard deviation is not.

However, the standard deviation of the distribution of sample means *is* related to the standard deviation of the population. But, it’s also related to the sample size. In the above graphs, see how the probability distributions of the graphs get thinner, as the number of dice goes up? The bigger your sample size, the closer your sample means are to the true population mean. So we need to adjust the original population standard deviation somehow to reflect this:

\(\sigma_{samplingDistribution} = \frac{\sigma_{population}}{\sqrt{n}}\)

The standard deviation of a sampling distribution is called the *standard error*.

Note here that the number of rolls does not affect the shape of the distribution, only how accurately we approximate it. Think about it, the true shape is determined by the number of dice we have. If we have fewer dice, we expect the mean value we get for them to fluctuate more wildly.

Can also use sampling distributions to compare other parameters - proportions, regression coefficients, or standard deviations, which also follow the central limit theorem.

## T-Distribution

They talk about T-dist closer to end of this vid

Continous, unimodal probability distribution that’s useful to represent sampling distributions. It changes its shape based on how much information there is. It’s got thicker tails for smaller sample sizes (as there is less information) - making the distribution less “tight” around the mean - showing that we’re more uncertain. For bigger and bigger sample sizes the distribution gets thinner tails, and eventually becomes the same as the normal distribution. Usually, for tests about means sample sizes >= 30, it’s usually the same as normal. For proportions, need at least 10 instances from each group in the proportion!

## Hypothesis Testing Applied

Cool definition from https://youtu.be/bf3egy7TQ2Q?list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&t=281:

__Null Hypothesis Significance Testing__ is a form of the reductio ad absurdum argument which tries to discredit an idea by assuming the idea is true, and then showing that if you make that assumption, something contradictory happens.

(Just some examples of reductio ad absurdum outside of stats: The Earth cannot be flat; otherwise, we would find people falling off the edge. There is no smallest positive rational number because, if there were, then it could be divided by two to get a smaller one. It’s like proof by contradiction.)

### P Value

A P-Value answers the question of how rare your data is, by telling you the probability of getting data that’s extreme as the data you observed, if the null hypothesis were true. P-value tells you is how extreme your sample is assuming that the null is true.

P-value = 0.1, then sample is in the top 10% most extreme samples we’d expect to see based on the distribution of sample means.

A 2-sided p-value will tell us the probability of getting a sample mean as extreme (on either side) as the one we’ve seen, under the null hypothesis. A 1 sided will tell us the probability of getting a sample mean as high (or low) as the one we’ve seen, under the null hypothesis.

NB: On Common misinterpretation of the a p-value is that it can telly ou the probability that the null hypothesis is true - IT CAN’T! Ronald Fisher said: “In general, tests of significance are based on hypothetical probabilities calculated from their null hypotheses. They do no generally lead to any probability statements about the real word, but to a rational and well-defines measure of reluctance to the acceptance of the hypotheses they test.” So, getting a p-value of 0.1 doesn’t mean there is a 10% chance the null hypothesis is true.

If P-Value is not lower than alpha, then we fail to reject \(H_0\). Notice that we say fail to rehect, not accept. Null hypothesis testing doesn’t allow us to “accept” or provide evidence that the null is true, instead we’ve only failed to provide evidence that it’s false. Abscense of evidence is not the evidence of abscense. Failint to reject the null doesn’t mean thtere isn’t an effect or relationship, just means we didn’t get enough evidence to say there definitely is one.

### Z - test vs T - test

Z - tests commonly done when population standard deviation is known. T - tests commonly done when population standard deviation is unknown. We use t - distribution because the unknown population standard deviation is now a value we must estimate.

### One Sample Z - Test

In the population the average IQ is 100 with a s.d. of 15. Give a sample of 30 participants a drug which either makes them smarter, stupider or has no effect. The sample of 30 people has a mean of 140. Did the medication affect intelligence?

Note here that the population standard deviation and mean is given. Very importantly, we’re testing a mean, the sample mean. The question we’re basically asking is how probable is it that we saw a sample mean of 140? Now, recall, from the above, sample means are distributed normally. And they will be centered around the population mean. And that the sample standard deviation is related to the population standard deviation. But remember, it’s actually the sample mean standard deviation, which gets thinner as we get more observations.

So, the normal distribution in the one sample Z-test is actually the distribution of sample means, under the null hypothesis. Mkay?

\(Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}} = 14.60\)

Let’s take an \(\alpha = 0.05\) (which corresponds to Type I error happening 5% of the time - Type I error is the probability of rejecting Null Hypothesis even though the Null is correct). This is a 2 - tailed test, so rehect if Z less than -1.96 or greater than 1.96 :) HERE WE REJECT NULL HYPOTHESIS.

### One Sample Z- Test for proportions

Survery claims 9/10 docs recommend aspirin. Survery, random sample, of 100 docs. 82 claim they recommend aspirin. At alpha = 0.05, do we believe?

\(H_0\): \(p = 0.9\) \(H_q\): \(p \neq 0.9\)

The test statistic:

\(Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = -2.667\)

\(\hat{p} = 0.82\) \(p_0 = 0.9\) \(n = 100\)

I think this comes from Bernoulli… Mean of Bernoulli is \(p_0 = 0.9\) and variance is \(p_0(1 - p_0) = 0.9 \times 0.1 = 0.09\)

So, really I think you can derive the test for proportions from the initial equation:

\(Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}\)

Except now, the population mean under the Null Hypothesis is \(p_0\), the sample mean is \(0.82\), the population variance, under the Null is \(0.9(1 - 0.1)\), which makes the sample mean’s s.d. \(\sqrt{\frac{p_0(1-p_0)}{n}}\). The thing about the Bernoulli is that if you’ve got the mean, you’ve got the variance.

Anyway, -2.667 < -1.96 so reject \(H_0\).

### One sample t - test

Population, average IQ = 100. New medication, either has negative, positive or no effect on intelligence. Sample of 30 participants have taken med, end up with a mean of 140 and s.d. of 20. Did medication affect intelligence?

\(H_0; \mu = 100\) \(H_1; \mu \neq 100\)

Degrees of freedom= \(n - 1 = 30 - 1 = 29\).

Critical values: -2.0452 & 2.0452.

\(t = \frac{\bar{x} - \mu}{s / \sqrt{n}} = 10.96\), reject \(H_0\).

### T Test - Degrees of Freedom and Effect Size

T-Distribution is like Z-Distribution, but has fatter tails.

Recall, we don’t know the population standard deviation in a T-Test. So, we estimate it using the sample standard deviation. So, we don’t have a perfect normal dist. But with bigger sample sizes, we’re better at estimating standard deviation, so T-Dist changes to look more and more like normal. More info means we’re more accurate. Degrees of Freedom can help us choose that accuracy.

Number of independent pieces of information we have in our data. If have 3 observations, and we know the mean of those 3 observations, and the value of 2 of the 3 observations, the last observations *has to* take on a certain value, for the mean to equal what it does. The last observation is not “Free” to take on any value. By contrast, if you know the mean of 3 observations, and only know the actual value of 1, the last 2 observations are free to take on a whole range of values.

In terms of t-test: When we calc the mean, we use up 1 of those degrees of freedom, or 1 piece of independent info. Amount of info we have, depends on our sample size \(n\). The more data you have, the more independent info that you have - but every time you make a calc like the mean, you’re using up 1 piece of independent information.

### Effect Size and Significance

You can sometimesm find that there is a statistically significant difference in 2 means, which might be what you wanted. But the effect size is so small (compared to random variation), that it may not be worth it. For example, a hair vitamin might cause your hair to grow a statistically significant extra couple of nano-meters per month, but since you hair grows 12 mm per month, and the vitamin is expensive, it may not be *practically significant* for you to buy it.

NB, recall that in your t test statistic, the denominator is scaled by the sample size; so the smaller the sample size, the smaller the scaled standard error: \(\frac{SE}{\sqrt{n}}\), the bigger the t test stat. If the numerator is very small (ie, the effect size if very small), then even if there is an effect, the only way to test it might be to increase the sample size, in order to get your t-test stat to be bigger.

\(t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac{S^2_{control}}{n_{control}} + \frac{S^2_{treatment}}{n_{treatment}}}}\) - in the 2 sample case. See how a small effect size (numerator) need much \(n_{i}\)s to pick up significance (that is, for t stat to be bigger)?

**So, P values should always be looked at WITH effect size.**

__ We need degrees of freedom to understand why smaller differences between means can be significant if you have a larger sample size! __. The more info you have, the more accurate your estiamtes are.

So, fewer degrees of freedom -> the fatter the tails, and the less likely your are to reject the null. More data though, means more degrees of freedom and thinner tails, so the more confident you are in extreme values, and thus more likely to reject the null when seeing them.

## Confidence Intervals

The mean of sample distribution is the center a CI (see definition below). The way we choose the confidence range is related to the distribution of sample means (depends on standard deviation of sample means.).

The 95% in a 95% confidence interval tells us that if we calculated a confidence interval from 100 different samples, about 95% of them would contain the true population mean. Think about it, if you got 2 different samples of 40 people, both of those would have different sample means and sample standard deviations, and thus different confidence intervals. It is possible the CI we’ve created doesn’t contain the true mean.

## Bayes’ Theorem

\(P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(A \cap B)}{P(B)}\)

Updating beliefs.

Suppose your sister tells you she has a friend. At this point, 50-50 chance that it’s male / female. Suppose your sister also tells you that the friend has breast cancer. Then \(P(male | BREAST CANCER) = \frac{P(BREAST CANCER | male)P(male)}{P(BREAST CANCER)}\) Suppose you know \(P(BREAST CANCER | male) and P(BREAST CANCER)\) from gov’t statistics, then you can calc: \(P(male | BREAST CANCER) = \frac{P(0.001)P(0.5)}{P(0.063)} = 0.79%\), so pretty low.

So, in the above, we updated our belief based on new information. Rearrange the formula slightly to see this more clearly: $P(A|B) = P(A) $ \(P(male | BREAST CANCER) = \frac{P(0.001)}{P(0.063)}P(A)\)

Updating beliefs is core to Bayesian stats.

Bayes’ Factor… Say you’re trying to figure out if someone is a star wars fan. Say your *prior* belief is that 60% of people are Star Wars fans. ie, \(P(fan) = 0.6\) Suppose also, that the probabilty of watching a new star wars movie that’s just come out is \(P(movie | fan) = 0.99\), as fans would rush to watch it. But suppose also that the probability of watching the movie given that a person is not a fan is \(P(movie | not fan) = 0.5\). This is because some people are just curious about it, (like me), or dragged by friends / family. So \(\frac{P(movie | fan)}{P(movie | not fan)} = \frac{0.99}{0.5} = 1.98\) is the odds that a person is a fan, if they tell you that they’ve watched the movie. It’s what we’ve learned about they hypothesis from the data (in this case the data is someone telling you they’ve watched the movie). We cam multiply our *prior* with this factor to get our posterior.

We can repeat this process; make the new posterior a prior for updating beliefs. In the example before, let’s say that there are 60% star wars fans and 40% not - this is your prior. So, odds you’re a fan is \(\frac{0.6}{0.4} = 1.5\). But we can update this with the data: \(\frac{0.6}{0.4} \times \frac{0.99}{0.5} = 1.5 \times 1.98 = 2.97\); so the odds have increased.

### Bayesian Hypothesis Testing

Compare the probabilities of the 2 hypotheses: from the above, compare the probabilities of the probability that a person is a fan, given they’d seen the movie. So, compare \(P(fan | seenmovie)\) and \(P(not fan | seenmovie)\).

So look that the ratio: \(\frac{P(fan | seenmovie)}{P(not fan | seenmovie)}\). We look at the ratio of 2 instances of Bayes’ thm; since both the numerator and the denominator are posterior probs.

### Criticism of Bayes’

In Bayesian hypothesis testing (also in just applying Bayes’ thm), the result is influenced by your prior *belief*. If you throw in a different number for your belief, then you get different results.

One of the main uses of statistics is science whis is supposed to be relatively “objective” and not influenced by opinion, yet here’s a method that included beliefs in its calculation.

In some research, the researchers publish the Bayes Factor, for just this reason - it’s free of the prior. You could then use their calculated Bayes factor to update your own prior.

### Bayesian Inference

Using Bayesian methods to analyze questions allows us to “inject” *prior beliefs* into the analysis.

Using Null Hypothesis significance testing, we can ask whether means of groups are likely different, or not (reject, or fail to reject null hypothesis - recall, you can only fail to reject the null hypothesis, that is, you don’t state that the means are the same, just that you haven’t seen enough evidence to show they aren’t). With Bayesian methods, you can assign a probability to the question of whether or not the means are different. It’s not just a binary outcome… (I think…)

Bayesian A/B Testing: https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-testing ^^ In the above he also mentions why to use instead of using t-tests for differences of means. “Classical statistics tells us Significance, but what we’re really after is Magnitude!” - sayeth Count Bayesie.

## Chi Square Tests

Model fits? Compare expectations vs observations.

Discrete. Contingency tables.

Test = \(\frac{Observed - ExpecetedUnderNullHypothesis}{AverageVariation}\)

What’s Average Variation?

If we just take observed minus expected, we always get to 0, so we square the answers before adding.

Chi sq, degrees of freedom - all the counts, minus 1 (same argument as t-test d.o.f). Can find p-val.

Types of Chisq test:

### Goodness of fit

Only has 1 row. Expected frequency must be > 5.

### Test of independence

Tests whether being a member of 1 category is independent of being in another.

### Test of homogeneity

Tests whether it’s likely thatn 2 samples come from the same population.

d.o.f = (r-1)(c-1).

## GLM - General Linear Model

Allow us to create many different models, like the regression model.

GLM overview: Data = Model + Error.

ie, \(y = b + mx\). (error is b; model part is mx)

Error is a deviation in model.

### Linear Regression

Assumption is that data is linear. Regression line is the line that’s as close as possible to all data points at once. Ie, minimizes Sum of squared distances from line.

Residual plot - plot the errors.

F-Test -> quantify how well we think our data fit a distribution, like the null distribution.

Recall, in general, a Test Stat = \(\frac{ObservedData - WhatWeExpectIfTheNullIsTrue}{AverageVariation}\).

F-stat = \(\frac{ObservedModel - ModelIfNullIsTrue}{AverageVariation}\).

Consider total sum of squares: \(\sum(y_i -\bar{y}) ^ 2\) <- total variation in the dataset. We want to know how much of that variation is accounted for in our model, and how much is just error. (good explanation)[https://youtu.be/WWqE7YHR4Jc?list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&t=424] - but the idea is, that from the data we calculate some slope coefficient, \(\hat{\beta}\), let’s say \(\hat{\beta} = 6.468\). We want to know what the difference would be if we assumed no relationship, ie \(\hat{\beta} = 0\).

\(F-stat = \frac{SSR}{Average Variation}\) - numerator, sum of squares for regression, is the difference between the models using the calculated slope, and a slope of 0. Bottom part is SSE.

**F statistic allows us to directly compare the amount of variation can and cannot explain**.

\(t^2 = F\)

### ANOVA

Test the difference between multiple groups.

Similar to regression, except use a categorical to predict an outcome. Like using degree studied for vs salary.

Note, Anova can tell you that there is a statistically significant difference between the groups, but not where it is. So follow up with multiple t-tests (Bonferroni).

The ANOVA makes an assumption about the world which it tests: that the best prediction for a variable is the mean rating of the group it belongs to!

RA Fisher.

ANOVAS can include more than one grouping variable - *intersectional groups*. Factorial ANOVA. So, for example, say you are wondering if there is a difference in means of car prices as they are grouped by both Manufacturer and Colour… \(PRICE = BASELINE + \beta{MAN} \ times MANUFACTURER + \beta{COL} \times COLOUR + \epsilon\).

# Definitions

__Bayes’ Factor__- Represents the amount of info we’ve learned about our hypotheses from the data.__Central Limit Theorem__- The distribution of sample means for an independent random variable will get closer and closer to a normal distribution as the size of the sample gets bigger and bigger, even if the original population distribution isn’t normal. Think of independet die rolls, like in here. Twenty dice are 20 independent random variables.__Confidence Interval__- An**estimated range of values**that seem reasonable based on what we’ve observed. Its center is still the**sample mean**, but we’ve got some room on either side for out uncertainty.__Degrees of freedom__Number of independent pieces of information we have -__Effect Size__- How big the effect we observed was, compared to random variation__Hypthesis Testing__-__Inference__- the process of drawing conclusions about population parameters based on a sample taken from the population.__P-Value__- probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct.__Point Estimate__- a single value given as an estimate of a parameter of a population. Like the mean.__Regression Line__- Regression line is the line that’s as close as possible to all data points at once.__Sampling distribution__- the probability distribution of a sample statistic (like the mean, or variance).__Standard Error__- of a statistic (usually an estimate of a parameter - like the mean) is the standard deviation of its sampling distribution. For example, Standard error of the mean is \(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)__Statistical Power__- \(1 - \beta\) - chance of detecting an effect if there is one.__Test Statistic__- Allow us to quantify how close things are to our expectations or theories__Type I error__- \(\alpha\) - probability of saying null hypothesis is wrong, when it is correct__Type II error__- \(\beta\) - probability of not saying the null hypothesis is wrong, even though it is wrong. Note, you’re not actually saying it’s true, just that it’s not wrong.__Z-Distrubtion__- Standardized normal distribution__Z-Score__- A Z-score is a numerical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean.**Z scores allow us to compare things that are not on the same scale as long as they are normally distributed**.