Statistics for Biology
Published:
Introduction to Statistics
Statistics helps us make sense of data by summarizing it, estimating unknowns, and testing ideas.
- Descriptive statistics: Summarize data
- Inferential statistics: Make predictions or generalizations
- Applied in biology, medicine, engineering, psychology, economics, etc.
Describing Distributions
Measures of Location
- Mean – average value
- Median – middle value
- Mode – most frequent value
These describe the “center” of the data.
Measures of Spread
- Range – difference between max and min
- Quartiles – divide data into four parts
- Percentiles – value below which X% of data falls
- Deviation – difference from the mean
- Variance – average of squared deviations
- Standard deviation – square root of variance
These show how scattered the data is.
Probability Foundations
What is Probability?
- The likelihood of an event occurring
- Expressed between 0 (impossible) and 1 (certain)
Law of Total Probability
If A can happen through exclusive events B and C:
P(A) = P(A|B) × P(B) + P(A|C) × P(C)
Bayesian Thinking
Bayes’ Theorem
Helps update belief in a hypothesis given new data.
P(H|D) = [P(D|H) × P(H)] / P(D)
Example: COVID-19 Testing
Even with a 99% accurate test, if prevalence is low (e.g. 1%), many positives may be false due to the base rate.
Probability Distributions
PMF (Probability Mass Function)
- For discrete variables
- Assigns probabilities to individual outcomes
- Example: Coin toss
PDF (Probability Density Function)
- For continuous variables
- Probability = Area under the curve
- Example: Human height
Normal Distribution
- Symmetrical bell curve
- Defined by mean (μ) and standard deviation (σ)
- Most biological measurements follow this
Sampling: Population vs Sample
- Population – entire group (e.g., all adults in India)
- Sample – subset used to infer about the population
Sampling must be random and unbiased to ensure valid conclusions.
Estimating Parameters
Point Estimate
- A single value (e.g., sample mean) used to estimate a population parameter
Central Limit Theorem
- The distribution of the sample mean becomes normal as sample size increases, even if the original data isn’t normal
Confidence Intervals
Used to show a range where the population value likely falls
- Confidence level – how sure we are (e.g., 95%)
- Margin of error – how much the sample estimate may vary
CI = sample mean ± z × (σ / √n)
Hypothesis Testing
Goal:
Test if a difference/relationship is statistically significant
- Null hypothesis (H₀) – no effect/difference
- Alternative hypothesis (H₁) – there is an effect
Errors:
- Type I (α): False positive (rejecting true H₀)
- Type II (β): False negative (failing to reject false H₀)
P-value
- Probability that observed data is due to chance
- If p ≤ α (e.g. 0.05) → reject H₀
Parametric Tests
| Test | Use Case |
|---|---|
| Z-test | Known variance, large sample size |
| T-test | Unknown variance |
| Paired T-test | Same subjects before/after |
| Two-sample T-test | Compare two independent groups |
| ANOVA | Compare 3+ group means |
| F-test | Compare variances between 2 groups |
ANOVA (Analysis of Variance)
Used to test differences across 3+ groups
Types:
- One-Way ANOVA – one factor
- Two-Way ANOVA – two factors, also tests interaction
Assumptions:
- Independence
- Normality
- Equal variances
Key Terms:
- SSB: Variation between groups
- SSW: Variation within groups
- F-ratio:
F = MSB / MSW
If F is large, likely that group means differ significantly.
Non-Parametric Tests
Used when assumptions of normality or equal variance don’t hold
| Test | Used For |
|---|---|
| Mann-Whitney U | Compare two independent medians |
| Wilcoxon Signed Rank | Paired samples without normality |
| Chi-Square Test | Relationship between categorical variables |
| Kruskal-Wallis | Non-parametric alternative to ANOVA |
Real-Life Examples
- Body Fat Study: Use 2-sample t-test to compare men vs women
- Teaching Effectiveness: F-test to compare variance in student scores
- Diet Study: One-way ANOVA to compare 3 diets’ weight loss
- HIV Drug Trial: Mann-Whitney test for viral load comparison
- Learning Preference vs Gender: Chi-square test for association
Practice Problems
- Normal range: What is the central 95% range for μ = 80, σ² = 144?
- Cholesterol study: Do Asian immigrants differ significantly in cholesterol?
- Normality check: Use a sample of weights to test for normal distribution.
- CLT demonstration: Use different sample sizes to show CLT in action.
✅ Summary Table: Test Selection
| Situation | Test to Use |
|---|---|
| Mean difference, large sample, known σ | Z-test |
| Mean difference, small sample, unknown σ | T-test |
| Difference between 3+ group means | ANOVA |
| Comparison of variances | F-test |
| Categorical variable association | Chi-square test |
| Median comparison (2 groups, non-normal) | Mann-Whitney U Test |
| Repeated measurements on same subjects | Paired T-test / Wilcoxon |
“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” – H.G. Wells
