Introduction
Throughout this book, we'll implement statistical concepts in Python to reinforce understanding and build practical skills. This section guides you through setting up a productive environment for statistical computing.
Why Python for Statistics? Python has become the lingua franca of data science and machine learning. Its rich ecosystem of scientific libraries (NumPy, SciPy, pandas, matplotlib) makes it ideal for statistical computing, while its readability makes complex algorithms accessible.
Environment Setup
Installing Python
We recommend using Python 3.10 or later. The easiest way to get started is with Anaconda or Miniconda, which include most scientific packages.
1# Option 1: Install Miniconda (recommended)
2# Download from: https://docs.conda.io/en/latest/miniconda.html
3
4# Option 2: Use pip with Python 3.10+
5python --version # Should show 3.10 or laterCreating a Virtual Environment
1# With conda
2conda create -n prob-stats python=3.11
3conda activate prob-stats
4
5# Or with venv
6python -m venv prob-stats-env
7source prob-stats-env/bin/activate # Linux/Mac
8# prob-stats-env\Scripts\activate # WindowsInstalling Required Packages
1# Core scientific stack
2pip install numpy scipy pandas matplotlib seaborn
3
4# Additional useful packages
5pip install jupyter jupyterlab
6pip install statsmodels scikit-learn
7pip install sympy # For symbolic math
8
9# Or with conda (handles dependencies better)
10conda install numpy scipy pandas matplotlib seaborn jupyter statsmodels scikit-learn| Package | Purpose | Version |
|---|---|---|
| numpy | Numerical arrays and operations | ≥1.24 |
| scipy | Scientific computing, statistics | ≥1.11 |
| pandas | Data manipulation and analysis | ≥2.0 |
| matplotlib | Plotting and visualization | ≥3.7 |
| seaborn | Statistical visualization | ≥0.12 |
| statsmodels | Statistical models and tests | ≥0.14 |
| scikit-learn | Machine learning tools | ≥1.3 |
NumPy Essentials
NumPy is the foundation of scientific Python. Understanding its array operations is essential for efficient statistical computing.
1import numpy as np
2
3# Array creation
4a = np.array([1, 2, 3, 4, 5])
5b = np.zeros((3, 4)) # 3x4 array of zeros
6c = np.ones((2, 3)) # 2x3 array of ones
7d = np.eye(4) # 4x4 identity matrix
8e = np.arange(0, 10, 0.5) # Array from 0 to 10, step 0.5
9f = np.linspace(0, 1, 100) # 100 evenly spaced points from 0 to 1
10
11# Random arrays (important for statistics!)
12np.random.seed(42) # For reproducibility
13uniform = np.random.rand(1000) # Uniform [0, 1)
14normal = np.random.randn(1000) # Standard normal
15integers = np.random.randint(1, 7, 1000) # Dice rolls
16
17print(f"Array shape: {normal.shape}")
18print(f"Array dtype: {normal.dtype}")Array Operations
1import numpy as np
2
3x = np.array([1, 2, 3, 4, 5])
4
5# Element-wise operations
6print(x + 10) # [11, 12, 13, 14, 15]
7print(x * 2) # [2, 4, 6, 8, 10]
8print(x ** 2) # [1, 4, 9, 16, 25]
9print(np.exp(x)) # Exponential of each element
10print(np.log(x)) # Natural log of each element
11
12# Aggregations
13print(f"Sum: {np.sum(x)}")
14print(f"Mean: {np.mean(x)}")
15print(f"Std: {np.std(x)}")
16print(f"Var: {np.var(x)}")
17print(f"Min: {np.min(x)}, Max: {np.max(x)}")
18
19# Broadcasting
20A = np.array([[1, 2, 3], [4, 5, 6]]) # Shape (2, 3)
21row = np.array([10, 20, 30]) # Shape (3,)
22print(A + row) # Broadcasting adds row to each row of AIndexing and Slicing
1import numpy as np
2
3A = np.arange(12).reshape(3, 4)
4print("Original array:")
5print(A)
6# [[ 0 1 2 3]
7# [ 4 5 6 7]
8# [ 8 9 10 11]]
9
10# Basic indexing
11print(f"Element [1,2]: {A[1, 2]}") # 6
12print(f"Row 0: {A[0, :]}") # [0, 1, 2, 3]
13print(f"Column 1: {A[:, 1]}") # [1, 5, 9]
14
15# Slicing
16print(f"Subarray [0:2, 1:3]:")
17print(A[0:2, 1:3]) # [[1, 2], [5, 6]]
18
19# Boolean indexing (very useful for filtering data!)
20x = np.array([1, -2, 3, -4, 5])
21positive = x[x > 0] # [1, 3, 5]
22print(f"Positive values: {positive}")Performance Tip
SciPy for Statistics
SciPy's stats module provides probability distributions, statistical tests, and descriptive statistics.
1from scipy import stats
2import numpy as np
3
4# Working with distributions
5normal = stats.norm(loc=0, scale=1) # Standard normal
6
7# PDF, CDF, and inverse CDF (quantile function)
8print(f"PDF at x=0: {normal.pdf(0):.4f}") # 0.3989
9print(f"CDF at x=0: {normal.cdf(0):.4f}") # 0.5000
10print(f"95th percentile: {normal.ppf(0.95):.4f}") # 1.6449
11
12# Generate random samples
13samples = normal.rvs(size=1000)
14print(f"Sample mean: {samples.mean():.4f}")
15print(f"Sample std: {samples.std():.4f}")
16
17# Other distributions
18exponential = stats.expon(scale=1)
19poisson = stats.poisson(mu=5)
20binomial = stats.binom(n=10, p=0.3)Descriptive Statistics
1from scipy import stats
2import numpy as np
3
4np.random.seed(42)
5data = np.random.randn(1000)
6
7# Descriptive statistics
8description = stats.describe(data)
9print(f"N: {description.nobs}")
10print(f"Min: {description.minmax[0]:.4f}")
11print(f"Max: {description.minmax[1]:.4f}")
12print(f"Mean: {description.mean:.4f}")
13print(f"Variance: {description.variance:.4f}")
14print(f"Skewness: {description.skewness:.4f}")
15print(f"Kurtosis: {description.kurtosis:.4f}")
16
17# Percentiles
18percentiles = np.percentile(data, [25, 50, 75])
19print(f"Quartiles: {percentiles}")Matplotlib Basics
Visualization is crucial for understanding data and communicating results. Matplotlib is the foundation of Python plotting.
1import matplotlib.pyplot as plt
2import numpy as np
3
4# Basic line plot
5x = np.linspace(0, 2*np.pi, 100)
6y = np.sin(x)
7
8plt.figure(figsize=(10, 4))
9plt.subplot(1, 2, 1)
10plt.plot(x, y, 'b-', label='sin(x)')
11plt.plot(x, np.cos(x), 'r--', label='cos(x)')
12plt.xlabel('x')
13plt.ylabel('y')
14plt.title('Trigonometric Functions')
15plt.legend()
16plt.grid(True)
17
18# Histogram
19plt.subplot(1, 2, 2)
20data = np.random.randn(1000)
21plt.hist(data, bins=30, density=True, alpha=0.7, label='Sample')
22
23# Overlay the theoretical PDF
24x_pdf = np.linspace(-4, 4, 100)
25plt.plot(x_pdf, stats.norm.pdf(x_pdf), 'r-', lw=2, label='N(0,1)')
26plt.xlabel('Value')
27plt.ylabel('Density')
28plt.title('Histogram with Normal PDF')
29plt.legend()
30
31plt.tight_layout()
32plt.savefig('example_plot.png', dpi=150)
33plt.show()Seaborn for Statistical Plots
1import seaborn as sns
2import matplotlib.pyplot as plt
3import numpy as np
4
5# Set style
6sns.set_style("whitegrid")
7
8# Generate sample data
9np.random.seed(42)
10x = np.random.randn(200)
11y = 0.5 * x + np.random.randn(200) * 0.5
12
13# Distribution plot
14fig, axes = plt.subplots(1, 3, figsize=(12, 4))
15
16# KDE plot
17sns.kdeplot(x, ax=axes[0], fill=True)
18axes[0].set_title('Kernel Density Estimate')
19
20# Scatter plot with regression line
21sns.regplot(x=x, y=y, ax=axes[1])
22axes[1].set_title('Scatter with Regression')
23
24# Box plot
25data = [np.random.randn(100) + i for i in range(3)]
26axes[2].boxplot(data)
27axes[2].set_title('Box Plot')
28
29plt.tight_layout()
30plt.show()Pandas Introduction
Pandas provides the DataFrame structure for handling tabular data, essential for real-world statistical analysis.
1import pandas as pd
2import numpy as np
3
4# Create a DataFrame
5np.random.seed(42)
6df = pd.DataFrame({
7 'group': np.random.choice(['A', 'B', 'C'], 100),
8 'value1': np.random.randn(100),
9 'value2': np.random.exponential(2, 100),
10 'value3': np.random.randint(1, 10, 100)
11})
12
13# Basic exploration
14print(df.head())
15print(f"\nShape: {df.shape}")
16print(f"\nData types:\n{df.dtypes}")
17print(f"\nStatistical summary:\n{df.describe()}")Data Manipulation
1import pandas as pd
2import numpy as np
3
4# Continuing with df from above
5np.random.seed(42)
6df = pd.DataFrame({
7 'group': np.random.choice(['A', 'B', 'C'], 100),
8 'value1': np.random.randn(100),
9 'value2': np.random.exponential(2, 100),
10})
11
12# Filtering
13group_a = df[df['group'] == 'A']
14print(f"Group A count: {len(group_a)}")
15
16# Group-by operations (essential for statistics!)
17group_stats = df.groupby('group').agg({
18 'value1': ['mean', 'std', 'count'],
19 'value2': ['mean', 'median']
20})
21print(f"\nGroup statistics:\n{group_stats}")
22
23# Correlation matrix
24numeric_cols = df.select_dtypes(include=[np.number])
25corr_matrix = numeric_cols.corr()
26print(f"\nCorrelation matrix:\n{corr_matrix}")Jupyter Workflow
Jupyter notebooks provide an interactive environment ideal for exploratory data analysis and learning.
Starting Jupyter
1# Start Jupyter Lab (recommended)
2jupyter lab
3
4# Or classic notebook
5jupyter notebook
6
7# Access at http://localhost:8888Useful Jupyter Features
- Cell types: Code cells for Python, Markdown cells for documentation
- Magic commands: %timeit for timing, %matplotlib inline for inline plots
- Tab completion: Press Tab for autocomplete suggestions
- Documentation: Use Shift+Tab or ?function_name for help
1# Magic commands to put at the top of notebooks
2%matplotlib inline
3%config InlineBackend.figure_format = 'retina' # High-res plots
4
5# Timing code
6%timeit np.random.randn(1000)
7
8# Get help
9?np.random.randn
10
11# Run external script
12%run my_script.pyBest Practice
Create a standard imports cell at the top of each notebook:
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4import seaborn as sns
5from scipy import stats
6
7# Settings
8np.random.seed(42) # Reproducibility
9plt.rcParams['figure.figsize'] = (10, 6)
10sns.set_style('whitegrid')
11%matplotlib inlineSummary
You now have a complete Python environment for statistical computing:
- NumPy for efficient numerical array operations
- SciPy.stats for probability distributions and statistical functions
- Matplotlib/Seaborn for visualization
- Pandas for data manipulation and analysis
- Jupyter for interactive exploration
With this foundation in place, we're ready to dive into probability theory. In the next chapter, we'll explore the fundamental concepts of probability: sample spaces, events, and the axioms that govern probability measures.
Practice Exercise
- Generates 10,000 samples from a normal distribution
- Calculates descriptive statistics (mean, std, skewness)
- Creates a histogram overlaid with the theoretical PDF
- Computes the empirical vs theoretical probability for |X| > 2