Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Throughout this book, we'll implement statistical concepts in Python to reinforce understanding and build practical skills. This section guides you through setting up a productive environment for statistical computing.

Why Python for Statistics? Python has become the lingua franca of data science and machine learning. Its rich ecosystem of scientific libraries (NumPy, SciPy, pandas, matplotlib) makes it ideal for statistical computing, while its readability makes complex algorithms accessible.

Environment Setup

Installing Python

We recommend using Python 3.10 or later. The easiest way to get started is with Anaconda or Miniconda, which include most scientific packages.

⚡terminal

1# Option 1: Install Miniconda (recommended)
2# Download from: https://docs.conda.io/en/latest/miniconda.html
3
4# Option 2: Use pip with Python 3.10+
5python --version  # Should show 3.10 or later

Creating a Virtual Environment

⚡terminal

1# With conda
2conda create -n prob-stats python=3.11
3conda activate prob-stats
4
5# Or with venv
6python -m venv prob-stats-env
7source prob-stats-env/bin/activate  # Linux/Mac
8# prob-stats-env\Scripts\activate  # Windows

Installing Required Packages

⚡terminal

1# Core scientific stack
2pip install numpy scipy pandas matplotlib seaborn
3
4# Additional useful packages
5pip install jupyter jupyterlab
6pip install statsmodels scikit-learn
7pip install sympy  # For symbolic math
8
9# Or with conda (handles dependencies better)
10conda install numpy scipy pandas matplotlib seaborn jupyter statsmodels scikit-learn

Package	Purpose	Version
numpy	Numerical arrays and operations	≥1.24
scipy	Scientific computing, statistics	≥1.11
pandas	Data manipulation and analysis	≥2.0
matplotlib	Plotting and visualization	≥3.7
seaborn	Statistical visualization	≥0.12
statsmodels	Statistical models and tests	≥0.14
scikit-learn	Machine learning tools	≥1.3

NumPy Essentials

NumPy is the foundation of scientific Python. Understanding its array operations is essential for efficient statistical computing.

🐍numpy_basics.py

1import numpy as np
2
3# Array creation
4a = np.array([1, 2, 3, 4, 5])
5b = np.zeros((3, 4))       # 3x4 array of zeros
6c = np.ones((2, 3))        # 2x3 array of ones
7d = np.eye(4)              # 4x4 identity matrix
8e = np.arange(0, 10, 0.5)  # Array from 0 to 10, step 0.5
9f = np.linspace(0, 1, 100) # 100 evenly spaced points from 0 to 1
10
11# Random arrays (important for statistics!)
12np.random.seed(42)  # For reproducibility
13uniform = np.random.rand(1000)        # Uniform [0, 1)
14normal = np.random.randn(1000)        # Standard normal
15integers = np.random.randint(1, 7, 1000)  # Dice rolls
16
17print(f"Array shape: {normal.shape}")
18print(f"Array dtype: {normal.dtype}")

Array Operations

🐍numpy_operations.py

1import numpy as np
2
3x = np.array([1, 2, 3, 4, 5])
4
5# Element-wise operations
6print(x + 10)      # [11, 12, 13, 14, 15]
7print(x * 2)       # [2, 4, 6, 8, 10]
8print(x ** 2)      # [1, 4, 9, 16, 25]
9print(np.exp(x))   # Exponential of each element
10print(np.log(x))   # Natural log of each element
11
12# Aggregations
13print(f"Sum: {np.sum(x)}")
14print(f"Mean: {np.mean(x)}")
15print(f"Std: {np.std(x)}")
16print(f"Var: {np.var(x)}")
17print(f"Min: {np.min(x)}, Max: {np.max(x)}")
18
19# Broadcasting
20A = np.array([[1, 2, 3], [4, 5, 6]])  # Shape (2, 3)
21row = np.array([10, 20, 30])          # Shape (3,)
22print(A + row)  # Broadcasting adds row to each row of A

Indexing and Slicing

🐍numpy_indexing.py

1import numpy as np
2
3A = np.arange(12).reshape(3, 4)
4print("Original array:")
5print(A)
6# [[ 0  1  2  3]
7#  [ 4  5  6  7]
8#  [ 8  9 10 11]]
9
10# Basic indexing
11print(f"Element [1,2]: {A[1, 2]}")  # 6
12print(f"Row 0: {A[0, :]}")          # [0, 1, 2, 3]
13print(f"Column 1: {A[:, 1]}")       # [1, 5, 9]
14
15# Slicing
16print(f"Subarray [0:2, 1:3]:")
17print(A[0:2, 1:3])  # [[1, 2], [5, 6]]
18
19# Boolean indexing (very useful for filtering data!)
20x = np.array([1, -2, 3, -4, 5])
21positive = x[x > 0]  # [1, 3, 5]
22print(f"Positive values: {positive}")

Performance Tip

NumPy operations are vectorized and run in optimized C code. Always prefer NumPy functions over Python loops for numerical operations. A loop processing 1 million elements can be 100x slower than the equivalent NumPy operation.

SciPy for Statistics

SciPy's stats module provides probability distributions, statistical tests, and descriptive statistics.

🐍scipy_stats_intro.py

1from scipy import stats
2import numpy as np
3
4# Working with distributions
5normal = stats.norm(loc=0, scale=1)  # Standard normal
6
7# PDF, CDF, and inverse CDF (quantile function)
8print(f"PDF at x=0: {normal.pdf(0):.4f}")      # 0.3989
9print(f"CDF at x=0: {normal.cdf(0):.4f}")      # 0.5000
10print(f"95th percentile: {normal.ppf(0.95):.4f}")  # 1.6449
11
12# Generate random samples
13samples = normal.rvs(size=1000)
14print(f"Sample mean: {samples.mean():.4f}")
15print(f"Sample std: {samples.std():.4f}")
16
17# Other distributions
18exponential = stats.expon(scale=1)
19poisson = stats.poisson(mu=5)
20binomial = stats.binom(n=10, p=0.3)

Descriptive Statistics

🐍descriptive_stats.py

1from scipy import stats
2import numpy as np
3
4np.random.seed(42)
5data = np.random.randn(1000)
6
7# Descriptive statistics
8description = stats.describe(data)
9print(f"N: {description.nobs}")
10print(f"Min: {description.minmax[0]:.4f}")
11print(f"Max: {description.minmax[1]:.4f}")
12print(f"Mean: {description.mean:.4f}")
13print(f"Variance: {description.variance:.4f}")
14print(f"Skewness: {description.skewness:.4f}")
15print(f"Kurtosis: {description.kurtosis:.4f}")
16
17# Percentiles
18percentiles = np.percentile(data, [25, 50, 75])
19print(f"Quartiles: {percentiles}")

Matplotlib Basics

Visualization is crucial for understanding data and communicating results. Matplotlib is the foundation of Python plotting.

🐍matplotlib_basics.py

1import matplotlib.pyplot as plt
2import numpy as np
3
4# Basic line plot
5x = np.linspace(0, 2*np.pi, 100)
6y = np.sin(x)
7
8plt.figure(figsize=(10, 4))
9plt.subplot(1, 2, 1)
10plt.plot(x, y, 'b-', label='sin(x)')
11plt.plot(x, np.cos(x), 'r--', label='cos(x)')
12plt.xlabel('x')
13plt.ylabel('y')
14plt.title('Trigonometric Functions')
15plt.legend()
16plt.grid(True)
17
18# Histogram
19plt.subplot(1, 2, 2)
20data = np.random.randn(1000)
21plt.hist(data, bins=30, density=True, alpha=0.7, label='Sample')
22
23# Overlay the theoretical PDF
24x_pdf = np.linspace(-4, 4, 100)
25plt.plot(x_pdf, stats.norm.pdf(x_pdf), 'r-', lw=2, label='N(0,1)')
26plt.xlabel('Value')
27plt.ylabel('Density')
28plt.title('Histogram with Normal PDF')
29plt.legend()
30
31plt.tight_layout()
32plt.savefig('example_plot.png', dpi=150)
33plt.show()

Seaborn for Statistical Plots

🐍seaborn_intro.py

1import seaborn as sns
2import matplotlib.pyplot as plt
3import numpy as np
4
5# Set style
6sns.set_style("whitegrid")
7
8# Generate sample data
9np.random.seed(42)
10x = np.random.randn(200)
11y = 0.5 * x + np.random.randn(200) * 0.5
12
13# Distribution plot
14fig, axes = plt.subplots(1, 3, figsize=(12, 4))
15
16# KDE plot
17sns.kdeplot(x, ax=axes[0], fill=True)
18axes[0].set_title('Kernel Density Estimate')
19
20# Scatter plot with regression line
21sns.regplot(x=x, y=y, ax=axes[1])
22axes[1].set_title('Scatter with Regression')
23
24# Box plot
25data = [np.random.randn(100) + i for i in range(3)]
26axes[2].boxplot(data)
27axes[2].set_title('Box Plot')
28
29plt.tight_layout()
30plt.show()

Pandas Introduction

Pandas provides the DataFrame structure for handling tabular data, essential for real-world statistical analysis.

🐍pandas_intro.py

1import pandas as pd
2import numpy as np
3
4# Create a DataFrame
5np.random.seed(42)
6df = pd.DataFrame({
7    'group': np.random.choice(['A', 'B', 'C'], 100),
8    'value1': np.random.randn(100),
9    'value2': np.random.exponential(2, 100),
10    'value3': np.random.randint(1, 10, 100)
11})
12
13# Basic exploration
14print(df.head())
15print(f"\nShape: {df.shape}")
16print(f"\nData types:\n{df.dtypes}")
17print(f"\nStatistical summary:\n{df.describe()}")

Data Manipulation

🐍pandas_operations.py

1import pandas as pd
2import numpy as np
3
4# Continuing with df from above
5np.random.seed(42)
6df = pd.DataFrame({
7    'group': np.random.choice(['A', 'B', 'C'], 100),
8    'value1': np.random.randn(100),
9    'value2': np.random.exponential(2, 100),
10})
11
12# Filtering
13group_a = df[df['group'] == 'A']
14print(f"Group A count: {len(group_a)}")
15
16# Group-by operations (essential for statistics!)
17group_stats = df.groupby('group').agg({
18    'value1': ['mean', 'std', 'count'],
19    'value2': ['mean', 'median']
20})
21print(f"\nGroup statistics:\n{group_stats}")
22
23# Correlation matrix
24numeric_cols = df.select_dtypes(include=[np.number])
25corr_matrix = numeric_cols.corr()
26print(f"\nCorrelation matrix:\n{corr_matrix}")

Jupyter Workflow

Jupyter notebooks provide an interactive environment ideal for exploratory data analysis and learning.

Starting Jupyter

⚡terminal

1# Start Jupyter Lab (recommended)
2jupyter lab
3
4# Or classic notebook
5jupyter notebook
6
7# Access at http://localhost:8888

Useful Jupyter Features

Cell types: Code cells for Python, Markdown cells for documentation
Magic commands: %timeit for timing, %matplotlib inline for inline plots
Tab completion: Press Tab for autocomplete suggestions
Documentation: Use Shift+Tab or ?function_name for help

🐍jupyter_tips.py

1# Magic commands to put at the top of notebooks
2%matplotlib inline
3%config InlineBackend.figure_format = 'retina'  # High-res plots
4
5# Timing code
6%timeit np.random.randn(1000)
7
8# Get help
9?np.random.randn
10
11# Run external script
12%run my_script.py

Best Practice

Create a standard imports cell at the top of each notebook:

🐍imports.py

1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4import seaborn as sns
5from scipy import stats
6
7# Settings
8np.random.seed(42)  # Reproducibility
9plt.rcParams['figure.figsize'] = (10, 6)
10sns.set_style('whitegrid')
11%matplotlib inline

Summary

You now have a complete Python environment for statistical computing:

NumPy for efficient numerical array operations
SciPy.stats for probability distributions and statistical functions
Matplotlib/Seaborn for visualization
Pandas for data manipulation and analysis
Jupyter for interactive exploration

With this foundation in place, we're ready to dive into probability theory. In the next chapter, we'll explore the fundamental concepts of probability: sample spaces, events, and the axioms that govern probability measures.

Practice Exercise

Create a Jupyter notebook that:

Generates 10,000 samples from a normal distribution
Calculates descriptive statistics (mean, std, skewness)
Creates a histogram overlaid with the theoretical PDF
Computes the empirical vs theoretical probability for |X| > 2