Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Static malware analysis examines a suspicious file without executing it. By inspecting the file's structure, metadata, embedded strings, and binary patterns, analysts can determine whether a file is malicious in milliseconds rather than the minutes or hours required for dynamic analysis. Machine learning has transformed static analysis from a manual, expert-driven process into an automated classification pipeline capable of processing millions of samples daily.

This section covers the fundamentals of static malware analysis with ML, from understanding the Portable Executable (PE) file format to training high-accuracy classifiers on industry-standard datasets.

Understanding PE File Structure

The Portable Executable (PE) format is the standard executable format on Windows operating systems, making it the most common file type encountered in malware analysis. Understanding PE structure is essential because it provides a wealth of features that distinguish benign from malicious executables.

A PE file consists of headers (DOS, PE, Optional), section tables, and data directories. The PE header contains information about the machine architecture, number of sections, timestamp, and entry point. Section headers describe code (.text), data (.data), and resource (.rsrc) sections with their virtual sizes, raw sizes, and characteristics flags.

Malicious PE files often exhibit structural anomalies: unusual section names, high entropy in code sections (indicating packing or encryption), mismatches between virtual and raw section sizes, and suspicious import tables that reference system manipulation APIs like WriteProcessMemory or VirtualAllocEx.

Section entropy: Packed or encrypted sections show entropy near 8.0 (maximum for random data)
Import table: Malware imports APIs for process injection, registry manipulation, and network communication
Resource section: Malware often embeds additional payloads in the resource section
Digital signatures: Missing or invalid signatures are strong malware indicators

Feature Extraction Techniques

Transforming a PE file into a feature vector suitable for ML classification requires extracting multiple categories of information. The most effective approaches combine structural features from the PE headers with content-based features derived from the file's binary content.

N-gram analysis extracts sequences of N consecutive bytes from the binary content, creating a frequency distribution that captures characteristic byte patterns. Byte bigrams (N=2) and trigrams (N=3) are commonly used. API call sequences extracted from the import address table reveal the intended functionality of the executable.

🐍python

1import pefile
2import numpy as np
3from collections import Counter
4
5def extract_pe_features(filepath):
6    pe = pefile.PE(filepath)
7    features = {}
8
9    # Header features
10    features["num_sections"] = pe.FILE_HEADER.NumberOfSections
11    features["timestamp"] = pe.FILE_HEADER.TimeDateStamp
12    features["entry_point"] = pe.OPTIONAL_HEADER.AddressOfEntryPoint
13    features["image_size"] = pe.OPTIONAL_HEADER.SizeOfImage
14
15    # Section entropy
16    for section in pe.sections:
17        name = section.Name.decode().strip("\x00")
18        features[f"entropy_{name}"] = section.get_entropy()
19
20    # Import features
21    if hasattr(pe, "DIRECTORY_ENTRY_IMPORT"):
22        imports = [entry.dll.decode() for entry in pe.DIRECTORY_ENTRY_IMPORT]
23        features["num_imports"] = len(imports)
24
25    return features

The EMBER Dataset

The Endgame Malware BEnchmark for Research (EMBER) is the industry-standard dataset for training and evaluating PE malware classifiers. Created by Endgame (now part of Elastic), EMBER contains feature vectors extracted from over 1.1 million PE files, split evenly between malicious and benign samples.

EMBER provides pre-extracted features across eight categories: byte histogram, byte-entropy histogram, string information, general file information, PE header information, section information, import information, and export information. This standardized feature set enables fair comparison between different ML approaches.

Why EMBER Matters: Unlike many security datasets, EMBER is large enough to train deep learning models and is regularly updated with new samples. The EMBER 2018 release included timestamps, enabling temporal evaluation that simulates real-world deployment conditions where models must detect malware created after the training cutoff date.

Tree-Based vs Deep Learning Classifiers

For tabular feature vectors like those from EMBER, gradient-boosted decision trees (LightGBM, XGBoost) consistently outperform deep learning approaches. LightGBM achieves an AUC of 0.9991 on the EMBER benchmark with training times measured in minutes rather than the hours required for neural networks.

Deep learning models become advantageous when working with raw binary data rather than pre-extracted features. MalConv, a convolutional neural network that processes raw byte sequences up to 2MB in length, can learn discriminative features directly from the binary content without manual feature engineering.

LightGBM/XGBoost: Best for pre-extracted tabular features; fast training, high AUC, interpretable
MalConv: CNN on raw bytes; no feature engineering needed but requires large training sets
EMBER Neural Network: Multi-layer perceptron on EMBER features; competitive with tree methods
Ensemble approaches: Combining tree-based and neural models often yields the best production results