Chapter 4
20 min read
Section 25 of 175

Dynamic Reasoning with o3

How OpenAI Codex Works

Introduction

Codex is powered by codex-1, which is built on OpenAI's o3 reasoning model. This gives Codex the ability to "think" for extended periods before acting - a crucial capability for complex software engineering tasks.

The Reasoning Difference: Traditional LLMs generate tokens immediately. Reasoning models like o3 can spend compute time thinking before generating output. This "extended thinking" dramatically improves performance on complex tasks.

What is o3

o3 is OpenAI's advanced reasoning model, evolved from o1. It excels at:

  • Multi-step reasoning: Breaking down complex problems
  • Planning: Developing approaches before executing
  • Self-correction: Catching and fixing its own mistakes
  • Domain expertise: Specialized knowledge application

o3 vs Standard Models

AspectStandard Model (GPT-4)Reasoning Model (o3)
Response timeImmediateVariable (seconds to minutes)
Token generationSequentialThink first, then generate
Complex reasoningLimitedExtended
Error rate on hard tasksHigherLower
Cost per tokenLowerHigher
Best forQuick responsesComplex tasks
📝reasoning_comparison.txt
1Standard Model (GPT-4):
2User: "Refactor this auth module to use JWT"
3Model: [Immediately starts generating code]
4       [May miss edge cases]
5       [Linear approach]
6
7Reasoning Model (o3):
8User: "Refactor this auth module to use JWT"
9Model: [Thinking...]
10       - What auth patterns exist in the codebase?
11       - What JWT library should I use?
12       - Where are sessions currently stored?
13       - What migration path minimizes breakage?
14       - What tests need to be updated?
15       [Then generates comprehensive solution]

Extended Thinking

Extended thinking is the core capability that makes o3 powerful for agentic tasks:

🐍extended_thinking_concept.py
1class ExtendedThinkingModel:
2    """Conceptual model of extended thinking."""
3
4    def generate(
5        self,
6        prompt: str,
7        thinking_budget: int = 10000,  # Thinking tokens
8    ) -> Response:
9        """Generate with extended thinking."""
10
11        # Phase 1: Internal reasoning (not shown to user)
12        thinking_tokens = self.internal_reasoning(
13            prompt,
14            budget=thinking_budget,
15        )
16
17        # Phase 2: Generate final response
18        response = self.generate_final(
19            prompt,
20            thinking_context=thinking_tokens,
21        )
22
23        return Response(
24            content=response,
25            thinking_tokens_used=len(thinking_tokens),
26        )
27
28    def internal_reasoning(self, prompt: str, budget: int) -> list[str]:
29        """
30        Internal reasoning process.
31        This is where the model 'thinks' before responding.
32        """
33        thinking = []
34
35        # Step 1: Understand the problem
36        thinking.append(self.analyze_problem(prompt))
37
38        # Step 2: Consider approaches
39        thinking.append(self.consider_approaches())
40
41        # Step 3: Evaluate tradeoffs
42        thinking.append(self.evaluate_tradeoffs())
43
44        # Step 4: Plan solution
45        thinking.append(self.plan_solution())
46
47        # Step 5: Anticipate issues
48        thinking.append(self.anticipate_issues())
49
50        return thinking

Thinking in Practice

📝thinking_example.txt
1Task: "Add rate limiting to the API"
2
3Internal Thinking (not visible to user):
4━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
5
61. UNDERSTANDING THE CODEBASE
7   - Express.js backend
8   - Currently no rate limiting
9   - Redis already in use for sessions
10   - Multiple API endpoints
11
122. EVALUATING APPROACHES
13   Option A: express-rate-limit
14   - Pros: Simple, well-maintained
15   - Cons: In-memory by default
16
17   Option B: Custom Redis-based
18   - Pros: Distributed, uses existing Redis
19   - Cons: More implementation work
20
21   Option C: rate-limiter-flexible
22   - Pros: Redis support built-in
23   - Cons: Another dependency
24
253. DECISION
26   → Option C with Redis store
27   - Leverages existing infrastructure
28   - Battle-tested library
29   - Clean API
30
314. IMPLEMENTATION PLAN
32   - Install rate-limiter-flexible
33   - Configure Redis connection
34   - Add middleware to routes
35   - Different limits for auth vs general
36   - Add rate limit headers
37
385. POTENTIAL ISSUES
39   - Need to handle Redis connection failures
40   - Consider user vs IP limiting
41   - Need bypass for health checks
42
43━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
44
45[Now begins code generation with this context]

Thinking Time = Better Results

The extended thinking time isn't wasted - it's where the model develops comprehensive understanding and avoids common pitfalls. Complex tasks benefit significantly from this upfront investment.

Dynamic Compute Allocation

Codex can dynamically allocate compute time based on task complexity:

🐍dynamic_compute.py
1from dataclasses import dataclass
2from enum import Enum
3
4class TaskComplexity(Enum):
5    SIMPLE = "simple"           # Fix typo, add comment
6    MODERATE = "moderate"       # Add function, fix bug
7    COMPLEX = "complex"         # New feature, refactor
8    VERY_COMPLEX = "very_complex"  # Architecture change
9
10@dataclass
11class ComputeConfig:
12    thinking_tokens: int
13    max_iterations: int
14    reasoning_depth: str
15
16# Dynamic compute allocation
17COMPUTE_CONFIGS = {
18    TaskComplexity.SIMPLE: ComputeConfig(
19        thinking_tokens=1000,
20        max_iterations=5,
21        reasoning_depth="shallow",
22    ),
23    TaskComplexity.MODERATE: ComputeConfig(
24        thinking_tokens=5000,
25        max_iterations=15,
26        reasoning_depth="medium",
27    ),
28    TaskComplexity.COMPLEX: ComputeConfig(
29        thinking_tokens=15000,
30        max_iterations=30,
31        reasoning_depth="deep",
32    ),
33    TaskComplexity.VERY_COMPLEX: ComputeConfig(
34        thinking_tokens=50000,
35        max_iterations=50,
36        reasoning_depth="exhaustive",
37    ),
38}
39
40
41class DynamicComputeAllocator:
42    """Allocate compute based on task complexity."""
43
44    def estimate_complexity(self, task: str, codebase_info: dict) -> TaskComplexity:
45        """Estimate task complexity from description and codebase."""
46
47        # Signals for complexity
48        complexity_signals = {
49            # Simple signals
50            "fix typo": -2,
51            "add comment": -2,
52            "rename": -1,
53
54            # Moderate signals
55            "add function": 1,
56            "fix bug": 1,
57            "add test": 1,
58
59            # Complex signals
60            "refactor": 2,
61            "new feature": 2,
62            "implement": 2,
63            "integrate": 2,
64
65            # Very complex signals
66            "architecture": 3,
67            "migrate": 3,
68            "redesign": 3,
69            "security": 3,
70        }
71
72        score = 0
73        task_lower = task.lower()
74
75        for signal, weight in complexity_signals.items():
76            if signal in task_lower:
77                score += weight
78
79        # Factor in codebase size
80        if codebase_info.get("files", 0) > 100:
81            score += 1
82        if codebase_info.get("files", 0) > 500:
83            score += 1
84
85        # Map score to complexity
86        if score <= 0:
87            return TaskComplexity.SIMPLE
88        elif score <= 2:
89            return TaskComplexity.MODERATE
90        elif score <= 4:
91            return TaskComplexity.COMPLEX
92        else:
93            return TaskComplexity.VERY_COMPLEX
94
95    def allocate(self, task: str, codebase_info: dict) -> ComputeConfig:
96        """Allocate compute resources for a task."""
97        complexity = self.estimate_complexity(task, codebase_info)
98        return COMPUTE_CONFIGS[complexity]

Adaptive Reasoning Depth

🐍adaptive_reasoning.py
1class AdaptiveReasoner:
2    """Adjust reasoning depth based on task progress."""
3
4    def __init__(self, initial_budget: int):
5        self.remaining_budget = initial_budget
6        self.depth_level = "normal"
7
8    def should_think_deeper(self, current_result: dict) -> bool:
9        """Decide if more thinking is needed."""
10
11        # Signals that more thinking helps
12        needs_more = any([
13            current_result.get("confidence") < 0.7,
14            current_result.get("error_count") > 0,
15            current_result.get("test_failures") > 0,
16            current_result.get("ambiguity") > 0.5,
17        ])
18
19        # Check budget
20        can_afford = self.remaining_budget > 1000
21
22        return needs_more and can_afford
23
24    def adjust_depth(self, feedback: dict) -> None:
25        """Adjust reasoning depth based on feedback."""
26
27        if feedback.get("errors_increasing"):
28            # Step back, think more carefully
29            self.depth_level = "deep"
30            self.remaining_budget += 5000  # Allocate more
31
32        elif feedback.get("making_progress"):
33            # Continue current approach
34            self.depth_level = "normal"
35
36        elif feedback.get("stuck"):
37            # Try different approach
38            self.depth_level = "exploratory"
39
40    def execute_with_adaptive_reasoning(
41        self,
42        task: str,
43        model: ExtendedThinkingModel,
44    ) -> Result:
45        """Execute task with adaptive reasoning depth."""
46
47        iteration = 0
48        max_iterations = 10
49
50        while iteration < max_iterations and self.remaining_budget > 0:
51            # Allocate thinking budget for this iteration
52            thinking_budget = self.calculate_iteration_budget()
53
54            # Execute with thinking
55            result = model.generate(
56                task,
57                thinking_budget=thinking_budget,
58            )
59
60            # Update remaining budget
61            self.remaining_budget -= result.thinking_tokens_used
62
63            # Check if we should continue
64            if result.is_complete:
65                return result
66
67            # Adjust for next iteration
68            self.adjust_depth(result.feedback)
69            iteration += 1
70
71        return result

Implementing Similar Patterns

You can implement reasoning-like patterns even without access to o3:

Pattern 1: Explicit Chain-of-Thought

🐍explicit_cot.py
1def reason_then_act(task: str, context: str) -> str:
2    """Force explicit reasoning before action."""
3
4    reasoning_prompt = f"""
5Task: {task}
6
7Context:
8{context}
9
10Before providing a solution, think through the following:
11
121. UNDERSTANDING
13   - What exactly is being asked?
14   - What are the constraints?
15   - What information do I need?
16
172. ANALYSIS
18   - What are the possible approaches?
19   - What are the tradeoffs?
20   - What could go wrong?
21
223. PLAN
23   - What approach will I take?
24   - What steps are needed?
25   - What's the order of operations?
26
274. VERIFICATION
28   - How will I verify the solution works?
29   - What edge cases should I consider?
30
31Please think through each section, then provide your solution.
32"""
33
34    return llm.generate(reasoning_prompt)

Pattern 2: Multi-Pass Reasoning

🐍multi_pass_reasoning.py
1class MultiPassReasoner:
2    """Multiple reasoning passes for complex tasks."""
3
4    def solve(self, task: str) -> Solution:
5        # Pass 1: Initial analysis
6        analysis = self.analyze(task)
7
8        # Pass 2: Generate solution
9        draft = self.generate_solution(task, analysis)
10
11        # Pass 3: Critical review
12        review = self.review_solution(task, draft)
13
14        # Pass 4: Refine based on review
15        final = self.refine_solution(draft, review)
16
17        return final
18
19    def analyze(self, task: str) -> str:
20        return self.llm.generate(f"""
21Analyze this task deeply:
22{task}
23
24Consider:
251. What are the key requirements?
262. What are potential challenges?
273. What patterns should be used?
284. What are the success criteria?
29
30Analysis:
31""")
32
33    def review_solution(self, task: str, draft: str) -> str:
34        return self.llm.generate(f"""
35Task: {task}
36
37Draft solution:
38{draft}
39
40Review critically:
411. Does this solve the task completely?
422. Are there bugs or issues?
433. Are there edge cases not handled?
444. Could this be improved?
45
46Review:
47""")

Pattern 3: Debate-Based Reasoning

🐍debate_reasoning.py
1class DebateReasoner:
2    """Use internal debate for better decisions."""
3
4    def solve_with_debate(self, task: str) -> Solution:
5        # Generate multiple approaches
6        approaches = self.generate_approaches(task, n=3)
7
8        # Have each approach critique the others
9        critiques = []
10        for i, approach in enumerate(approaches):
11            others = approaches[:i] + approaches[i+1:]
12            critique = self.critique(approach, others)
13            critiques.append(critique)
14
15        # Synthesize best solution
16        return self.synthesize(approaches, critiques)
17
18    def generate_approaches(self, task: str, n: int) -> list[str]:
19        approaches = []
20        for i in range(n):
21            prompt = f"""
22Task: {task}
23
24Generate approach #{i+1} (different from previous).
25Consider a unique angle or method.
26
27Approach:
28"""
29            approaches.append(self.llm.generate(prompt))
30        return approaches
31
32    def critique(self, approach: str, others: list[str]) -> str:
33        return self.llm.generate(f"""
34This approach:
35{approach}
36
37Alternative approaches:
38{chr(10).join(others)}
39
40Critique this approach:
411. What are its strengths?
422. What are its weaknesses?
433. How does it compare to alternatives?
444. Should it be preferred? Why?
45
46Critique:
47""")

Reasoning Adds Value

These patterns add latency but significantly improve quality for complex tasks. Use simpler patterns for simple tasks - don't over-engineer.

Summary

Dynamic reasoning with o3:

  1. Extended thinking: Models can think before generating
  2. Better quality: More reasoning = fewer errors on complex tasks
  3. Dynamic allocation: Match compute to task complexity
  4. Adaptive depth: Adjust reasoning based on progress
  5. Implementable patterns: Chain-of-thought, multi-pass, debate
Next: Let's explore AGENTS.md, Codex's equivalent of CLAUDE.md for configuration and context.