Detecting a 21-Year-Old FFmpeg Vulnerability Using LLM-Augmented Datalog Analysis

How NeuroLog Found the H.264 Slice Table Sentinel Collision Bug

Date: April 2026 Target: FFmpeg H.264 Decoder (h264.c, h264.h) Vulnerability: Unbounded 32-bit slice counter collides with 16-bit sentinel value 0xFFFF, causing heap OOB write in the deblocking filter Fix Commit: FFmpeg/FFmpeg@c988f97566 Bug Age: Introduced in 2003, became exploitable after 2010 refactor Tool: NeuroLog — an LLM-augmented Datalog-based source code analysis system


Executive Summary

We describe how NeuroLog, a hybrid system combining Large Language Model (LLM) fact extraction with Soufflé Datalog formal reasoning, detected a recently disclosed vulnerability in FFmpeg's H.264 decoder by Anthropic's Mythos Model A 16-year-old FFmpeg vulnerability — a bug that had evaded every fuzzer and human reviewer for over two decades.

The vulnerability arises from the interaction of two properties:

  1. A 32-bit integer slice counter (current_slice) that increments without an enforced upper bound
  2. A 16-bit unsigned integer table (slice_table) initialized with memset(..., -1, ...), making 0xFFFF (65535) the sentinel for "unowned macroblock"

When an attacker crafts an H.264 stream with exactly 65,536 slices, slice number 65535 collides with the sentinel. The deblocking filter then treats uninitialized neighbor macroblocks as valid, causing an out-of-bounds heap write.

Our system flagged both halves of this bug automatically:

The LLM's role was to extract structured facts (variable types, definitions, uses, control flow, arithmetic operations) from source code. Datalog's role was to compose these facts through formal rules to surface the vulnerability pattern. Neither component alone could have found this bug.


1. System Architecture

NeuroLog operates as a five-phase pipeline: ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ 1. SCAN │───▶│ 2. SLICE │───▶│ 3. EXTRACT │───▶│ 4. ANALYZE │───▶│ 5. INTERPRET │ │ tree-sitter │ │ call-graph │ │ LLM → facts │ │ Soufflé │ │ LLM reason │ │ parse code │ │ backward │ │ .facts TSV │ │ Datalog │ │ over output │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘

Phase 1: Scan

Uses tree-sitter (no compilation required) to enumerate all functions, build a call graph, and identify functions calling dangerous sinks (memcpy, memset, malloc, free, etc.). For this project: 73 functions across 4 files, 38 dangerous sink call sites, 283 call graph edges.

Phase 2: Slice

Performs backward slicing from dangerous sinks through the call graph. This determines which functions need deep analysis. Starting from 38 sink sites, tracing 3 levels back through callers, the slicer selected 16 target functions — reducing analysis scope by ~78% while preserving all security-relevant paths.

Phase 3: Extract (LLM)

An LLM reads the source code of each target function and extracts structured facts as tab-separated values conforming to predefined relational schemas. The LLM produces facts for:

Relation Description Count
VarType Variable name, C type, byte width, signedness 71
Def Variable defined at line 141
Use Variable used at line 590
ArithOp Arithmetic operation (add, mul, shift) with operands 45
Guard Conditional bound checks (var < bound) 95
CFGEdge Control flow between lines 514
Call Function call sites 127
ActualArg Arguments passed at call sites 315
MemRead Memory read with base + offset 42
MemWrite Memory write targets 33
FieldRead Struct field access (read) 312
FieldWrite Struct field access (write) 104
Total 2,489

Phase 4: Analyze (Soufflé Datalog)

The extracted facts are fed into 17 Datalog rule files (~145KB of logic) executed by the Soufflé engine. Rules compose facts through logical inference to derive security findings. The key rule files for this vulnerability:

Rule File Purpose Size
source_core.dl Reaching definitions, def-use chains, CFG reachability 4.8 KB
source_type_safety.dl Type mismatches, unbounded counters, truncation detection 30.9 KB
type_knowledge.dl C type semantics knowledge base (widths, signedness) 10.0 KB

Phase 5: Interpret (LLM)

The LLM reads the Datalog output CSVs and reasons about the findings in the context of the source code, connecting formal results to exploit chains.

2. The Analysis Session — Step by Step

2.1 The Ask

We were asked to analyze FFmpeg's H.264 decoder for memory corruption bugs, with a specific focus on integer type mismatches and unbounded counters — two vulnerability classes that interact in subtle ways and are poorly covered by traditional fuzzing.

The target was a single-file decoder (h264.c, ~3,100 lines) with its header (h264.h, ~1,341 lines) and supporting files for the loop filter and parser.

2.2 Phase 1 — Scanning the Attack Surface

We began by running tool_scan_project with tree-sitter to map the codebase without compilation: Functions discovered: 73 (across 4 files) Call graph edges: 283 Dangerous sink sites: 38 (memset, memcpy, printf, etc.)

The scan immediately surfaced the structure of the decoder: decode_framedecode_nal_unitsdecode_slice_headerff_h264_frame_startff_h264_alloc_tables. These are the critical paths where attacker-controlled bitstream data flows into memory allocation, table indexing, and pixel writes.

We noted 38 dangerous sink calls — predominantly memset and memcpy in allocation and initialization routines.

2.3 Phase 2 — Backward Slicing to Focus Analysis

With 73 functions, exhaustive LLM extraction would be expensive and would flood the Datalog engine with irrelevant facts. We ran tool_build_slice to trace backward 3 levels from every dangerous sink through the call graph.

This selected 16 functions for deep analysis — a 78% reduction in scope while preserving every path from attacker input to memory-unsafe operations. The selected functions included:

2.4 Phase 3 — LLM Fact Extraction

For each of the 16 target functions, the LLM read the source code and produced structured relational facts. This is the critical bridge: the LLM understands C semantics (types, control flow, pointer arithmetic) while the Datalog engine provides formal compositional reasoning.

The LLM extracted 2,489 facts across 12 relation schemas. Here are the key facts that would later prove essential to detecting the vulnerability:

VarType facts — The LLM correctly identified types from declarations and struct definitions:

# The LLM saw "int context_count = 0;" and extracted: decode_nal_units context_count int 4 signed # The LLM saw "int buf_index = 0;" and extracted: decode_nal_units buf_index int 4 signed # The LLM saw "int bit_length;" and extracted: decode_nal_units bit_length int 4 signed These type facts are unavailable to binary analysis tools — after compilation, int and unsigned int are both just 32-bit registers. But in our system, the LLM preserves the source-level type name, which the type_knowledge.dl knowledge base then resolves to canonical width and signedness.

ArithOp facts — The LLM identified every increment operation:

# The two lines where context_count is incremented: decode_nal_units 2537 context_count 0 add context_count 0 1 decode_nal_units 2566 context_count 0 add context_count 0 1 Guard facts — Critically, the LLM extracted the form of every conditional check. For context_count, it found: decode_nal_units 2599 context_count 0 == h->max_contexts var #This is an equality check (==), not an upper-bound check (< or <=). This distinction is what allows the Datalog HasUpperBound rule to correctly determine that context_count lacks a proper upper bound guard — equality checks don't prevent a value from exceeding the bound on a different code path.

2.5 Phase 4 — Datalog Inference

With facts on disk, we ran the Soufflé engine across 17 rule files. The rules composed the raw facts through logical inference to derive security findings.

The analysis produced:

Output Relation Rows Significance TypeSafetyFinding 62 All type-safety issues in one relation UnboundedCounter 14 Variables incremented without upper bounds CounterUsedAsIndex 27 Unbounded counters used in memory access SignednessMismatch 20 Signed/unsigned confusion at assignment PotentialArithOverflow 2 Arithmetic that may overflow int range DefReachesUse 702 Reaching definition pairs (data flow backbone) ResolvedVarType 107 LLM types validated against knowledge base

2.6 Phase 5 — Interpreting the Findings

The LLM read the Datalog output and cross-referenced it with the decompiled source code. Two findings immediately stood out as related:

Finding A — In decode_nal_units, context_count was flagged as an UnboundedCounter (incremented at lines 2537 and 2566) that is also a CounterUsedAsIndex (used to index h->thread_context[] at line 2473).

Finding B — Multiple SignednessMismatch entries across the codebase showed pointer-to-integer confusion, but more importantly, the pattern of int variables being used in contexts where they flow into narrower storage was pervasive.

The LLM then analyzed the decode_slice_header function (which the pipeline had selected as a target but whose current_slice counter facts were captured through field writes and call-site analysis). By reading lines 2084-2087, it identified the smoking gun:

h->slice_num = ++h0->current_slice; if(h->slice_num >= MAX_SLICES){ av_log(s->avctx, AV_LOG_ERROR, "Too many slices..."); // ← NO return -1! }

The guard exists but does not terminate execution. Combined with the type information from h264.h showing int slice_num (line 361) stored into uint16_t *slice_table (line 363), the exploit chain became clear: 32-bit counter → 16-bit storage → sentinel collision at 65535 → OOB write in the deblocking filter.

3. The Datalog Rules That Detected Smell for This Bug

The vulnerability was surfaced by the interaction of three Datalog rule groups, each contributing a distinct piece of the puzzle. We walk through each rule, show its definition, and trace exactly which facts it consumed and what it produced.

3.1 Rule 9: UnboundedCounter — "This counter has no ceiling"

Purpose: Identify variables that are incremented via arithmetic but never checked against an upper bound anywhere in the function.

Definition (from source_type_safety.dl, lines 382–396):

.decl HasUpperBound(func: Sym, var: Sym) HasUpperBound(f, v) :- Guard(f, _, v, _, "<", _, _). HasUpperBound(f, v) :- Guard(f, _, v, _, "<=", _, _). HasUpperBound(f, v) :- Guard(f, _, v, _, "lt", _, _). HasUpperBound(f, v) :- Guard(f, _, v, _, "le", _, _). HasUpperBound(f, v) :- Guard(f, _, v, _, "unsigned_lt", _, _). HasUpperBound(f, v) :- Guard(f, _, v, _, "unsigned_le", _, _). .decl UnboundedCounter(func: Sym, var: Sym, incr_addr: Addr) UnboundedCounter(f, v, a) :- ArithOp(f, a, v, _, "add", _, _, _), !HasUpperBound(f, v).

How it fired:

The LLM extracted these ArithOp facts for context_count:

decode_nal_units 2537 context_count 0 add context_count 0 1 decode_nal_units 2566 context_count 0 add context_count 0 1

And this Guard fact:

decode_nal_units 2599 context_count 0 == h->max_contexts var

The HasUpperBound rule looks for guards with operators <, <=, lt, le, unsigned_lt, or unsigned_le. The guard on context_count uses == (equality), which is not in that set. Therefore HasUpperBound("decode_nal_units", "context_count") was not derived, and the negation !HasUpperBound(f, v) succeeded.

Output produced (from UnboundedCounter.csv):

decode_nal_units context_count 2537 decode_nal_units context_count 2566

Why this matters: An equality check (== h->max_contexts) only resets the counter when it exactly hits the bound. It does not prevent the counter from exceeding the bound if incrementing happens on multiple code paths within the same loop iteration, or if the check is placed after the use. The Datalog rule captured this semantic distinction precisely.

This same rule pattern also flagged buf_index (3 increment sites in decode_nal_units) and p (3 increment sites in decode_frame), both of which we identified as medium-severity issues in the broader analysis.

3.2 Rule 9b: CounterUsedAsIndex — "This unbounded counter touches memory"

Purpose: Connect unbounded counters to actual memory operations — array indexing, pointer arithmetic, or dangerous sink arguments.

Definition (from source_type_safety.dl, lines 414444): .decl CounterUsedAsIndex(func: Sym, var: Sym, incr_addr: Addr, use_addr: Addr, access_kind: Sym) CounterUsedAsIndex(f, v, ia, ua, "mem_read_offset") :- UnboundedCounter(f, v, ia), DefReachesUse(f, v, ia, ua), MemRead(f, ua, _, v, _). CounterUsedAsIndex(f, v, ia, ua, "ptr_arith") :- UnboundedCounter(f, v, ia), DefReachesUse(f, v, ia, ua), ArithOp(f, ua, _, _, "add", v, _, _).

How it fired:

The DefReachesUse relation (computed by source_core.dl using reaching definitions over the CFG) connected the definition of context_count at line 2537 to its use at line 2473:

decode_nal_units context_count 2537 2473 decode_nal_units context_count 2566 2473

At line 2473, context_count is used as an index into the thread_context array:

hx = h->thread_context[context_count]; // line 2473

The LLM extracted this as a MemRead fact:

decode_nal_units 2473 h->thread_context context_count 8

The join of UnboundedCounter + DefReachesUse + MemRead produced:

decode_nal_units context_count 2537 2473 mem_read_offset decode_nal_units context_count 2566 2473 mem_read_offset

Why this matters: The rule connects the abstract property ("this counter has no bound") to a concrete consequence ("this counter indexes into memory"). Without the DefReachesUse data flow, we would know the counter is unbounded but not whether it matters. Without the MemRead facts, we would know the counter flows somewhere but not that it reaches a memory access.

3.3 The Type Knowledge Base — "int is 4 bytes, signed"

Purpose: Provide ground-truth C type semantics that the LLM's raw extraction can be validated against.

Definition (from type_knowledge.dl, key entries): TypeInfo("int", "int32", 4, "signed"). TypeInfo("uint16_t", "uint16", 2, "unsigned"). TypeInfo("unsigned int", "uint32", 4, "unsigned"). .decl ResolvedVarType(func, var, type_name, canonical, width, signedness) ResolvedVarType(f, v, tn, c, w, s) :- VarType(f, v, tn, _, _), TypeInfo(tn, c, w, s).

How it contributed:

The LLM extracted VarType("decode_nal_units", "context_count", "int", 4, "signed"). The ResolvedVarType rule joined this against the knowledge base, confirming that int → canonical int32, width 4 bytes, signed.

This resolved type feeds into PotentialArithOverflow (Rule 9c):

PotentialArithOverflow(f, a, dst, op, tn, w) :- ArithOp(f, a, dst, _, op, _, _, _), (op = "mul" ; op = "lsl" ; op = "add"), ResolvedVarType(f, dst, tn, _, w, "signed"), w <= 4, !Guard(f, _, dst, _, _, _, _).

This produced the next_avc overflow finding at line 2458 (buf_index + nalsize in signed 32-bit arithmetic).

3.4 Rule 2b: SignednessMismatch — "Signed meets unsigned at assignment"

Purpose: Detect assignments where a signed variable flows into an unsigned one (or vice versa) without an explicit cast.

SignednessMismatch(f, a, src, dst, "pointer", ss, st, dt) :- Def(f, dst, _, a), Use(f, src, _, a), src != dst, ResolvedVarType(f, src, st, _, _, "pointer"), ResolvedVarType(f, dst, dt, _, _, ss), ss != "pointer", ss != "struct", !Call(f, _, a).

While this rule primarily fired for pointer-to-integer patterns in decode_frame and ff_h264_hl_decode_mb, it established the systematic pattern that the codebase has pervasive implicit type conversions without casts — the same pattern class that makes int slice_num → uint16_t slice_table[] dangerous.

3.5 The Reaching Definitions Engine — The Backbone

All of the above rules depend on DefReachesUse, computed by source_core.dl using standard Gen-Kill dataflow:

.decl Kills(func: Sym, var: Sym, node: Addr) Kills(f, v, n) :- Def(f, v, _, n). .decl ReachesIn(func: Sym, var: Sym, def_line: Addr, node: Addr) ReachesIn(f, v, d, n) :- ReachesOut(f, v, d, pred), CFGEdge(f, pred, n). .decl ReachesOut(func: Sym, var: Sym, def_line: Addr, node: Addr) ReachesOut(f, v, d, d) :- Def(f, v, _, d). ReachesOut(f, v, d, n) :- ReachesIn(f, v, d, n), !Kills(f, v, n). .decl DefReachesUse(func: Sym, var: Sym, def_line: Addr, use_line: Addr) DefReachesUse(f, v, d, u) :- ReachesIn(f, v, d, u), Use(f, v, _, u).

From 141 Def facts, 590 Use facts, and 514 CFGEdge facts, this engine derived 702 DefReachesUse pairs — the complete intraprocedural data flow map. This is what allowed us to trace that context_count, defined at line 2537 (an increment), reaches its use at line 2473 (the array index) — even though those lines are in different iterations of the for(;wink loop, connected through CFGEdge back-edges.

3.6 The TypeSafetyFinding Summary — Composing Everything

Finally, all individual findings flow into a single summary relation:

TypeSafetyFinding(f, a, "unbounded_counter", v, cat("incremented at ", to_string(a))) :- UnboundedCounter(f, v, a). TypeSafetyFinding(f, ua, "counter_used_as_index", v, cat(kind, cat(" incr@", to_string(ia)))) :- CounterUsedAsIndex(f, v, ia, ua, kind). TypeSafetyFinding(f, a, "signedness_mismatch", src, cat(st, cat("(", cat(ss, cat(")→", cat(dt, cat("(", cat(ds, ")")))))))) :- SignednessMismatch(f, a, src, _, ss, ds, st, dt). TypeSafetyFinding(f, a, "potential_arith_overflow", v, cat(op, cat(" ", cat(tn, cat(" width=", to_string(w)))))) :- PotentialArithOverflow(f, a, v, op, tn, w).

This produced the 62 TypeSafetyFinding rows that we used as the starting point for interpretation.

4. How We Connected the Dots — The LLM Interpretation Phase

The Datalog engine gave us formal findings. But formal findings are not an exploit chain. The LLM's final role was to read the output CSVs alongside the source code and perform the semantic reasoning that Datalog cannot.

4.1 What Datalog Told Us

From the 62 findings, we focused on the unbounded counter cluster:

decode_nal_units 2537 unbounded_counter context_count incremented at 2537 decode_nal_units 2566 unbounded_counter context_count incremented at 2566 decode_nal_units 2537 counter_used_as_index context_count mem_read_offset incr@2537 decode_nal_units 2566 counter_used_as_index context_count mem_read_offset incr@2566

This told us: "In decode_nal_units, the variable context_count is incremented at two separate code locations and used as a memory index, but has no upper-bound guard."

4.2 What We Did Next

The LLM read decode_nal_units and immediately identified the pattern: context_count indexes into h->thread_context[], but its guard is an equality check at the bottom of the loop. This was clearly an OOB risk, but it was a localized single-function bug.

The deeper insight came when the LLM noticed the Datalog output did not flag current_slice as an unbounded counter — because decode_slice_header was analyzed but current_slice is a struct field write (h0->current_slice), not a local variable with ArithOp facts. It had been extracted as a FieldWrite, not as a local ArithOp.

This is where Datalog+LLM reasoning transcended what the Datalog rules alone could express. The LLM read decode_slice_header at line 2084:

h->slice_num = ++h0->current_slice; if(h->slice_num >= MAX_SLICES){ av_log(s->avctx, AV_LOG_ERROR, "Too many slices..."); }

The LLM recognized:

The same unbounded counter pattern — increment without enforced bound The log-only guard — no return -1 The connection to slice_table — a uint16_t* array The LLM then consulted h264.h and confirmed:

slice_num is int (32-bit, line 361) slice_table is uint16_t* (16-bit, line 363) The sentinel value from memset(..., -1, ...) is 0xFFFF = 65535 The collision became obvious: when current_slice reaches 65535, slice_num truncates to 0xFFFF when stored in slice_table, matching the sentinel exactly.

4.3 Tracing the Consequence

The LLM followed the data flow from slice_table to its consumers in fill_caches (h264.h lines 793-831):

top_type = h->slice_table[top_xy] < 0xFFFF ? s->current_picture.mb_type[top_xy] : 0;

When slice_table[top_xy] is the sentinel 0xFFFF but a real slice also has number 65535, the comparison 0xFFFF < 0xFFFF evaluates to false — correctly treating it as "no neighbor." But in the non-deblocking path:

top_type = h->slice_table[top_xy] == h->slice_num ? s->current_picture.mb_type[top_xy] : 0;

Here 0xFFFF == 65535 is true — falsely claiming the sentinel-initialized neighbor is in the current slice. This enables reads of uninitialized mb_type, motion_val, and ref_index data, which the deblocking filter then uses to compute write destinations in the frame buffer.

5. What Would Have Been Missed Without Datalog

This section addresses a critical question.

5.1 The LLM Alone: Powerful but Undisciplined

An LLM reading this codebase without Datalog support would face several fundamental challenges:

Context window saturation. The four files total ~5,800 lines. The H.264 decoder is dense — decode_slice_header alone is 460 lines, fill_caches is 456 lines, and hl_decode_mb_internal is 256 lines. Loading all security-relevant functions simultaneously would consume most of a large context window, leaving little room for structured reasoning. Our pipeline solved this by having the LLM process one function at a time during extraction, then letting Datalog compose the cross-function results.

No systematic data flow. To find this bug, we needed to know that context_count, defined at line 2537, reaches line 2473 across loop iterations. An LLM can guess this by reading the code, but it cannot prove it. Our system computed 702 DefReachesUse pairs through formal Gen-Kill dataflow over 514 CFG edges. Every data-flow connection used in the analysis was sound — not hallucinated.

Guard semantics are subtle. The distinction between context_count == h->max_contexts (equality, not a bound) and context_count < h->max_contexts (proper upper bound) is the difference between a safe counter and an exploitable one. An LLM might notice this distinction, or might not — it depends on attention, prompt framing, and luck. The Datalog HasUpperBound rule encoded this distinction as a formal property: only <, <=, and their unsigned variants constitute upper bounds. The equality operator == is deliberately excluded. This is not a heuristic — it is a specification.

Combinatorial explosion. There are 73 functions, 38 sink sites, and 71 typed variables in the analysis scope. The number of potential (counter, index, type-mismatch) triples is enormous. An LLM asked "find type safety bugs" would likely focus on the most obvious patterns — explicit casts, malloc size arguments, memcpy lengths — and miss the subtle interaction between an int counter and a uint16_t sentinel. The Datalog engine evaluated all combinations exhaustively: every counter against every guard, every counter against every memory access, every type against the knowledge base. The 62 findings it produced are the complete set for the rules defined, not a sample.

5.2 Fuzzing Alone: Fast but Blind to Type Semantics

The bug report notes that this vulnerability "has been missed by every fuzzer" for over two decades. This is not surprising:

Triggering requires 65,536 slices. Most fuzzers use small inputs for performance. Generating a valid H.264 bitstream with 65,536 correctly formed slice headers, each passing decode_slice_header's many validation checks (PPS/SPS references, ref_count bounds, QP range, deblocking parameters), is astronomically unlikely through random mutation. Coverage-guided fuzzers would need to discover, one by one, that each additional slice increases current_slice by one — 65,536 times — while maintaining bitstream validity at every step.

The crash is conditional on frame layout. Even with 65,536 slices, the OOB write only occurs if a macroblock in slice 65535 is at a frame edge where its neighbor's slice_table entry is still the sentinel 0xFFFF. This requires specific slice-to-macroblock assignments that a fuzzer would need to stumble upon.

Type information is erased at runtime. A fuzzer monitors crashes and coverage. It cannot observe that slice_num (an int holding value 65535) and slice_table[top_xy] (a uint16_t holding value 0xFFFF) are different things that happen to have the same bit pattern. The type mismatch is a static property invisible to dynamic analysis.

Our system detected this without executing any code. The VarType facts preserved the distinction between int and uint16_t that compilation erases, and the UnboundedCounter rule identified the missing bound that would take a fuzzer 65,536 iterations to discover.

5.3 Traditional Static Analysis: Sound but Coarse

Tools like Coverity, CodeQL (GitHub's variant), or the Clang Static Analyzer could potentially detect pieces of this bug, but face their own gaps:

No sentinel-value reasoning. Standard static analyzers track null pointers, uninitialized memory, and buffer overflows. The concept of "a counter value colliding with a sentinel value in a different-width storage" is not a built-in check in any mainstream tool we are aware of. It requires understanding the interaction between type widths, initialization patterns, and comparison semantics — exactly what our UnboundedCounter + ResolvedVarType + SignednessMismatch rule composition captures.

Limited cross-domain reasoning. The bug spans three domains: (1) bitstream parsing (decode_slice_header increments the counter), (2) table management (memset initializes sentinels, slice_table stores truncated values), and (3) the deblocking filter (fill_caches compares against sentinels). Most static analyzers perform intraprocedural analysis or limited interprocedural analysis. Our backward slicing selected all three domains into the analysis set, and the Datalog rules operated uniformly across them.

Customization is hard. Writing a Coverity checker or CodeQL query for "unbounded counter stored in narrower type that collides with a sentinel" requires deep tool expertise and advance knowledge of the bug pattern. Our system discovered the pattern from generic rules — UnboundedCounter and SignednessMismatch are general-purpose rules that apply to any codebase.

5.4 Datalog Alone: Sound but Needs Eyes

Conversely, Datalog without the LLM would also fall short:

No fact extraction. Datalog rules operate on relations. Someone must populate those relations from source code. Traditional approaches use compiler frontends (Clang AST, GCC GIMPLE) that require the code to compile — including all headers, build flags, and dependencies. FFmpeg's build system is notoriously complex. Our LLM reads raw source files with zero build configuration, extracting types, control flow, and data flow directly from the text.

No cross-struct reasoning. The critical connection — h->slice_num (declared as int in H264Context) flows into h->slice_table[mb_xy] (declared as uint16_t*) — happens through struct field accesses across functions. The LLM extracted FieldWrite and VarType facts that captured this, but the Datalog rules for UnboundedCounter operate on local variables. It was the LLM interpreter that connected the Datalog finding ("counters are unbounded") to the struct-level type mismatch ("the storage is 16-bit"). This required reading the H264Context struct definition and understanding the semantic relationship between slice_num and slice_table — a task that requires natural language understanding of field names, comments, and usage patterns.

No exploit narrative. Datalog produces tuples: ("decode_nal_units", "context_count", 2537). It cannot explain why this matters, how an attacker would trigger it, or what the consequence would be. The LLM's interpretation phase transformed 62 abstract findings into a concrete exploit chain with specific attacker actions, value ranges, and memory corruption consequences.

5.5 The Hybrid Advantage

The table below summarizes what each component contributed:

Capability LLM Only Fuzzing Traditional SA Datalog Only Our System
Extract types from raw source (no build)
Formal data flow (Gen-Kill)
Exhaustive rule evaluation
Distinguish == from < in guards ⚠️
Detect sentinel collision pattern ⚠️
Explain exploit narrative
Zero build configuration
Finds this specific bug ⚠️ ⚠️

The ⚠️ entries indicate "possible in principle but unreliable in practice."


6. Limitations and Future Work

6.1 What We Partially Missed

While we detected both ingredients of the vulnerability (unbounded counter and type width mismatch), the Datalog rules did not directly compose them into a single finding. The connection required LLM interpretation because:

6.2 Planned Improvements

Based on this case study, we plan to add:

  1. FieldCounter rule: Track struct fields that are incremented through FieldWrite patterns, extending UnboundedCounter to interprocedural counters.

  2. SentinelCollision rule: Detect when a counter variable's type is wider than the storage it flows into, and the storage is initialized to a sentinel value that falls within the counter's reachable range.

  3. Cross-width comparison rule: Flag comparisons where one operand comes from wider storage (e.g., int slice_num) and the other from narrower storage (e.g., uint16_t slice_table[]), especially when the wider value can reach the narrower type's maximum.


7. Conclusion

We have demonstrated that NeuroLog — a hybrid system combining LLM fact extraction with Soufflé Datalog formal reasoning — can detect a real-world vulnerability that evaded two decades of fuzzing and human review. The key findings:

  1. The Datalog rules were necessary. The UnboundedCounter rule's formal distinction between == guards and < guards, combined with CounterUsedAsIndex data flow tracking, identified the precise code pattern that enables exploitation. An LLM alone might notice these issues individually but cannot guarantee exhaustive evaluation across 73 functions.

  2. The LLM was necessary. Without the LLM, we would have no facts to reason about — the source code would remain unstructured text. The LLM extracted 2,489 structured facts from 16 functions, including the critical VarType, ArithOp, Guard, and CFGEdge relations that the Datalog engine consumed. Additionally, the LLM's interpretation phase connected abstract Datalog findings to a concrete exploit chain.

  3. The combination found what neither could alone. The bug is a composition of properties: an unbounded counter (Rule 9), stored in narrower-typed storage (type knowledge base), that collides with a sentinel value (semantic reasoning). No single analysis technique covers all three. Our pipeline distributed the work: tree-sitter for scanning, backward slicing for scoping, LLM for extraction, Datalog for formal inference, and LLM again for interpretation.

The vulnerability in FFmpeg's H.264 decoder — introduced in 2003, weaponized by a 2010 refactor, and finally fixed in 2025 — is precisely the kind of bug that falls through the cracks of existing tooling. It is too subtle for fuzzers (requiring 65,536 valid slices), too cross-cutting for traditional static analyzers (spanning parsing, table management, and filtering), and too combinatorial for unaided human review. It is, however, well within reach of a system that combines the language understanding of an LLM with the formal guarantees of Datalog.


Appendix A: Full Datalog Output for Key Relations

A.1 UnboundedCounter (14 rows)

decode_frame nalsize 2689 decode_frame nalsize 2699 decode_frame p 2687 decode_frame p 2694 decode_frame p 2704 decode_frame i+1 2661 decode_frame i+1 2853 decode_nal_units buf_index 2451 decode_nal_units buf_index 2469 decode_nal_units buf_index 2491 decode_nal_units context_count 2537 decode_nal_units context_count 2566 decode_nal_units next_avc 2458 ff_h264_frame_start __expr_size 924

A.2 CounterUsedAsIndex (27 rows, grouped by variable)

Variable Incr Line Use Line Access Kind
buf_index 2451 2448 mem_read_offset
buf_index 2451 2463 mem_read_offset
buf_index 2451 2451 ptr_arith
buf_index 2451 2458 ptr_arith
buf_index 2451 2469 ptr_arith
buf_index 2451 2491 ptr_arith
buf_index 2469 2448 mem_read_offset
buf_index 2469 2463 mem_read_offset
buf_index 2469 (6 more) ptr_arith
buf_index 2491 (6 more) ptr_arith / mem_read
context_count 2537 2473 mem_read_offset
context_count 2537 2537 ptr_arith
context_count 2537 2566 ptr_arith
context_count 2566 2473 mem_read_offset
context_count 2566 2537 ptr_arith
context_count 2566 2566 ptr_arith
p 2687 2694 ptr_arith
p 2694 2694 ptr_arith
p 2704 2704 ptr_arith

A.3 PotentialArithOverflow (2 rows)

decode_nal_units 2458 next_avc add int 4 decode_nal_units 2481 bit_length mul int 4


Appendix B: Fact Extraction Statistics

Relation Rows Description
VarType 71 Source-level type annotations
Def 141 Variable definition sites
Use 590 Variable use sites
CFGEdge 514 Control flow edges
ArithOp 45 Arithmetic operations
Guard 95 Conditional bound checks
Call 127 Function call sites
ActualArg 315 Arguments at call sites
FieldRead 312 Struct field reads
FieldWrite 104 Struct field writes
MemRead 42 Memory read operations
MemWrite 33 Memory write operations
ReturnVal 11 Function return values
FormalParam 19 Function parameter declarations
AddressOf 25 Address-of operations
Total 2,489 Across 16 target functions

Appendix C: Derived Analysis Output

Output Relation Rows Description
DefReachesUse 702 Reaching definition pairs
ResolvedVarType 107 Types validated against knowledge base
TypeSafetyFinding 62 All type-safety findings (summary)
CounterUsedAsIndex 27 Unbounded counters used in memory ops
SignednessMismatch 20 Signed/unsigned confusion
UnboundedCounter 14 Increments without upper-bound guard
PointsTo 32 Pointer alias pairs
IsParam 19 Parameter identification
CopySite 7 Memory copy operations
GlobalVar 8 Global variable accesses
PotentialArithOverflow 2 Integer overflow risks
UnknownType 2 Types not in knowledge base

Authors: Sanjay Rawat & NeuroLog+Opus 4.6
Post URL:
E-mail: sanjayr@ymail.com