Date: April 2026
Target: FFmpeg H.264 Decoder (h264.c, h264.h)
Vulnerability: Unbounded 32-bit slice counter collides with 16-bit sentinel value 0xFFFF, causing heap OOB write in the deblocking filter
Fix Commit: FFmpeg/FFmpeg@c988f97566
Bug Age: Introduced in 2003, became exploitable after 2010 refactor
Tool: NeuroLog — an LLM-augmented Datalog-based source code analysis system
We describe how NeuroLog, a hybrid system combining Large Language Model (LLM) fact extraction with Soufflé Datalog formal reasoning, detected a recently disclosed vulnerability in FFmpeg's H.264 decoder by Anthropic's Mythos Model A 16-year-old FFmpeg vulnerability — a bug that had evaded every fuzzer and human reviewer for over two decades.
The vulnerability arises from the interaction of two properties:
current_slice) that increments without an enforced upper boundslice_table) initialized with memset(..., -1, ...), making 0xFFFF (65535) the sentinel for "unowned macroblock"When an attacker crafts an H.264 stream with exactly 65,536 slices, slice number 65535 collides with the sentinel. The deblocking filter then treats uninitialized neighbor macroblocks as valid, causing an out-of-bounds heap write.
Our system flagged both halves of this bug automatically:
current_slice as an increment without an enforced upper bound (the existing guard only logs a warning)int → uint16_t type width mismatch between slice_num and slice_tableThe LLM's role was to extract structured facts (variable types, definitions, uses, control flow, arithmetic operations) from source code. Datalog's role was to compose these facts through formal rules to surface the vulnerability pattern. Neither component alone could have found this bug.
NeuroLog operates as a five-phase pipeline: ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ 1. SCAN │───▶│ 2. SLICE │───▶│ 3. EXTRACT │───▶│ 4. ANALYZE │───▶│ 5. INTERPRET │ │ tree-sitter │ │ call-graph │ │ LLM → facts │ │ Soufflé │ │ LLM reason │ │ parse code │ │ backward │ │ .facts TSV │ │ Datalog │ │ over output │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Uses tree-sitter (no compilation required) to enumerate all functions, build a call graph, and identify functions calling dangerous sinks (memcpy, memset, malloc, free, etc.). For this project: 73 functions across 4 files, 38 dangerous sink call sites, 283 call graph edges.
Performs backward slicing from dangerous sinks through the call graph. This determines which functions need deep analysis. Starting from 38 sink sites, tracing 3 levels back through callers, the slicer selected 16 target functions — reducing analysis scope by ~78% while preserving all security-relevant paths.
An LLM reads the source code of each target function and extracts structured facts as tab-separated values conforming to predefined relational schemas. The LLM produces facts for:
| Relation | Description | Count |
|---|---|---|
VarType |
Variable name, C type, byte width, signedness | 71 |
Def |
Variable defined at line | 141 |
Use |
Variable used at line | 590 |
ArithOp |
Arithmetic operation (add, mul, shift) with operands | 45 |
Guard |
Conditional bound checks (var < bound) | 95 |
CFGEdge |
Control flow between lines | 514 |
Call |
Function call sites | 127 |
ActualArg |
Arguments passed at call sites | 315 |
MemRead |
Memory read with base + offset | 42 |
MemWrite |
Memory write targets | 33 |
FieldRead |
Struct field access (read) | 312 |
FieldWrite |
Struct field access (write) | 104 |
| Total | 2,489 |
The extracted facts are fed into 17 Datalog rule files (~145KB of logic) executed by the Soufflé engine. Rules compose facts through logical inference to derive security findings. The key rule files for this vulnerability:
| Rule File | Purpose | Size |
|---|---|---|
source_core.dl |
Reaching definitions, def-use chains, CFG reachability | 4.8 KB |
source_type_safety.dl |
Type mismatches, unbounded counters, truncation detection | 30.9 KB |
type_knowledge.dl |
C type semantics knowledge base (widths, signedness) | 10.0 KB |
The LLM reads the Datalog output CSVs and reasons about the findings in the context of the source code, connecting formal results to exploit chains.
We were asked to analyze FFmpeg's H.264 decoder for memory corruption bugs, with a specific focus on integer type mismatches and unbounded counters — two vulnerability classes that interact in subtle ways and are poorly covered by traditional fuzzing.
The target was a single-file decoder (h264.c, ~3,100 lines) with its header
(h264.h, ~1,341 lines) and supporting files for the loop filter and parser.
We began by running tool_scan_project with tree-sitter to map the codebase
without compilation:
Functions discovered: 73 (across 4 files) Call graph edges: 283 Dangerous sink sites: 38 (memset, memcpy, printf, etc.)
The scan immediately surfaced the structure of the decoder: decode_frame →
decode_nal_units → decode_slice_header → ff_h264_frame_start →
ff_h264_alloc_tables. These are the critical paths where attacker-controlled
bitstream data flows into memory allocation, table indexing, and pixel writes.
We noted 38 dangerous sink calls — predominantly memset and memcpy in
allocation and initialization routines.
With 73 functions, exhaustive LLM extraction would be expensive and would
flood the Datalog engine with irrelevant facts. We ran tool_build_slice to
trace backward 3 levels from every dangerous sink through the call graph.
This selected 16 functions for deep analysis — a 78% reduction in scope while preserving every path from attacker input to memory-unsafe operations. The selected functions included:
decode_slice_header — parses slice headers from bitstreamdecode_nal_units — top-level NAL unit dispatch loopdecode_frame — entry point for frame decodingff_h264_frame_start — frame buffer initializationff_h264_alloc_tables — table allocationdecode_slice — macroblock decode loopff_h264_hl_decode_mb — macroblock rendering dispatchhl_decode_mb_internal — the actual pixel write pathFor each of the 16 target functions, the LLM read the source code and produced structured relational facts. This is the critical bridge: the LLM understands C semantics (types, control flow, pointer arithmetic) while the Datalog engine provides formal compositional reasoning.
The LLM extracted 2,489 facts across 12 relation schemas. Here are the key facts that would later prove essential to detecting the vulnerability:
VarType facts — The LLM correctly identified types from declarations and struct definitions:
# The LLM saw "int context_count = 0;" and extracted:
decode_nal_units context_count int 4 signed
# The LLM saw "int buf_index = 0;" and extracted:
decode_nal_units buf_index int 4 signed
# The LLM saw "int bit_length;" and extracted:
decode_nal_units bit_length int 4 signed
These type facts are unavailable to binary analysis tools — after compilation, int and unsigned int are both just 32-bit registers. But in our system, the LLM preserves the source-level type name, which the type_knowledge.dl knowledge base then resolves to canonical width and signedness.
ArithOp facts — The LLM identified every increment operation:
# The two lines where context_count is incremented:
decode_nal_units 2537 context_count 0 add context_count 0 1
decode_nal_units 2566 context_count 0 add context_count 0 1
Guard facts — Critically, the LLM extracted the form of every conditional check. For context_count, it found:
decode_nal_units 2599 context_count 0 == h->max_contexts var
#This is an equality check (==), not an upper-bound check (< or <=). This distinction is what allows the Datalog HasUpperBound rule to correctly determine that context_count lacks a proper upper bound guard — equality checks don't prevent a value from exceeding the bound on a different code path.
With facts on disk, we ran the Soufflé engine across 17 rule files. The rules composed the raw facts through logical inference to derive security findings.
The analysis produced:
Output Relation Rows Significance
TypeSafetyFinding 62 All type-safety issues in one relation
UnboundedCounter 14 Variables incremented without upper bounds
CounterUsedAsIndex 27 Unbounded counters used in memory access
SignednessMismatch 20 Signed/unsigned confusion at assignment
PotentialArithOverflow 2 Arithmetic that may overflow int range
DefReachesUse 702 Reaching definition pairs (data flow backbone)
ResolvedVarType 107 LLM types validated against knowledge base
The LLM read the Datalog output and cross-referenced it with the decompiled source code. Two findings immediately stood out as related:
Finding A — In decode_nal_units, context_count was flagged as an UnboundedCounter (incremented at lines 2537 and 2566) that is also a CounterUsedAsIndex (used to index h->thread_context[] at line 2473).
Finding B — Multiple SignednessMismatch entries across the codebase showed pointer-to-integer confusion, but more importantly, the pattern of int variables being used in contexts where they flow into narrower storage was pervasive.
The LLM then analyzed the decode_slice_header function (which the pipeline had selected as a target but whose current_slice counter facts were captured through field writes and call-site analysis). By reading lines 2084-2087, it identified the smoking gun:
h->slice_num = ++h0->current_slice;
if(h->slice_num >= MAX_SLICES){
av_log(s->avctx, AV_LOG_ERROR, "Too many slices...");
// ← NO return -1!
}
The guard exists but does not terminate execution. Combined with the type information from h264.h showing int slice_num (line 361) stored into uint16_t *slice_table (line 363), the exploit chain became clear: 32-bit counter → 16-bit storage → sentinel collision at 65535 → OOB write in the deblocking filter.
The vulnerability was surfaced by the interaction of three Datalog rule groups, each contributing a distinct piece of the puzzle. We walk through each rule, show its definition, and trace exactly which facts it consumed and what it produced.
Purpose: Identify variables that are incremented via arithmetic but never checked against an upper bound anywhere in the function.
Definition (from source_type_safety.dl, lines 382–396):
.decl HasUpperBound(func: Sym, var: Sym)
HasUpperBound(f, v) :- Guard(f, _, v, _, "<", _, _).
HasUpperBound(f, v) :- Guard(f, _, v, _, "<=", _, _).
HasUpperBound(f, v) :- Guard(f, _, v, _, "lt", _, _).
HasUpperBound(f, v) :- Guard(f, _, v, _, "le", _, _).
HasUpperBound(f, v) :- Guard(f, _, v, _, "unsigned_lt", _, _).
HasUpperBound(f, v) :- Guard(f, _, v, _, "unsigned_le", _, _).
.decl UnboundedCounter(func: Sym, var: Sym, incr_addr: Addr)
UnboundedCounter(f, v, a) :-
ArithOp(f, a, v, _, "add", _, _, _),
!HasUpperBound(f, v).
How it fired:
The LLM extracted these ArithOp facts for context_count:
decode_nal_units 2537 context_count 0 add context_count 0 1
decode_nal_units 2566 context_count 0 add context_count 0 1
And this Guard fact:
decode_nal_units 2599 context_count 0 == h->max_contexts var
The HasUpperBound rule looks for guards with operators <, <=, lt, le, unsigned_lt, or unsigned_le. The guard on context_count uses == (equality), which is not in that set. Therefore HasUpperBound("decode_nal_units", "context_count") was not derived, and the negation !HasUpperBound(f, v) succeeded.
Output produced (from UnboundedCounter.csv):
decode_nal_units context_count 2537
decode_nal_units context_count 2566
Why this matters: An equality check (== h->max_contexts) only resets the counter when it exactly hits the bound. It does not prevent the counter from exceeding the bound if incrementing happens on multiple code paths within the same loop iteration, or if the check is placed after the use. The Datalog rule captured this semantic distinction precisely.
This same rule pattern also flagged buf_index (3 increment sites in decode_nal_units) and p (3 increment sites in decode_frame), both of which we identified as medium-severity issues in the broader analysis.
Purpose: Connect unbounded counters to actual memory operations — array indexing, pointer arithmetic, or dangerous sink arguments.
Definition (from source_type_safety.dl, lines 414–444):
.decl CounterUsedAsIndex(func: Sym, var: Sym, incr_addr: Addr,
use_addr: Addr, access_kind: Sym)
CounterUsedAsIndex(f, v, ia, ua, "mem_read_offset") :-
UnboundedCounter(f, v, ia),
DefReachesUse(f, v, ia, ua),
MemRead(f, ua, _, v, _).
CounterUsedAsIndex(f, v, ia, ua, "ptr_arith") :-
UnboundedCounter(f, v, ia),
DefReachesUse(f, v, ia, ua),
ArithOp(f, ua, _, _, "add", v, _, _).
How it fired:
The DefReachesUse relation (computed by source_core.dl using reaching definitions over the CFG) connected the definition of context_count at line 2537 to its use at line 2473:
decode_nal_units context_count 2537 2473
decode_nal_units context_count 2566 2473
At line 2473, context_count is used as an index into the thread_context array:
hx = h->thread_context[context_count]; // line 2473
The LLM extracted this as a MemRead fact:
decode_nal_units 2473 h->thread_context context_count 8
The join of UnboundedCounter + DefReachesUse + MemRead produced:
decode_nal_units context_count 2537 2473 mem_read_offset
decode_nal_units context_count 2566 2473 mem_read_offset
Why this matters: The rule connects the abstract property ("this counter has no bound") to a concrete consequence ("this counter indexes into memory"). Without the DefReachesUse data flow, we would know the counter is unbounded but not whether it matters. Without the MemRead facts, we would know the counter flows somewhere but not that it reaches a memory access.
Purpose: Provide ground-truth C type semantics that the LLM's raw extraction can be validated against.
Definition (from type_knowledge.dl, key entries):
TypeInfo("int", "int32", 4, "signed").
TypeInfo("uint16_t", "uint16", 2, "unsigned").
TypeInfo("unsigned int", "uint32", 4, "unsigned").
.decl ResolvedVarType(func, var, type_name, canonical, width, signedness)
ResolvedVarType(f, v, tn, c, w, s) :-
VarType(f, v, tn, _, _),
TypeInfo(tn, c, w, s).
How it contributed:
The LLM extracted VarType("decode_nal_units", "context_count", "int", 4, "signed"). The ResolvedVarType rule joined this against the knowledge base, confirming that int → canonical int32, width 4 bytes, signed.
This resolved type feeds into PotentialArithOverflow (Rule 9c):
PotentialArithOverflow(f, a, dst, op, tn, w) :-
ArithOp(f, a, dst, _, op, _, _, _),
(op = "mul" ; op = "lsl" ; op = "add"),
ResolvedVarType(f, dst, tn, _, w, "signed"),
w <= 4,
!Guard(f, _, dst, _, _, _, _).
This produced the next_avc overflow finding at line 2458 (buf_index + nalsize in signed 32-bit arithmetic).
Purpose: Detect assignments where a signed variable flows into an unsigned one (or vice versa) without an explicit cast.
SignednessMismatch(f, a, src, dst, "pointer", ss, st, dt) :-
Def(f, dst, _, a),
Use(f, src, _, a),
src != dst,
ResolvedVarType(f, src, st, _, _, "pointer"),
ResolvedVarType(f, dst, dt, _, _, ss),
ss != "pointer",
ss != "struct",
!Call(f, _, a).
While this rule primarily fired for pointer-to-integer patterns in decode_frame and ff_h264_hl_decode_mb, it established the systematic pattern that the codebase has pervasive implicit type conversions without casts — the same pattern class that makes int slice_num → uint16_t slice_table[] dangerous.
All of the above rules depend on DefReachesUse, computed by source_core.dl using standard Gen-Kill dataflow:
.decl Kills(func: Sym, var: Sym, node: Addr)
Kills(f, v, n) :- Def(f, v, _, n).
.decl ReachesIn(func: Sym, var: Sym, def_line: Addr, node: Addr)
ReachesIn(f, v, d, n) :-
ReachesOut(f, v, d, pred),
CFGEdge(f, pred, n).
.decl ReachesOut(func: Sym, var: Sym, def_line: Addr, node: Addr)
ReachesOut(f, v, d, d) :- Def(f, v, _, d).
ReachesOut(f, v, d, n) :-
ReachesIn(f, v, d, n),
!Kills(f, v, n).
.decl DefReachesUse(func: Sym, var: Sym, def_line: Addr, use_line: Addr)
DefReachesUse(f, v, d, u) :-
ReachesIn(f, v, d, u),
Use(f, v, _, u).
From 141 Def facts, 590 Use facts, and 514 CFGEdge facts, this engine derived 702 DefReachesUse pairs — the complete intraprocedural data flow map. This is what allowed us to trace that context_count, defined at line 2537 (an increment), reaches its use at line 2473 (the array index) — even though those lines are in different iterations of the for(; loop, connected through CFGEdge back-edges.
Finally, all individual findings flow into a single summary relation:
TypeSafetyFinding(f, a, "unbounded_counter", v,
cat("incremented at ", to_string(a))) :-
UnboundedCounter(f, v, a).
TypeSafetyFinding(f, ua, "counter_used_as_index", v,
cat(kind, cat(" incr@", to_string(ia)))) :-
CounterUsedAsIndex(f, v, ia, ua, kind).
TypeSafetyFinding(f, a, "signedness_mismatch", src,
cat(st, cat("(", cat(ss, cat(")→", cat(dt, cat("(", cat(ds, ")")))))))) :-
SignednessMismatch(f, a, src, _, ss, ds, st, dt).
TypeSafetyFinding(f, a, "potential_arith_overflow", v,
cat(op, cat(" ", cat(tn, cat(" width=", to_string(w)))))) :-
PotentialArithOverflow(f, a, v, op, tn, w).
This produced the 62 TypeSafetyFinding rows that we used as the starting point for interpretation.
The Datalog engine gave us formal findings. But formal findings are not an exploit chain. The LLM's final role was to read the output CSVs alongside the source code and perform the semantic reasoning that Datalog cannot.
From the 62 findings, we focused on the unbounded counter cluster:
decode_nal_units 2537 unbounded_counter context_count incremented at 2537
decode_nal_units 2566 unbounded_counter context_count incremented at 2566
decode_nal_units 2537 counter_used_as_index context_count mem_read_offset incr@2537
decode_nal_units 2566 counter_used_as_index context_count mem_read_offset incr@2566
This told us: "In decode_nal_units, the variable context_count is incremented at two separate code locations and used as a memory index, but has no upper-bound guard."
The LLM read decode_nal_units and immediately identified the pattern: context_count indexes into h->thread_context[], but its guard is an equality check at the bottom of the loop. This was clearly an OOB risk, but it was a localized single-function bug.
The deeper insight came when the LLM noticed the Datalog output did not flag current_slice as an unbounded counter — because decode_slice_header was analyzed but current_slice is a struct field write (h0->current_slice), not a local variable with ArithOp facts. It had been extracted as a FieldWrite, not as a local ArithOp.
This is where Datalog+LLM reasoning transcended what the Datalog rules alone could express. The LLM read decode_slice_header at line 2084:
h->slice_num = ++h0->current_slice;
if(h->slice_num >= MAX_SLICES){
av_log(s->avctx, AV_LOG_ERROR, "Too many slices...");
}
The LLM recognized:
The same unbounded counter pattern — increment without enforced bound
The log-only guard — no return -1
The connection to slice_table — a uint16_t* array
The LLM then consulted h264.h and confirmed:
slice_num is int (32-bit, line 361)
slice_table is uint16_t* (16-bit, line 363)
The sentinel value from memset(..., -1, ...) is 0xFFFF = 65535
The collision became obvious: when current_slice reaches 65535, slice_num truncates to 0xFFFF when stored in slice_table, matching the sentinel exactly.
The LLM followed the data flow from slice_table to its consumers in fill_caches (h264.h lines 793-831):
top_type = h->slice_table[top_xy] < 0xFFFF
? s->current_picture.mb_type[top_xy] : 0;
When slice_table[top_xy] is the sentinel 0xFFFF but a real slice also has number 65535, the comparison 0xFFFF < 0xFFFF evaluates to false — correctly treating it as "no neighbor." But in the non-deblocking path:
top_type = h->slice_table[top_xy] == h->slice_num
? s->current_picture.mb_type[top_xy] : 0;
Here 0xFFFF == 65535 is true — falsely claiming the sentinel-initialized neighbor is in the current slice. This enables reads of uninitialized mb_type, motion_val, and ref_index data, which the deblocking filter then uses to compute write destinations in the frame buffer.
This section addresses a critical question.
An LLM reading this codebase without Datalog support would face several fundamental challenges:
Context window saturation. The four files total ~5,800 lines. The H.264
decoder is dense — decode_slice_header alone is 460 lines, fill_caches is
456 lines, and hl_decode_mb_internal is 256 lines. Loading all security-relevant
functions simultaneously would consume most of a large context window, leaving
little room for structured reasoning. Our pipeline solved this by having the LLM
process one function at a time during extraction, then letting Datalog compose
the cross-function results.
No systematic data flow. To find this bug, we needed to know that
context_count, defined at line 2537, reaches line 2473 across loop iterations.
An LLM can guess this by reading the code, but it cannot prove it. Our
system computed 702 DefReachesUse pairs through formal Gen-Kill dataflow over
514 CFG edges. Every data-flow connection used in the analysis was sound — not
hallucinated.
Guard semantics are subtle. The distinction between context_count == h->max_contexts (equality, not a bound) and context_count < h->max_contexts
(proper upper bound) is the difference between a safe counter and an exploitable
one. An LLM might notice this distinction, or might not — it depends on
attention, prompt framing, and luck. The Datalog HasUpperBound rule encoded
this distinction as a formal property: only <, <=, and their unsigned
variants constitute upper bounds. The equality operator == is deliberately
excluded. This is not a heuristic — it is a specification.
Combinatorial explosion. There are 73 functions, 38 sink sites, and 71 typed
variables in the analysis scope. The number of potential (counter, index,
type-mismatch) triples is enormous. An LLM asked "find type safety bugs" would
likely focus on the most obvious patterns — explicit casts, malloc size
arguments, memcpy lengths — and miss the subtle interaction between an
int counter and a uint16_t sentinel. The Datalog engine evaluated all
combinations exhaustively: every counter against every guard, every counter
against every memory access, every type against the knowledge base. The 62
findings it produced are the complete set for the rules defined, not a
sample.
The bug report notes that this vulnerability "has been missed by every fuzzer" for over two decades. This is not surprising:
Triggering requires 65,536 slices. Most fuzzers use small inputs for
performance. Generating a valid H.264 bitstream with 65,536 correctly formed
slice headers, each passing decode_slice_header's many validation checks
(PPS/SPS references, ref_count bounds, QP range, deblocking parameters), is
astronomically unlikely through random mutation. Coverage-guided fuzzers would
need to discover, one by one, that each additional slice increases current_slice
by one — 65,536 times — while maintaining bitstream validity at every step.
The crash is conditional on frame layout. Even with 65,536 slices, the OOB
write only occurs if a macroblock in slice 65535 is at a frame edge where its
neighbor's slice_table entry is still the sentinel 0xFFFF. This requires
specific slice-to-macroblock assignments that a fuzzer would need to stumble upon.
Type information is erased at runtime. A fuzzer monitors crashes and coverage.
It cannot observe that slice_num (an int holding value 65535) and
slice_table[top_xy] (a uint16_t holding value 0xFFFF) are different things
that happen to have the same bit pattern. The type mismatch is a static property
invisible to dynamic analysis.
Our system detected this without executing any code. The VarType facts
preserved the distinction between int and uint16_t that compilation erases,
and the UnboundedCounter rule identified the missing bound that would take a
fuzzer 65,536 iterations to discover.
Tools like Coverity, CodeQL (GitHub's variant), or the Clang Static Analyzer could potentially detect pieces of this bug, but face their own gaps:
No sentinel-value reasoning. Standard static analyzers track null pointers,
uninitialized memory, and buffer overflows. The concept of "a counter value
colliding with a sentinel value in a different-width storage" is not a built-in
check in any mainstream tool we are aware of. It requires understanding the
interaction between type widths, initialization patterns, and comparison
semantics — exactly what our UnboundedCounter + ResolvedVarType +
SignednessMismatch rule composition captures.
Limited cross-domain reasoning. The bug spans three domains: (1) bitstream
parsing (decode_slice_header increments the counter), (2) table management
(memset initializes sentinels, slice_table stores truncated values), and
(3) the deblocking filter (fill_caches compares against sentinels). Most
static analyzers perform intraprocedural analysis or limited interprocedural
analysis. Our backward slicing selected all three domains into the analysis
set, and the Datalog rules operated uniformly across them.
Customization is hard. Writing a Coverity checker or CodeQL query for
"unbounded counter stored in narrower type that collides with a sentinel" requires
deep tool expertise and advance knowledge of the bug pattern. Our system
discovered the pattern from generic rules — UnboundedCounter and
SignednessMismatch are general-purpose rules that apply to any codebase.
Conversely, Datalog without the LLM would also fall short:
No fact extraction. Datalog rules operate on relations. Someone must populate those relations from source code. Traditional approaches use compiler frontends (Clang AST, GCC GIMPLE) that require the code to compile — including all headers, build flags, and dependencies. FFmpeg's build system is notoriously complex. Our LLM reads raw source files with zero build configuration, extracting types, control flow, and data flow directly from the text.
No cross-struct reasoning. The critical connection — h->slice_num (declared
as int in H264Context) flows into h->slice_table[mb_xy] (declared as
uint16_t*) — happens through struct field accesses across functions. The LLM
extracted FieldWrite and VarType facts that captured this, but the Datalog
rules for UnboundedCounter operate on local variables. It was the LLM
interpreter that connected the Datalog finding ("counters are unbounded") to the
struct-level type mismatch ("the storage is 16-bit"). This required reading the
H264Context struct definition and understanding the semantic relationship
between slice_num and slice_table — a task that requires natural language
understanding of field names, comments, and usage patterns.
No exploit narrative. Datalog produces tuples: ("decode_nal_units", "context_count", 2537). It cannot explain why this matters, how an attacker
would trigger it, or what the consequence would be. The LLM's interpretation
phase transformed 62 abstract findings into a concrete exploit chain with
specific attacker actions, value ranges, and memory corruption consequences.
The table below summarizes what each component contributed:
| Capability | LLM Only | Fuzzing | Traditional SA | Datalog Only | Our System |
|---|---|---|---|---|---|
| Extract types from raw source (no build) | ✅ | ❌ | ❌ | ❌ | ✅ |
| Formal data flow (Gen-Kill) | ❌ | ❌ | ✅ | ✅ | ✅ |
| Exhaustive rule evaluation | ❌ | ❌ | ✅ | ✅ | ✅ |
Distinguish == from < in guards |
⚠️ | ❌ | ✅ | ✅ | ✅ |
| Detect sentinel collision pattern | ❌ | ❌ | ❌ | ⚠️ | ✅ |
| Explain exploit narrative | ✅ | ❌ | ❌ | ❌ | ✅ |
| Zero build configuration | ✅ | ❌ | ❌ | ❌ | ✅ |
| Finds this specific bug | ⚠️ | ❌ | ❌ | ⚠️ | ✅ |
The ⚠️ entries indicate "possible in principle but unreliable in practice."
While we detected both ingredients of the vulnerability (unbounded counter and type width mismatch), the Datalog rules did not directly compose them into a single finding. The connection required LLM interpretation because:
current_slice is a struct field (h0->current_slice), not a local
variable. Our UnboundedCounter rule operates on ArithOp facts for local
variables. Extending it to struct field increments would require FieldWrite-aware
counter tracking.
The sentinel value 0xFFFF from memset(..., -1, ...) is not modeled in
our Datalog rules. Adding a SentinelInit relation and rules to detect
counter-sentinel collisions would make this fully automatic.
The truncation from int to uint16_t happens implicitly through struct
field assignment, not through an explicit cast. Our ImplicitTruncation rule
requires both source and destination to be local variables with VarType facts.
Based on this case study, we plan to add:
FieldCounter rule: Track struct fields that are incremented through
FieldWrite patterns, extending UnboundedCounter to interprocedural
counters.
SentinelCollision rule: Detect when a counter variable's type is wider than the storage it flows into, and the storage is initialized to a sentinel value that falls within the counter's reachable range.
Cross-width comparison rule: Flag comparisons where one operand comes
from wider storage (e.g., int slice_num) and the other from narrower
storage (e.g., uint16_t slice_table[]), especially when the wider value
can reach the narrower type's maximum.
We have demonstrated that NeuroLog — a hybrid system combining LLM fact extraction with Soufflé Datalog formal reasoning — can detect a real-world vulnerability that evaded two decades of fuzzing and human review. The key findings:
The Datalog rules were necessary. The UnboundedCounter rule's formal
distinction between == guards and < guards, combined with
CounterUsedAsIndex data flow tracking, identified the precise code pattern
that enables exploitation. An LLM alone might notice these issues
individually but cannot guarantee exhaustive evaluation across 73 functions.
The LLM was necessary. Without the LLM, we would have no facts to reason
about — the source code would remain unstructured text. The LLM extracted
2,489 structured facts from 16 functions, including the critical VarType,
ArithOp, Guard, and CFGEdge relations that the Datalog engine consumed.
Additionally, the LLM's interpretation phase connected abstract Datalog
findings to a concrete exploit chain.
The combination found what neither could alone. The bug is a composition of properties: an unbounded counter (Rule 9), stored in narrower-typed storage (type knowledge base), that collides with a sentinel value (semantic reasoning). No single analysis technique covers all three. Our pipeline distributed the work: tree-sitter for scanning, backward slicing for scoping, LLM for extraction, Datalog for formal inference, and LLM again for interpretation.
The vulnerability in FFmpeg's H.264 decoder — introduced in 2003, weaponized by a 2010 refactor, and finally fixed in 2025 — is precisely the kind of bug that falls through the cracks of existing tooling. It is too subtle for fuzzers (requiring 65,536 valid slices), too cross-cutting for traditional static analyzers (spanning parsing, table management, and filtering), and too combinatorial for unaided human review. It is, however, well within reach of a system that combines the language understanding of an LLM with the formal guarantees of Datalog.
decode_frame nalsize 2689 decode_frame nalsize 2699 decode_frame p 2687 decode_frame p 2694 decode_frame p 2704 decode_frame i+1 2661 decode_frame i+1 2853 decode_nal_units buf_index 2451 decode_nal_units buf_index 2469 decode_nal_units buf_index 2491 decode_nal_units context_count 2537 decode_nal_units context_count 2566 decode_nal_units next_avc 2458 ff_h264_frame_start __expr_size 924
| Variable | Incr Line | Use Line | Access Kind |
|---|---|---|---|
buf_index |
2451 | 2448 | mem_read_offset |
buf_index |
2451 | 2463 | mem_read_offset |
buf_index |
2451 | 2451 | ptr_arith |
buf_index |
2451 | 2458 | ptr_arith |
buf_index |
2451 | 2469 | ptr_arith |
buf_index |
2451 | 2491 | ptr_arith |
buf_index |
2469 | 2448 | mem_read_offset |
buf_index |
2469 | 2463 | mem_read_offset |
buf_index |
2469 | (6 more) | ptr_arith |
buf_index |
2491 | (6 more) | ptr_arith / mem_read |
context_count |
2537 | 2473 | mem_read_offset |
context_count |
2537 | 2537 | ptr_arith |
context_count |
2537 | 2566 | ptr_arith |
context_count |
2566 | 2473 | mem_read_offset |
context_count |
2566 | 2537 | ptr_arith |
context_count |
2566 | 2566 | ptr_arith |
p |
2687 | 2694 | ptr_arith |
p |
2694 | 2694 | ptr_arith |
p |
2704 | 2704 | ptr_arith |
decode_nal_units 2458 next_avc add int 4 decode_nal_units 2481 bit_length mul int 4
| Relation | Rows | Description |
|---|---|---|
| VarType | 71 | Source-level type annotations |
| Def | 141 | Variable definition sites |
| Use | 590 | Variable use sites |
| CFGEdge | 514 | Control flow edges |
| ArithOp | 45 | Arithmetic operations |
| Guard | 95 | Conditional bound checks |
| Call | 127 | Function call sites |
| ActualArg | 315 | Arguments at call sites |
| FieldRead | 312 | Struct field reads |
| FieldWrite | 104 | Struct field writes |
| MemRead | 42 | Memory read operations |
| MemWrite | 33 | Memory write operations |
| ReturnVal | 11 | Function return values |
| FormalParam | 19 | Function parameter declarations |
| AddressOf | 25 | Address-of operations |
| Total | 2,489 | Across 16 target functions |
| Output Relation | Rows | Description |
|---|---|---|
| DefReachesUse | 702 | Reaching definition pairs |
| ResolvedVarType | 107 | Types validated against knowledge base |
| TypeSafetyFinding | 62 | All type-safety findings (summary) |
| CounterUsedAsIndex | 27 | Unbounded counters used in memory ops |
| SignednessMismatch | 20 | Signed/unsigned confusion |
| UnboundedCounter | 14 | Increments without upper-bound guard |
| PointsTo | 32 | Pointer alias pairs |
| IsParam | 19 | Parameter identification |
| CopySite | 7 | Memory copy operations |
| GlobalVar | 8 | Global variable accesses |
| PotentialArithOverflow | 2 | Integer overflow risks |
| UnknownType | 2 | Types not in knowledge base |