65816-llvm-mos/SESSION_RECOVERY.md
2026-05-27 19:37:26 -05:00

20 KiB
Raw Blame History

Session Recovery — last updated 2026-05-27

Living recovery doc. Update on every meaningful change. If session is lost, read this top-to-bottom + the memory notes referenced inside, then reread the actual diffs in tree to ground assumptions.

Headline state

  • Smoke: 148/148 green. Demos 9/9 (helloBeep/helloText/helloWindow/ orcaFrame/qdProbe/heavyRelocs/frame/reversi/minicad).
  • Active config: ptr32 (p:32:16), full IMG0..IMG15 caller-clobber on JSL, greedy regalloc at -O1+. Inline-threshold lowered to 50 target-wide (was LLVM default 225; was 75 earlier this session).
  • Branch: main.
  • vs Calypsi (2026-05-27) — Layer 2 + recent peepholes:
    • Cycle benches geomean: 0.62× Calypsi. 9 of 10 below 1.0×; only fib trails at 1.06× (recursive overhead, structural). See cycle bench table below.
    • Lua 5.1.5: default config 1.13× Calypsi; with Layer 2 0.93×.
    • CoreMark 1.0: with Layer 2 0.79× Calypsi (we beat by 21%).
  • vs Calypsi static-inst ratio (synthetic bench): sumSquares 0.84× (26 vs 31 — we beat), mul16to32 0.25× (1 vs 4 — we beat), evalAt 1.86× (472 vs 254 — structural floor; ABI overhaul rejected).
  • New code-gen options (2026-05-25) — see docs/USAGE.md "Advanced: pointer-deref code generation":
    • Layer 1 ptr32 deref-fold (always on): LDY offset instead of CLC/ADC carry chain. ~3 instr saved per struct-field access.
    • -mllvm -w65816-dbr-safe-ptrs (Layer 2, opt-in): uses lda (d,S),Y for ptr32 derefs assuming bank-byte == DBR. 5 instr → 1 instr per deref. Lua -20.6%. MISCOMPILES cross-bank pointers — opt in per-TU only when safe.
    • Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark matrix.o 1.37× → 0.97× Calypsi. Override with -mllvm -inline-threshold=N.
  • Cycle benches per-call (2026-05-27, Layer 2) — via scripts/benchCyclesPrecise.sh vs scripts/benchCyclesCalypsi.sh:
    Bench         Ours   Calypsi  Ratio
    dotProduct    1534   5712     0.27×
    bsearch       682    2387     0.29×
    sumOfSquares  6820   16368    0.42×
    bubbleSort    11594  17050    0.68×
    strLen        767    1023     0.75×
    djb2Hash      2046   2643     0.77×
    popcount      1194   1534     0.78×
    strcpy        1108   1194     0.93×
    memcmp        682    716      0.95×
    fib           11594  10912    1.06×
    
    Geomean 0.62×. Older HBL-tick numbers (per-iter, 100 iter loops) from benchCycles.sh are still available but lower resolution.
  • Recent session wins (2026-05-27):
    • Y-as-counter for strLen — structural rewrite: drop STX/INX/INC, use Y as offset AND counter. strLen 1279 → 767 cyc (-40%); 0.75× Calypsi (was 1.25×).
    • Stack-rel dead-store elim — companion to DP version with SP tracking across PHA/PHP/PEA/PEI/PER/PLA/PLP/PLX/PLY/PHX/PHY. strcpy 1194 → 1108 (-7%, 0.93× Calypsi, beats by 7%). Refactored as a static helper called from the recursive-call bail too so fib gets it. fib 12106 → 11594 (-4%, 1.06× Calypsi).
    • DP-indirect-Y for iter (follow-on to X-iter peephole): rewrites TXA;STA stack-rel S;INX;…;LDA (S,s),Y to STX_DP D;INX;…;LDA (D),Y. Saves 4 cyc/iter.
    • Dead INC_HI_IF_CARRY elim — when the StackRel ptr-hi slot is never read, elide the carry-bookkeeping for Layer 2 ptr32 loops. Wide impact across strLen/strcpy/djb2Hash/memcmp.
  • Recent session wins (earlier — 2026-05-20):
    • 8 always-on peepholes + extended phase 4 in W65816StackRelToImg (evalAt 498→472, fib -35%, 35 libc fns shrunk)
    • __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
    • case-(b) ImgCalleeSave bracket hoist enables phase 4 to elide TAY/TYA round-trip in synergy
    • FP cycle benches added (dadd/dmul/ddiv) with per-bench iter count
    • Documented LSR-dp cycle mystery as HBL-counter wrap artifact
    • Game-like benches added: particles (i16 physics), mandelbrot (i32 fp)
    • elideStoreForwarding now reached via early-return bail paths: particles 5005→2253 cyc/iter (-55%). Was being skipped for any function where main IMG promotion bailed (SpAdj invalid, no accesses, or > 16 hot slots).

Uncommitted, must keep

git status --short (5 modified, no untracked of consequence):

  1. SESSION_RECOVERY.md — this doc.
  2. scripts/smokeTest.sh — added "omfEmit --stack-size emits a DP/Stack ~Direct segment" check. Validates 3-segment layout (ExpressLoad + code + DP/Stack) when --stack-size is supplied; parses the third segment header against KIND/LENGTH/RESSPC/ALIGN/ SEGNUM=3/name="~Direct" expectations.
  3. src/link816/omfEmit.cppemitDpStackSeg(length, segNum) plus the --stack-size N CLI flag. Validation: 256 ≤ N ≤ 65536, page- aligned. --stack-size implicitly enables --expressload — the GS/OS Loader's slow path silently rejects multi-seg OMFs (see §D below for the empirical evidence).
  4. src/llvm/lib/Target/W65816/W65816ISelLowering.cppLowerShift now inlines i32 SHL/SRL/SRA by N=1..4 instead of routing to __lshrsi3/__ashlsi3/__ashrsi3. See §E.
  5. src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp — pre-existing uncommitted change from prior turns; verify against git log before re-staging if recovery is fresh.

Earlier-mentioned files (snprintf.c, W65816InstrInfo.cpp, W65816SjLjFinalize.cpp) have been checkpoint-committed and are no longer in git status.

Already-committed in this session arc

Per git log --oneline -20 these are the recent checkpoint commits; the diffs they contain are real and load-bearing.

The big ones (search by file or grep):

  • JSLpseudo Defs += IMG0..IMG15 in W65816InstrInfo.td. With the wider Defs, regalloc spills IMG-class vregs around calls instead of treating them as preserved.
  • W65816RegisterInfo.cpp eliminateFrameIndex for STAfi: PHA-bracketed for non-A source (IMG/X/Y). The lda dp; sta d,s chain clobbered A; bracket preserves A while shifting offset by +2 between PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation.
  • W65816RegisterInfo.cpp eliminateFrameIndex for LDAfi: if Dst = IMGn, append STA dp so the IMG slot actually receives the loaded value. Previously only loaded into A; downstream COPY $x = $imgN (= ldx $D?) read garbage. This was the smoking gun for dadd(1.5, 2.5) → 0x4010_0000_3000_3000.
  • W65816LowerWide32.cpp fixed-point erase loop. Was single-pass; REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it alive at the iteration moment. Removed ~40 dead Wide32 vregs from __adddf3's pre-RA MIR.
  • src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll relaxed stx 0xd / sta 0xd to 0x{{[cd]}} (regalloc now picks IMG8..15 too).

Fixes landed (full list with rationale)

Each entry: what / why / where / what regression it would cause if reverted.

A. Hash-shell DELETE bug → IMG caller-clobber

Symptom: dbDelete("age") returned 0 ("not found") instead of 1. DELETE never ran; COUNT stayed at 2.

Root cause: dbDelete did stx 0xd0 to save k_high, called hashKey, then pei 0xd0 to push k_high to strcmp. hashKey used $D0 as scratch in its loop body (sta 0xd0 storing the iterator's running-ptr-low). $D0 was clobbered by the time pei 0xd0 ran. JSLpseudo Defs only listed A, X, Y, DPF0 — IMG slots were not modelled as caller-clobber.

Fix: JSLpseudo Defs += [IMG0..IMG15].

Cascading fallout (each required its own fix):

A1. copyPhysReg vreg fallback

storeRegToStackSlot's unpaired-Wide32 default branch hit unreachable when called with a vreg source. Basic regalloc's InlineSpiller does this. Fix: short-circuit virtual-reg cases to TargetOpcode::COPY.

A2. LowerWide32 fixed-point erase

Single-pass erase left ~40 dead Wide32 vregs in __adddf3. Pattern:

%X:wide32 = REG_SEQUENCE ...
%Y:wide32 = COPY %X
... uses of %Y rewritten by Pass 3 ...

Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then COPY erased (now %X dead but loop already passed it). Fix: iterate until no progress.

A3. STAfi PHA-bracket

Without bracket, regalloc could schedule $img0 = COPY $a AFTER a STAfi-with-IMG-source whose internal lda dp clobbered $a, silently storing X's value where A's was expected.

A4. LDAfi-IMG-dest STA dp

The big one. With narrow IMG, regalloc kept Wide16 vregs in IMG slots across calls, never needed $imgN = LDAfi %stack.X. With full IMG, every cross-call spill needed it. The expansion only emitted LDA d,s (load A) — never wrote to the IMG slot. Downstream COPY $x = $imgN (= ldx $D?) read stale prior data. Manifested as dadd(1.5, 2.5) → 0x4010_0000_3000_3000 (mantissa garbage).

Diagnostic that found it: diff post-RA MIR narrow vs full IMG. Pre-RA MIR was identical. Full had 6 $imgN = LDAfi instances; narrow had 0. Narrow used COPY $imgN = $a patterns instead — those work correctly.

A5. FileCheck regex

src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll expected stx 0xd / sta 0xd. Under full IMG clobber, regalloc picks IMG8..15 ($C0/$C2) for cross-call arg saves. Relaxed to 0x{{[cd]}}.

B. C++ try/catch source-level path

Two bugs blocking real clang++ -fsjlj-exceptions source code:

B1. W65816SjLjFinalize catchtab ordering

runOnFunction erased landingpad insts at line ~245, then built the catchtab at line ~290 via LPadBB->getLandingPadInst(). By that point, landingpads were nullptr. The build loop's if (!LP) continue; skipped every entry. Catchtab ended with just (0,0) sentinel. LSDA was 4 bytes of zeros. findCatch saw ctx->lsda == 0's entry and bailed. Result: any throw aborted.

Fix: capture catch-clause typeinfo Constants into a DenseMap<BasicBlock*, LPadInfo> BEFORE erasing landingpads; the catchtab build loop reads from the saved map.

B2. copyPhysReg IMG-to-IMG PHA-bracket

Comment said "Caller is responsible for ensuring A is dead at this program point (regalloc usually arranges this)." It doesn't, in practice. Regalloc inserted IMG-to-IMG copies between $a = COPY $img10 and STAfi $a, slot. Unbracketed lda src; sta dst clobbered A. The subsequent STAfi spilled garbage. Visible as *p = 42 after __cxa_allocate_exception storing 42 to wrong addr (indirect-long setup got hi-half at lo-slot).

Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare).

Verified end-to-end via MAME breakpoints: begin_catch entered with correct ExcHeader, end_catch entered with A=42, doTest returns A=42 from real C++ try { throw 42; } catch (int x) { return x; }.

C. Cleanup wins

  • runtime/src/snprintf.c:106 — removed optnone on emitULong. Smoke green.
  • runtime/src/snprintf.c:303 — removed optnone on snprintf. Smoke green.

D. omfEmit --stack-size — DP/Stack segment for GS/OS Loader

Added emitDpStackSeg (src/link816/omfEmit.cpp). KIND=0x1012 (DP/Stack | PRIVATE), LENGTH=RESSPC=requested-bytes, ALIGN=0x100, BANKSIZE=0, body is a single END opcode. Apps can now request a stack of any page-aligned size from 256B to 64KB (replacing GS/OS Loader's default 4KB allocation).

Loader gotcha (cost ~1 hour to debug): plain (non-ExpressLoad) multi-segment OMFs do NOT launch under real GS/OS 6.0.2 — the Loader's slow path silently rejects the file and our entry point never runs. ExpressLoad-wrapped multi-segment OMFs DO work. Fix: --stack-size now implicitly enables --expressload (the Loader's slow path is empirically broken for our 2-seg layout). The DP/Stack seg is appended AFTER the user code seg as SEGNUM=3; the Loader walks all segments by KIND after the ExpressLoad fast-load step finishes.

Verified: runViaFinder.sh /tmp/test_el_dp.omf --check 0x70=0x42 0x71=0x99 passes under real GS/OS 6.0.2 with --stack-size 4096 --expressload. Verified failure mode: same payload with --stack-size alone (no --expressload) → 0x70=0x00 (program never executed). Documented in feedback_loader_multi_seg_needs_expressload.md.

Smoke updated: 132/132 expects 3 segs (ExpressLoad + code + DP/Stack) when --stack-size is supplied.

E. i32 shift-by-N inlined (was full libcall) — speed win

W65816ISelLowering.cpp LowerShift now inlines i32 SHL/SRL/SRA by N=1..4. Previously every i32 shift went through __lshrsi3/ __ashlsi3/__ashrsi3 — ~300+ cyc per call. popcount benchmark: 8320 → 6888 cyc/call, 17% faster. Implementation extracts Wide32 halves via extractWide32Lo/Hi, applies per-step lsr; ror-equivalent SDAG ops with explicit carry propagation ((Hi & 1) << 15 for SRL/SRA's lo-fill, Lo >> 15 for SHL's hi-fill), recombines via buildWide32. N>4 still routes to libcall — the unrolled cost (~5 i16 ops × N) crosses libcall overhead at N≈5.

Documented in feedback_i32_shift_inline.md.

Still-open work areas

Each carries a fair-warning note for whoever picks it up.

1. qsort/bsearch optnone — REMOVED 2026-05-08

Source-restructured qsort: split the inner loop into a __attribute__((noinline)) helper qsortInner (4 args: base, cur, size, cmp). Outer qsort just iterates i = 1..nmemb-1 and calls qsortInner(base, base + i*size, size, cmp). This drops outer qsort's i32-vreg simultaneous-live count below the inline-spill OOM threshold; both halves compile cleanly at -O2 + basic regalloc.

bsearch optnone was kept-for-symmetry — once removed, it just worked. The IMG-clobber + LDAfi-IMG-store backend fixes from 2026-05-07 had already resolved its underlying pressure issue.

Smoke stays green (now 132/132).

2. gmtime_r optnone

runtime/src/timeExt.c:69. NOT a backend bug — IR-level optimization issue (loop rotation + IndVar simplify mis-evaluating days >= 365L + (__isLeap(...) ? 1 : 0)). Fixing requires deciding which combine pass is wrong and why. Out of scope for backend work.

3. softDouble noinlines

runtime/src/softDouble.c:30 (dpack) and :51 (dclass). Removing dpack noinline broke dadd this session — register pressure for __adddf3/__muldf3/__divdf3. Architectural for the same reason as qsort.

4. Greedy regalloc retry — TRIED, blocked

Tested 2026-05-08. Greedy fails immediately on atoi in libc.c:

LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...):
Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed.

Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register pair partial defs that the regalloc treats as fully dead. Greedy is genuinely incompatible with the W65816's split-half subreg-pair patterns until the upstream LLVM issue is patched. Reverted to basic regalloc. Document feedback_greedy_high_pressure.md already covers this.

5. gmtime_r optnone — TRIED, blocked

Tested 2026-05-08. Hoisting yearLen to a long local (avoiding the double-recompute of 365L + (__isLeap ? 1 : 0)) didn't help; adding volatile to the local also didn't help. IR optimizer is still folding the comparison to compile-time-false. Source-level C restructuring won't dodge it; needs IR-pass-level work to identify which combine pass mis-evaluates and why. optnone stays.

How to verify recovery

cd /home/scott/claude/llvm816
git status                                        # 5 modified files listed above
cd tools/llvm-mos-build && ninja llc clang        # rebuild backend (~5 min)
cd /home/scott/claude/llvm816
cd src/link816 && make && cd ../..                # rebuild link816 + omfEmit
bash runtime/build.sh                             # build runtime
bash scripts/smokeTest.sh                         # should end "all smoke checks passed"
bash scripts/benchCyclesPrecise.sh                # popcount should be ~6888 cyc

Loader smoke (validates DP/Stack seg under real GS/OS 6.0.2):

# Build a simple test program with --stack-size, run via Finder.
tools/omfEmit --input X.bin --map X.map --base 0x1000 --entry __start \
    --output /tmp/t.omf --stack-size 4096 --relocs X.relocs
bash scripts/runViaFinder.sh /tmp/t.omf --check 0x70=0x42 0x71=0x99

If smoke fails, the likely cause is one of the 5 uncommitted files got reverted; check git status and re-apply. If popcount bench regressed past ~7500 cyc, suspect the i32-shift-inline change in W65816ISelLowering.cpp was lost.

Diagnostic tools that worked

For posterity — these are the patterns that paid off this session.

Pre-RA vs post-RA MIR diff

clang -mllvm -stop-before=regallocbasic -S ...     # pre-RA
clang -mllvm -stop-after=virtregrewriter -S ...    # post-RA (post-virtregrewriter)

Diff narrow-IMG vs full-IMG post-RA MIR for the failing function. Pre-RA is identical (same IR), so the diff isolates regalloc-decision divergence. Look at every NEW pattern that appears only in the failing build — $imgN = LDAfi was the smoking gun for dadd.

Pass-by-pass IR/MIR dumps

clang -mllvm -print-after=w65816-lower-wide32 -S ...
clang -mllvm -print-after-all -S ... 2>dump.txt

MAME debugger via xvfb-run

xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N

With autoboot Lua: load .bin into bank 0 (skip $C000..$CFFF I/O), set CPU state, then cpu.debug:bpset(addr, condition, action) with actions like "logerror \"...\\n\",a,x,...; go". logerror with format args goes to stdout under -oslog. Memory reads in expressions: b@(addr), w@(addr). Watchpoints: cpu.debug:wpset(prog_space, "w", addr, len, condition, action).

AVOID: add_machine_pause_notifier + cpu.debug:go() in callback — segfaults from reentrancy. printf in actions stays in debugger console (not -oslog). tracelog also debugger-console only.

Trace methodology (find divergence point)

  1. Set BPs at every JSLpseudo callee in the failing function.
  2. Capture A/X/Y/DPF0 at each return.
  3. Find first divergent return between known-good and failing builds.
  4. The instruction sequence between previous-OK and first-divergent return is where the bug lives.

This pattern found the dadd bug at jsl@0x207f → __lshrsi3(0x8001_8000, 3) in 30 minutes. Recommended.

Memory notes referenced

(Filenames under /home/scott/.claude/projects/-home-scott-claude-llvm816/memory/.)

  • feedback_strstr9_long_haystack.md — the hash-shell bug story.
  • feedback_cpp_subset.md — C++ subset, including the SjLj fix.
  • feedback_ptr32_frame_limit.md — was 5 days stale; updated 2026-05-07 to "DONE, 131/131 smoke green".
  • feedback_jslpseudo_caller_save.md, feedback_libcall_img_clobber.md, feedback_img_slot_expansion.md, feedback_greedy_high_pressure.md — related backend topics.
  • feedback_loader_multi_seg_needs_expressload.mdnew 2026-05-08. Multi-seg OMFs need ExpressLoad to launch under real Loader.
  • feedback_i32_shift_inline.mdnew 2026-05-08. Inline i32 shift-by-N for N=1..4; first quantified bench-vs-self speed win.
  • feedback_speed_over_size.mdnew 2026-05-07. Optimization priorities: cycle count over byte count, full stop.

Next session candidates (ranked)

evalAt at 1.86× vs Calypsi is the structural floor for peephole work (see feedback_evalat_structural_gap.md). Further gains need:

  1. i64-by-pointer ABI (rejected this session — diminishing returns). Pass doubles by ptr instead of value: saves ~120 cyc per evalAt call. Requires runtime rewrite, OMF compat checks, every double caller updated. Risk:reward too high for the size of the gain.
  2. __divdf3 / __adddf3 algorithmic improvements. ddiv 1261 cyc could drop via Newton-Raphson reciprocal multiplication (a*1/b instead of bit-by-bit long division). Major rewrite, but our __muldi3 short-circuit makes the multiplications cheap now.
  3. Higher-resolution cycle timer. HBL counter is 8-bit and wraps at ~256 ticks; combining scan-line position + frame counter would give per-bench resolution better than ±65 cyc. Would unblock benchmarking sub-loop changes (e.g., the LSR-dp shift form).
  4. More peepholes from the audit. Phase 4 STA_StackRel extension landed but doesn't fire in current libc (frame sizes too large). If callers shrink frames via better SSM, more functions become eligible.