17 KiB
Session Recovery — last updated 2026-05-20
Living recovery doc. Update on every meaningful change. If session is lost, read this top-to-bottom + the memory notes referenced inside, then reread the actual diffs in tree to ground assumptions.
Headline state
- Smoke: 148/148 green. Demos 9/9 (helloBeep/helloText/helloWindow/ orcaFrame/qdProbe/heavyRelocs/frame/reversi/minicad).
- Active config: ptr32 (
p:32:16), full IMG0..IMG15 caller-clobber on JSL, greedy regalloc at -O1+. - Branch:
main. - vs Calypsi static-inst ratio (2026-05-20): sumSquares 0.84× (26 vs 31 — we beat), mul16to32 0.25× (1 vs 4 — we beat), evalAt 1.86× (472 vs 254 — structural floor; ABI overhaul rejected).
- Cycle benches (2026-05-20): popcount 93, strcpy 91, bsearch 127, memcmp 113, fib 97, dotProduct 144, sumOfSquares 126 cyc/iter (100 iters); dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters).
- Recent session wins (2026-05-20):
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg (evalAt 498→472, fib -35%, 35 libc fns shrunk)
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
- case-(b) ImgCalleeSave bracket hoist enables phase 4 to elide TAY/TYA round-trip in synergy
- FP cycle benches added (dadd/dmul/ddiv) with per-bench iter count
- Documented LSR-dp cycle mystery as HBL-counter wrap artifact
Uncommitted, must keep
git status --short (5 modified, no untracked of consequence):
SESSION_RECOVERY.md— this doc.scripts/smokeTest.sh— added "omfEmit--stack-sizeemits a DP/Stack~Directsegment" check. Validates 3-segment layout (ExpressLoad + code + DP/Stack) when--stack-sizeis supplied; parses the third segment header against KIND/LENGTH/RESSPC/ALIGN/ SEGNUM=3/name="~Direct" expectations.src/link816/omfEmit.cpp—emitDpStackSeg(length, segNum)plus the--stack-size NCLI flag. Validation: 256 ≤ N ≤ 65536, page- aligned.--stack-sizeimplicitly enables--expressload— the GS/OS Loader's slow path silently rejects multi-seg OMFs (see §D below for the empirical evidence).src/llvm/lib/Target/W65816/W65816ISelLowering.cpp—LowerShiftnow inlines i32 SHL/SRL/SRA by N=1..4 instead of routing to__lshrsi3/__ashlsi3/__ashrsi3. See §E.src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp— pre-existing uncommitted change from prior turns; verify against git log before re-staging if recovery is fresh.
Earlier-mentioned files (snprintf.c, W65816InstrInfo.cpp,
W65816SjLjFinalize.cpp) have been checkpoint-committed and are no
longer in git status.
Already-committed in this session arc
Per git log --oneline -20 these are the recent checkpoint commits;
the diffs they contain are real and load-bearing.
The big ones (search by file or grep):
- JSLpseudo Defs += IMG0..IMG15 in
W65816InstrInfo.td. With the wider Defs, regalloc spills IMG-class vregs around calls instead of treating them as preserved. W65816RegisterInfo.cppeliminateFrameIndexforSTAfi: PHA-bracketed for non-A source (IMG/X/Y). Thelda dp; sta d,schain clobbered A; bracket preserves A while shifting offset by +2 between PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation.W65816RegisterInfo.cppeliminateFrameIndexforLDAfi: ifDst = IMGn, appendSTA dpso the IMG slot actually receives the loaded value. Previously only loaded into A; downstreamCOPY $x = $imgN(=ldx $D?) read garbage. This was the smoking gun fordadd(1.5, 2.5) → 0x4010_0000_3000_3000.W65816LowerWide32.cppfixed-point erase loop. Was single-pass; REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it alive at the iteration moment. Removed ~40 dead Wide32 vregs from__adddf3's pre-RA MIR.src/llvm/test/CodeGen/W65816/i64-first-arg-img16.llrelaxedstx 0xd / sta 0xdto0x{{[cd]}}(regalloc now picks IMG8..15 too).
Fixes landed (full list with rationale)
Each entry: what / why / where / what regression it would cause if reverted.
A. Hash-shell DELETE bug → IMG caller-clobber
Symptom: dbDelete("age") returned 0 ("not found") instead of 1.
DELETE never ran; COUNT stayed at 2.
Root cause: dbDelete did stx 0xd0 to save k_high, called hashKey,
then pei 0xd0 to push k_high to strcmp. hashKey used $D0 as scratch
in its loop body (sta 0xd0 storing the iterator's running-ptr-low). $D0
was clobbered by the time pei 0xd0 ran. JSLpseudo Defs only listed
A, X, Y, DPF0 — IMG slots were not modelled as caller-clobber.
Fix: JSLpseudo Defs += [IMG0..IMG15].
Cascading fallout (each required its own fix):
A1. copyPhysReg vreg fallback
storeRegToStackSlot's unpaired-Wide32 default branch hit unreachable
when called with a vreg source. Basic regalloc's InlineSpiller does
this. Fix: short-circuit virtual-reg cases to TargetOpcode::COPY.
A2. LowerWide32 fixed-point erase
Single-pass erase left ~40 dead Wide32 vregs in __adddf3. Pattern:
%X:wide32 = REG_SEQUENCE ...
%Y:wide32 = COPY %X
... uses of %Y rewritten by Pass 3 ...
Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then COPY erased (now %X dead but loop already passed it). Fix: iterate until no progress.
A3. STAfi PHA-bracket
Without bracket, regalloc could schedule $img0 = COPY $a AFTER a
STAfi-with-IMG-source whose internal lda dp clobbered $a, silently
storing X's value where A's was expected.
A4. LDAfi-IMG-dest STA dp
The big one. With narrow IMG, regalloc kept Wide16 vregs in IMG
slots across calls, never needed $imgN = LDAfi %stack.X. With full
IMG, every cross-call spill needed it. The expansion only emitted
LDA d,s (load A) — never wrote to the IMG slot. Downstream
COPY $x = $imgN (= ldx $D?) read stale prior data. Manifested as
dadd(1.5, 2.5) → 0x4010_0000_3000_3000 (mantissa garbage).
Diagnostic that found it: diff post-RA MIR narrow vs full IMG. Pre-RA
MIR was identical. Full had 6 $imgN = LDAfi instances; narrow had 0.
Narrow used COPY $imgN = $a patterns instead — those work correctly.
A5. FileCheck regex
src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll expected
stx 0xd / sta 0xd. Under full IMG clobber, regalloc picks IMG8..15
($C0/$C2) for cross-call arg saves. Relaxed to 0x{{[cd]}}.
B. C++ try/catch source-level path
Two bugs blocking real clang++ -fsjlj-exceptions source code:
B1. W65816SjLjFinalize catchtab ordering
runOnFunction erased landingpad insts at line ~245, then built the
catchtab at line ~290 via LPadBB->getLandingPadInst(). By that
point, landingpads were nullptr. The build loop's if (!LP) continue;
skipped every entry. Catchtab ended with just (0,0) sentinel. LSDA
was 4 bytes of zeros. findCatch saw ctx->lsda == 0's entry and
bailed. Result: any throw aborted.
Fix: capture catch-clause typeinfo Constants into a
DenseMap<BasicBlock*, LPadInfo> BEFORE erasing landingpads; the
catchtab build loop reads from the saved map.
B2. copyPhysReg IMG-to-IMG PHA-bracket
Comment said "Caller is responsible for ensuring A is dead at this
program point (regalloc usually arranges this)." It doesn't, in
practice. Regalloc inserted IMG-to-IMG copies between $a = COPY $img10
and STAfi $a, slot. Unbracketed lda src; sta dst clobbered A.
The subsequent STAfi spilled garbage. Visible as *p = 42 after
__cxa_allocate_exception storing 42 to wrong addr (indirect-long
setup got hi-half at lo-slot).
Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare).
Verified end-to-end via MAME breakpoints: begin_catch entered
with correct ExcHeader, end_catch entered with A=42, doTest returns
A=42 from real C++ try { throw 42; } catch (int x) { return x; }.
C. Cleanup wins
runtime/src/snprintf.c:106— removedoptnoneonemitULong. Smoke green.runtime/src/snprintf.c:303— removedoptnoneonsnprintf. Smoke green.
D. omfEmit --stack-size — DP/Stack segment for GS/OS Loader
Added emitDpStackSeg (src/link816/omfEmit.cpp). KIND=0x1012 (DP/Stack
| PRIVATE), LENGTH=RESSPC=requested-bytes, ALIGN=0x100, BANKSIZE=0, body
is a single END opcode. Apps can now request a stack of any
page-aligned size from 256B to 64KB (replacing GS/OS Loader's default
4KB allocation).
Loader gotcha (cost ~1 hour to debug): plain (non-ExpressLoad)
multi-segment OMFs do NOT launch under real GS/OS 6.0.2 — the Loader's
slow path silently rejects the file and our entry point never runs.
ExpressLoad-wrapped multi-segment OMFs DO work. Fix: --stack-size now
implicitly enables --expressload (the Loader's slow path is
empirically broken for our 2-seg layout). The DP/Stack seg is appended
AFTER the user code seg as SEGNUM=3; the Loader walks all segments by
KIND after the ExpressLoad fast-load step finishes.
Verified: runViaFinder.sh /tmp/test_el_dp.omf --check 0x70=0x42 0x71=0x99
passes under real GS/OS 6.0.2 with --stack-size 4096 --expressload.
Verified failure mode: same payload with --stack-size alone (no
--expressload) → 0x70=0x00 (program never executed). Documented
in feedback_loader_multi_seg_needs_expressload.md.
Smoke updated: 132/132 expects 3 segs (ExpressLoad + code + DP/Stack)
when --stack-size is supplied.
E. i32 shift-by-N inlined (was full libcall) — speed win
W65816ISelLowering.cpp LowerShift now inlines i32 SHL/SRL/SRA by
N=1..4. Previously every i32 shift went through __lshrsi3/
__ashlsi3/__ashrsi3 — ~300+ cyc per call. popcount benchmark:
8320 → 6888 cyc/call, 17% faster. Implementation extracts
Wide32 halves via extractWide32Lo/Hi, applies per-step
lsr; ror-equivalent SDAG ops with explicit carry propagation
((Hi & 1) << 15 for SRL/SRA's lo-fill, Lo >> 15 for SHL's
hi-fill), recombines via buildWide32. N>4 still routes to libcall
— the unrolled cost (~5 i16 ops × N) crosses libcall overhead at N≈5.
Documented in feedback_i32_shift_inline.md.
Still-open work areas
Each carries a fair-warning note for whoever picks it up.
1. qsort/bsearch optnone — REMOVED 2026-05-08
Source-restructured qsort: split the inner loop into a
__attribute__((noinline)) helper qsortInner (4 args: base, cur,
size, cmp). Outer qsort just iterates i = 1..nmemb-1 and calls
qsortInner(base, base + i*size, size, cmp). This drops outer
qsort's i32-vreg simultaneous-live count below the inline-spill
OOM threshold; both halves compile cleanly at -O2 + basic regalloc.
bsearch optnone was kept-for-symmetry — once removed, it just
worked. The IMG-clobber + LDAfi-IMG-store backend fixes from
2026-05-07 had already resolved its underlying pressure issue.
Smoke stays green (now 132/132).
2. gmtime_r optnone
runtime/src/timeExt.c:69. NOT a backend bug — IR-level optimization
issue (loop rotation + IndVar simplify mis-evaluating
days >= 365L + (__isLeap(...) ? 1 : 0)). Fixing requires deciding
which combine pass is wrong and why. Out of scope for backend work.
3. softDouble noinlines
runtime/src/softDouble.c:30 (dpack) and :51 (dclass). Removing
dpack noinline broke dadd this session — register pressure for
__adddf3/__muldf3/__divdf3. Architectural for the same reason as
qsort.
4. Greedy regalloc retry — TRIED, blocked
Tested 2026-05-08. Greedy fails immediately on atoi in libc.c:
LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...):
Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed.
Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register
pair partial defs that the regalloc treats as fully dead. Greedy is
genuinely incompatible with the W65816's split-half subreg-pair patterns
until the upstream LLVM issue is patched. Reverted to basic regalloc.
Document feedback_greedy_high_pressure.md already covers this.
5. gmtime_r optnone — TRIED, blocked
Tested 2026-05-08. Hoisting yearLen to a long local (avoiding the
double-recompute of 365L + (__isLeap ? 1 : 0)) didn't help; adding
volatile to the local also didn't help. IR optimizer is still
folding the comparison to compile-time-false. Source-level C
restructuring won't dodge it; needs IR-pass-level work to identify
which combine pass mis-evaluates and why. optnone stays.
How to verify recovery
cd /home/scott/claude/llvm816
git status # 5 modified files listed above
cd tools/llvm-mos-build && ninja llc clang # rebuild backend (~5 min)
cd /home/scott/claude/llvm816
cd src/link816 && make && cd ../.. # rebuild link816 + omfEmit
bash runtime/build.sh # build runtime
bash scripts/smokeTest.sh # should end "all smoke checks passed"
bash scripts/benchCyclesPrecise.sh # popcount should be ~6888 cyc
Loader smoke (validates DP/Stack seg under real GS/OS 6.0.2):
# Build a simple test program with --stack-size, run via Finder.
tools/omfEmit --input X.bin --map X.map --base 0x1000 --entry __start \
--output /tmp/t.omf --stack-size 4096 --relocs X.relocs
bash scripts/runViaFinder.sh /tmp/t.omf --check 0x70=0x42 0x71=0x99
If smoke fails, the likely cause is one of the 5 uncommitted files
got reverted; check git status and re-apply. If popcount bench
regressed past ~7500 cyc, suspect the i32-shift-inline change in
W65816ISelLowering.cpp was lost.
Diagnostic tools that worked
For posterity — these are the patterns that paid off this session.
Pre-RA vs post-RA MIR diff
clang -mllvm -stop-before=regallocbasic -S ... # pre-RA
clang -mllvm -stop-after=virtregrewriter -S ... # post-RA (post-virtregrewriter)
Diff narrow-IMG vs full-IMG post-RA MIR for the failing function.
Pre-RA is identical (same IR), so the diff isolates regalloc-decision
divergence. Look at every NEW pattern that appears only in the failing
build — $imgN = LDAfi was the smoking gun for dadd.
Pass-by-pass IR/MIR dumps
clang -mllvm -print-after=w65816-lower-wide32 -S ...
clang -mllvm -print-after-all -S ... 2>dump.txt
MAME debugger via xvfb-run
xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N
With autoboot Lua: load .bin into bank 0 (skip $C000..$CFFF I/O),
set CPU state, then cpu.debug:bpset(addr, condition, action) with
actions like "logerror \"...\\n\",a,x,...; go". logerror with format
args goes to stdout under -oslog. Memory reads in expressions:
b@(addr), w@(addr). Watchpoints: cpu.debug:wpset(prog_space, "w", addr, len, condition, action).
AVOID: add_machine_pause_notifier + cpu.debug:go() in callback —
segfaults from reentrancy. printf in actions stays in debugger console
(not -oslog). tracelog also debugger-console only.
Trace methodology (find divergence point)
- Set BPs at every
JSLpseudocallee in the failing function. - Capture A/X/Y/DPF0 at each return.
- Find first divergent return between known-good and failing builds.
- The instruction sequence between previous-OK and first-divergent return is where the bug lives.
This pattern found the dadd bug at jsl@0x207f → __lshrsi3(0x8001_8000, 3)
in 30 minutes. Recommended.
Memory notes referenced
(Filenames under /home/scott/.claude/projects/-home-scott-claude-llvm816/memory/.)
feedback_strstr9_long_haystack.md— the hash-shell bug story.feedback_cpp_subset.md— C++ subset, including the SjLj fix.feedback_ptr32_frame_limit.md— was 5 days stale; updated 2026-05-07 to "DONE, 131/131 smoke green".feedback_jslpseudo_caller_save.md,feedback_libcall_img_clobber.md,feedback_img_slot_expansion.md,feedback_greedy_high_pressure.md— related backend topics.feedback_loader_multi_seg_needs_expressload.md— new 2026-05-08. Multi-seg OMFs need ExpressLoad to launch under real Loader.feedback_i32_shift_inline.md— new 2026-05-08. Inline i32 shift-by-N for N=1..4; first quantified bench-vs-self speed win.feedback_speed_over_size.md— new 2026-05-07. Optimization priorities: cycle count over byte count, full stop.
Next session candidates (ranked)
evalAt at 1.86× vs Calypsi is the structural floor for peephole work
(see feedback_evalat_structural_gap.md). Further gains need:
- i64-by-pointer ABI (rejected this session — diminishing returns). Pass doubles by ptr instead of value: saves ~120 cyc per evalAt call. Requires runtime rewrite, OMF compat checks, every double caller updated. Risk:reward too high for the size of the gain.
- __divdf3 / __adddf3 algorithmic improvements. ddiv 1261 cyc could drop via Newton-Raphson reciprocal multiplication (a*1/b instead of bit-by-bit long division). Major rewrite, but our __muldi3 short-circuit makes the multiplications cheap now.
- Higher-resolution cycle timer. HBL counter is 8-bit and wraps at ~256 ticks; combining scan-line position + frame counter would give per-bench resolution better than ±65 cyc. Would unblock benchmarking sub-loop changes (e.g., the LSR-dp shift form).
- More peepholes from the audit. Phase 4 STA_StackRel extension landed but doesn't fire in current libc (frame sizes too large). If callers shrink frames via better SSM, more functions become eligible.