Scott Duensing 25a67901a6 More optimizations.

2026-05-27 19:37:26 -05:00

20 KiB

Raw Blame History

Session Recovery — last updated 2026-05-27

Living recovery doc. Update on every meaningful change. If session is lost, read this top-to-bottom + the memory notes referenced inside, then reread the actual diffs in tree to ground assumptions.

Headline state

Smoke: 148/148 green. Demos 9/9 (helloBeep/helloText/helloWindow/ orcaFrame/qdProbe/heavyRelocs/frame/reversi/minicad).
Active config: ptr32 (p:32:16), full IMG0..IMG15 caller-clobber on JSL, greedy regalloc at -O1+. Inline-threshold lowered to 50 target-wide (was LLVM default 225; was 75 earlier this session).
Branch: main.
vs Calypsi (2026-05-27) — Layer 2 + recent peepholes:
- Cycle benches geomean: 0.62× Calypsi. 9 of 10 below 1.0×; only fib trails at 1.06× (recursive overhead, structural). See cycle bench table below.
- Lua 5.1.5: default config 1.13× Calypsi; with Layer 2 0.93×.
- CoreMark 1.0: with Layer 2 0.79× Calypsi (we beat by 21%).
vs Calypsi static-inst ratio (synthetic bench): sumSquares 0.84× (26 vs 31 — we beat), mul16to32 0.25× (1 vs 4 — we beat), evalAt 1.86× (472 vs 254 — structural floor; ABI overhaul rejected).
New code-gen options (2026-05-25) — see docs/USAGE.md "Advanced: pointer-deref code generation":
- Layer 1 ptr32 deref-fold (always on): LDY offset instead of CLC/ADC carry chain. ~3 instr saved per struct-field access.
- -mllvm -w65816-dbr-safe-ptrs (Layer 2, opt-in): uses lda (d,S),Y for ptr32 derefs assuming bank-byte == DBR. 5 instr → 1 instr per deref. Lua -20.6%. MISCOMPILES cross-bank pointers — opt in per-TU only when safe.
- Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark matrix.o 1.37× → 0.97× Calypsi. Override with -mllvm -inline-threshold=N.

Cycle benches per-call (2026-05-27, Layer 2) — via scripts/benchCyclesPrecise.sh vs scripts/benchCyclesCalypsi.sh:

Bench         Ours   Calypsi  Ratio
dotProduct    1534   5712     0.27×
bsearch       682    2387     0.29×
sumOfSquares  6820   16368    0.42×
bubbleSort    11594  17050    0.68×
strLen        767    1023     0.75×
djb2Hash      2046   2643     0.77×
popcount      1194   1534     0.78×
strcpy        1108   1194     0.93×
memcmp        682    716      0.95×
fib           11594  10912    1.06×

Geomean 0.62×. Older HBL-tick numbers (per-iter, 100 iter loops) from benchCycles.sh are still available but lower resolution.

Recent session wins (2026-05-27):
- Y-as-counter for strLen — structural rewrite: drop STX/INX/INC, use Y as offset AND counter. strLen 1279 → 767 cyc (-40%); 0.75× Calypsi (was 1.25×).
- Stack-rel dead-store elim — companion to DP version with SP tracking across PHA/PHP/PEA/PEI/PER/PLA/PLP/PLX/PLY/PHX/PHY. strcpy 1194 → 1108 (-7%, 0.93× Calypsi, beats by 7%). Refactored as a static helper called from the recursive-call bail too so fib gets it. fib 12106 → 11594 (-4%, 1.06× Calypsi).
- DP-indirect-Y for iter (follow-on to X-iter peephole): rewrites TXA;STA stack-rel S;INX;…;LDA (S,s),Y to STX_DP D;INX;…;LDA (D),Y. Saves 4 cyc/iter.
- Dead INC_HI_IF_CARRY elim — when the StackRel ptr-hi slot is never read, elide the carry-bookkeeping for Layer 2 ptr32 loops. Wide impact across strLen/strcpy/djb2Hash/memcmp.
Recent session wins (earlier — 2026-05-20):
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg (evalAt 498→472, fib -35%, 35 libc fns shrunk)
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
- case-(b) ImgCalleeSave bracket hoist enables phase 4 to elide TAY/TYA round-trip in synergy
- FP cycle benches added (dadd/dmul/ddiv) with per-bench iter count
- Documented LSR-dp cycle mystery as HBL-counter wrap artifact
- Game-like benches added: particles (i16 physics), mandelbrot (i32 fp)
- elideStoreForwarding now reached via early-return bail paths: particles 5005→2253 cyc/iter (-55%). Was being skipped for any function where main IMG promotion bailed (SpAdj invalid, no accesses, or > 16 hot slots).

Uncommitted, must keep

git status --short (5 modified, no untracked of consequence):

SESSION_RECOVERY.md — this doc.
scripts/smokeTest.sh — added "omfEmit --stack-size emits a DP/Stack ~Direct segment" check. Validates 3-segment layout (ExpressLoad + code + DP/Stack) when --stack-size is supplied; parses the third segment header against KIND/LENGTH/RESSPC/ALIGN/ SEGNUM=3/name="~Direct" expectations.
src/link816/omfEmit.cpp — emitDpStackSeg(length, segNum) plus the --stack-size N CLI flag. Validation: 256 ≤ N ≤ 65536, page- aligned. --stack-size implicitly enables --expressload — the GS/OS Loader's slow path silently rejects multi-seg OMFs (see §D below for the empirical evidence).
src/llvm/lib/Target/W65816/W65816ISelLowering.cpp — LowerShift now inlines i32 SHL/SRL/SRA by N=1..4 instead of routing to __lshrsi3/__ashlsi3/__ashrsi3. See §E.
src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp — pre-existing uncommitted change from prior turns; verify against git log before re-staging if recovery is fresh.

Earlier-mentioned files (snprintf.c, W65816InstrInfo.cpp, W65816SjLjFinalize.cpp) have been checkpoint-committed and are no longer in git status.

Already-committed in this session arc

Per git log --oneline -20 these are the recent checkpoint commits; the diffs they contain are real and load-bearing.

The big ones (search by file or grep):

JSLpseudo Defs += IMG0..IMG15 in W65816InstrInfo.td. With the wider Defs, regalloc spills IMG-class vregs around calls instead of treating them as preserved.
W65816RegisterInfo.cpp eliminateFrameIndex for STAfi: PHA-bracketed for non-A source (IMG/X/Y). The lda dp; sta d,s chain clobbered A; bracket preserves A while shifting offset by +2 between PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation.
W65816RegisterInfo.cpp eliminateFrameIndex for LDAfi: if Dst = IMGn, append STA dp so the IMG slot actually receives the loaded value. Previously only loaded into A; downstream COPY $x = $imgN (= ldx $D?) read garbage. This was the smoking gun for dadd(1.5, 2.5) → 0x4010_0000_3000_3000.
W65816LowerWide32.cpp fixed-point erase loop. Was single-pass; REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it alive at the iteration moment. Removed ~40 dead Wide32 vregs from __adddf3's pre-RA MIR.
src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll relaxed stx 0xd / sta 0xd to 0x{{[cd]}} (regalloc now picks IMG8..15 too).

Fixes landed (full list with rationale)

Each entry: what / why / where / what regression it would cause if reverted.

A. Hash-shell DELETE bug → IMG caller-clobber

Symptom: dbDelete("age") returned 0 ("not found") instead of 1. DELETE never ran; COUNT stayed at 2.

Root cause: dbDelete did stx 0xd0 to save k_high, called hashKey, then pei 0xd0 to push k_high to strcmp. hashKey used $D0 as scratch in its loop body (sta 0xd0 storing the iterator's running-ptr-low). $D0 was clobbered by the time pei 0xd0 ran. JSLpseudo Defs only listed A, X, Y, DPF0 — IMG slots were not modelled as caller-clobber.

Fix: JSLpseudo Defs += [IMG0..IMG15].

Cascading fallout (each required its own fix):

A1. copyPhysReg vreg fallback

storeRegToStackSlot's unpaired-Wide32 default branch hit unreachable when called with a vreg source. Basic regalloc's InlineSpiller does this. Fix: short-circuit virtual-reg cases to TargetOpcode::COPY.

A2. LowerWide32 fixed-point erase

Single-pass erase left ~40 dead Wide32 vregs in __adddf3. Pattern:

%X:wide32 = REG_SEQUENCE ...
%Y:wide32 = COPY %X
... uses of %Y rewritten by Pass 3 ...

Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then COPY erased (now %X dead but loop already passed it). Fix: iterate until no progress.

A3. STAfi PHA-bracket

Without bracket, regalloc could schedule $img0 = COPY $a AFTER a STAfi-with-IMG-source whose internal lda dp clobbered $a, silently storing X's value where A's was expected.

A4. LDAfi-IMG-dest STA dp

The big one. With narrow IMG, regalloc kept Wide16 vregs in IMG slots across calls, never needed $imgN = LDAfi %stack.X. With full IMG, every cross-call spill needed it. The expansion only emitted LDA d,s (load A) — never wrote to the IMG slot. Downstream COPY $x = $imgN (= ldx $D?) read stale prior data. Manifested as dadd(1.5, 2.5) → 0x4010_0000_3000_3000 (mantissa garbage).

Diagnostic that found it: diff post-RA MIR narrow vs full IMG. Pre-RA MIR was identical. Full had 6 $imgN = LDAfi instances; narrow had 0. Narrow used COPY $imgN = $a patterns instead — those work correctly.

A5. FileCheck regex

src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll expected stx 0xd / sta 0xd. Under full IMG clobber, regalloc picks IMG8..15 ($C0/$C2) for cross-call arg saves. Relaxed to 0x{{[cd]}}.

B. C++ try/catch source-level path

Two bugs blocking real clang++ -fsjlj-exceptions source code:

B1. W65816SjLjFinalize catchtab ordering

runOnFunction erased landingpad insts at line ~245, then built the catchtab at line ~290 via LPadBB->getLandingPadInst(). By that point, landingpads were nullptr. The build loop's if (!LP) continue; skipped every entry. Catchtab ended with just (0,0) sentinel. LSDA was 4 bytes of zeros. findCatch saw ctx->lsda == 0's entry and bailed. Result: any throw aborted.

Fix: capture catch-clause typeinfo Constants into a DenseMap<BasicBlock*, LPadInfo> BEFORE erasing landingpads; the catchtab build loop reads from the saved map.

B2. copyPhysReg IMG-to-IMG PHA-bracket

Comment said "Caller is responsible for ensuring A is dead at this program point (regalloc usually arranges this)." It doesn't, in practice. Regalloc inserted IMG-to-IMG copies between $a = COPY $img10 and STAfi $a, slot. Unbracketed lda src; sta dst clobbered A. The subsequent STAfi spilled garbage. Visible as *p = 42 after __cxa_allocate_exception storing 42 to wrong addr (indirect-long setup got hi-half at lo-slot).

Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare).

Verified end-to-end via MAME breakpoints: begin_catch entered with correct ExcHeader, end_catch entered with A=42, doTest returns A=42 from real C++ try { throw 42; } catch (int x) { return x; }.

C. Cleanup wins

runtime/src/snprintf.c:106 — removed optnone on emitULong. Smoke green.
runtime/src/snprintf.c:303 — removed optnone on snprintf. Smoke green.

D. `omfEmit --stack-size` — DP/Stack segment for GS/OS Loader

Added emitDpStackSeg (src/link816/omfEmit.cpp). KIND=0x1012 (DP/Stack | PRIVATE), LENGTH=RESSPC=requested-bytes, ALIGN=0x100, BANKSIZE=0, body is a single END opcode. Apps can now request a stack of any page-aligned size from 256B to 64KB (replacing GS/OS Loader's default 4KB allocation).

Loader gotcha (cost ~1 hour to debug): plain (non-ExpressLoad) multi-segment OMFs do NOT launch under real GS/OS 6.0.2 — the Loader's slow path silently rejects the file and our entry point never runs. ExpressLoad-wrapped multi-segment OMFs DO work. Fix: --stack-size now implicitly enables --expressload (the Loader's slow path is empirically broken for our 2-seg layout). The DP/Stack seg is appended AFTER the user code seg as SEGNUM=3; the Loader walks all segments by KIND after the ExpressLoad fast-load step finishes.

Verified: runViaFinder.sh /tmp/test_el_dp.omf --check 0x70=0x42 0x71=0x99 passes under real GS/OS 6.0.2 with --stack-size 4096 --expressload. Verified failure mode: same payload with --stack-size alone (no --expressload) → 0x70=0x00 (program never executed). Documented in feedback_loader_multi_seg_needs_expressload.md.

Smoke updated: 132/132 expects 3 segs (ExpressLoad + code + DP/Stack) when --stack-size is supplied.

E. i32 shift-by-N inlined (was full libcall) — speed win

W65816ISelLowering.cpp LowerShift now inlines i32 SHL/SRL/SRA by N=1..4. Previously every i32 shift went through __lshrsi3/ __ashlsi3/__ashrsi3 — ~300+ cyc per call. popcount benchmark: 8320 → 6888 cyc/call, 17% faster. Implementation extracts Wide32 halves via extractWide32Lo/Hi, applies per-step lsr; ror-equivalent SDAG ops with explicit carry propagation ((Hi & 1) << 15 for SRL/SRA's lo-fill, Lo >> 15 for SHL's hi-fill), recombines via buildWide32. N>4 still routes to libcall — the unrolled cost (~5 i16 ops × N) crosses libcall overhead at N≈5.

Documented in feedback_i32_shift_inline.md.

Still-open work areas

Each carries a fair-warning note for whoever picks it up.

1. qsort/bsearch `optnone` — REMOVED 2026-05-08

Source-restructured qsort: split the inner loop into a __attribute__((noinline)) helper qsortInner (4 args: base, cur, size, cmp). Outer qsort just iterates i = 1..nmemb-1 and calls qsortInner(base, base + i*size, size, cmp). This drops outer qsort's i32-vreg simultaneous-live count below the inline-spill OOM threshold; both halves compile cleanly at -O2 + basic regalloc.

bsearch optnone was kept-for-symmetry — once removed, it just worked. The IMG-clobber + LDAfi-IMG-store backend fixes from 2026-05-07 had already resolved its underlying pressure issue.

Smoke stays green (now 132/132).

2. gmtime_r `optnone`

runtime/src/timeExt.c:69. NOT a backend bug — IR-level optimization issue (loop rotation + IndVar simplify mis-evaluating days >= 365L + (__isLeap(...) ? 1 : 0)). Fixing requires deciding which combine pass is wrong and why. Out of scope for backend work.

3. softDouble noinlines

runtime/src/softDouble.c:30 (dpack) and :51 (dclass). Removing dpack noinline broke dadd this session — register pressure for __adddf3/__muldf3/__divdf3. Architectural for the same reason as qsort.

4. Greedy regalloc retry — TRIED, blocked

Tested 2026-05-08. Greedy fails immediately on atoi in libc.c:

LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...):
Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed.

Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register pair partial defs that the regalloc treats as fully dead. Greedy is genuinely incompatible with the W65816's split-half subreg-pair patterns until the upstream LLVM issue is patched. Reverted to basic regalloc. Document feedback_greedy_high_pressure.md already covers this.

5. gmtime_r `optnone` — TRIED, blocked

Tested 2026-05-08. Hoisting yearLen to a long local (avoiding the double-recompute of 365L + (__isLeap ? 1 : 0)) didn't help; adding volatile to the local also didn't help. IR optimizer is still folding the comparison to compile-time-false. Source-level C restructuring won't dodge it; needs IR-pass-level work to identify which combine pass mis-evaluates and why. optnone stays.

How to verify recovery

cd /home/scott/claude/llvm816
git status                                        # 5 modified files listed above
cd tools/llvm-mos-build && ninja llc clang        # rebuild backend (~5 min)
cd /home/scott/claude/llvm816
cd src/link816 && make && cd ../..                # rebuild link816 + omfEmit
bash runtime/build.sh                             # build runtime
bash scripts/smokeTest.sh                         # should end "all smoke checks passed"
bash scripts/benchCyclesPrecise.sh                # popcount should be ~6888 cyc

Loader smoke (validates DP/Stack seg under real GS/OS 6.0.2):

# Build a simple test program with --stack-size, run via Finder.
tools/omfEmit --input X.bin --map X.map --base 0x1000 --entry __start \
    --output /tmp/t.omf --stack-size 4096 --relocs X.relocs
bash scripts/runViaFinder.sh /tmp/t.omf --check 0x70=0x42 0x71=0x99

If smoke fails, the likely cause is one of the 5 uncommitted files got reverted; check git status and re-apply. If popcount bench regressed past ~7500 cyc, suspect the i32-shift-inline change in W65816ISelLowering.cpp was lost.

Diagnostic tools that worked

For posterity — these are the patterns that paid off this session.

Pre-RA vs post-RA MIR diff

clang -mllvm -stop-before=regallocbasic -S ...     # pre-RA
clang -mllvm -stop-after=virtregrewriter -S ...    # post-RA (post-virtregrewriter)

Diff narrow-IMG vs full-IMG post-RA MIR for the failing function. Pre-RA is identical (same IR), so the diff isolates regalloc-decision divergence. Look at every NEW pattern that appears only in the failing build — $imgN = LDAfi was the smoking gun for dadd.

Pass-by-pass IR/MIR dumps

clang -mllvm -print-after=w65816-lower-wide32 -S ...
clang -mllvm -print-after-all -S ... 2>dump.txt

MAME debugger via xvfb-run

xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N

With autoboot Lua: load .bin into bank 0 (skip $C000..$CFFF I/O), set CPU state, then cpu.debug:bpset(addr, condition, action) with actions like "logerror \"...\\n\",a,x,...; go". logerror with format args goes to stdout under -oslog. Memory reads in expressions: b@(addr), w@(addr). Watchpoints: cpu.debug:wpset(prog_space, "w", addr, len, condition, action).

AVOID: add_machine_pause_notifier + cpu.debug:go() in callback — segfaults from reentrancy. printf in actions stays in debugger console (not -oslog). tracelog also debugger-console only.

Trace methodology (find divergence point)

Set BPs at every JSLpseudo callee in the failing function.
Capture A/X/Y/DPF0 at each return.
Find first divergent return between known-good and failing builds.
The instruction sequence between previous-OK and first-divergent return is where the bug lives.

This pattern found the dadd bug at jsl@0x207f → __lshrsi3(0x8001_8000, 3) in 30 minutes. Recommended.

Memory notes referenced

(Filenames under /home/scott/.claude/projects/-home-scott-claude-llvm816/memory/.)

feedback_strstr9_long_haystack.md — the hash-shell bug story.
feedback_cpp_subset.md — C++ subset, including the SjLj fix.
feedback_ptr32_frame_limit.md — was 5 days stale; updated 2026-05-07 to "DONE, 131/131 smoke green".
feedback_jslpseudo_caller_save.md, feedback_libcall_img_clobber.md, feedback_img_slot_expansion.md, feedback_greedy_high_pressure.md — related backend topics.
feedback_loader_multi_seg_needs_expressload.md — new 2026-05-08. Multi-seg OMFs need ExpressLoad to launch under real Loader.
feedback_i32_shift_inline.md — new 2026-05-08. Inline i32 shift-by-N for N=1..4; first quantified bench-vs-self speed win.
feedback_speed_over_size.md — new 2026-05-07. Optimization priorities: cycle count over byte count, full stop.

Next session candidates (ranked)

evalAt at 1.86× vs Calypsi is the structural floor for peephole work (see feedback_evalat_structural_gap.md). Further gains need:

i64-by-pointer ABI (rejected this session — diminishing returns). Pass doubles by ptr instead of value: saves ~120 cyc per evalAt call. Requires runtime rewrite, OMF compat checks, every double caller updated. Risk:reward too high for the size of the gain.
__divdf3 / __adddf3 algorithmic improvements. ddiv 1261 cyc could drop via Newton-Raphson reciprocal multiplication (a*1/b instead of bit-by-bit long division). Major rewrite, but our __muldi3 short-circuit makes the multiplications cheap now.
Higher-resolution cycle timer. HBL counter is 8-bit and wraps at ~256 ticks; combining scan-line position + frame counter would give per-bench resolution better than ±65 cyc. Would unblock benchmarking sub-loop changes (e.g., the LSR-dp shift form).
More peepholes from the audit. Phase 4 STA_StackRel extension landed but doesn't fire in current libc (frame sizes too large). If callers shrink frames via better SSM, more functions become eligible.

20 KiB Raw Blame History Unescape Escape