65816-llvm-mos/SESSION_RECOVERY.md
2026-05-27 19:37:26 -05:00

425 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Recovery — last updated 2026-05-27
Living recovery doc. Update on every meaningful change. If session is lost,
read this top-to-bottom + the memory notes referenced inside, then reread
the actual diffs in tree to ground assumptions.
## Headline state
- **Smoke**: 148/148 green. Demos 9/9 (helloBeep/helloText/helloWindow/
orcaFrame/qdProbe/heavyRelocs/frame/reversi/minicad).
- **Active config**: ptr32 (`p:32:16`), full IMG0..IMG15 caller-clobber
on JSL, greedy regalloc at -O1+. **Inline-threshold lowered to 50
target-wide** (was LLVM default 225; was 75 earlier this session).
- **Branch**: `main`.
- **vs Calypsi (2026-05-27)** — Layer 2 + recent peepholes:
- **Cycle benches geomean**: **0.62× Calypsi**. 9 of 10 below 1.0×;
only `fib` trails at 1.06× (recursive overhead, structural). See
cycle bench table below.
- **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93×.
- **CoreMark 1.0**: with Layer 2 0.79× Calypsi (we beat by 21%).
- **vs Calypsi static-inst ratio (synthetic bench)**:
sumSquares **0.84×** (26 vs 31 — we beat),
mul16to32 **0.25×** (1 vs 4 — we beat),
evalAt 1.86× (472 vs 254 — structural floor; ABI overhaul rejected).
- **New code-gen options (2026-05-25)** — see docs/USAGE.md "Advanced:
pointer-deref code generation":
- Layer 1 ptr32 deref-fold (always on): LDY offset instead of
CLC/ADC carry chain. ~3 instr saved per struct-field access.
- `-mllvm -w65816-dbr-safe-ptrs` (Layer 2, opt-in): uses
`lda (d,S),Y` for ptr32 derefs assuming bank-byte == DBR.
5 instr → 1 instr per deref. Lua -20.6%. **MISCOMPILES
cross-bank pointers — opt in per-TU only when safe.**
- Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark
matrix.o 1.37× → 0.97× Calypsi. Override with
`-mllvm -inline-threshold=N`.
- **Cycle benches per-call (2026-05-27, Layer 2)** — via
`scripts/benchCyclesPrecise.sh` vs `scripts/benchCyclesCalypsi.sh`:
```
Bench Ours Calypsi Ratio
dotProduct 1534 5712 0.27×
bsearch 682 2387 0.29×
sumOfSquares 6820 16368 0.42×
bubbleSort 11594 17050 0.68×
strLen 767 1023 0.75×
djb2Hash 2046 2643 0.77×
popcount 1194 1534 0.78×
strcpy 1108 1194 0.93×
memcmp 682 716 0.95×
fib 11594 10912 1.06×
```
Geomean **0.62×**. Older HBL-tick numbers (per-iter, 100 iter loops)
from `benchCycles.sh` are still available but lower resolution.
- **Recent session wins (2026-05-27)**:
- **Y-as-counter for strLen** — structural rewrite: drop STX/INX/INC,
use Y as offset AND counter. strLen 1279 → 767 cyc (-40%); 0.75×
Calypsi (was 1.25×).
- **Stack-rel dead-store elim** — companion to DP version with SP
tracking across PHA/PHP/PEA/PEI/PER/PLA/PLP/PLX/PLY/PHX/PHY.
strcpy 1194 → 1108 (-7%, 0.93× Calypsi, beats by 7%). Refactored
as a static helper called from the recursive-call bail too so fib
gets it. fib 12106 → 11594 (-4%, 1.06× Calypsi).
- **DP-indirect-Y for iter** (follow-on to X-iter peephole): rewrites
`TXA;STA stack-rel S;INX;…;LDA (S,s),Y` to `STX_DP D;INX;…;LDA
(D),Y`. Saves 4 cyc/iter.
- **Dead INC_HI_IF_CARRY elim** — when the StackRel ptr-hi slot is
never read, elide the carry-bookkeeping for Layer 2 ptr32 loops.
Wide impact across strLen/strcpy/djb2Hash/memcmp.
- **Recent session wins (earlier — 2026-05-20)**:
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg
(evalAt 498→472, fib -35%, 35 libc fns shrunk)
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
- case-(b) ImgCalleeSave bracket hoist enables phase 4 to elide
TAY/TYA round-trip in synergy
- FP cycle benches added (dadd/dmul/ddiv) with per-bench iter count
- Documented LSR-dp cycle mystery as HBL-counter wrap artifact
- Game-like benches added: particles (i16 physics), mandelbrot (i32 fp)
- **elideStoreForwarding now reached via early-return bail paths**:
particles 5005→2253 cyc/iter (-55%). Was being skipped for any
function where main IMG promotion bailed (SpAdj invalid, no
accesses, or > 16 hot slots).
## Uncommitted, must keep
`git status --short` (5 modified, no untracked of consequence):
1. `SESSION_RECOVERY.md` — this doc.
2. `scripts/smokeTest.sh` — added "omfEmit `--stack-size` emits a
DP/Stack `~Direct` segment" check. Validates 3-segment layout
(ExpressLoad + code + DP/Stack) when `--stack-size` is supplied;
parses the third segment header against KIND/LENGTH/RESSPC/ALIGN/
SEGNUM=3/name="~Direct" expectations.
3. `src/link816/omfEmit.cpp` — `emitDpStackSeg(length, segNum)` plus
the `--stack-size N` CLI flag. Validation: 256 ≤ N ≤ 65536, page-
aligned. **`--stack-size` implicitly enables `--expressload`** —
the GS/OS Loader's slow path silently rejects multi-seg OMFs (see
§D below for the empirical evidence).
4. `src/llvm/lib/Target/W65816/W65816ISelLowering.cpp` — `LowerShift`
now inlines i32 SHL/SRL/SRA by N=1..4 instead of routing to
`__lshrsi3`/`__ashlsi3`/`__ashrsi3`. See §E.
5. `src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp` — pre-existing
uncommitted change from prior turns; verify against git log before
re-staging if recovery is fresh.
Earlier-mentioned files (snprintf.c, W65816InstrInfo.cpp,
W65816SjLjFinalize.cpp) have been checkpoint-committed and are no
longer in `git status`.
## Already-committed in this session arc
Per `git log --oneline -20` these are the recent checkpoint commits;
the diffs they contain are real and load-bearing.
The big ones (search by file or grep):
- **JSLpseudo Defs += IMG0..IMG15** in `W65816InstrInfo.td`. With the
wider Defs, regalloc spills IMG-class vregs around calls instead of
treating them as preserved.
- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `STAfi`**:
PHA-bracketed for non-A source (IMG/X/Y). The `lda dp; sta d,s` chain
clobbered A; bracket preserves A while shifting offset by +2 between
PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation.
- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `LDAfi`**:
if `Dst = IMGn`, append `STA dp` so the IMG slot actually receives the
loaded value. Previously only loaded into A; downstream
`COPY $x = $imgN` (= `ldx $D?`) read garbage. **This was the smoking
gun for `dadd(1.5, 2.5) → 0x4010_0000_3000_3000`.**
- **`W65816LowerWide32.cpp`** fixed-point erase loop. Was single-pass;
REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it
alive at the iteration moment. Removed ~40 dead Wide32 vregs from
`__adddf3`'s pre-RA MIR.
- **`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll`** relaxed
`stx 0xd / sta 0xd` to `0x{{[cd]}}` (regalloc now picks IMG8..15 too).
## Fixes landed (full list with rationale)
Each entry: what / why / where / what regression it would cause if reverted.
### A. Hash-shell DELETE bug → IMG caller-clobber
**Symptom**: `dbDelete("age")` returned 0 ("not found") instead of 1.
DELETE never ran; `COUNT` stayed at 2.
**Root cause**: `dbDelete` did `stx 0xd0` to save k_high, called `hashKey`,
then `pei 0xd0` to push k_high to strcmp. `hashKey` used $D0 as scratch
in its loop body (`sta 0xd0` storing the iterator's running-ptr-low). $D0
was clobbered by the time `pei 0xd0` ran. JSLpseudo Defs only listed
`A, X, Y, DPF0` — IMG slots were not modelled as caller-clobber.
**Fix**: `JSLpseudo Defs += [IMG0..IMG15]`.
**Cascading fallout** (each required its own fix):
#### A1. copyPhysReg vreg fallback
`storeRegToStackSlot`'s unpaired-Wide32 default branch hit `unreachable`
when called with a vreg source. Basic regalloc's InlineSpiller does
this. Fix: short-circuit virtual-reg cases to `TargetOpcode::COPY`.
#### A2. LowerWide32 fixed-point erase
Single-pass erase left ~40 dead Wide32 vregs in `__adddf3`. Pattern:
```
%X:wide32 = REG_SEQUENCE ...
%Y:wide32 = COPY %X
... uses of %Y rewritten by Pass 3 ...
```
Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then
COPY erased (now %X dead but loop already passed it). Fix: iterate
until no progress.
#### A3. STAfi PHA-bracket
Without bracket, regalloc could schedule `$img0 = COPY $a` AFTER a
`STAfi`-with-IMG-source whose internal `lda dp` clobbered $a, silently
storing X's value where A's was expected.
#### A4. LDAfi-IMG-dest STA dp
**The big one.** With narrow IMG, regalloc kept Wide16 vregs in IMG
slots across calls, never needed `$imgN = LDAfi %stack.X`. With full
IMG, every cross-call spill needed it. The expansion only emitted
`LDA d,s` (load A) — never wrote to the IMG slot. Downstream
`COPY $x = $imgN` (= `ldx $D?`) read stale prior data. Manifested as
`dadd(1.5, 2.5) → 0x4010_0000_3000_3000` (mantissa garbage).
**Diagnostic that found it**: diff post-RA MIR narrow vs full IMG. Pre-RA
MIR was identical. Full had 6 `$imgN = LDAfi` instances; narrow had 0.
Narrow used `COPY $imgN = $a` patterns instead — those work correctly.
#### A5. FileCheck regex
`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll` expected
`stx 0xd / sta 0xd`. Under full IMG clobber, regalloc picks IMG8..15
($C0/$C2) for cross-call arg saves. Relaxed to `0x{{[cd]}}`.
### B. C++ try/catch source-level path
Two bugs blocking real `clang++ -fsjlj-exceptions` source code:
#### B1. W65816SjLjFinalize catchtab ordering
`runOnFunction` erased landingpad insts at line ~245, then built the
catchtab at line ~290 via `LPadBB->getLandingPadInst()`. By that
point, landingpads were nullptr. The build loop's `if (!LP) continue;`
skipped every entry. Catchtab ended with just `(0,0)` sentinel. LSDA
was 4 bytes of zeros. `findCatch` saw `ctx->lsda == 0`'s entry and
bailed. Result: any `throw` aborted.
Fix: capture catch-clause typeinfo Constants into a
`DenseMap<BasicBlock*, LPadInfo>` BEFORE erasing landingpads; the
catchtab build loop reads from the saved map.
#### B2. copyPhysReg IMG-to-IMG PHA-bracket
Comment said "Caller is responsible for ensuring A is dead at this
program point (regalloc usually arranges this)." It doesn't, in
practice. Regalloc inserted IMG-to-IMG copies between `$a = COPY $img10`
and `STAfi $a, slot`. Unbracketed `lda src; sta dst` clobbered A.
The subsequent STAfi spilled garbage. Visible as `*p = 42` after
`__cxa_allocate_exception` storing 42 to wrong addr (indirect-long
setup got hi-half at lo-slot).
Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare).
**Verified end-to-end** via MAME breakpoints: `begin_catch` entered
with correct ExcHeader, `end_catch` entered with A=42, doTest returns
A=42 from real C++ `try { throw 42; } catch (int x) { return x; }`.
### C. Cleanup wins
- `runtime/src/snprintf.c:106` — removed `optnone` on `emitULong`. Smoke green.
- `runtime/src/snprintf.c:303` — removed `optnone` on `snprintf`. Smoke green.
### D. `omfEmit --stack-size` — DP/Stack segment for GS/OS Loader
Added `emitDpStackSeg` (`src/link816/omfEmit.cpp`). KIND=0x1012 (DP/Stack
| PRIVATE), LENGTH=RESSPC=requested-bytes, ALIGN=0x100, BANKSIZE=0, body
is a single END opcode. Apps can now request a stack of any
page-aligned size from 256B to 64KB (replacing GS/OS Loader's default
4KB allocation).
**Loader gotcha** (cost ~1 hour to debug): plain (non-ExpressLoad)
multi-segment OMFs do NOT launch under real GS/OS 6.0.2 — the Loader's
slow path silently rejects the file and our entry point never runs.
ExpressLoad-wrapped multi-segment OMFs DO work. Fix: `--stack-size` now
implicitly enables `--expressload` (the Loader's slow path is
empirically broken for our 2-seg layout). The DP/Stack seg is appended
AFTER the user code seg as SEGNUM=3; the Loader walks all segments by
KIND after the ExpressLoad fast-load step finishes.
Verified: `runViaFinder.sh /tmp/test_el_dp.omf --check 0x70=0x42 0x71=0x99`
passes under real GS/OS 6.0.2 with `--stack-size 4096 --expressload`.
Verified failure mode: same payload with `--stack-size` alone (no
`--expressload`) → `0x70=0x00` (program never executed). Documented
in `feedback_loader_multi_seg_needs_expressload.md`.
Smoke updated: 132/132 expects 3 segs (ExpressLoad + code + DP/Stack)
when `--stack-size` is supplied.
### E. i32 shift-by-N inlined (was full libcall) — speed win
`W65816ISelLowering.cpp` `LowerShift` now inlines i32 SHL/SRL/SRA by
N=1..4. Previously every i32 shift went through `__lshrsi3`/
`__ashlsi3`/`__ashrsi3` — ~300+ cyc per call. popcount benchmark:
**8320 → 6888 cyc/call, 17% faster**. Implementation extracts
`Wide32` halves via `extractWide32Lo/Hi`, applies per-step
`lsr; ror`-equivalent SDAG ops with explicit carry propagation
(`(Hi & 1) << 15` for SRL/SRA's lo-fill, `Lo >> 15` for SHL's
hi-fill), recombines via `buildWide32`. N>4 still routes to libcall
— the unrolled cost (~5 i16 ops × N) crosses libcall overhead at N≈5.
Documented in `feedback_i32_shift_inline.md`.
## Still-open work areas
Each carries a fair-warning note for whoever picks it up.
### 1. qsort/bsearch `optnone` — REMOVED 2026-05-08
Source-restructured `qsort`: split the inner loop into a
`__attribute__((noinline))` helper `qsortInner` (4 args: base, cur,
size, cmp). Outer `qsort` just iterates `i = 1..nmemb-1` and calls
`qsortInner(base, base + i*size, size, cmp)`. This drops outer
qsort's i32-vreg simultaneous-live count below the inline-spill
OOM threshold; both halves compile cleanly at -O2 + basic regalloc.
`bsearch` `optnone` was kept-for-symmetry — once removed, it just
worked. The IMG-clobber + LDAfi-IMG-store backend fixes from
2026-05-07 had already resolved its underlying pressure issue.
Smoke stays green (now 132/132).
### 2. gmtime_r `optnone`
`runtime/src/timeExt.c:69`. NOT a backend bug — IR-level optimization
issue (loop rotation + IndVar simplify mis-evaluating
`days >= 365L + (__isLeap(...) ? 1 : 0)`). Fixing requires deciding
which combine pass is wrong and why. Out of scope for backend work.
### 3. softDouble noinlines
`runtime/src/softDouble.c:30` (`dpack`) and `:51` (`dclass`). Removing
`dpack` noinline broke dadd this session — register pressure for
`__adddf3`/`__muldf3`/`__divdf3`. Architectural for the same reason as
qsort.
### 4. Greedy regalloc retry — TRIED, blocked
Tested 2026-05-08. Greedy fails immediately on `atoi` in libc.c:
```
LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...):
Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed.
```
Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register
pair partial defs that the regalloc treats as fully dead. Greedy is
genuinely incompatible with the W65816's split-half subreg-pair patterns
until the upstream LLVM issue is patched. Reverted to basic regalloc.
Document `feedback_greedy_high_pressure.md` already covers this.
### 5. gmtime_r `optnone` — TRIED, blocked
Tested 2026-05-08. Hoisting `yearLen` to a long local (avoiding the
double-recompute of `365L + (__isLeap ? 1 : 0)`) didn't help; adding
`volatile` to the local also didn't help. IR optimizer is still
folding the comparison to compile-time-false. Source-level C
restructuring won't dodge it; needs IR-pass-level work to identify
which combine pass mis-evaluates and why. optnone stays.
## How to verify recovery
```bash
cd /home/scott/claude/llvm816
git status # 5 modified files listed above
cd tools/llvm-mos-build && ninja llc clang # rebuild backend (~5 min)
cd /home/scott/claude/llvm816
cd src/link816 && make && cd ../.. # rebuild link816 + omfEmit
bash runtime/build.sh # build runtime
bash scripts/smokeTest.sh # should end "all smoke checks passed"
bash scripts/benchCyclesPrecise.sh # popcount should be ~6888 cyc
```
Loader smoke (validates DP/Stack seg under real GS/OS 6.0.2):
```bash
# Build a simple test program with --stack-size, run via Finder.
tools/omfEmit --input X.bin --map X.map --base 0x1000 --entry __start \
--output /tmp/t.omf --stack-size 4096 --relocs X.relocs
bash scripts/runViaFinder.sh /tmp/t.omf --check 0x70=0x42 0x71=0x99
```
If smoke fails, the likely cause is one of the 5 uncommitted files
got reverted; check `git status` and re-apply. If popcount bench
regressed past ~7500 cyc, suspect the i32-shift-inline change in
`W65816ISelLowering.cpp` was lost.
## Diagnostic tools that worked
For posterity — these are the patterns that paid off this session.
### Pre-RA vs post-RA MIR diff
```bash
clang -mllvm -stop-before=regallocbasic -S ... # pre-RA
clang -mllvm -stop-after=virtregrewriter -S ... # post-RA (post-virtregrewriter)
```
Diff narrow-IMG vs full-IMG post-RA MIR for the failing function.
Pre-RA is identical (same IR), so the diff isolates regalloc-decision
divergence. Look at every NEW pattern that appears only in the failing
build — `$imgN = LDAfi` was the smoking gun for dadd.
### Pass-by-pass IR/MIR dumps
```bash
clang -mllvm -print-after=w65816-lower-wide32 -S ...
clang -mllvm -print-after-all -S ... 2>dump.txt
```
### MAME debugger via xvfb-run
```bash
xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N
```
With autoboot Lua: load .bin into bank 0 (skip `$C000..$CFFF` I/O),
set CPU state, then `cpu.debug:bpset(addr, condition, action)` with
actions like `"logerror \"...\\n\",a,x,...; go"`. `logerror` with format
args goes to stdout under `-oslog`. Memory reads in expressions:
`b@(addr)`, `w@(addr)`. Watchpoints: `cpu.debug:wpset(prog_space, "w",
addr, len, condition, action)`.
**AVOID**: `add_machine_pause_notifier` + `cpu.debug:go()` in callback —
segfaults from reentrancy. `printf` in actions stays in debugger console
(not -oslog). `tracelog` also debugger-console only.
### Trace methodology (find divergence point)
1. Set BPs at every `JSLpseudo` callee in the failing function.
2. Capture A/X/Y/DPF0 at each return.
3. Find first divergent return between known-good and failing builds.
4. The instruction sequence between previous-OK and first-divergent
return is where the bug lives.
This pattern found the dadd bug at `jsl@0x207f → __lshrsi3(0x8001_8000, 3)`
in 30 minutes. Recommended.
## Memory notes referenced
(Filenames under `/home/scott/.claude/projects/-home-scott-claude-llvm816/memory/`.)
- `feedback_strstr9_long_haystack.md` — the hash-shell bug story.
- `feedback_cpp_subset.md` — C++ subset, including the SjLj fix.
- `feedback_ptr32_frame_limit.md` — was 5 days stale; updated 2026-05-07
to "DONE, 131/131 smoke green".
- `feedback_jslpseudo_caller_save.md`, `feedback_libcall_img_clobber.md`,
`feedback_img_slot_expansion.md`, `feedback_greedy_high_pressure.md`
related backend topics.
- `feedback_loader_multi_seg_needs_expressload.md`**new 2026-05-08**.
Multi-seg OMFs need ExpressLoad to launch under real Loader.
- `feedback_i32_shift_inline.md`**new 2026-05-08**. Inline i32
shift-by-N for N=1..4; first quantified bench-vs-self speed win.
- `feedback_speed_over_size.md`**new 2026-05-07**. Optimization
priorities: cycle count over byte count, full stop.
## Next session candidates (ranked)
evalAt at 1.86× vs Calypsi is the structural floor for peephole work
(see `feedback_evalat_structural_gap.md`). Further gains need:
1. **i64-by-pointer ABI** (rejected this session — diminishing returns).
Pass doubles by ptr instead of value: saves ~120 cyc per evalAt call.
Requires runtime rewrite, OMF compat checks, every double caller
updated. Risk:reward too high for the size of the gain.
2. **__divdf3 / __adddf3 algorithmic improvements**. ddiv 1261 cyc
could drop via Newton-Raphson reciprocal multiplication (a*1/b
instead of bit-by-bit long division). Major rewrite, but our
__muldi3 short-circuit makes the multiplications cheap now.
3. **Higher-resolution cycle timer**. HBL counter is 8-bit and wraps
at ~256 ticks; combining scan-line position + frame counter would
give per-bench resolution better than ±65 cyc. Would unblock
benchmarking sub-loop changes (e.g., the LSR-dp shift form).
4. **More peepholes from the audit**. Phase 4 STA_StackRel extension
landed but doesn't fire in current libc (frame sizes too large).
If callers shrink frames via better SSM, more functions become
eligible.