From 583cee849df062729ff6d60b8e52399846612881 Mon Sep 17 00:00:00 2001 From: Scott Duensing Date: Fri, 8 May 2026 16:19:04 -0500 Subject: [PATCH] Checkpoint --- SESSION_RECOVERY.md | 293 ++++++++++++++++++++++++++++++++++++++++++ runtime/src/qsort.c | 47 ++++--- runtime/src/timeExt.c | 6 + 3 files changed, 329 insertions(+), 17 deletions(-) create mode 100644 SESSION_RECOVERY.md diff --git a/SESSION_RECOVERY.md b/SESSION_RECOVERY.md new file mode 100644 index 0000000..18dd2f7 --- /dev/null +++ b/SESSION_RECOVERY.md @@ -0,0 +1,293 @@ +# Session Recovery — 2026-05-07/08 + +Living recovery doc. Update on every meaningful change. If session is lost, +read this top-to-bottom + the memory notes referenced inside, then reread +the actual diffs in tree to ground assumptions. + +## Headline state + +- **Smoke**: 131/131 green. +- **Active config**: ptr32 (`p:32:16`), full IMG0..IMG15 caller-clobber on JSL, basic regalloc at -O1+. +- **Working tree**: clean except 3 modified files listed below; all are real fixes that haven't been committed yet. +- **Branch**: `main`, ahead of `origin/main` by recent checkpoint commits. + +## Uncommitted, must keep + +These are the in-flight improvements. Rebuild after applying any of them. + +1. `runtime/src/snprintf.c` — removed `__attribute__((optnone))` from + `emitULong` (line 106) and `snprintf` (line 303). Slot-aliasing + workaround that the IMG-clobber + LDAfi-IMG fixes made unnecessary. +2. `src/llvm/lib/Target/W65816/W65816InstrInfo.cpp` + - `copyPhysReg` virtual-register short-circuit: if `SrcReg` or `DestReg` + is virtual, emit a `TargetOpcode::COPY` and return. Basic regalloc's + InlineSpiller calls `storeRegToStackSlot` with vreg sources before + final physreg assignment; without the short-circuit the unpaired- + Wide32 default branch hits the `unreachable`. + - `copyPhysReg` IMG-to-IMG PHA-bracket: was `lda src; sta dst` — + unbracketed clobber of A, regalloc inserted these copies between + `$a = COPY $img10` and use-of-A. PHA/PLA bracket preserves A. +3. `src/llvm/lib/Target/W65816/W65816SjLjFinalize.cpp` — catchtab build + moved BEFORE landingpad erase. Old code did `LPadBB->getLandingPadInst()` + AFTER erasing the insts → returned nullptr → empty LSDA → catch never + matched, abort. Now captures catch-clause typeinfo Constants into a + `DenseMap` BEFORE erase; build loop reads from + the saved map. + +To commit when ready (do NOT amend; create new commits): +```bash +git add runtime/src/snprintf.c \ + src/llvm/lib/Target/W65816/W65816InstrInfo.cpp \ + src/llvm/lib/Target/W65816/W65816SjLjFinalize.cpp +git commit -m "..." # message stub below +``` +Suggested commit message: see "Fixes landed" section below; one commit +per logical change is cleaner. + +## Already-committed in this session arc + +Per `git log --oneline -20` these are the recent checkpoint commits; +the diffs they contain are real and load-bearing. + +The big ones (search by file or grep): +- **JSLpseudo Defs += IMG0..IMG15** in `W65816InstrInfo.td`. With the + wider Defs, regalloc spills IMG-class vregs around calls instead of + treating them as preserved. +- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `STAfi`**: + PHA-bracketed for non-A source (IMG/X/Y). The `lda dp; sta d,s` chain + clobbered A; bracket preserves A while shifting offset by +2 between + PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation. +- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `LDAfi`**: + if `Dst = IMGn`, append `STA dp` so the IMG slot actually receives the + loaded value. Previously only loaded into A; downstream + `COPY $x = $imgN` (= `ldx $D?`) read garbage. **This was the smoking + gun for `dadd(1.5, 2.5) → 0x4010_0000_3000_3000`.** +- **`W65816LowerWide32.cpp`** fixed-point erase loop. Was single-pass; + REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it + alive at the iteration moment. Removed ~40 dead Wide32 vregs from + `__adddf3`'s pre-RA MIR. +- **`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll`** relaxed + `stx 0xd / sta 0xd` to `0x{{[cd]}}` (regalloc now picks IMG8..15 too). + +## Fixes landed (full list with rationale) + +Each entry: what / why / where / what regression it would cause if reverted. + +### A. Hash-shell DELETE bug → IMG caller-clobber + +**Symptom**: `dbDelete("age")` returned 0 ("not found") instead of 1. +DELETE never ran; `COUNT` stayed at 2. + +**Root cause**: `dbDelete` did `stx 0xd0` to save k_high, called `hashKey`, +then `pei 0xd0` to push k_high to strcmp. `hashKey` used $D0 as scratch +in its loop body (`sta 0xd0` storing the iterator's running-ptr-low). $D0 +was clobbered by the time `pei 0xd0` ran. JSLpseudo Defs only listed +`A, X, Y, DPF0` — IMG slots were not modelled as caller-clobber. + +**Fix**: `JSLpseudo Defs += [IMG0..IMG15]`. + +**Cascading fallout** (each required its own fix): + +#### A1. copyPhysReg vreg fallback +`storeRegToStackSlot`'s unpaired-Wide32 default branch hit `unreachable` +when called with a vreg source. Basic regalloc's InlineSpiller does +this. Fix: short-circuit virtual-reg cases to `TargetOpcode::COPY`. + +#### A2. LowerWide32 fixed-point erase +Single-pass erase left ~40 dead Wide32 vregs in `__adddf3`. Pattern: +``` +%X:wide32 = REG_SEQUENCE ... +%Y:wide32 = COPY %X +... uses of %Y rewritten by Pass 3 ... +``` +Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then +COPY erased (now %X dead but loop already passed it). Fix: iterate +until no progress. + +#### A3. STAfi PHA-bracket +Without bracket, regalloc could schedule `$img0 = COPY $a` AFTER a +`STAfi`-with-IMG-source whose internal `lda dp` clobbered $a, silently +storing X's value where A's was expected. + +#### A4. LDAfi-IMG-dest STA dp +**The big one.** With narrow IMG, regalloc kept Wide16 vregs in IMG +slots across calls, never needed `$imgN = LDAfi %stack.X`. With full +IMG, every cross-call spill needed it. The expansion only emitted +`LDA d,s` (load A) — never wrote to the IMG slot. Downstream +`COPY $x = $imgN` (= `ldx $D?`) read stale prior data. Manifested as +`dadd(1.5, 2.5) → 0x4010_0000_3000_3000` (mantissa garbage). + +**Diagnostic that found it**: diff post-RA MIR narrow vs full IMG. Pre-RA +MIR was identical. Full had 6 `$imgN = LDAfi` instances; narrow had 0. +Narrow used `COPY $imgN = $a` patterns instead — those work correctly. + +#### A5. FileCheck regex +`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll` expected +`stx 0xd / sta 0xd`. Under full IMG clobber, regalloc picks IMG8..15 +($C0/$C2) for cross-call arg saves. Relaxed to `0x{{[cd]}}`. + +### B. C++ try/catch source-level path + +Two bugs blocking real `clang++ -fsjlj-exceptions` source code: + +#### B1. W65816SjLjFinalize catchtab ordering +`runOnFunction` erased landingpad insts at line ~245, then built the +catchtab at line ~290 via `LPadBB->getLandingPadInst()`. By that +point, landingpads were nullptr. The build loop's `if (!LP) continue;` +skipped every entry. Catchtab ended with just `(0,0)` sentinel. LSDA +was 4 bytes of zeros. `findCatch` saw `ctx->lsda == 0`'s entry and +bailed. Result: any `throw` aborted. + +Fix: capture catch-clause typeinfo Constants into a +`DenseMap` BEFORE erasing landingpads; the +catchtab build loop reads from the saved map. + +#### B2. copyPhysReg IMG-to-IMG PHA-bracket +Comment said "Caller is responsible for ensuring A is dead at this +program point (regalloc usually arranges this)." It doesn't, in +practice. Regalloc inserted IMG-to-IMG copies between `$a = COPY $img10` +and `STAfi $a, slot`. Unbracketed `lda src; sta dst` clobbered A. +The subsequent STAfi spilled garbage. Visible as `*p = 42` after +`__cxa_allocate_exception` storing 42 to wrong addr (indirect-long +setup got hi-half at lo-slot). + +Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare). + +**Verified end-to-end** via MAME breakpoints: `begin_catch` entered +with correct ExcHeader, `end_catch` entered with A=42, doTest returns +A=42 from real C++ `try { throw 42; } catch (int x) { return x; }`. + +### C. Cleanup wins + +- `runtime/src/snprintf.c:106` — removed `optnone` on `emitULong`. Smoke green. +- `runtime/src/snprintf.c:303` — removed `optnone` on `snprintf`. Smoke green. + +## Still-open work areas + +Each carries a fair-warning note for whoever picks it up. + +### 1. qsort/bsearch `optnone` — REMOVED 2026-05-08 +Source-restructured `qsort`: split the inner loop into a +`__attribute__((noinline))` helper `qsortInner` (4 args: base, cur, +size, cmp). Outer `qsort` just iterates `i = 1..nmemb-1` and calls +`qsortInner(base, base + i*size, size, cmp)`. This drops outer +qsort's i32-vreg simultaneous-live count below the inline-spill +OOM threshold; both halves compile cleanly at -O2 + basic regalloc. + +`bsearch` `optnone` was kept-for-symmetry — once removed, it just +worked. The IMG-clobber + LDAfi-IMG-store backend fixes from +2026-05-07 had already resolved its underlying pressure issue. + +Smoke 131/131 stays green. + +### 2. gmtime_r `optnone` +`runtime/src/timeExt.c:69`. NOT a backend bug — IR-level optimization +issue (loop rotation + IndVar simplify mis-evaluating +`days >= 365L + (__isLeap(...) ? 1 : 0)`). Fixing requires deciding +which combine pass is wrong and why. Out of scope for backend work. + +### 3. softDouble noinlines +`runtime/src/softDouble.c:30` (`dpack`) and `:51` (`dclass`). Removing +`dpack` noinline broke dadd this session — register pressure for +`__adddf3`/`__muldf3`/`__divdf3`. Architectural for the same reason as +qsort. + +### 4. Greedy regalloc retry — TRIED, blocked +Tested 2026-05-08. Greedy fails immediately on `atoi` in libc.c: +``` +LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...): +Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed. +``` +Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register +pair partial defs that the regalloc treats as fully dead. Greedy is +genuinely incompatible with the W65816's split-half subreg-pair patterns +until the upstream LLVM issue is patched. Reverted to basic regalloc. +Document `feedback_greedy_high_pressure.md` already covers this. + +### 5. gmtime_r `optnone` — TRIED, blocked +Tested 2026-05-08. Hoisting `yearLen` to a long local (avoiding the +double-recompute of `365L + (__isLeap ? 1 : 0)`) didn't help; adding +`volatile` to the local also didn't help. IR optimizer is still +folding the comparison to compile-time-false. Source-level C +restructuring won't dodge it; needs IR-pass-level work to identify +which combine pass mis-evaluates and why. optnone stays. + +## How to verify recovery + +```bash +cd /home/scott/claude/llvm816 +git status # 3 modified files listed above +cd tools/llvm-mos-build && ninja llc clang # rebuild backend +cd /home/scott/claude/llvm816 +bash runtime/build.sh # build runtime +bash scripts/smokeTest.sh # should print "all smoke checks passed" +``` + +If smoke fails, the most likely cause is one of the three uncommitted +files got reverted; check `git status` and re-apply. + +## Diagnostic tools that worked + +For posterity — these are the patterns that paid off this session. + +### Pre-RA vs post-RA MIR diff +```bash +clang -mllvm -stop-before=regallocbasic -S ... # pre-RA +clang -mllvm -stop-after=virtregrewriter -S ... # post-RA (post-virtregrewriter) +``` +Diff narrow-IMG vs full-IMG post-RA MIR for the failing function. +Pre-RA is identical (same IR), so the diff isolates regalloc-decision +divergence. Look at every NEW pattern that appears only in the failing +build — `$imgN = LDAfi` was the smoking gun for dadd. + +### Pass-by-pass IR/MIR dumps +```bash +clang -mllvm -print-after=w65816-lower-wide32 -S ... +clang -mllvm -print-after-all -S ... 2>dump.txt +``` + +### MAME debugger via xvfb-run +```bash +xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N +``` +With autoboot Lua: load .bin into bank 0 (skip `$C000..$CFFF` I/O), +set CPU state, then `cpu.debug:bpset(addr, condition, action)` with +actions like `"logerror \"...\\n\",a,x,...; go"`. `logerror` with format +args goes to stdout under `-oslog`. Memory reads in expressions: +`b@(addr)`, `w@(addr)`. Watchpoints: `cpu.debug:wpset(prog_space, "w", +addr, len, condition, action)`. + +**AVOID**: `add_machine_pause_notifier` + `cpu.debug:go()` in callback — +segfaults from reentrancy. `printf` in actions stays in debugger console +(not -oslog). `tracelog` also debugger-console only. + +### Trace methodology (find divergence point) +1. Set BPs at every `JSLpseudo` callee in the failing function. +2. Capture A/X/Y/DPF0 at each return. +3. Find first divergent return between known-good and failing builds. +4. The instruction sequence between previous-OK and first-divergent + return is where the bug lives. + +This pattern found the dadd bug at `jsl@0x207f → __lshrsi3(0x8001_8000, 3)` +in 30 minutes. Recommended. + +## Memory notes referenced + +(Filenames under `/home/scott/.claude/projects/-home-scott-claude-llvm816/memory/`.) + +- `feedback_strstr9_long_haystack.md` — the hash-shell bug story. +- `feedback_cpp_subset.md` — C++ subset, including the SjLj fix. +- `feedback_ptr32_frame_limit.md` — was 5 days stale; updated 2026-05-07 + to "DONE, 131/131 smoke green". +- `feedback_jslpseudo_caller_save.md`, `feedback_libcall_img_clobber.md`, + `feedback_img_slot_expansion.md`, `feedback_greedy_high_pressure.md` — + related backend topics. + +## Next session candidates (ranked) + +1. **Commit the uncommitted fixes.** They've earned it. +2. **Greedy regalloc retry.** Cheap experiment, potentially big win. +3. **qsort source restructure.** Clear `optnone` if you're willing to + reshape the algorithm. Source-level work, not backend. +4. **gmtime_r IR investigation.** Find which combine miscompiles + `days >= 365L + (leap?1:0)`. IR-level, not backend. diff --git a/runtime/src/qsort.c b/runtime/src/qsort.c index f5326d7..8e4e8e2 100644 --- a/runtime/src/qsort.c +++ b/runtime/src/qsort.c @@ -23,10 +23,11 @@ static void byteSwap(unsigned char *a, unsigned char *b, size_t size) { } -// optnone under ptr32: greedy regalloc runs out of registers when the -// 32-bit pointer arithmetic puts multiple simultaneously-live Wide32 -// vregs in flight. Fast regalloc spills liberally and gets through. -__attribute__((optnone)) +// Iterative binary search. Previously had `optnone` to dodge a +// regalloc OOM under ptr32 + multiple simultaneously-live Wide32 +// vregs. The IMG-clobber + LDAfi-IMG-store backend fixes (2026-05-07) +// resolved the underlying regalloc pressure issue and bsearch now +// compiles cleanly at -O2. void *bsearch(const void *key, const void *base, size_t nmemb, size_t size, CmpFnT cmp) { const unsigned char *baseP = (const unsigned char *)base; @@ -49,24 +50,36 @@ void *bsearch(const void *key, const void *base, size_t nmemb, } -__attribute__((optnone)) +// Inner loop of qsort's insertion sort: starting from `cur` (the +// element at index i) walk left through `[base, cur)`, swapping +// adjacent pairs while out-of-order. Hoisted out of qsort so the +// outer loop's frame stays small enough for basic regalloc; without +// the split, qsort's i32-pointer pressure under ptr32 produces +// ADCEfi tied-def chains the inline-spiller can't allocate ("ran +// out of registers" failure). +__attribute__((noinline)) +static void qsortInner(unsigned char *base, unsigned char *cur, + size_t size, CmpFnT cmp) { + while (cur > base) { + unsigned char *prev = cur - size; + if (cmp(prev, cur) <= 0) { + return; + } + byteSwap(prev, cur, size); + cur = prev; + } +} + + void qsort(void *base, size_t nmemb, size_t size, CmpFnT cmp) { if (nmemb < 2 || size == 0) { return; } unsigned char *baseP = (unsigned char *)base; - // Insertion sort: for each i in 1..n-1, walk left swapping until - // the slot at j-1 is <= the slot at j. Stable, in-place. + // Insertion sort: for each i in 1..n-1, hand the element at i to + // the inner-loop helper which walks left swapping out-of-order + // pairs. Stable, in-place. for (size_t i = 1; i < nmemb; i++) { - size_t j = i; - while (j > 0) { - unsigned char *cur = baseP + j * size; - unsigned char *prev = cur - size; - if (cmp(prev, cur) <= 0) { - break; - } - byteSwap(prev, cur, size); - j--; - } + qsortInner(baseP, baseP + i * size, size, cmp); } } diff --git a/runtime/src/timeExt.c b/runtime/src/timeExt.c index c78c0c9..414e20a 100644 --- a/runtime/src/timeExt.c +++ b/runtime/src/timeExt.c @@ -66,6 +66,12 @@ struct tm *localtime(const time_t *t) { // optnone preserves the per-statement structure and the loop runs // correctly. Verified end-to-end against 1710484245L → 2024-03-15 // 06:30:45 UTC (Friday, day-of-year 74). +// +// Tried 2026-05-08 (didn't fix): hoist yearLen to a long local; +// hoist with `volatile`; restructure as for(;;) with break. All hit +// the same IR-level bug — IndVar simplify still folds the comparison +// to compile-time-false. The fix needs IR-pass-level work, not C +// restructuring. __attribute__((optnone)) struct tm *gmtime_r(const time_t *t, struct tm *out) { long secs = *t;