Checkpoint

This commit is contained in:
Scott Duensing 2026-05-08 16:19:04 -05:00
parent 99be5acb89
commit 583cee849d
3 changed files with 329 additions and 17 deletions

293
SESSION_RECOVERY.md Normal file
View file

@ -0,0 +1,293 @@
# Session Recovery — 2026-05-07/08
Living recovery doc. Update on every meaningful change. If session is lost,
read this top-to-bottom + the memory notes referenced inside, then reread
the actual diffs in tree to ground assumptions.
## Headline state
- **Smoke**: 131/131 green.
- **Active config**: ptr32 (`p:32:16`), full IMG0..IMG15 caller-clobber on JSL, basic regalloc at -O1+.
- **Working tree**: clean except 3 modified files listed below; all are real fixes that haven't been committed yet.
- **Branch**: `main`, ahead of `origin/main` by recent checkpoint commits.
## Uncommitted, must keep
These are the in-flight improvements. Rebuild after applying any of them.
1. `runtime/src/snprintf.c` — removed `__attribute__((optnone))` from
`emitULong` (line 106) and `snprintf` (line 303). Slot-aliasing
workaround that the IMG-clobber + LDAfi-IMG fixes made unnecessary.
2. `src/llvm/lib/Target/W65816/W65816InstrInfo.cpp`
- `copyPhysReg` virtual-register short-circuit: if `SrcReg` or `DestReg`
is virtual, emit a `TargetOpcode::COPY` and return. Basic regalloc's
InlineSpiller calls `storeRegToStackSlot` with vreg sources before
final physreg assignment; without the short-circuit the unpaired-
Wide32 default branch hits the `unreachable`.
- `copyPhysReg` IMG-to-IMG PHA-bracket: was `lda src; sta dst`
unbracketed clobber of A, regalloc inserted these copies between
`$a = COPY $img10` and use-of-A. PHA/PLA bracket preserves A.
3. `src/llvm/lib/Target/W65816/W65816SjLjFinalize.cpp` — catchtab build
moved BEFORE landingpad erase. Old code did `LPadBB->getLandingPadInst()`
AFTER erasing the insts → returned nullptr → empty LSDA → catch never
matched, abort. Now captures catch-clause typeinfo Constants into a
`DenseMap<BasicBlock*, LPadInfo>` BEFORE erase; build loop reads from
the saved map.
To commit when ready (do NOT amend; create new commits):
```bash
git add runtime/src/snprintf.c \
src/llvm/lib/Target/W65816/W65816InstrInfo.cpp \
src/llvm/lib/Target/W65816/W65816SjLjFinalize.cpp
git commit -m "..." # message stub below
```
Suggested commit message: see "Fixes landed" section below; one commit
per logical change is cleaner.
## Already-committed in this session arc
Per `git log --oneline -20` these are the recent checkpoint commits;
the diffs they contain are real and load-bearing.
The big ones (search by file or grep):
- **JSLpseudo Defs += IMG0..IMG15** in `W65816InstrInfo.td`. With the
wider Defs, regalloc spills IMG-class vregs around calls instead of
treating them as preserved.
- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `STAfi`**:
PHA-bracketed for non-A source (IMG/X/Y). The `lda dp; sta d,s` chain
clobbered A; bracket preserves A while shifting offset by +2 between
PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation.
- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `LDAfi`**:
if `Dst = IMGn`, append `STA dp` so the IMG slot actually receives the
loaded value. Previously only loaded into A; downstream
`COPY $x = $imgN` (= `ldx $D?`) read garbage. **This was the smoking
gun for `dadd(1.5, 2.5) → 0x4010_0000_3000_3000`.**
- **`W65816LowerWide32.cpp`** fixed-point erase loop. Was single-pass;
REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it
alive at the iteration moment. Removed ~40 dead Wide32 vregs from
`__adddf3`'s pre-RA MIR.
- **`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll`** relaxed
`stx 0xd / sta 0xd` to `0x{{[cd]}}` (regalloc now picks IMG8..15 too).
## Fixes landed (full list with rationale)
Each entry: what / why / where / what regression it would cause if reverted.
### A. Hash-shell DELETE bug → IMG caller-clobber
**Symptom**: `dbDelete("age")` returned 0 ("not found") instead of 1.
DELETE never ran; `COUNT` stayed at 2.
**Root cause**: `dbDelete` did `stx 0xd0` to save k_high, called `hashKey`,
then `pei 0xd0` to push k_high to strcmp. `hashKey` used $D0 as scratch
in its loop body (`sta 0xd0` storing the iterator's running-ptr-low). $D0
was clobbered by the time `pei 0xd0` ran. JSLpseudo Defs only listed
`A, X, Y, DPF0` — IMG slots were not modelled as caller-clobber.
**Fix**: `JSLpseudo Defs += [IMG0..IMG15]`.
**Cascading fallout** (each required its own fix):
#### A1. copyPhysReg vreg fallback
`storeRegToStackSlot`'s unpaired-Wide32 default branch hit `unreachable`
when called with a vreg source. Basic regalloc's InlineSpiller does
this. Fix: short-circuit virtual-reg cases to `TargetOpcode::COPY`.
#### A2. LowerWide32 fixed-point erase
Single-pass erase left ~40 dead Wide32 vregs in `__adddf3`. Pattern:
```
%X:wide32 = REG_SEQUENCE ...
%Y:wide32 = COPY %X
... uses of %Y rewritten by Pass 3 ...
```
Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then
COPY erased (now %X dead but loop already passed it). Fix: iterate
until no progress.
#### A3. STAfi PHA-bracket
Without bracket, regalloc could schedule `$img0 = COPY $a` AFTER a
`STAfi`-with-IMG-source whose internal `lda dp` clobbered $a, silently
storing X's value where A's was expected.
#### A4. LDAfi-IMG-dest STA dp
**The big one.** With narrow IMG, regalloc kept Wide16 vregs in IMG
slots across calls, never needed `$imgN = LDAfi %stack.X`. With full
IMG, every cross-call spill needed it. The expansion only emitted
`LDA d,s` (load A) — never wrote to the IMG slot. Downstream
`COPY $x = $imgN` (= `ldx $D?`) read stale prior data. Manifested as
`dadd(1.5, 2.5) → 0x4010_0000_3000_3000` (mantissa garbage).
**Diagnostic that found it**: diff post-RA MIR narrow vs full IMG. Pre-RA
MIR was identical. Full had 6 `$imgN = LDAfi` instances; narrow had 0.
Narrow used `COPY $imgN = $a` patterns instead — those work correctly.
#### A5. FileCheck regex
`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll` expected
`stx 0xd / sta 0xd`. Under full IMG clobber, regalloc picks IMG8..15
($C0/$C2) for cross-call arg saves. Relaxed to `0x{{[cd]}}`.
### B. C++ try/catch source-level path
Two bugs blocking real `clang++ -fsjlj-exceptions` source code:
#### B1. W65816SjLjFinalize catchtab ordering
`runOnFunction` erased landingpad insts at line ~245, then built the
catchtab at line ~290 via `LPadBB->getLandingPadInst()`. By that
point, landingpads were nullptr. The build loop's `if (!LP) continue;`
skipped every entry. Catchtab ended with just `(0,0)` sentinel. LSDA
was 4 bytes of zeros. `findCatch` saw `ctx->lsda == 0`'s entry and
bailed. Result: any `throw` aborted.
Fix: capture catch-clause typeinfo Constants into a
`DenseMap<BasicBlock*, LPadInfo>` BEFORE erasing landingpads; the
catchtab build loop reads from the saved map.
#### B2. copyPhysReg IMG-to-IMG PHA-bracket
Comment said "Caller is responsible for ensuring A is dead at this
program point (regalloc usually arranges this)." It doesn't, in
practice. Regalloc inserted IMG-to-IMG copies between `$a = COPY $img10`
and `STAfi $a, slot`. Unbracketed `lda src; sta dst` clobbered A.
The subsequent STAfi spilled garbage. Visible as `*p = 42` after
`__cxa_allocate_exception` storing 42 to wrong addr (indirect-long
setup got hi-half at lo-slot).
Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare).
**Verified end-to-end** via MAME breakpoints: `begin_catch` entered
with correct ExcHeader, `end_catch` entered with A=42, doTest returns
A=42 from real C++ `try { throw 42; } catch (int x) { return x; }`.
### C. Cleanup wins
- `runtime/src/snprintf.c:106` — removed `optnone` on `emitULong`. Smoke green.
- `runtime/src/snprintf.c:303` — removed `optnone` on `snprintf`. Smoke green.
## Still-open work areas
Each carries a fair-warning note for whoever picks it up.
### 1. qsort/bsearch `optnone` — REMOVED 2026-05-08
Source-restructured `qsort`: split the inner loop into a
`__attribute__((noinline))` helper `qsortInner` (4 args: base, cur,
size, cmp). Outer `qsort` just iterates `i = 1..nmemb-1` and calls
`qsortInner(base, base + i*size, size, cmp)`. This drops outer
qsort's i32-vreg simultaneous-live count below the inline-spill
OOM threshold; both halves compile cleanly at -O2 + basic regalloc.
`bsearch` `optnone` was kept-for-symmetry — once removed, it just
worked. The IMG-clobber + LDAfi-IMG-store backend fixes from
2026-05-07 had already resolved its underlying pressure issue.
Smoke 131/131 stays green.
### 2. gmtime_r `optnone`
`runtime/src/timeExt.c:69`. NOT a backend bug — IR-level optimization
issue (loop rotation + IndVar simplify mis-evaluating
`days >= 365L + (__isLeap(...) ? 1 : 0)`). Fixing requires deciding
which combine pass is wrong and why. Out of scope for backend work.
### 3. softDouble noinlines
`runtime/src/softDouble.c:30` (`dpack`) and `:51` (`dclass`). Removing
`dpack` noinline broke dadd this session — register pressure for
`__adddf3`/`__muldf3`/`__divdf3`. Architectural for the same reason as
qsort.
### 4. Greedy regalloc retry — TRIED, blocked
Tested 2026-05-08. Greedy fails immediately on `atoi` in libc.c:
```
LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...):
Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed.
```
Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register
pair partial defs that the regalloc treats as fully dead. Greedy is
genuinely incompatible with the W65816's split-half subreg-pair patterns
until the upstream LLVM issue is patched. Reverted to basic regalloc.
Document `feedback_greedy_high_pressure.md` already covers this.
### 5. gmtime_r `optnone` — TRIED, blocked
Tested 2026-05-08. Hoisting `yearLen` to a long local (avoiding the
double-recompute of `365L + (__isLeap ? 1 : 0)`) didn't help; adding
`volatile` to the local also didn't help. IR optimizer is still
folding the comparison to compile-time-false. Source-level C
restructuring won't dodge it; needs IR-pass-level work to identify
which combine pass mis-evaluates and why. optnone stays.
## How to verify recovery
```bash
cd /home/scott/claude/llvm816
git status # 3 modified files listed above
cd tools/llvm-mos-build && ninja llc clang # rebuild backend
cd /home/scott/claude/llvm816
bash runtime/build.sh # build runtime
bash scripts/smokeTest.sh # should print "all smoke checks passed"
```
If smoke fails, the most likely cause is one of the three uncommitted
files got reverted; check `git status` and re-apply.
## Diagnostic tools that worked
For posterity — these are the patterns that paid off this session.
### Pre-RA vs post-RA MIR diff
```bash
clang -mllvm -stop-before=regallocbasic -S ... # pre-RA
clang -mllvm -stop-after=virtregrewriter -S ... # post-RA (post-virtregrewriter)
```
Diff narrow-IMG vs full-IMG post-RA MIR for the failing function.
Pre-RA is identical (same IR), so the diff isolates regalloc-decision
divergence. Look at every NEW pattern that appears only in the failing
build — `$imgN = LDAfi` was the smoking gun for dadd.
### Pass-by-pass IR/MIR dumps
```bash
clang -mllvm -print-after=w65816-lower-wide32 -S ...
clang -mllvm -print-after-all -S ... 2>dump.txt
```
### MAME debugger via xvfb-run
```bash
xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N
```
With autoboot Lua: load .bin into bank 0 (skip `$C000..$CFFF` I/O),
set CPU state, then `cpu.debug:bpset(addr, condition, action)` with
actions like `"logerror \"...\\n\",a,x,...; go"`. `logerror` with format
args goes to stdout under `-oslog`. Memory reads in expressions:
`b@(addr)`, `w@(addr)`. Watchpoints: `cpu.debug:wpset(prog_space, "w",
addr, len, condition, action)`.
**AVOID**: `add_machine_pause_notifier` + `cpu.debug:go()` in callback —
segfaults from reentrancy. `printf` in actions stays in debugger console
(not -oslog). `tracelog` also debugger-console only.
### Trace methodology (find divergence point)
1. Set BPs at every `JSLpseudo` callee in the failing function.
2. Capture A/X/Y/DPF0 at each return.
3. Find first divergent return between known-good and failing builds.
4. The instruction sequence between previous-OK and first-divergent
return is where the bug lives.
This pattern found the dadd bug at `jsl@0x207f → __lshrsi3(0x8001_8000, 3)`
in 30 minutes. Recommended.
## Memory notes referenced
(Filenames under `/home/scott/.claude/projects/-home-scott-claude-llvm816/memory/`.)
- `feedback_strstr9_long_haystack.md` — the hash-shell bug story.
- `feedback_cpp_subset.md` — C++ subset, including the SjLj fix.
- `feedback_ptr32_frame_limit.md` — was 5 days stale; updated 2026-05-07
to "DONE, 131/131 smoke green".
- `feedback_jslpseudo_caller_save.md`, `feedback_libcall_img_clobber.md`,
`feedback_img_slot_expansion.md`, `feedback_greedy_high_pressure.md`
related backend topics.
## Next session candidates (ranked)
1. **Commit the uncommitted fixes.** They've earned it.
2. **Greedy regalloc retry.** Cheap experiment, potentially big win.
3. **qsort source restructure.** Clear `optnone` if you're willing to
reshape the algorithm. Source-level work, not backend.
4. **gmtime_r IR investigation.** Find which combine miscompiles
`days >= 365L + (leap?1:0)`. IR-level, not backend.

View file

@ -23,10 +23,11 @@ static void byteSwap(unsigned char *a, unsigned char *b, size_t size) {
} }
// optnone under ptr32: greedy regalloc runs out of registers when the // Iterative binary search. Previously had `optnone` to dodge a
// 32-bit pointer arithmetic puts multiple simultaneously-live Wide32 // regalloc OOM under ptr32 + multiple simultaneously-live Wide32
// vregs in flight. Fast regalloc spills liberally and gets through. // vregs. The IMG-clobber + LDAfi-IMG-store backend fixes (2026-05-07)
__attribute__((optnone)) // resolved the underlying regalloc pressure issue and bsearch now
// compiles cleanly at -O2.
void *bsearch(const void *key, const void *base, size_t nmemb, void *bsearch(const void *key, const void *base, size_t nmemb,
size_t size, CmpFnT cmp) { size_t size, CmpFnT cmp) {
const unsigned char *baseP = (const unsigned char *)base; const unsigned char *baseP = (const unsigned char *)base;
@ -49,24 +50,36 @@ void *bsearch(const void *key, const void *base, size_t nmemb,
} }
__attribute__((optnone)) // Inner loop of qsort's insertion sort: starting from `cur` (the
// element at index i) walk left through `[base, cur)`, swapping
// adjacent pairs while out-of-order. Hoisted out of qsort so the
// outer loop's frame stays small enough for basic regalloc; without
// the split, qsort's i32-pointer pressure under ptr32 produces
// ADCEfi tied-def chains the inline-spiller can't allocate ("ran
// out of registers" failure).
__attribute__((noinline))
static void qsortInner(unsigned char *base, unsigned char *cur,
size_t size, CmpFnT cmp) {
while (cur > base) {
unsigned char *prev = cur - size;
if (cmp(prev, cur) <= 0) {
return;
}
byteSwap(prev, cur, size);
cur = prev;
}
}
void qsort(void *base, size_t nmemb, size_t size, CmpFnT cmp) { void qsort(void *base, size_t nmemb, size_t size, CmpFnT cmp) {
if (nmemb < 2 || size == 0) { if (nmemb < 2 || size == 0) {
return; return;
} }
unsigned char *baseP = (unsigned char *)base; unsigned char *baseP = (unsigned char *)base;
// Insertion sort: for each i in 1..n-1, walk left swapping until // Insertion sort: for each i in 1..n-1, hand the element at i to
// the slot at j-1 is <= the slot at j. Stable, in-place. // the inner-loop helper which walks left swapping out-of-order
// pairs. Stable, in-place.
for (size_t i = 1; i < nmemb; i++) { for (size_t i = 1; i < nmemb; i++) {
size_t j = i; qsortInner(baseP, baseP + i * size, size, cmp);
while (j > 0) {
unsigned char *cur = baseP + j * size;
unsigned char *prev = cur - size;
if (cmp(prev, cur) <= 0) {
break;
}
byteSwap(prev, cur, size);
j--;
}
} }
} }

View file

@ -66,6 +66,12 @@ struct tm *localtime(const time_t *t) {
// optnone preserves the per-statement structure and the loop runs // optnone preserves the per-statement structure and the loop runs
// correctly. Verified end-to-end against 1710484245L → 2024-03-15 // correctly. Verified end-to-end against 1710484245L → 2024-03-15
// 06:30:45 UTC (Friday, day-of-year 74). // 06:30:45 UTC (Friday, day-of-year 74).
//
// Tried 2026-05-08 (didn't fix): hoist yearLen to a long local;
// hoist with `volatile`; restructure as for(;;) with break. All hit
// the same IR-level bug — IndVar simplify still folds the comparison
// to compile-time-false. The fix needs IR-pass-level work, not C
// restructuring.
__attribute__((optnone)) __attribute__((optnone))
struct tm *gmtime_r(const time_t *t, struct tm *out) { struct tm *gmtime_r(const time_t *t, struct tm *out) {
long secs = *t; long secs = *t;