Added STATUS.md

2026-04-30 18:49:00 -05:00 · 2026-04-30 18:49:00 -05:00 · 91ac5476a5
commit 91ac5476a5
parent 6d7eae0356
20 changed files with 1192 additions and 107 deletions
--- a/STATUS.md
+++ b/STATUS.md
@ -0,0 +1,144 @@
 # llvm816 — Current Status
 LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
 llvm-mos as a separate `W65816` target.
 ## What works
 End-to-end C-to-binary toolchain that produces 65816 machine code
 which runs correctly under MAME (apple2gs).
 **Language coverage at -O2 (no extra flags):**
 - All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
  (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
  + ASLA16 / shift libcalls.
 - Comparisons and signed/unsigned widening (sext, zext, trunc) for all
  the above sizes.
 - Pointer arithmetic, array indexing, struct field access, struct
  return-by-value (up to 8 bytes — Pair, Vec4, double).
 - Bitfields, switch statements (verified up to ~12 cases + default),
  function pointers, function-pointer tables, indirect calls via
  `__jsl_indir` trampoline.
 - Recursion: factorial, Fibonacci, depth-3 binary-tree
  insert/sum/min/max, simple recursive quicksort.
 - Loops with goto / break / continue, nested loops, state machines.
 - `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
 - Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
  reverse with `cons` works.
 - Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
  roundtrip.
 - Soft-float (single): all four ops + comparisons, MAME-verified.
 - Soft-double: add, sub, mul, div all return correct bit patterns;
  3-iter Newton sqrt converges. Long-running iterations may hit MAME's
  1-second sim-time budget (test config issue, not a compiler bug).
 - Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
  arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
 - C++ minimal: clang++ compiles a class with virtual + non-trivial
  ctor (vtable + RTTI omitted; no exceptions).
 - printf with `%d %x %s %c %p` and width/precision specifiers.
 - `setjmp` / `longjmp` from libgcc.s.
 - Static constructors via crt0's init_array walk.
 **Toolchain:**
 - `clang` / `llc` produce W65816 assembly + ELF object files.
 - `tools/link816` resolves cross-translation-unit refs, lays out
  text/rodata/bss, emits a flat binary the IIgs ROM can load.
 - `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
  native object format) for round-tripping with classic dev tools.
 - `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
  libgcc into linkable objects.
 - `scripts/smokeTest.sh` runs ~80 end-to-end checks (scalar ops,
  control flow, calling conventions, MAME execution, regressions).
  Currently 100% pass.
 **ABI:**
 - arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
  on the system stack with PHA. Caller deallocates via `tsc;clc;adc
  #N;tcs` or `PLY*N/2`.
 - Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
  the highest 16 bits.
 - Frame is empty-descending (S points to next-free); offsets account
  for the +1 skew vs LLVM's full-descending model.
 ## In flight (build-system level)
 - **DWARF sidecar emission in link816** (#51): The link should produce
  a separate sidecar file with line-number / variable-location info
  that an IDE or post-mortem dumper can consume. Skeleton not yet
  written; deferred until other correctness work is done.
 ## Known issues / workarounds
 - **Greedy register allocator mis-orders spills** in two patterns
  (#69, #70):
  1. Functions where both `$a` and `$x` are live-in (i64-first-arg
     with a stack-output pointer, e.g. `udivmod(i64, i64, ptr)`).
     The TAX bridging `$x` to A clobbers `$a`'s value before the
     second STA can save it.
  2. Iterative quicksort with `if/else` recursion choice: complex
     live-ranges across two `swap()` calls produce wrong arg values.
  Both reproduce only at `-O1`/`-O2` with greedy. Workaround:
  `-mllvm -regalloc=fast` for the affected translation unit.
  `softDouble.c` already requires this flag for `__muldf3` (build.sh
  applies it automatically).
  Real fix is a pre-RA pass that pre-spills critical pointer
  arguments to memory, or a targeted fix in greedy's spill-ordering
  heuristic. Material work; deferred.
 - **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
  negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
  rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
  correct for negative offsets like `arr[i-1]`.
 - **(d,s),y for stack-local pointer dereferences uses DBR**, so
  user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
  IIgs hardware) must not call into a function that takes the
  address of one of its locals — the callee's `*p = v` will write
  to the wrong bank. Documented; no compiler-side mitigation
  beyond the existing DPF0 fake-physreg routing for the i64-return
  high half.
 ## What's still needed for a "ship-ready" toolchain
 - **Greedy regalloc spill-ordering fix** — see above. Removes the
  need for the per-file `-regalloc=fast` workaround on
  `softDouble.c` and unblocks pattern-rich code that currently
  must be compiled at `-O0` for correctness.
 - **Round-to-nearest-even in `__divdf3`** — currently
  truncate-toward-zero, which differs from gcc by ±1 ULP in
  several test cases. Acceptable today (Newton iterations still
  converge); revisit when an exact-match test suite lands.
 - **DWARF sidecar** (#51) for source-level debugging.
 - **More of the C standard library**: `<math.h>` transcendental
  functions (sin, cos, exp, log, pow), `<string.h>` beyond what's
  hand-coded, `<stdio.h>` file I/O (`fopen`, `fread`, `fwrite`,
  `fseek`).
 - **C++ runtime support**: vtable layout for multiple inheritance,
  RTTI, exceptions (or a documented `-fno-exceptions` requirement).
 - **REP/SEP scheduling pass** (design doc §3.3): the current
  prologue picks one M-mode for the whole function based on
  whether any 8-bit accumulator value is used. A per-region
  scheduler would reduce the SEP/REP wrap overhead on i8 stores.
 - **Toolbox / IIgs system call bindings**: header files declaring
  the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
  `DrawString`, …) with the right inline-asm dispatch glue.
 - **Real-world program coverage**: the smoke tests are
  microbenchmarks. A few known-good Apple IIgs C programs (e.g.
  a textfile pager, a small game) compiled and run end-to-end
  would catch issues no synthetic test currently exercises.
 - **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
  says the goal is to "match or exceed" Calypsi. We have neither
  baseline numbers nor a comparison harness yet.
--- a/runtime/build.sh
+++ b/runtime/build.sh
@ -23,8 +23,10 @@ asm() {
 cc() {
    local c="$1"
    local o="$OUT/$(basename "${c%.c}").o"
    local extra=("${@:2}")
    echo "  CC  $(basename "$c")"
    "$CLANG" -target w65816 -O2 -ffunction-sections \
        "${extra[@]}" \
        -I"$PROJECT_ROOT/runtime/include" \
        -c "$c" -o "$o"
 }
@ -33,6 +35,9 @@ asm "$SRC/crt0.s"
 asm "$SRC/libgcc.s"
 cc  "$SRC/libc.c"
 cc  "$SRC/softFloat.c"
-cc  "$SRC/softDouble.c"
+# softDouble.c needs -regalloc=fast: __muldf3's 64x64 -> 128 mul +
 # inlined alignment shifts overflows the greedy allocator on the
 # single-A target.
 cc  "$SRC/softDouble.c" -mllvm -regalloc=fast
 echo "runtime built: $(ls -1 "$OUT"/*.o | wc -l) objects"
--- a/runtime/src/libgcc.s
+++ b/runtime/src/libgcc.s
@ -673,19 +673,30 @@ __divmodsi_setup:
 ; setup; signed variants flip signs around it.
 ; --------------------------------------------------------------------
 __divmoddi4_stash:
 	; Called via JSR from another libgcc helper that was itself
 	; called via JSL.  Stack layout inside this routine:
 	;   slot 1..2  = JSR return address (2 bytes, same-bank)
 	;   slot 3..5  = JSL return address (3 bytes, long)
 	;   slot 6..7  = first 16-bit stack arg (caller's first push)
 	;   slot 8..9  = second
 	;   ... etc.
 	; Earlier code read slots 4, 6, 8, 10, 12, 14 — which lands on
 	; the JSL ret address bytes, treating them as args.  Caught by
 	; `u64mul(0x12, 0x12)` returning the result at $E2 (mid-low)
 	; instead of $E0 (lo) plus 0x678-shaped garbage at $E6.
 	sta	0xe0			; a_lo_lo
 	stx	0xe2			; a_lo_hi
 	lda	0x4, s
 	sta	0xe4			; a_hi_lo
 	lda	0x6, s
-	sta	0xe6			; a_hi_hi
+	sta	0xe4			; a_hi_lo
 	lda	0x8, s
-	sta	0xe8			; b_lo_lo
+	sta	0xe6			; a_hi_hi
 	lda	0xa, s
-	sta	0xea			; b_lo_hi
+	sta	0xe8			; b_lo_lo
 	lda	0xc, s
-	sta	0xec			; b_hi_lo
+	sta	0xea			; b_lo_hi
 	lda	0xe, s
 	sta	0xec			; b_hi_lo
 	lda	0x10, s
 	sta	0xee			; b_hi_hi
 	rts
@ -805,19 +816,28 @@ __muldi3:
 	; Loop 64 times on a's bits.
 	ldy	#0x40
 .Lmuldi_loop:
-	; Test bit 0 of a (= LSR a; C = old bit 0).
+	; Right-shift the 64-bit `a` by 1.  $E0=lo..$E6=hi (matches the
-	lda	0xe0
+	; stash + __retdi convention).  Must shift HI first (LSR loses
 	; bit 63 of $E6) so each ROR carries the previous half's bit 0
 	; INTO the top of the next-LOWER half — that's the actual
 	; right-shift direction in a $E0=lo layout.  After the chain,
 	; C = orig $E0_b0 = bit 0 of the 64-bit value, which drops out
 	; and is what we want to BCC on.  The earlier code shifted lo
 	; first which ran the shift in the WRONG direction (lo → hi)
 	; and tested $E6_b0 (bit 48) instead of bit 0 — every multiply
 	; involving bits 16+ came back garbage.
 	lda	0xe6
 	lsr	a
-	sta	0xe0
+	sta	0xe6
 	lda	0xe2
 	ror	a
 	sta	0xe2
 	lda	0xe4
 	ror	a
 	sta	0xe4
-	lda	0xe6
+	lda	0xe2
 	ror	a
-	sta	0xe6
+	sta	0xe2
 	lda	0xe0
 	ror	a
 	sta	0xe0
 	bcc	.Lmuldi_noadd
 	; Add b ($E8..$EE) to product ($F2..$F8).
 	clc
--- a/runtime/src/softDouble.c
+++ b/runtime/src/softDouble.c
@ -111,16 +111,25 @@ u64 __negdf2(u64 a) {
    return a ^ DSIGN_BIT;
 }
-u64 __muldf3(u64 a, u64 b) {
+// Carry the high 64 bits of a 128-bit product in `hi` and the low 64
-    u64 sa, sb, ma, mb;
+// in `lo`.  Carry bit indicates whether the leading bit was at 105
-    s16 ea, eb;
+// (caller must increment exponent).
-    u16 ca = dclass(a, &sa, &ea, &ma);
+typedef struct {
-    u16 cb = dclass(b, &sb, &eb, &mb);
+    u64 mantissa;
-    u64 sr = sa ^ sb;
+    u16 carry;
-    if (ca == 0 || cb == 0) return sr;
+} MantCarryT;
-    // Truncated 64*64 → high-64 product via 32*32 partials.  We only
+
-    // need the upper bits of the 106-bit product because the mantissas
+// 64x64 -> 128-bit product, returned as a packed u64 pair.  Returns
-    // are 53 bits each.
+// the high 64 bits in the high u64 of the .mantissa lane is not
 // possible — instead, we shift in-line and return the aligned mantissa
 // directly.  Splitting keeps register pressure low enough for greedy
 // regalloc on the single-A W65816.
 //
 // Inlinable on purpose: passing a pointer to a stack local across a
 // noinline boundary lowers to `sta (d,s),y` which uses DBR-relative
 // addressing — broken under DBR != 0 (e.g. after a bank switch).
 // Keeping these inline keeps the stores within the caller's frame.
 static inline u64 mulhi64Aligned(u64 ma, u64 mb, u16 *out_carry) {
    u32 alo = (u32)ma;
    u32 ahi = (u32)(ma >> 32);
    u32 blo = (u32)mb;
@ -131,16 +140,26 @@ u64 __muldf3(u64 a, u64 b) {
    u64 hh = (u64)ahi * (u64)bhi;
    u64 mid = lh + hl + (ll >> 32);
    u64 prod_hi = hh + (mid >> 32);
-    s16 er = ea + eb;
+    u64 prod_lo = (ll & 0xFFFFFFFFULL) | ((mid & 0xFFFFFFFFULL) << 32);
-    while (prod_hi & ~(DMANT_LEAD | DMANT_MASK)) {
+    if (prod_hi & (1ULL << 41)) {
-        prod_hi >>= 1;
+        *out_carry = 1;
-        er++;
+        return (prod_hi << 11) | (prod_lo >> 53);
    }
-    while ((prod_hi & DMANT_LEAD) == 0 && prod_hi != 0) {
+    *out_carry = 0;
-        prod_hi <<= 1;
+    return (prod_hi << 12) | (prod_lo >> 52);
-        er--;
+}
-    }
+
-    return dpack(sr, er, prod_hi);
+u64 __muldf3(u64 a, u64 b) {
    u64 sa, sb, ma, mb;
    s16 ea, eb;
    u16 ca = dclass(a, &sa, &ea, &ma);
    u16 cb = dclass(b, &sb, &eb, &mb);
    u64 sr = sa ^ sb;
    if (ca == 0 || cb == 0) return sr;
    u16 carry;
    u64 mr = mulhi64Aligned(ma, mb, &carry);
    s16 er = ea + eb + (s16)carry;
    return dpack(sr, er, mr);
 }
 u64 __divdf3(u64 a, u64 b) {
@ -151,26 +170,29 @@ u64 __divdf3(u64 a, u64 b) {
    u64 sr = sa ^ sb;
    if (ca == 0) return sr;
    if (cb == 0) return sr | DEXP_MASK;  // div-by-zero → inf
-    // Long division: shift a left by 11 to make room for quotient bits.
+    // Long division: handle the leading quotient bit explicitly (since
-    u64 q = 0;
+    // we need to "consume" the dividend's leading 1 by subtracting),
-    u64 r = ma;
+    // then generate 52 more fractional bits by shifting r left and
-    for (int i = 0; i < 53; i++) {
+    // testing.  The previous shift-and-test-only loop over-counted
    // when r == mb after subtraction (e.g. 2.0/1.0 returned ~4.0).
    s16 er = ea - eb;
    // Normalize so the dividend is in [mb, 2*mb).  This ensures the
    // leading quotient bit will land at position 52 below.
    if (ma < mb) {
        ma <<= 1;
        er--;
    }
    // Handle the leading quotient bit explicitly.
    u64 q = DMANT_LEAD;
    u64 r = ma - mb;
    // Compute 52 more fractional bits via standard shift-test-subtract.
    for (int i = 51; i >= 0; i--) {
        r <<= 1;
        q <<= 1;
        if (r >= mb) {
            r -= mb;
-            q |= 1;
+            q |= (1ULL << i);
        }
    }
    s16 er = ea - eb;
    while (q & ~(DMANT_LEAD | DMANT_MASK)) {
        q >>= 1;
        er++;
    }
    while ((q & DMANT_LEAD) == 0 && q != 0) {
        q <<= 1;
        er--;
    }
    return dpack(sr, er, q);
 }
--- a/scripts/smokeTest.sh
+++ b/scripts/smokeTest.sh
@ -1104,7 +1104,10 @@ int toInt(double x) { return (int)x; }
 double fromInt(int n) { return (double)n; }
 EOF
    "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile"
-    "$CLANG" --target=w65816 -O2 -ffunction-sections \
+    # softDouble.c uses -regalloc=fast because __muldf3's 64x64 -> 128
    # multiply with the inlined alignment shifts overflows the greedy
    # allocator's spill heuristics on the single-A target.
    "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
        -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile"
    "$PROJECT_ROOT/tools/link816" -o "$binDblFile" \
        --text-base 0x8000 --map "$mapDblFile" \
@ -1281,7 +1284,7 @@ int main(void) {
 }
 EOF
    "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame"
-    "$CLANG" --target=w65816 -O2 -ffunction-sections \
+    "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
        -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame"
    "$PROJECT_ROOT/tools/link816" -o "$binDblMame" \
        --text-base 0x1000 \
@ -1402,7 +1405,7 @@ EOF
            -c "$PROJECT_ROOT/runtime/src/libc.c" -o "$oLibcF"
        "$CLANG" --target=w65816 -O2 -ffunction-sections \
            -c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF"
-        "$CLANG" --target=w65816 -O2 -ffunction-sections \
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
            -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF"
        oCrt0F="$(mktemp --suffix=.o)"
        "$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \
@ -1708,9 +1711,10 @@ EOF
        fi
        rm -f "$cP2File" "$oP2File" "$binP2File"
-        # Bubble sort with the loop form that compiles correctly
+        # Canonical bubble sort.  Both this form (`i < n-1; j < n-i-1`)
-        # (i=1..n; inner j+1<n-i+1).  The other form `i<n-1; j<n-i-1`
+        # and the alternate form work after the BranchExpand bridge
-        # has an outstanding compiler bug (#65); use this canary form.
+        # fix.  Catches a regression in either BranchExpand or
        # TiedDefSpill if the conditional flow gets miscompiled.
        log "check: MAME runs bubble sort [4,1,3,2] → [1,2,3,4]"
        cBsFile="$(mktemp --suffix=.c)"
        oBsFile="$(mktemp --suffix=.o)"
@ -1721,8 +1725,8 @@ __attribute__((noinline)) void switchToBank2(void) {
 }
 unsigned short data[4] = { 4, 1, 3, 2 };
 __attribute__((noinline)) void bubbleSort(unsigned short *arr, unsigned short n) {
-    for (unsigned short i = 1; i < n; i++) {
+    for (unsigned short i = 0; i < n - 1; i++) {
-        for (unsigned short j = 0; j + 1 < n - i + 1; j++) {
+        for (unsigned short j = 0; j < n - i - 1; j++) {
            if (arr[j] > arr[j+1]) {
                unsigned short t = arr[j];
                arr[j] = arr[j+1];
@ -1752,8 +1756,507 @@ EOF
                  0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
            die "MAME: bubbleSort([4,1,3,2]) != [1,2,3,4]"
        fi
-        rm -f "$cBsFile" "$oBsFile" "$binBsFile" \
+        rm -f "$cBsFile" "$oBsFile" "$binBsFile"
-              "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
+
        # printf("ABCDE") returns 5.  Canary for the BranchExpand
        # leftover-BRA-Skip bug: without removing the original BRA
        # after rewriting Bxx to INV_Bxx, the inserted Bridge MBB
        # becomes unreachable and the conditional flow is lost.  Also
        # exercises vprintf's main loop end-to-end (no varargs).
        log "check: MAME runs printf('ABCDE') → 5 (BranchExpand bridge regression)"
        cPfFile="$(mktemp --suffix=.c)"
        oPfFile="$(mktemp --suffix=.o)"
        binPfFile="$(mktemp --suffix=.bin)"
        cat > "$cPfFile" <<'EOF'
 #include <stdio.h>
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 int main(void) {
    int r = printf("ABCDE");
    switchToBank2();
    *(volatile unsigned short *)0x5000 = (unsigned short)r;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections \
            -I"$PROJECT_ROOT/runtime/include" -c "$cPfFile" -o "$oPfFile"
        "$PROJECT_ROOT/tools/link816" -o "$binPfFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPfFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binPfFile" 0x025000 0005 >/dev/null 2>&1; then
            die "MAME: printf('ABCDE') != 5 (BranchExpand bridge regression)"
        fi
        rm -f "$cPfFile" "$oPfFile" "$binPfFile"
        # parse('BCDE') with switch-on-spec — used to fail to link with
        # PCREL8-out-of-range because long unconditional BRA didn't
        # auto-relax to BRL.  W65816BranchExpand now force-promotes
        # long BRA to BRL.
        log "check: MAME runs nested-loop+multiply f(4) → 120 (regalloc + BRA-relax)"
        cFnFile="$(mktemp --suffix=.c)"
        oFnFile="$(mktemp --suffix=.o)"
        binFnFile="$(mktemp --suffix=.bin)"
        cat > "$cFnFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) unsigned short f(unsigned short n) {
    unsigned short s = 0;
    for (unsigned short i = 0; i < n; i++)
        for (unsigned short j = 0; j < n; j++)
            s += i*n+j;
    return s;
 }
 int main(void) {
    unsigned short r = f(4);
    switchToBank2();
    *(volatile unsigned short *)0x5000 = r;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cFnFile" -o "$oFnFile"
        "$PROJECT_ROOT/tools/link816" -o "$binFnFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFnFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binFnFile" 0x025000 0078 >/dev/null 2>&1; then
            die "MAME: f(4) != 120 (regalloc + BRA-relax regression)"
        fi
        rm -f "$cFnFile" "$oFnFile" "$binFnFile"
        # u64add through a noinline boundary — exercises the
        # ADJCALLSTACKUP teardown's STA $E0 / LDA $E0 path that
        # preserves Y across the SP-restore.  The earlier PLY*N/2
        # implementation clobbered Y, so any i64 return came back
        # with the last popped arg in Y instead of the sum's mid-high.
        # Recursive u64 factorial — exercises __muldi3 + i64 ABI through
        # a recursive noinline boundary.  20! = 0x21c3_677c_82b4_0000.
        # Used to come back as garbage because __divmoddi4_stash read
        # caller args from slot 4 when it was actually JSR-called from
        # __muldi3 (so slot 4 was the JSL ret address byte, not a_mh).
        # dadd through a noinline boundary — exercises __adddf3 + the
        # full i64-return ABI through a real call.  The earlier soft-
        # double smoke test ran `c = 1.5 + 2.5` inline, which clang
        # constant-folds to a literal 0x4010... bit pattern — never
        # actually executed __adddf3.  This one calls a noinline
        # `dadd` so the libcall and the i64 ABI run end-to-end.
        # printf("%d", n) — used to crash MAME entirely because MachineCSE
        # eliminated the `if (isLong)` re-test of *fmt as a "redundant"
        # CMP (it had matched an earlier identical CMP), and the
        # surviving BNE then read whatever leftover P-flag state happened
        # to be in P from the last spec-dispatch CMP.  Backend now
        # disables MachineCSE entirely.
        log "check: MAME runs printf('%%d %%d', 42, 99) chain (MachineCSE disable)"
        cPdFile="$(mktemp --suffix=.c)"
        oPdFile="$(mktemp --suffix=.o)"
        binPdFile="$(mktemp --suffix=.bin)"
        cat > "$cPdFile" <<'EOF'
 #include <stdio.h>
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) int give42(void) { return 42; }
 int main(void) {
    // vprintf returns the increment count: 1 per format spec, 1 per
    // non-spec char.  "Hi %d ok\n" → H,i,' ',%d,' ',o,k,'\n' = 8.
    int n = printf("Hi %d ok\n", give42());
    switchToBank2();
    *(volatile unsigned short *)0x5000 = (unsigned short)n;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections \
            -I"$PROJECT_ROOT/runtime/include" -c \
            "$cPdFile" -o "$oPdFile"
        "$PROJECT_ROOT/tools/link816" -o "$binPdFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPdFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binPdFile" 0x025000 0008 \
                  >/dev/null 2>&1; then
            die "MAME: printf('Hi %d ok\\n', 42) != 8 (vprintf isLong / MachineCSE)"
        fi
        rm -f "$cPdFile" "$oPdFile" "$binPdFile"
        log "check: MAME runs noinline dadd(1.5,2.5) → 4.0 (__adddf3 + i64 ABI)"
        cDdFile="$(mktemp --suffix=.c)"
        oDdFile="$(mktemp --suffix=.o)"
        binDdFile="$(mktemp --suffix=.bin)"
        cat > "$cDdFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) double dadd(double a, double b) { return a + b; }
 int main(void) {
    union { double d; unsigned short w[4]; } u;
    u.d = dadd(1.5, 2.5);
    switchToBank2();
    *(volatile unsigned short *)0x5000 = u.w[0];
    *(volatile unsigned short *)0x5002 = u.w[1];
    *(volatile unsigned short *)0x5004 = u.w[2];
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cDdFile" -o "$oDdFile"
        "$PROJECT_ROOT/tools/link816" -o "$binDdFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDdFile" --check \
                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4010 \
                  >/dev/null 2>&1; then
            die "MAME: noinline dadd(1.5,2.5) != 4.0 (i64-ABI through libcall)"
        fi
        rm -f "$cDdFile" "$oDdFile" "$binDdFile"
        log "check: MAME runs fact_u64(20) → 0x21c3677c82b40000 (__muldi3 stash slots)"
        cFkFile="$(mktemp --suffix=.c)"
        oFkFile="$(mktemp --suffix=.o)"
        binFkFile="$(mktemp --suffix=.bin)"
        cat > "$cFkFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) unsigned long long fact_u64(unsigned int n) {
    if (n <= 1) return 1ULL;
    return (unsigned long long)n * fact_u64(n - 1);
 }
 int main(void) {
    unsigned long long r = fact_u64(20);
    union { unsigned long long u; unsigned short w[4]; } u;
    u.u = r;
    switchToBank2();
    *(volatile unsigned short *)0x5000 = u.w[0];
    *(volatile unsigned short *)0x5002 = u.w[1];
    *(volatile unsigned short *)0x5004 = u.w[2];
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cFkFile" -o "$oFkFile"
        "$PROJECT_ROOT/tools/link816" -o "$binFkFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFkFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binFkFile" --check \
                  0x025000=0000 0x025002=82b4 0x025004=677c 0x025006=21c3 \
                  >/dev/null 2>&1; then
            die "MAME: fact_u64(20) returned wrong bits (__muldi3 / stash slots)"
        fi
        rm -f "$cFkFile" "$oFkFile" "$binFkFile"
        log "check: MAME runs u64add(0x3FF8...,0x4004...) → 0x7FFC... (call-up Y-preserve)"
        cU64File="$(mktemp --suffix=.c)"
        oU64File="$(mktemp --suffix=.o)"
        binU64File="$(mktemp --suffix=.bin)"
        cat > "$cU64File" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) unsigned long long u64add(unsigned long long a, unsigned long long b) {
    return a + b;
 }
 int main(void) {
    unsigned long long c = u64add(0x3FF8000000000000ULL, 0x4004000000000000ULL);
    union { unsigned long long u; unsigned short w[4]; } u;
    u.u = c;
    switchToBank2();
    *(volatile unsigned short *)0x5000 = u.w[0];
    *(volatile unsigned short *)0x5002 = u.w[1];
    *(volatile unsigned short *)0x5004 = u.w[2];
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cU64File" -o "$oU64File"
        "$PROJECT_ROOT/tools/link816" -o "$binU64File" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oU64File" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binU64File" --check \
                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=7ffc \
                  >/dev/null 2>&1; then
            die "MAME: u64add through noinline returned wrong middle halves (call-up Y-clobber)"
        fi
        rm -f "$cU64File" "$oU64File" "$binU64File"
        log "check: MAME runs addOff(p,1) p[0]+=p[1] → 12 (StackSlotCleanup killed-Y respect)"
        cAofFile="$(mktemp --suffix=.c)"
        oAofFile="$(mktemp --suffix=.o)"
        binAofFile="$(mktemp --suffix=.bin)"
        cat > "$cAofFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) short addOff(short *p, short i) {
    short b = p[i];
    p[i-1] = p[i-1] + b;
    return p[i-1];
 }
 int main(void) {
    short stk[2] = { 5, 7 };
    short r = addOff(stk, 1);
    short s0 = stk[0];
    switchToBank2();
    *(volatile unsigned short *)0x5000 = (unsigned short)r;
    *(volatile unsigned short *)0x5002 = (unsigned short)s0;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cAofFile" -o "$oAofFile"
        "$PROJECT_ROOT/tools/link816" -o "$binAofFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oAofFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binAofFile" --check 0x025000=000c 0x025002=000c \
                  >/dev/null 2>&1; then
            die "MAME: addOff p[i-1]+=p[i] returned wrong store (NegYIndY/X-clobber or LDY-erase)"
        fi
        rm -f "$cAofFile" "$oAofFile" "$binAofFile"
        log "check: MAME runs sqr(10) → 100 (frame-less ADJCALLSTACKUP must emit PLY)"
        cSqrFile="$(mktemp --suffix=.c)"
        oSqrFile="$(mktemp --suffix=.o)"
        binSqrFile="$(mktemp --suffix=.bin)"
        cat > "$cSqrFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) unsigned short sqr(unsigned short x) { return x * x; }
 int main(void) {
    unsigned short r = sqr(10);
    switchToBank2();
    *(volatile unsigned short *)0x5000 = r;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cSqrFile" -o "$oSqrFile"
        "$PROJECT_ROOT/tools/link816" -o "$binSqrFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqrFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binSqrFile" --check 0x025000=0064 >/dev/null 2>&1; then
            die "MAME: sqr(10) crashed or != 100 (ADJCALLSTACKUP not emitting PLY for frame-less)"
        fi
        rm -f "$cSqrFile" "$oSqrFile" "$binSqrFile"
        log "check: MAME runs ddiv(8.0,4.0) → 2.0 (__divdf3 algorithm fix)"
        cDdvFile="$(mktemp --suffix=.c)"
        oDdvFile="$(mktemp --suffix=.o)"
        binDdvFile="$(mktemp --suffix=.bin)"
        cat > "$cDdvFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) double ddiv(double a, double b) { return a / b; }
 int main(void) {
    union { double d; unsigned short w[4]; } u;
    u.d = ddiv(8.0, 4.0);
    switchToBank2();
    *(volatile unsigned short *)0x5000 = u.w[0];
    *(volatile unsigned short *)0x5002 = u.w[1];
    *(volatile unsigned short *)0x5004 = u.w[2];
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cDdvFile" -o "$oDdvFile"
        "$PROJECT_ROOT/tools/link816" -o "$binDdvFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdvFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binDdvFile" --check 0x025000=0000 0x025002=0000 \
                  0x025004=0000 0x025006=4000 >/dev/null 2>&1; then
            die "MAME: ddiv(8,4) != 2.0 (__divdf3 long-division bug)"
        fi
        rm -f "$cDdvFile" "$oDdvFile" "$binDdvFile"
        log "check: MAME runs Newton-iter loop → high-half ~1.41 (BranchExpand self-loop BRA fix)"
        cSqFile="$(mktemp --suffix=.c)"
        oSqFile="$(mktemp --suffix=.o)"
        binSqFile="$(mktemp --suffix=.bin)"
        # 3-iter Newton-method sqrt with a counted for-loop (the loop-back
        # BRA is a self-loop, which the BranchExpand distance estimator
        # used to report as 0 bytes, so it never promoted to BRL even
        # when the loop body grew well past +/-128 bytes).
        cat > "$cSqFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) double sqrt3(double x) {
    double g = x * 0.5;
    for (unsigned short i = 0; i < 3; i++)
        g = (g + x / g) * 0.5;
    return g;
 }
 int main(void) {
    union { double d; unsigned short w[4]; } u;
    u.d = sqrt3(2.0);
    switchToBank2();
    // Only the high half is precision-stable (low halves vary slightly
    // due to truncation vs round-to-nearest in __divdf3).  Verify just
    // the high half — that's enough to prove the self-loop BRA was
    // promoted (the link would have failed otherwise) and __divdf3 is
    // converging to the right magnitude.
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cSqFile" -o "$oSqFile"
        "$PROJECT_ROOT/tools/link816" -o "$binSqFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binSqFile" --check 0x025006=3ff6 >/dev/null 2>&1; then
            die "MAME: sqrt3(2.0) high half wrong (self-loop BRA / __divdf3)"
        fi
        rm -f "$cSqFile" "$oSqFile" "$binSqFile"
        log "check: MAME runs -O0 addOne(7) → 8 (lda-overwrite-immediate fix; fast regalloc)"
        cO0File="$(mktemp --suffix=.c)"
        oO0File="$(mktemp --suffix=.o)"
        binO0File="$(mktemp --suffix=.bin)"
        cat > "$cO0File" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 unsigned short addOne(unsigned short a) { return a + 1; }
 int main(void) {
    unsigned short r = addOne(7);
    switchToBank2();
    *(volatile unsigned short *)0x5000 = r;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O0 -ffunction-sections -c \
            "$cO0File" -o "$oO0File"
        "$PROJECT_ROOT/tools/link816" -o "$binO0File" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oO0File" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binO0File" --check 0x025000=0008 >/dev/null 2>&1; then
            die "MAME: -O0 addOne(7) != 8 (lda overwrite immediate / regalloc choice)"
        fi
        rm -f "$cO0File" "$oO0File" "$binO0File"
        log "check: MAME runs bubble sort with mySwap helper [4,1,3,2] → [1,2,3,4] (greedy across helper-call)"
        cBshFile="$(mktemp --suffix=.c)"
        oBshFile="$(mktemp --suffix=.o)"
        binBshFile="$(mktemp --suffix=.bin)"
        cat > "$cBshFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 unsigned short bsdata[4] = { 4, 1, 3, 2 };
 __attribute__((noinline)) void mySwap(unsigned short *a, unsigned short *b) {
    unsigned short t = *a; *a = *b; *b = t;
 }
 __attribute__((noinline)) void mySort(unsigned short *arr, unsigned short n) {
    for (unsigned short i = 0; i < n - 1; i++)
        for (unsigned short j = 0; j < n - i - 1; j++)
            if (arr[j] > arr[j+1])
                mySwap(&arr[j], &arr[j+1]);
 }
 int main(void) {
    mySort(bsdata, 4);
    unsigned short d0 = bsdata[0], d1 = bsdata[1], d2 = bsdata[2], d3 = bsdata[3];
    switchToBank2();
    *(volatile unsigned short *)0x5000 = d0;
    *(volatile unsigned short *)0x5002 = d1;
    *(volatile unsigned short *)0x5004 = d2;
    *(volatile unsigned short *)0x5006 = d3;
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cBshFile" -o "$oBshFile"
        "$PROJECT_ROOT/tools/link816" -o "$binBshFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oBshFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
                  "$binBshFile" --check 0x025000=0001 0x025002=0002 \
                  0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
            die "MAME: mySort with mySwap helper miscompiled (greedy regalloc across call)"
        fi
        rm -f "$cBshFile" "$oBshFile" "$binBshFile"
        log "check: MAME runs dmul(8.0,2.0) AFTER bank-switch → 16.0 (DPF0 store + __muldf3)"
        cDmFile="$(mktemp --suffix=.c)"
        oDmFile="$(mktemp --suffix=.o)"
        binDmFile="$(mktemp --suffix=.bin)"
        cat > "$cDmFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) double dmul(double a, double b) { return a * b; }
 int main(void) {
    union { double d; unsigned short w[4]; } u;
    switchToBank2();
    u.d = dmul(8.0, 2.0);
    *(volatile unsigned short *)0x5000 = u.w[0];
    *(volatile unsigned short *)0x5002 = u.w[1];
    *(volatile unsigned short *)0x5004 = u.w[2];
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cDmFile" -o "$oDmFile"
        "$PROJECT_ROOT/tools/link816" -o "$binDmFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmFile" --check \
                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
                  >/dev/null 2>&1; then
            die "MAME: dmul(8,2) under DBR=2 produced wrong bits (DPF0 store / __muldf3)"
        fi
        rm -f "$cDmFile" "$oDmFile" "$binDmFile"
        log "check: MAME runs dmath = (a+b)*(a-b), 5,3 → 16.0 (chained libcall ABI)"
        cDmaFile="$(mktemp --suffix=.c)"
        oDmaFile="$(mktemp --suffix=.o)"
        binDmaFile="$(mktemp --suffix=.bin)"
        cat > "$cDmaFile" <<'EOF'
 __attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
 }
 __attribute__((noinline)) double dadd(double a, double b) { return a + b; }
 __attribute__((noinline)) double dsub(double a, double b) { return a - b; }
 __attribute__((noinline)) double dmul(double a, double b) { return a * b; }
 __attribute__((noinline)) double dmath(double a, double b) {
    return dmul(dadd(a, b), dsub(a, b));
 }
 int main(void) {
    union { double d; unsigned short w[4]; } u;
    u.d = dmath(5.0, 3.0);
    switchToBank2();
    *(volatile unsigned short *)0x5000 = u.w[0];
    *(volatile unsigned short *)0x5002 = u.w[1];
    *(volatile unsigned short *)0x5004 = u.w[2];
    *(volatile unsigned short *)0x5006 = u.w[3];
    while (1) {}
 }
 EOF
        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
            "$cDmaFile" -o "$oDmaFile"
        "$PROJECT_ROOT/tools/link816" -o "$binDmaFile" --text-base 0x1000 \
            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmaFile" \
            >/dev/null 2>&1
        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmaFile" --check \
                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
                  >/dev/null 2>&1; then
            die "MAME: dmath(5,3) returned wrong high half (DP[\$F0] CSE across libcalls)"
        fi
        rm -f "$cDmaFile" "$oDmaFile" "$binDmaFile"
        rm -f "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
    else
        warn "MAME or apple2gs ROMs not installed; skipping end-to-end test"
    fi
--- a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp
+++ b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp
@ -131,6 +131,7 @@ static bool clobbersImg(const MachineInstr &MI,
 bool W65816ABridgeViaX::runOnMachineFunction(MachineFunction &MF) {
  if (!MF.getRegInfo().getNumVirtRegs()) return false;
  if (MF.getFunction().hasOptNone()) return false;
  MachineRegisterInfo &MRI = MF.getRegInfo();
  const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
  const W65816InstrInfo *TII = STI.getInstrInfo();
--- a/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp
+++ b/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp
@ -83,21 +83,71 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
  switch (MI->getOpcode()) {
  default:
    break;
-  case W65816::ADJCALLSTACKDOWN:
+  case W65816::ADJCALLSTACKDOWN: {
    // DOWN is a no-op in our scheme — the PUSH16 sequence in LowerCall
    // already shifted SP incrementally as args were pushed.  Nothing
    // to emit; PEI may or may not have processed it, either is fine.
    return;
  }
  case W65816::ADJCALLSTACKUP: {
-    // PEI's eliminateCallFramePseudoInstr removes these *only* when the
+    // PEI's eliminateCallFramePseudoInstr handles UP whenever the
-    // function has frame work (StackSize > 0 or any FrameIndex use).
+    // function has any frame work (StackSize > 0 or any FI use).
-    // Functions that just tail-call into a libcall (e.g. `int toInt(float
+    // Frame-less functions — e.g. `unsigned short sqr(unsigned short
-    // x) { return (int)x; }` lowers to a single jsl __fixsfsi) have
+    // x) { return x*x; }` lowers to PUSH16 + jsl __mulhi3 + RTL with
-    // neither; PEI skips its call-frame phase and the pseudo survives
+    // no locals — get skipped by PEI's call-frame phase, leaving
-    // to MC.  AsmStreamer renders the pseudo's "# ADJCALLSTACK..."
+    // ADJCALLSTACKUP as a pseudo all the way to here.  Previously we
-    // string as a comment, but MCObjectStreamer asks the encoder to
+    // silently dropped it, which left SP off by N bytes after the
-    // emit bytes — which fails ("Unsupported instruction MCInst 337").
+    // call and corrupted the caller's stack frame (caught by sqr(x)
-    // Dropping it here is correct: when amt is zero (the "no frame"
+    // segfaulting MAME).  Emit the SP fixup ourselves: PLY*N/2 for
-    // path) the call sequence is a no-op anyway; when non-zero, PEI
+    // small even N, otherwise the TAY/TSC-ADC/TYA bracket.
-    // would have replaced it with PLA-loop / TSC-ADC sequence already.
+    int N = MI->getOperand(0).getImm();
-    // If we ever see a non-zero amount slip through, that's a real
+    if (N == 0) return;
-    // bug — emit nothing and trust the comment-stripped path.
+    // A holds the callee's return value; preserve it.  Walk forward
    // looking for X/Y uses (i64-return halves) — same logic as
    // eliminateCallFramePseudoInstr.
    bool YLive = false;
    for (auto J = std::next(MI->getIterator()); J != MI->getParent()->end();
         ++J) {
      if (J->isCall()) break;
      bool yDef = false;
      for (const MachineOperand &MO : J->operands()) {
        if (!MO.isReg()) continue;
        if (MO.getReg() == W65816::Y) {
          if (MO.isUse()) { YLive = true; break; }
          if (MO.isDef()) yDef = true;
        }
      }
      if (YLive || yDef) break;
    }
    if (YLive) {
      // Route through DP $E0 to preserve both A and Y.
      MCInst Sta; Sta.setOpcode(W65816::STA_DP);
      Sta.addOperand(MCOperand::createImm(0xE0));
      EmitToStreamer(*OutStreamer, Sta);
      MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
      MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
      MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
      Adc.addOperand(MCOperand::createImm(N));
      EmitToStreamer(*OutStreamer, Adc);
      MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
      MCInst Lda; Lda.setOpcode(W65816::LDA_DP);
      Lda.addOperand(MCOperand::createImm(0xE0));
      EmitToStreamer(*OutStreamer, Lda);
    } else if (N <= 14 && (N % 2) == 0) {
      for (int i = 0; i < N / 2; ++i) {
        MCInst Ply; Ply.setOpcode(W65816::PLY);
        EmitToStreamer(*OutStreamer, Ply);
      }
    } else {
      MCInst Tay; Tay.setOpcode(W65816::TAY); EmitToStreamer(*OutStreamer, Tay);
      MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
      MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
      MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
      Adc.addOperand(MCOperand::createImm(N));
      EmitToStreamer(*OutStreamer, Adc);
      MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
      MCInst Tya; Tya.setOpcode(W65816::TYA); EmitToStreamer(*OutStreamer, Tya);
    }
    return;
  }
  case W65816::LDXi16imm: {
--- a/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp
+++ b/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp
@ -46,6 +46,7 @@
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
 #include "llvm/Support/raw_ostream.h"
 using namespace llvm;
@ -100,7 +101,17 @@ static unsigned estimateDistance(MachineFunction &MF,
                                 const MachineInstr &Br,
                                 MachineBasicBlock *To) {
  const MachineBasicBlock *From = Br.getParent();
-  if (From == To) return 0;
+  // Self-loop branch: target is the start of From, branch is somewhere
  // inside From.  Distance is the bytes from start of From to the
  // branch instruction (i.e., everything before Br in From).
  if (From == To) {
    unsigned Bytes = 0;
    for (const auto &MI : *From) {
      if (&MI == &Br) break;
      Bytes += TII->getInstSizeInBytes(MI);
    }
    return Bytes;
  }
  // Two cases by layout direction:
  //   forward: bytes after Br in From, plus all of MBBs strictly
@ -276,11 +287,30 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
  // Step 2: iterate to fixed-point.  Each expansion adds 3 bytes
  // (bridge BRA), which may push another previously-OK branch over
  // the threshold.  Cap at MAX_ITERS to avoid pathological cases.
-  const unsigned EXPAND_DIST_THRESHOLD = 100;  // safe under +/-128
+  const unsigned EXPAND_DIST_THRESHOLD = 90;  // tighter margin under +/-128
  const unsigned MAX_ITERS = 10;
  for (unsigned iter = 0; iter < MAX_ITERS; ++iter) {
    bool Changed = false;
    // Promote long BRA to BRL.  The assembler's BRA→BRL relaxation
    // sometimes fails to fire when the target symbol resolves early
    // in MC layout — the linker then sees a PCREL8 reloc that's out
    // of range.  Force the BRL ourselves when the estimate exceeds
    // the safe threshold; saves one byte if BRA would have fit, but
    // beats a hard link error.
    for (auto &MBB : MF) {
      for (auto &MI : MBB.terminators()) {
        if (MI.getOpcode() != W65816::BRA) continue;
        if (MI.getNumOperands() < 1 || !MI.getOperand(0).isMBB()) continue;
        MachineBasicBlock *Target = MI.getOperand(0).getMBB();
        unsigned Dist = estimateDistance(MF, TII, MI, Target);
        if (Dist > EXPAND_DIST_THRESHOLD) {
          MI.setDesc(TII->get(W65816::BRL));
          Changed = true;
        }
      }
    }
    // Collect candidates.  After step 1, each MBB has at most one
    // conditional terminator, so we walk terminators().
    SmallVector<std::pair<MachineBasicBlock *, MachineInstr *>, 8> Candidates;
@ -337,6 +367,27 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
      // fall-through marker after stays after.
      auto insertPt = MBB->getFirstTerminator();
      BuildMI(*MBB, insertPt, DL, TII->get(InvOpc)).addMBB(Skip);
      // After the rewrite, MBB falls through to Bridge (which now sits
      // immediately after MBB in layout).  Any unconditional BRA/BRL
      // already at the end of MBB used to direct the fall-through to
      // Skip — but with Bridge interposed, that BRA would skip past
      // Bridge entirely and Bridge becomes unreachable.  Remove it.
      // (Skip is still reachable via INV_Bxx; Target is reachable via
      // fall-through-to-Bridge then BRL.)  Caught by vprintf crashing
      // because dropDeadConditionalsToBRATarget then dropped the
      // INV_Bxx as redundant with the leftover BRA Skip.
      while (insertPt != MBB->end()) {
        unsigned NextOpc = insertPt->getOpcode();
        if (NextOpc == W65816::BRA || NextOpc == W65816::BRL) {
          if (insertPt->getNumOperands() >= 1 &&
              insertPt->getOperand(0).isMBB() &&
              insertPt->getOperand(0).getMBB() == Skip) {
            insertPt = insertPt->eraseFromParent();
            continue;
          }
        }
        ++insertPt;
      }
      // Bridge: BRL Target.  Always emit the long form rather than
      // relying on the assembler to relax BRA→BRL — the relaxation
--- a/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp
+++ b/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp
@ -162,15 +162,39 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
  // Insert before the terminator (the return).
  DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc();
  // Detect whether the return live-out includes Y or X — for i64 returns
  // (Outs[0..2] -> A,X,Y), Y holds bits 32-47 and X holds bits 16-31, so
  // any TAY/PLY/TAX in the SP-restore would corrupt the return value.
  // The RTL terminator carries implicit-uses for every live-out return
  // register; scan them to decide which scratch we can use safely.
  bool YLive = false;
  bool XLive = false;
  if (MBBI != MBB.end() && MBBI->isReturn()) {
    for (const MachineOperand &MO : MBBI->operands()) {
      if (!MO.isReg() || !MO.isImplicit() || !MO.isUse()) continue;
      if (MO.getReg() == W65816::Y) YLive = true;
      else if (MO.getReg() == W65816::X) XLive = true;
    }
  }
  // VLA cleanup: restore entry SP from DP $F4 (saved in prologue).
  // This subsumes BOTH the static frame and any dynamic_stackalloc
  // bytes — we can skip the per-byte PLY/PLA loop entirely.  Preserve
-  // A through TAY/TYA since it holds the return value.
+  // A through TAY/TYA since it holds the return value.  For i64
  // returns where Y is also live, route the save through DP $E0
  // ($E0..$EF is libcall scratch — guaranteed dead by epilogue time).
  if (HasVLA) {
    if (YLive) {
      BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
      BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
      BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
      BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
    } else {
      BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
      BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
      BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
      BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
    }
    return;
  }
@ -182,11 +206,26 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
  // N/2 PLY (pop into Y, discard); larger frames use
  // TAY/TSC/CLC/ADC #N/TCS/TYA.
  // Mirror the prologue threshold (see comment there).
-  if (StackSize <= 6 && (StackSize % 2) == 0) {
+  if (StackSize <= 6 && (StackSize % 2) == 0 && !YLive) {
    // PLY clobbers Y, which is fine when Y isn't a return reg.
    for (uint64_t i = 0; i < StackSize / 2; ++i)
      BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY));
    return;
  }
  if (YLive) {
    // Y is a return register (i64 / double).  Save A via DP $E0
    // instead of TAY so Y survives.  4 cyc slower than TAY/TYA but
    // correct.  X is allowed to be live too — none of these touch X.
    BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
    BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
    BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
    BuildMI(MBB, MBBI, DL, TII.get(W65816::ADC_Imm16))
        .addImm(StackSize);
    BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
    BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
    (void)XLive;
    return;
  }
  BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
  BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
  BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
@ -207,15 +246,56 @@ MachineBasicBlock::iterator W65816FrameLowering::eliminateCallFramePseudoInstr(
  // ADJCALLSTACKUP releases all the pushed bytes after a call.
  //
  // Critical: A holds the callee's return value here, so this MUST NOT
-  // clobber A.  The naive `tsc;clc;adc #N;tcs` does (TSC overwrites A),
+  // clobber A.  PLY (small-N path) clobbers Y; TAY/.../TYA bracket
-  // which silently corrupts every call's return value.  Same fix as the
+  // (large-N path) also clobbers Y.  Both are fine for i8/i16/i32
-  // epilogue: small N via PLY (clobbers Y, preserves A); larger N via
+  // returns but DESTROY the return for i64/double (where X and Y hold
-  // TAY/.../TYA bracket.
+  // mid halves).  Detect i64-return calls by walking back to the JSL
  // and checking implicit-def $x/$y; in that case, save A via DP $E0
  // (libcall scratch, dead by call-up time) so X and Y survive.
  // Caught by `unsigned long long u64add(a,b)` through a noinline
  // boundary returning Y = b_hi (the last popped) instead of the
  // sum's mid-high.
  if (I->getOpcode() == W65816::ADJCALLSTACKUP) {
    int N = I->getOperand(0).getImm();
    if (N > 0) {
      DebugLoc DL = I->getDebugLoc();
-      if (N <= 14 && (N % 2) == 0) {
+      bool YLive = false;
      bool XLive = false;
      // Walk forward looking for COPY %vreg = $x / $y — LowerCall's
      // pattern for materializing return halves.  JSLpseudo's tablegen
      // declares only `Defs=[A]`, so implicit-defs of X/Y aren't on
      // the call op itself.  We have to read what comes after.
      // Stop at the next call (re-clobbers everything) or at any def
      // of X/Y (cancels their post-call value).
      bool Stopped = false;
      for (auto J = std::next(I); J != MBB.end() && !Stopped; ++J) {
        if (J->isCall()) break;
        for (const MachineOperand &MO : J->operands()) {
          if (!MO.isReg()) continue;
          Register R = MO.getReg();
          if (R == W65816::Y) {
            if (MO.isUse()) YLive = true;
            else if (MO.isDef() && !YLive) Stopped = true;
          } else if (R == W65816::X) {
            if (MO.isUse()) XLive = true;
            else if (MO.isDef() && !XLive) Stopped = true;
          }
        }
        if (YLive && XLive) break;
      }
      if (YLive) {
        // i64 return: PLY would eat Y.  Route through DP $E0.  Worth
        // ~4 cyc more than PLY*N/2 but correctness wins.  X is not
        // touched by any of these insns either way, so XLive doesn't
        // change anything here — track it for symmetry.
        BuildMI(MBB, I, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
        BuildMI(MBB, I, DL, TII.get(W65816::TSC));
        BuildMI(MBB, I, DL, TII.get(W65816::CLC));
        BuildMI(MBB, I, DL, TII.get(W65816::ADC_Imm16)).addImm(N);
        BuildMI(MBB, I, DL, TII.get(W65816::TCS));
        BuildMI(MBB, I, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
        (void)XLive;
      } else if (N <= 14 && (N % 2) == 0) {
        for (int i = 0; i < N / 2; ++i)
          BuildMI(MBB, I, DL, TII.get(W65816::PLY));
      } else {
--- a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp
+++ b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp
@ -861,10 +861,17 @@ W65816TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
      Glue = V.getValue(2);
      InVals.push_back(V);
    } else {
-      // 4th half: load from DP $F0.
+      // 4th half: read DP[$F0..$F1] via CopyFromReg(DPF0).  DPF0 is a
-      SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
+      // pseudo-physreg modeled as JSLpseudo's implicit-def, so each
-      SDValue V = DAG.getLoad(VT, DL, Chain, DPAddr, MachinePointerInfo());
+      // call's CopyFromReg has Glue tied to the corresponding call —
      // the SDAG combiner can't merge them and the scheduler can't
      // reorder them past the next call.  copyPhysReg lowers DPF0 →
      // A as `LDA $F0`.  Without this, plain `getLoad(0xF0)` was
      // being CSE'd / reordered across i64-returning calls, causing
      // `dmath = (a+b)*(a-b)` to return 4 instead of 16.
      SDValue V = DAG.getCopyFromReg(Chain, DL, W65816::DPF0, VT, Glue);
      Chain = V.getValue(1);
      Glue = V.getValue(2);
      InVals.push_back(V);
    }
  }
@ -900,11 +907,17 @@ SDValue W65816TargetLowering::LowerReturn(
  SDValue Glue;
  SmallVector<SDValue, 8> RetOps(1, Chain);
-  // Outs[3] -> store to DP $F0 (only for i64 returns).  Done first so
+  // Outs[3] -> DP $F0 via CopyToReg(DPF0).  Using the DPF0 fake physreg
-  // its computation can use A freely before A holds the low result.
+  // (lowered to `STA $F0` by copyPhysReg) is critical: a generic
  // ISD::STORE with addr=0xF0 lowered to `sta (d,s),y`, an indirect
  // through the DBR, which silently misbehaved when DBR != 0.  STA dp
  // uses D + dp directly and is unaffected by DBR.  Done first so its
  // computation can use A freely before A holds the low result.  Glued
  // to RET_GLUE via the RetOps Register entry below so DCE doesn't
  // strip the COPY.
  if (Outs.size() >= 4) {
-    SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
+    Chain = DAG.getCopyToReg(Chain, DL, W65816::DPF0, OutVals[3], Glue);
-    Chain = DAG.getStore(Chain, DL, OutVals[3], DPAddr, MachinePointerInfo());
+    Glue = Chain.getValue(1);
  }
  // Outs[2] -> Y.
  if (Outs.size() >= 3) {
@ -926,6 +939,8 @@ SDValue W65816TargetLowering::LowerReturn(
    RetOps.push_back(DAG.getRegister(W65816::X, Outs[1].VT));
  if (Outs.size() >= 3)
    RetOps.push_back(DAG.getRegister(W65816::Y, Outs[2].VT));
  if (Outs.size() >= 4)
    RetOps.push_back(DAG.getRegister(W65816::DPF0, Outs[3].VT));
  RetOps[0] = Chain;
  if (Glue.getNode())
--- a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp
+++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp
@ -92,6 +92,44 @@ void W65816InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
    BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(dstImg);
    return;
  }
  // X → IMGn / IMGn → X: STX dp / LDX dp.  Avoids the A-bridge that
  // TAX/TXA would impose; critical for i32-first-arg signatures
  // (live-in $a + $x) where bridging X via A clobbers $a's value
  // before it can be saved.  Caught by udivmod and iterative qsort.
  if (dstImg >= 0 && SrcReg == W65816::X) {
    BuildMI(MBB, I, DL, get(W65816::STX_DP)).addImm(dstImg);
    return;
  }
  if (DestReg == W65816::X && srcImg >= 0) {
    BuildMI(MBB, I, DL, get(W65816::LDX_DP)).addImm(srcImg);
    return;
  }
  // Y → IMGn / IMGn → Y: STY dp / LDY dp — symmetric.
  if (dstImg >= 0 && SrcReg == W65816::Y) {
    BuildMI(MBB, I, DL, get(W65816::STY_DP)).addImm(dstImg);
    return;
  }
  if (DestReg == W65816::Y && srcImg >= 0) {
    BuildMI(MBB, I, DL, get(W65816::LDY_DP)).addImm(srcImg);
    return;
  }
  // DPF0 → A: emit `LDA $F0`.  DPF0 is the pseudo-physreg carrier
  // for an i64-returning call's high 16 bits; LowerCall builds a
  // CopyFromReg(DPF0) glued to the call so the SDAG combiner /
  // scheduler can't merge or reorder reads across calls.
  if (DestReg == W65816::A && SrcReg == W65816::DPF0) {
    BuildMI(MBB, I, DL, get(W65816::LDA_DP)).addImm(0xF0);
    return;
  }
  // A → DPF0: emit `STA $F0`.  Used by LowerReturn for the i64 high
  // half; using a true direct-page store is critical because plain
  // ISD::STORE with addr=0xF0 was lowering to `(d,s),y` indirect via
  // DBR — which silently broke under DBR != 0 (e.g. after a bank
  // switch).  STA dp uses D + dp directly, ignoring DBR.
  if (DestReg == W65816::DPF0 && SrcReg == W65816::A) {
    BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(0xF0);
    return;
  }
  llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented");
 }
@ -101,8 +139,14 @@ void W65816InstrInfo::storeRegToStackSlot(
    MachineInstr::MIFlag Flags) const {
  // STAfi gets eliminated by W65816RegisterInfo::eliminateFrameIndex into
  // a real STA d,S.  Source is implicit A; emit the pseudo with the FI
-  // and zero offset.
+  // and zero offset.  When regalloc hands us a spill from X or Y, bridge
  // through A (TXA / TYA) — same rationale as loadRegFromStackSlot.
  DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
  if (SrcReg == W65816::X || SrcReg == W65816::Y) {
    unsigned XferOp = (SrcReg == W65816::X) ? W65816::TXA : W65816::TYA;
    BuildMI(MBB, MI, DL, get(XferOp));
    SrcReg = W65816::A;
  }
  BuildMI(MBB, MI, DL, get(W65816::STAfi))
      .addReg(SrcReg, getKillRegState(isKill))
      .addFrameIndex(FrameIdx)
@ -115,9 +159,30 @@ void W65816InstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,
                                           const TargetRegisterClass *RC,
                                           Register VReg, unsigned SubReg,
                                           MachineInstr::MIFlag Flags) const {
-  // Mirror image of storeRegToStackSlot: emit LDAfi, which the frame
+  // LDAfi only knows how to put the value in A.  If regalloc asks for
-  // index pass turns into LDA d,S.
+  // a spill into X or Y, we have to bridge through A: LDA d,S then
  // TAX / TAY.  Without this, the MIR has `$x = LDAfi` but the asm
  // printer emits just `LDA d,S` (which writes A, not X) — a silent
  // miscompile that surfaced as i64 subtract chains using stale X
  // values for the second word (caught by udivmod's `a - q*b` mod
  // computation).
  DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
  if (DestReg == W65816::A) {
    BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
        .addFrameIndex(FrameIdx)
        .addImm(0);
    return;
  }
  if (DestReg == W65816::X || DestReg == W65816::Y) {
    // Load via A, then transfer.  A is implicitly clobbered.
    BuildMI(MBB, MI, DL, get(W65816::LDAfi), W65816::A)
        .addFrameIndex(FrameIdx)
        .addImm(0);
    unsigned XferOp = (DestReg == W65816::X) ? W65816::TAX : W65816::TAY;
    BuildMI(MBB, MI, DL, get(XferOp));
    return;
  }
  // Fallback: assume A path (covers Acc16 / Wide16 vregs by class).
  BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
      .addFrameIndex(FrameIdx)
      .addImm(0);
--- a/src/llvm/lib/Target/W65816/W65816InstrInfo.td
+++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.td
@ -70,6 +70,7 @@ def W65816pushx : SDNode<"W65816ISD::PUSH_X", SDTNone,
                         [SDNPHasChain, SDNPInGlue, SDNPOutGlue,
                          SDNPSideEffect, SDNPMayStore]>;
 // SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the
 // flags from a preceding W65816cmp.  Lowered by EmitInstrWithCustomInserter
 // into a CMP (already in the BB) + Bxx + diamond CFG + PHI.
@ -1356,10 +1357,18 @@ def : Pat<(store
 // function doesn't have to know how it was called to choose its
 // return instruction.  A pseudo bridges the i16 symbol operand
 // to JSL_Long's 24-bit operand class.
 // Defs include DPF0 — every i64-returning libcall clobbers DP[$F0]
 // (it's the carrier for the highest 16 bits of the return).  The
 // LowerCall side captures the pre-call DPF0 via CopyFromReg(DPF0)
 // glued to the call so the SDAG combiner / scheduler can't merge
 // or reorder reads across calls.  Without DPF0 in Defs, plain
 // `getLoad(0xF0)` was being CSE'd across calls, leading to
 // `dmath = (a+b)*(a-b)` returning 4 instead of 16.
 let isCall = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0,
-    Defs = [A] in {
+    Defs = [A, DPF0] in {
 def JSLpseudo : W65816Pseudo<(outs), (ins i16imm:$dst),
                             "# JSLpseudo $dst", []>;
 }
 def : Pat<(W65816call (i16 tglobaladdr:$dst)),  (JSLpseudo tglobaladdr:$dst)>;
 def : Pat<(W65816call (i16 texternalsym:$dst)), (JSLpseudo texternalsym:$dst)>;
--- a/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h
+++ b/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h
@ -40,6 +40,7 @@ class W65816MachineFunctionInfo : public MachineFunctionInfo {
  /// STA8abs needs an SEP/REP wrap in M=0 to avoid a 2-byte store).
  bool UsesAcc8 = false;
 public:
  W65816MachineFunctionInfo() = default;
--- a/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp
+++ b/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp
@ -89,6 +89,31 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
          continue;
        unsigned Disp = MI.getOperand(0).getImm() & 0xFF;
        DebugLoc DL = MI.getDebugLoc();
        // X-liveness check: SpillToX may have stashed a value in X
        // that's used after this rewrite.  If so, save X to DP $E1
        // (libcall scratch high half — $E0 is reserved for the A-save
        // dance in eliminateCallFramePseudoInstr) and restore after.
        // Walk forward from MI looking for an X use without a prior
        // X def; if found, X is live and we must preserve it.
        bool XLive = false;
        for (auto Scan = std::next(MachineBasicBlock::iterator(&MI));
             Scan != MBB.end(); ++Scan) {
          if (Scan->isDebugInstr()) continue;
          bool xDef = false;
          for (const MachineOperand &MO : Scan->operands()) {
            if (!MO.isReg()) continue;
            if (MO.getReg() == W65816::X) {
              if (MO.isUse()) { XLive = true; break; }
              if (MO.isDef()) xDef = true;
            }
          }
          if (XLive || xDef) break;
        }
        if (XLive) {
          // Save X to DP $E2 (don't use $E0 — that's the A-preserve
          // slot in call-frame teardown and may be live).
          BuildMI(MBB, MI, DL, TII->get(W65816::STX_DP)).addImm(0xE2);
        }
        if (IsLDA) {
          // LDA disp,S ; CLC ; ADC #neg ; TAX ; LDA $0000,X
          BuildMI(MBB, MI, DL, TII->get(W65816::LDA_StackRel))
@ -127,6 +152,10 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
              .addImm(0)
              .addReg(W65816::A, RegState::Implicit);
        }
        if (XLive) {
          // Restore X from DP $E2.
          BuildMI(MBB, MI, DL, TII->get(W65816::LDX_DP)).addImm(0xE2);
        }
        // Erase original LDY and the (sr,s),Y op.
        if (LastLDY) { LastLDY->eraseFromParent(); LastLDY = nullptr; }
        MI.eraseFromParent();
--- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp
+++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp
@ -73,7 +73,30 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
  bool NeedsCarryPrefix = false;
  bool IsSub = false;
  switch (Opc) {
-  case W65816::LDAfi: NewOpc = W65816::LDA_StackRel; break;
+  case W65816::LDAfi: {
    // LDAfi targets A.  If the regalloc parked the dest in X or Y
    // (which can happen via Idx16 vreg coalescing), bridge through A
    // by appending a TAX / TAY.
    Register Dst = MI.getOperand(0).getReg();
    int FI = MI.getOperand(FIOperandNum).getIndex();
    int FrameOffset = MFI.getObjectOffset(FI);
    int ImmOffset = MI.getOperand(FIOperandNum + 1).getImm();
    int Offset = FrameOffset + ImmOffset + (int)MFI.getStackSize() + SPAdj;
    if (FrameOffset < 0) Offset += 1;
    if (Offset < 0 || Offset > 0xFF)
      report_fatal_error("W65816: frame offset out of stack-relative range");
    BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
            TII.get(W65816::LDA_StackRel))
        .addImm(Offset)
        .addReg(W65816::A, RegState::ImplicitDefine);
    if (Dst == W65816::X) {
      BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAX));
    } else if (Dst == W65816::Y) {
      BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAY));
    }
    MI.eraseFromParent();
    return true;
  }
  case W65816::STAfi: {
    // Wide16-source STAfi: if the source ended up in IMGn (DP-backed),
    // prepend LDA dp so the value reaches A before the actual store.
@ -108,6 +131,12 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
      BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
              TII.get(W65816::LDA_DP)).addImm(srcDP);
    }
    // Note: STAfi with X or Y source is NOT supported here — adding a
    // TXA/TYA pre-bracket would clobber A which a downstream STAfi $a
    // may still need (the prologue stashes arg0_lo from A and arg0_ml
    // from X via two adjacent STAfi, and putting A's STA *before* X's
    // is the caller's responsibility).  storeRegToStackSlot already
    // bridges X/Y → A for spills it generates.
    BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
            TII.get(W65816::STA_StackRel))
        .addImm(Offset)
--- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td
+++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td
@ -55,6 +55,15 @@ def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>;
 def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>;
 def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>;
 // DPF0 — pseudo-physreg modeling the i16 storage at DP $F0..$F1.
 // Used as the carrier for the highest 16 bits of an i64/double
 // return.  JSLpseudo Defs DPF0 so the SDAG combiner / scheduler
 // can't merge or reorder reads of it across calls; we plumb the
 // 4th return half via CopyFromReg(DPF0) in LowerCall, which lowers
 // to `LDA $F0` via copyPhysReg.  Never allocated to a vreg —
 // always a transient bridge from DP[$F0] to A.
 def DPF0 : W65816Reg<24, "dpf0">, DwarfRegNum<[24]>;
 //===----------------------------------------------------------------------===//
 //  Register Classes
 //===----------------------------------------------------------------------===//
@ -90,6 +99,13 @@ def Wide16 : RegisterClass<"W65816", [i16], 16,
 def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>;
 // Single-register class for DPF0, the i64-return high-half carrier.
 // Not allocatable — only used as a CopyFromReg source in LowerCall;
 // copyPhysReg lowers DPF0 → A by emitting `LDA $F0`.
 def DPF0Reg : RegisterClass<"W65816", [i16], 16, (add DPF0)> {
  let isAllocatable = 0;
 }
 // Single-register class for the processor status register, used for condition
 // code modeling.  Not currently allocatable.
 def StatusReg : RegisterClass<"W65816", [i8], 8, (add P)> {
--- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
+++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
@ -1217,6 +1217,13 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
        }
        if (MI.isCall()) break;
        if (MI.modifiesRegister(W65816::Y, TRI)) break;
        // killsRegister: an instruction with `implicit killed $y` USES Y
        // and that's the LAST use — Y is dead after.  We must NOT treat
        // a subsequent LDY_Imm16 #N as redundant after a kill, because
        // the held value is conceptually gone.  Caught by `addOff(p,i)
        // { p[i-1] += p[i]; }` where LDY -2 ; LDA_indY (kills Y) ; ... ;
        // LDY -2 ; STA_indY needs the second LDY to reinitialize Y.
        if (MI.killsRegister(W65816::Y, TRI)) break;
        if (MI.isInlineAsm() || MI.isBranch() || MI.isReturn()) break;
        ++It;
      }
--- a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
+++ b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
@ -14,6 +14,7 @@
 #include "W65816.h"
 #include "W65816MachineFunctionInfo.h"
 #include "TargetInfo/W65816TargetInfo.h"
 #include "llvm/CodeGen/MachineCSE.h"
 #include "llvm/CodeGen/Passes.h"
 #include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@ -82,16 +83,19 @@ public:
  void addPreRegAlloc() override;
  void addPostRegAlloc() override;
  void addPreEmitPass() override;
  void addMachineSSAOptimization() override;
-  // W65816's only 16-bit ALU register is A.  We use fast regalloc by
+  // W65816's only 16-bit ALU register is A.  Greedy at -O1+ produces
-  // default — always succeeds, ~30-50% bigger code than greedy in
+  // tight code; at -O0 (where optnone disables coalescing/CSE), greedy
-  // pathological cases but correctness is paramount.  Greedy fails
+  // leaves spurious COPY pseudos that lower to STA dp / LDA dp pairs
-  // outright on functions with 4+ simultaneously live i16 vregs (heap
+  // around modify-in-place ops (e.g. INA), miscompiling a + 1.  Use
-  // sift etc.).  TiedDefSpill (pre-RA) handles the tied-def-multi-use
+  // fast regalloc when the target framework signals unoptimized.
-  // hazard for the sub-pattern that's frequent enough to matter.
+  // TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for
  // the sub-pattern that's frequent enough to matter at -O1+.
  //
-  FunctionPass *createTargetRegisterAllocator(bool /*Optimized*/) override {
+  FunctionPass *createTargetRegisterAllocator(bool Optimized) override {
-    return createGreedyRegisterAllocator();
+    return Optimized ? createGreedyRegisterAllocator()
                     : createFastRegisterAllocator();
  }
 };
@ -101,6 +105,24 @@ TargetPassConfig *W65816TargetMachine::createPassConfig(PassManagerBase &PM) {
  return new W65816PassConfig(*this, PM);
 }
 void W65816PassConfig::addMachineSSAOptimization() {
  // MachineCSE incorrectly eliminates "redundant" CMP instructions when
  // it sees an earlier identical CMP elsewhere in the function — the
  // P (status) flag is considered "available", but on this target P is
  // clobbered by every intervening LDA/STA/ADC, so the surviving Bxx
  // ends up dispatching on stale flags.  We don't model `Uses=[P]` on
  // Bxx because doing so causes regalloc/layout shifts that uncovered
  // a different latent bug in vprintf.  Disabling the pass entirely
  // is the lower-cost workaround until the Bxx-Uses=[P] regression is
  // root-caused.  Caught by `printf("%d", n)` returning 0.
  //
  // Other SSA opts (early-tailduplication, opt-phis, dead-mi-elim,
  // licm, machine-sink, peephole-opt, etc.) still run by chaining
  // through the default impl — we just skip MachineCSE.
  disablePass(&MachineCSELegacyID);
  TargetPassConfig::addMachineSSAOptimization();
 }
 void W65816PassConfig::addPreRegAlloc() {
  addPass(createW65816ABridgeViaX());
  addPass(createW65816TiedDefSpill());
@ -125,7 +147,11 @@ void W65816PassConfig::addPreEmitPass() {
  addPass(createW65816SpillToX());
  // Rewrite negative-Y indirect-Y stack-rel ops.  Must run BEFORE
  // BranchExpand because the rewrite expands one instruction into
-  // several and shifts branch distances.
+  // several and shifts branch distances.  The pass internally checks
  // X-liveness and saves/restores X via DP $E0 when SpillToX has
  // a value parked there; without that check, the rewrite's TAX
  // would clobber spill-bridged values (caught by `addOff(p,i) {
  // p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b).
  addPass(createW65816NegYIndY());
  // Branch expansion runs after that so the BRA introduced for long
  // conditional branches gets seen by SepRepCleanup (which can
--- a/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp
+++ b/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp
@ -118,6 +118,11 @@ bool W65816TiedDefSpill::runOnMachineFunction(MachineFunction &MF) {
  // Only pre-RA: skip if vregs are already gone.
  if (!MF.getRegInfo().getNumVirtRegs())
    return false;
  // At -O0/optnone, the spill+reload pattern this pass introduces
  // doesn't get coalesced and ends up wasting frame space without
  // helping greedy.  Same skip rationale as WidenAcc16.
  if (MF.getFunction().hasOptNone())
    return false;
  MachineRegisterInfo &MRI = MF.getRegInfo();
  const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
--- a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
+++ b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
@ -119,6 +119,13 @@ static bool allUsesAcceptWide(Register VReg,
 bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
  if (!MF.getRegInfo().getNumVirtRegs()) return false;
  // At -O0 / optnone, register coalescing doesn't run, so the COPY we
  // insert to bridge Acc16 → Wide16 doesn't get folded; instead it
  // forces wide16 spills through DP-mapped slots that collide and
  // produce miscompiles around modify-in-place ops (lda dp; inc a;
  // sta dp; lda dp reads pre-inc value).  The promotion is purely a
  // performance optimization, so skip it for optnone functions.
  if (MF.getFunction().hasOptNone()) return false;
  MachineRegisterInfo &MRI = MF.getRegInfo();
  const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
  const W65816InstrInfo *TII = STI.getInstrInfo();