diff --git a/STATUS.md b/STATUS.md
new file mode 100644
index 0000000..3d65a76
--- /dev/null
+++ b/STATUS.md
@@ -0,0 +1,144 @@
+# llvm816 — Current Status
+
+LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
+llvm-mos as a separate `W65816` target.
+
+## What works
+
+End-to-end C-to-binary toolchain that produces 65816 machine code
+which runs correctly under MAME (apple2gs).
+
+**Language coverage at -O2 (no extra flags):**
+
+- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
+  (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+  + ASLA16 / shift libcalls.
+- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
+  the above sizes.
+- Pointer arithmetic, array indexing, struct field access, struct
+  return-by-value (up to 8 bytes — Pair, Vec4, double).
+- Bitfields, switch statements (verified up to ~12 cases + default),
+  function pointers, function-pointer tables, indirect calls via
+  `__jsl_indir` trampoline.
+- Recursion: factorial, Fibonacci, depth-3 binary-tree
+  insert/sum/min/max, simple recursive quicksort.
+- Loops with goto / break / continue, nested loops, state machines.
+- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
+- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
+  reverse with `cons` works.
+- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
+  roundtrip.
+- Soft-float (single): all four ops + comparisons, MAME-verified.
+- Soft-double: add, sub, mul, div all return correct bit patterns;
+  3-iter Newton sqrt converges. Long-running iterations may hit MAME's
+  1-second sim-time budget (test config issue, not a compiler bug).
+- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
+  arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
+- C++ minimal: clang++ compiles a class with virtual + non-trivial
+  ctor (vtable + RTTI omitted; no exceptions).
+- printf with `%d %x %s %c %p` and width/precision specifiers.
+- `setjmp` / `longjmp` from libgcc.s.
+- Static constructors via crt0's init_array walk.
+
+**Toolchain:**
+
+- `clang` / `llc` produce W65816 assembly + ELF object files.
+- `tools/link816` resolves cross-translation-unit refs, lays out
+  text/rodata/bss, emits a flat binary the IIgs ROM can load.
+- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
+  native object format) for round-tripping with classic dev tools.
+- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
+  libgcc into linkable objects.
+- `scripts/smokeTest.sh` runs ~80 end-to-end checks (scalar ops,
+  control flow, calling conventions, MAME execution, regressions).
+  Currently 100% pass.
+
+**ABI:**
+
+- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
+  on the system stack with PHA. Caller deallocates via `tsc;clc;adc
+  #N;tcs` or `PLY*N/2`.
+- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
+  the highest 16 bits.
+- Frame is empty-descending (S points to next-free); offsets account
+  for the +1 skew vs LLVM's full-descending model.
+
+## In flight (build-system level)
+
+- **DWARF sidecar emission in link816** (#51): The link should produce
+  a separate sidecar file with line-number / variable-location info
+  that an IDE or post-mortem dumper can consume. Skeleton not yet
+  written; deferred until other correctness work is done.
+
+## Known issues / workarounds
+
+- **Greedy register allocator mis-orders spills** in two patterns
+  (#69, #70):
+  1. Functions where both `$a` and `$x` are live-in (i64-first-arg
+     with a stack-output pointer, e.g. `udivmod(i64, i64, ptr)`).
+     The TAX bridging `$x` to A clobbers `$a`'s value before the
+     second STA can save it.
+  2. Iterative quicksort with `if/else` recursion choice: complex
+     live-ranges across two `swap()` calls produce wrong arg values.
+
+  Both reproduce only at `-O1`/`-O2` with greedy. Workaround:
+  `-mllvm -regalloc=fast` for the affected translation unit.
+  `softDouble.c` already requires this flag for `__muldf3` (build.sh
+  applies it automatically).
+
+  Real fix is a pre-RA pass that pre-spills critical pointer
+  arguments to memory, or a targeted fix in greedy's spill-ordering
+  heuristic. Material work; deferred.
+
+- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
+  negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
+  rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
+  correct for negative offsets like `arr[i-1]`.
+
+- **(d,s),y for stack-local pointer dereferences uses DBR**, so
+  user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
+  IIgs hardware) must not call into a function that takes the
+  address of one of its locals — the callee's `*p = v` will write
+  to the wrong bank. Documented; no compiler-side mitigation
+  beyond the existing DPF0 fake-physreg routing for the i64-return
+  high half.
+
+## What's still needed for a "ship-ready" toolchain
+
+- **Greedy regalloc spill-ordering fix** — see above. Removes the
+  need for the per-file `-regalloc=fast` workaround on
+  `softDouble.c` and unblocks pattern-rich code that currently
+  must be compiled at `-O0` for correctness.
+
+- **Round-to-nearest-even in `__divdf3`** — currently
+  truncate-toward-zero, which differs from gcc by ±1 ULP in
+  several test cases. Acceptable today (Newton iterations still
+  converge); revisit when an exact-match test suite lands.
+
+- **DWARF sidecar** (#51) for source-level debugging.
+
+- **More of the C standard library**: `<math.h>` transcendental
+  functions (sin, cos, exp, log, pow), `<string.h>` beyond what's
+  hand-coded, `<stdio.h>` file I/O (`fopen`, `fread`, `fwrite`,
+  `fseek`).
+
+- **C++ runtime support**: vtable layout for multiple inheritance,
+  RTTI, exceptions (or a documented `-fno-exceptions` requirement).
+
+- **REP/SEP scheduling pass** (design doc §3.3): the current
+  prologue picks one M-mode for the whole function based on
+  whether any 8-bit accumulator value is used. A per-region
+  scheduler would reduce the SEP/REP wrap overhead on i8 stores.
+
+- **Toolbox / IIgs system call bindings**: header files declaring
+  the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
+  `DrawString`, …) with the right inline-asm dispatch glue.
+
+- **Real-world program coverage**: the smoke tests are
+  microbenchmarks. A few known-good Apple IIgs C programs (e.g.
+  a textfile pager, a small game) compiled and run end-to-end
+  would catch issues no synthetic test currently exercises.
+
+- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
+  says the goal is to "match or exceed" Calypsi. We have neither
+  baseline numbers nor a comparison harness yet.
diff --git a/runtime/build.sh b/runtime/build.sh
index dff9a7a..db71310 100755
--- a/runtime/build.sh
+++ b/runtime/build.sh
@@ -23,8 +23,10 @@ asm() {
 cc() {
     local c="$1"
     local o="$OUT/$(basename "${c%.c}").o"
+    local extra=("${@:2}")
     echo "  CC  $(basename "$c")"
     "$CLANG" -target w65816 -O2 -ffunction-sections \
+        "${extra[@]}" \
         -I"$PROJECT_ROOT/runtime/include" \
         -c "$c" -o "$o"
 }
@@ -33,6 +35,9 @@ asm "$SRC/crt0.s"
 asm "$SRC/libgcc.s"
 cc  "$SRC/libc.c"
 cc  "$SRC/softFloat.c"
-cc  "$SRC/softDouble.c"
+# softDouble.c needs -regalloc=fast: __muldf3's 64x64 -> 128 mul +
+# inlined alignment shifts overflows the greedy allocator on the
+# single-A target.
+cc  "$SRC/softDouble.c" -mllvm -regalloc=fast
 
 echo "runtime built: $(ls -1 "$OUT"/*.o | wc -l) objects"
diff --git a/runtime/src/libgcc.s b/runtime/src/libgcc.s
index a96977b..f34b22e 100644
--- a/runtime/src/libgcc.s
+++ b/runtime/src/libgcc.s
@@ -673,19 +673,30 @@ __divmodsi_setup:
 ; setup; signed variants flip signs around it.
 ; --------------------------------------------------------------------
 __divmoddi4_stash:
+	; Called via JSR from another libgcc helper that was itself
+	; called via JSL.  Stack layout inside this routine:
+	;   slot 1..2  = JSR return address (2 bytes, same-bank)
+	;   slot 3..5  = JSL return address (3 bytes, long)
+	;   slot 6..7  = first 16-bit stack arg (caller's first push)
+	;   slot 8..9  = second
+	;   ... etc.
+	; Earlier code read slots 4, 6, 8, 10, 12, 14 — which lands on
+	; the JSL ret address bytes, treating them as args.  Caught by
+	; `u64mul(0x12, 0x12)` returning the result at $E2 (mid-low)
+	; instead of $E0 (lo) plus 0x678-shaped garbage at $E6.
 	sta	0xe0			; a_lo_lo
 	stx	0xe2			; a_lo_hi
-	lda	0x4, s
-	sta	0xe4			; a_hi_lo
 	lda	0x6, s
-	sta	0xe6			; a_hi_hi
+	sta	0xe4			; a_hi_lo
 	lda	0x8, s
-	sta	0xe8			; b_lo_lo
+	sta	0xe6			; a_hi_hi
 	lda	0xa, s
-	sta	0xea			; b_lo_hi
+	sta	0xe8			; b_lo_lo
 	lda	0xc, s
-	sta	0xec			; b_hi_lo
+	sta	0xea			; b_lo_hi
 	lda	0xe, s
+	sta	0xec			; b_hi_lo
+	lda	0x10, s
 	sta	0xee			; b_hi_hi
 	rts
 
@@ -805,19 +816,28 @@ __muldi3:
 	; Loop 64 times on a's bits.
 	ldy	#0x40
 .Lmuldi_loop:
-	; Test bit 0 of a (= LSR a; C = old bit 0).
-	lda	0xe0
+	; Right-shift the 64-bit `a` by 1.  $E0=lo..$E6=hi (matches the
+	; stash + __retdi convention).  Must shift HI first (LSR loses
+	; bit 63 of $E6) so each ROR carries the previous half's bit 0
+	; INTO the top of the next-LOWER half — that's the actual
+	; right-shift direction in a $E0=lo layout.  After the chain,
+	; C = orig $E0_b0 = bit 0 of the 64-bit value, which drops out
+	; and is what we want to BCC on.  The earlier code shifted lo
+	; first which ran the shift in the WRONG direction (lo → hi)
+	; and tested $E6_b0 (bit 48) instead of bit 0 — every multiply
+	; involving bits 16+ came back garbage.
+	lda	0xe6
 	lsr	a
-	sta	0xe0
-	lda	0xe2
-	ror	a
-	sta	0xe2
+	sta	0xe6
 	lda	0xe4
 	ror	a
 	sta	0xe4
-	lda	0xe6
+	lda	0xe2
 	ror	a
-	sta	0xe6
+	sta	0xe2
+	lda	0xe0
+	ror	a
+	sta	0xe0
 	bcc	.Lmuldi_noadd
 	; Add b ($E8..$EE) to product ($F2..$F8).
 	clc
diff --git a/runtime/src/softDouble.c b/runtime/src/softDouble.c
index 88af25d..97cc8e5 100644
--- a/runtime/src/softDouble.c
+++ b/runtime/src/softDouble.c
@@ -111,16 +111,25 @@ u64 __negdf2(u64 a) {
     return a ^ DSIGN_BIT;
 }
 
-u64 __muldf3(u64 a, u64 b) {
-    u64 sa, sb, ma, mb;
-    s16 ea, eb;
-    u16 ca = dclass(a, &sa, &ea, &ma);
-    u16 cb = dclass(b, &sb, &eb, &mb);
-    u64 sr = sa ^ sb;
-    if (ca == 0 || cb == 0) return sr;
-    // Truncated 64*64 → high-64 product via 32*32 partials.  We only
-    // need the upper bits of the 106-bit product because the mantissas
-    // are 53 bits each.
+// Carry the high 64 bits of a 128-bit product in `hi` and the low 64
+// in `lo`.  Carry bit indicates whether the leading bit was at 105
+// (caller must increment exponent).
+typedef struct {
+    u64 mantissa;
+    u16 carry;
+} MantCarryT;
+
+// 64x64 -> 128-bit product, returned as a packed u64 pair.  Returns
+// the high 64 bits in the high u64 of the .mantissa lane is not
+// possible — instead, we shift in-line and return the aligned mantissa
+// directly.  Splitting keeps register pressure low enough for greedy
+// regalloc on the single-A W65816.
+//
+// Inlinable on purpose: passing a pointer to a stack local across a
+// noinline boundary lowers to `sta (d,s),y` which uses DBR-relative
+// addressing — broken under DBR != 0 (e.g. after a bank switch).
+// Keeping these inline keeps the stores within the caller's frame.
+static inline u64 mulhi64Aligned(u64 ma, u64 mb, u16 *out_carry) {
     u32 alo = (u32)ma;
     u32 ahi = (u32)(ma >> 32);
     u32 blo = (u32)mb;
@@ -131,16 +140,26 @@ u64 __muldf3(u64 a, u64 b) {
     u64 hh = (u64)ahi * (u64)bhi;
     u64 mid = lh + hl + (ll >> 32);
     u64 prod_hi = hh + (mid >> 32);
-    s16 er = ea + eb;
-    while (prod_hi & ~(DMANT_LEAD | DMANT_MASK)) {
-        prod_hi >>= 1;
-        er++;
+    u64 prod_lo = (ll & 0xFFFFFFFFULL) | ((mid & 0xFFFFFFFFULL) << 32);
+    if (prod_hi & (1ULL << 41)) {
+        *out_carry = 1;
+        return (prod_hi << 11) | (prod_lo >> 53);
     }
-    while ((prod_hi & DMANT_LEAD) == 0 && prod_hi != 0) {
-        prod_hi <<= 1;
-        er--;
-    }
-    return dpack(sr, er, prod_hi);
+    *out_carry = 0;
+    return (prod_hi << 12) | (prod_lo >> 52);
+}
+
+u64 __muldf3(u64 a, u64 b) {
+    u64 sa, sb, ma, mb;
+    s16 ea, eb;
+    u16 ca = dclass(a, &sa, &ea, &ma);
+    u16 cb = dclass(b, &sb, &eb, &mb);
+    u64 sr = sa ^ sb;
+    if (ca == 0 || cb == 0) return sr;
+    u16 carry;
+    u64 mr = mulhi64Aligned(ma, mb, &carry);
+    s16 er = ea + eb + (s16)carry;
+    return dpack(sr, er, mr);
 }
 
 u64 __divdf3(u64 a, u64 b) {
@@ -151,26 +170,29 @@ u64 __divdf3(u64 a, u64 b) {
     u64 sr = sa ^ sb;
     if (ca == 0) return sr;
     if (cb == 0) return sr | DEXP_MASK;  // div-by-zero → inf
-    // Long division: shift a left by 11 to make room for quotient bits.
-    u64 q = 0;
-    u64 r = ma;
-    for (int i = 0; i < 53; i++) {
+    // Long division: handle the leading quotient bit explicitly (since
+    // we need to "consume" the dividend's leading 1 by subtracting),
+    // then generate 52 more fractional bits by shifting r left and
+    // testing.  The previous shift-and-test-only loop over-counted
+    // when r == mb after subtraction (e.g. 2.0/1.0 returned ~4.0).
+    s16 er = ea - eb;
+    // Normalize so the dividend is in [mb, 2*mb).  This ensures the
+    // leading quotient bit will land at position 52 below.
+    if (ma < mb) {
+        ma <<= 1;
+        er--;
+    }
+    // Handle the leading quotient bit explicitly.
+    u64 q = DMANT_LEAD;
+    u64 r = ma - mb;
+    // Compute 52 more fractional bits via standard shift-test-subtract.
+    for (int i = 51; i >= 0; i--) {
         r <<= 1;
-        q <<= 1;
         if (r >= mb) {
             r -= mb;
-            q |= 1;
+            q |= (1ULL << i);
         }
     }
-    s16 er = ea - eb;
-    while (q & ~(DMANT_LEAD | DMANT_MASK)) {
-        q >>= 1;
-        er++;
-    }
-    while ((q & DMANT_LEAD) == 0 && q != 0) {
-        q <<= 1;
-        er--;
-    }
     return dpack(sr, er, q);
 }
 
diff --git a/scripts/smokeTest.sh b/scripts/smokeTest.sh
index 935dd26..b862f9f 100755
--- a/scripts/smokeTest.sh
+++ b/scripts/smokeTest.sh
@@ -1104,7 +1104,10 @@ int toInt(double x) { return (int)x; }
 double fromInt(int n) { return (double)n; }
 EOF
     "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile"
-    "$CLANG" --target=w65816 -O2 -ffunction-sections \
+    # softDouble.c uses -regalloc=fast because __muldf3's 64x64 -> 128
+    # multiply with the inlined alignment shifts overflows the greedy
+    # allocator's spill heuristics on the single-A target.
+    "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
         -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile"
     "$PROJECT_ROOT/tools/link816" -o "$binDblFile" \
         --text-base 0x8000 --map "$mapDblFile" \
@@ -1281,7 +1284,7 @@ int main(void) {
 }
 EOF
     "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame"
-    "$CLANG" --target=w65816 -O2 -ffunction-sections \
+    "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
         -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame"
     "$PROJECT_ROOT/tools/link816" -o "$binDblMame" \
         --text-base 0x1000 \
@@ -1402,7 +1405,7 @@ EOF
             -c "$PROJECT_ROOT/runtime/src/libc.c" -o "$oLibcF"
         "$CLANG" --target=w65816 -O2 -ffunction-sections \
             -c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF"
-        "$CLANG" --target=w65816 -O2 -ffunction-sections \
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
             -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF"
         oCrt0F="$(mktemp --suffix=.o)"
         "$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \
@@ -1708,9 +1711,10 @@ EOF
         fi
         rm -f "$cP2File" "$oP2File" "$binP2File"
 
-        # Bubble sort with the loop form that compiles correctly
-        # (i=1..n; inner j+1<n-i+1).  The other form `i<n-1; j<n-i-1`
-        # has an outstanding compiler bug (#65); use this canary form.
+        # Canonical bubble sort.  Both this form (`i < n-1; j < n-i-1`)
+        # and the alternate form work after the BranchExpand bridge
+        # fix.  Catches a regression in either BranchExpand or
+        # TiedDefSpill if the conditional flow gets miscompiled.
         log "check: MAME runs bubble sort [4,1,3,2] → [1,2,3,4]"
         cBsFile="$(mktemp --suffix=.c)"
         oBsFile="$(mktemp --suffix=.o)"
@@ -1721,8 +1725,8 @@ __attribute__((noinline)) void switchToBank2(void) {
 }
 unsigned short data[4] = { 4, 1, 3, 2 };
 __attribute__((noinline)) void bubbleSort(unsigned short *arr, unsigned short n) {
-    for (unsigned short i = 1; i < n; i++) {
-        for (unsigned short j = 0; j + 1 < n - i + 1; j++) {
+    for (unsigned short i = 0; i < n - 1; i++) {
+        for (unsigned short j = 0; j < n - i - 1; j++) {
             if (arr[j] > arr[j+1]) {
                 unsigned short t = arr[j];
                 arr[j] = arr[j+1];
@@ -1752,8 +1756,507 @@ EOF
                   0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
             die "MAME: bubbleSort([4,1,3,2]) != [1,2,3,4]"
         fi
-        rm -f "$cBsFile" "$oBsFile" "$binBsFile" \
-              "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
+        rm -f "$cBsFile" "$oBsFile" "$binBsFile"
+
+        # printf("ABCDE") returns 5.  Canary for the BranchExpand
+        # leftover-BRA-Skip bug: without removing the original BRA
+        # after rewriting Bxx to INV_Bxx, the inserted Bridge MBB
+        # becomes unreachable and the conditional flow is lost.  Also
+        # exercises vprintf's main loop end-to-end (no varargs).
+        log "check: MAME runs printf('ABCDE') → 5 (BranchExpand bridge regression)"
+        cPfFile="$(mktemp --suffix=.c)"
+        oPfFile="$(mktemp --suffix=.o)"
+        binPfFile="$(mktemp --suffix=.bin)"
+        cat > "$cPfFile" <<'EOF'
+#include <stdio.h>
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+int main(void) {
+    int r = printf("ABCDE");
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = (unsigned short)r;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections \
+            -I"$PROJECT_ROOT/runtime/include" -c "$cPfFile" -o "$oPfFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binPfFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPfFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binPfFile" 0x025000 0005 >/dev/null 2>&1; then
+            die "MAME: printf('ABCDE') != 5 (BranchExpand bridge regression)"
+        fi
+        rm -f "$cPfFile" "$oPfFile" "$binPfFile"
+
+        # parse('BCDE') with switch-on-spec — used to fail to link with
+        # PCREL8-out-of-range because long unconditional BRA didn't
+        # auto-relax to BRL.  W65816BranchExpand now force-promotes
+        # long BRA to BRL.
+        log "check: MAME runs nested-loop+multiply f(4) → 120 (regalloc + BRA-relax)"
+        cFnFile="$(mktemp --suffix=.c)"
+        oFnFile="$(mktemp --suffix=.o)"
+        binFnFile="$(mktemp --suffix=.bin)"
+        cat > "$cFnFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) unsigned short f(unsigned short n) {
+    unsigned short s = 0;
+    for (unsigned short i = 0; i < n; i++)
+        for (unsigned short j = 0; j < n; j++)
+            s += i*n+j;
+    return s;
+}
+int main(void) {
+    unsigned short r = f(4);
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = r;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cFnFile" -o "$oFnFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binFnFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFnFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binFnFile" 0x025000 0078 >/dev/null 2>&1; then
+            die "MAME: f(4) != 120 (regalloc + BRA-relax regression)"
+        fi
+        rm -f "$cFnFile" "$oFnFile" "$binFnFile"
+
+        # u64add through a noinline boundary — exercises the
+        # ADJCALLSTACKUP teardown's STA $E0 / LDA $E0 path that
+        # preserves Y across the SP-restore.  The earlier PLY*N/2
+        # implementation clobbered Y, so any i64 return came back
+        # with the last popped arg in Y instead of the sum's mid-high.
+        # Recursive u64 factorial — exercises __muldi3 + i64 ABI through
+        # a recursive noinline boundary.  20! = 0x21c3_677c_82b4_0000.
+        # Used to come back as garbage because __divmoddi4_stash read
+        # caller args from slot 4 when it was actually JSR-called from
+        # __muldi3 (so slot 4 was the JSL ret address byte, not a_mh).
+        # dadd through a noinline boundary — exercises __adddf3 + the
+        # full i64-return ABI through a real call.  The earlier soft-
+        # double smoke test ran `c = 1.5 + 2.5` inline, which clang
+        # constant-folds to a literal 0x4010... bit pattern — never
+        # actually executed __adddf3.  This one calls a noinline
+        # `dadd` so the libcall and the i64 ABI run end-to-end.
+        # printf("%d", n) — used to crash MAME entirely because MachineCSE
+        # eliminated the `if (isLong)` re-test of *fmt as a "redundant"
+        # CMP (it had matched an earlier identical CMP), and the
+        # surviving BNE then read whatever leftover P-flag state happened
+        # to be in P from the last spec-dispatch CMP.  Backend now
+        # disables MachineCSE entirely.
+        log "check: MAME runs printf('%%d %%d', 42, 99) chain (MachineCSE disable)"
+        cPdFile="$(mktemp --suffix=.c)"
+        oPdFile="$(mktemp --suffix=.o)"
+        binPdFile="$(mktemp --suffix=.bin)"
+        cat > "$cPdFile" <<'EOF'
+#include <stdio.h>
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) int give42(void) { return 42; }
+int main(void) {
+    // vprintf returns the increment count: 1 per format spec, 1 per
+    // non-spec char.  "Hi %d ok\n" → H,i,' ',%d,' ',o,k,'\n' = 8.
+    int n = printf("Hi %d ok\n", give42());
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = (unsigned short)n;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections \
+            -I"$PROJECT_ROOT/runtime/include" -c \
+            "$cPdFile" -o "$oPdFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binPdFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPdFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binPdFile" 0x025000 0008 \
+                  >/dev/null 2>&1; then
+            die "MAME: printf('Hi %d ok\\n', 42) != 8 (vprintf isLong / MachineCSE)"
+        fi
+        rm -f "$cPdFile" "$oPdFile" "$binPdFile"
+
+        log "check: MAME runs noinline dadd(1.5,2.5) → 4.0 (__adddf3 + i64 ABI)"
+        cDdFile="$(mktemp --suffix=.c)"
+        oDdFile="$(mktemp --suffix=.o)"
+        binDdFile="$(mktemp --suffix=.bin)"
+        cat > "$cDdFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) double dadd(double a, double b) { return a + b; }
+int main(void) {
+    union { double d; unsigned short w[4]; } u;
+    u.d = dadd(1.5, 2.5);
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = u.w[0];
+    *(volatile unsigned short *)0x5002 = u.w[1];
+    *(volatile unsigned short *)0x5004 = u.w[2];
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cDdFile" -o "$oDdFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binDdFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDdFile" --check \
+                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4010 \
+                  >/dev/null 2>&1; then
+            die "MAME: noinline dadd(1.5,2.5) != 4.0 (i64-ABI through libcall)"
+        fi
+        rm -f "$cDdFile" "$oDdFile" "$binDdFile"
+
+        log "check: MAME runs fact_u64(20) → 0x21c3677c82b40000 (__muldi3 stash slots)"
+        cFkFile="$(mktemp --suffix=.c)"
+        oFkFile="$(mktemp --suffix=.o)"
+        binFkFile="$(mktemp --suffix=.bin)"
+        cat > "$cFkFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) unsigned long long fact_u64(unsigned int n) {
+    if (n <= 1) return 1ULL;
+    return (unsigned long long)n * fact_u64(n - 1);
+}
+int main(void) {
+    unsigned long long r = fact_u64(20);
+    union { unsigned long long u; unsigned short w[4]; } u;
+    u.u = r;
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = u.w[0];
+    *(volatile unsigned short *)0x5002 = u.w[1];
+    *(volatile unsigned short *)0x5004 = u.w[2];
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cFkFile" -o "$oFkFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binFkFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFkFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binFkFile" --check \
+                  0x025000=0000 0x025002=82b4 0x025004=677c 0x025006=21c3 \
+                  >/dev/null 2>&1; then
+            die "MAME: fact_u64(20) returned wrong bits (__muldi3 / stash slots)"
+        fi
+        rm -f "$cFkFile" "$oFkFile" "$binFkFile"
+
+        log "check: MAME runs u64add(0x3FF8...,0x4004...) → 0x7FFC... (call-up Y-preserve)"
+        cU64File="$(mktemp --suffix=.c)"
+        oU64File="$(mktemp --suffix=.o)"
+        binU64File="$(mktemp --suffix=.bin)"
+        cat > "$cU64File" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) unsigned long long u64add(unsigned long long a, unsigned long long b) {
+    return a + b;
+}
+int main(void) {
+    unsigned long long c = u64add(0x3FF8000000000000ULL, 0x4004000000000000ULL);
+    union { unsigned long long u; unsigned short w[4]; } u;
+    u.u = c;
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = u.w[0];
+    *(volatile unsigned short *)0x5002 = u.w[1];
+    *(volatile unsigned short *)0x5004 = u.w[2];
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cU64File" -o "$oU64File"
+        "$PROJECT_ROOT/tools/link816" -o "$binU64File" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oU64File" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binU64File" --check \
+                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=7ffc \
+                  >/dev/null 2>&1; then
+            die "MAME: u64add through noinline returned wrong middle halves (call-up Y-clobber)"
+        fi
+        rm -f "$cU64File" "$oU64File" "$binU64File"
+
+        log "check: MAME runs addOff(p,1) p[0]+=p[1] → 12 (StackSlotCleanup killed-Y respect)"
+        cAofFile="$(mktemp --suffix=.c)"
+        oAofFile="$(mktemp --suffix=.o)"
+        binAofFile="$(mktemp --suffix=.bin)"
+        cat > "$cAofFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) short addOff(short *p, short i) {
+    short b = p[i];
+    p[i-1] = p[i-1] + b;
+    return p[i-1];
+}
+int main(void) {
+    short stk[2] = { 5, 7 };
+    short r = addOff(stk, 1);
+    short s0 = stk[0];
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = (unsigned short)r;
+    *(volatile unsigned short *)0x5002 = (unsigned short)s0;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cAofFile" -o "$oAofFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binAofFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oAofFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binAofFile" --check 0x025000=000c 0x025002=000c \
+                  >/dev/null 2>&1; then
+            die "MAME: addOff p[i-1]+=p[i] returned wrong store (NegYIndY/X-clobber or LDY-erase)"
+        fi
+        rm -f "$cAofFile" "$oAofFile" "$binAofFile"
+
+        log "check: MAME runs sqr(10) → 100 (frame-less ADJCALLSTACKUP must emit PLY)"
+        cSqrFile="$(mktemp --suffix=.c)"
+        oSqrFile="$(mktemp --suffix=.o)"
+        binSqrFile="$(mktemp --suffix=.bin)"
+        cat > "$cSqrFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) unsigned short sqr(unsigned short x) { return x * x; }
+int main(void) {
+    unsigned short r = sqr(10);
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = r;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cSqrFile" -o "$oSqrFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binSqrFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqrFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binSqrFile" --check 0x025000=0064 >/dev/null 2>&1; then
+            die "MAME: sqr(10) crashed or != 100 (ADJCALLSTACKUP not emitting PLY for frame-less)"
+        fi
+        rm -f "$cSqrFile" "$oSqrFile" "$binSqrFile"
+
+        log "check: MAME runs ddiv(8.0,4.0) → 2.0 (__divdf3 algorithm fix)"
+        cDdvFile="$(mktemp --suffix=.c)"
+        oDdvFile="$(mktemp --suffix=.o)"
+        binDdvFile="$(mktemp --suffix=.bin)"
+        cat > "$cDdvFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) double ddiv(double a, double b) { return a / b; }
+int main(void) {
+    union { double d; unsigned short w[4]; } u;
+    u.d = ddiv(8.0, 4.0);
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = u.w[0];
+    *(volatile unsigned short *)0x5002 = u.w[1];
+    *(volatile unsigned short *)0x5004 = u.w[2];
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cDdvFile" -o "$oDdvFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binDdvFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdvFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binDdvFile" --check 0x025000=0000 0x025002=0000 \
+                  0x025004=0000 0x025006=4000 >/dev/null 2>&1; then
+            die "MAME: ddiv(8,4) != 2.0 (__divdf3 long-division bug)"
+        fi
+        rm -f "$cDdvFile" "$oDdvFile" "$binDdvFile"
+
+        log "check: MAME runs Newton-iter loop → high-half ~1.41 (BranchExpand self-loop BRA fix)"
+        cSqFile="$(mktemp --suffix=.c)"
+        oSqFile="$(mktemp --suffix=.o)"
+        binSqFile="$(mktemp --suffix=.bin)"
+        # 3-iter Newton-method sqrt with a counted for-loop (the loop-back
+        # BRA is a self-loop, which the BranchExpand distance estimator
+        # used to report as 0 bytes, so it never promoted to BRL even
+        # when the loop body grew well past +/-128 bytes).
+        cat > "$cSqFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) double sqrt3(double x) {
+    double g = x * 0.5;
+    for (unsigned short i = 0; i < 3; i++)
+        g = (g + x / g) * 0.5;
+    return g;
+}
+int main(void) {
+    union { double d; unsigned short w[4]; } u;
+    u.d = sqrt3(2.0);
+    switchToBank2();
+    // Only the high half is precision-stable (low halves vary slightly
+    // due to truncation vs round-to-nearest in __divdf3).  Verify just
+    // the high half — that's enough to prove the self-loop BRA was
+    // promoted (the link would have failed otherwise) and __divdf3 is
+    // converging to the right magnitude.
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cSqFile" -o "$oSqFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binSqFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binSqFile" --check 0x025006=3ff6 >/dev/null 2>&1; then
+            die "MAME: sqrt3(2.0) high half wrong (self-loop BRA / __divdf3)"
+        fi
+        rm -f "$cSqFile" "$oSqFile" "$binSqFile"
+
+        log "check: MAME runs -O0 addOne(7) → 8 (lda-overwrite-immediate fix; fast regalloc)"
+        cO0File="$(mktemp --suffix=.c)"
+        oO0File="$(mktemp --suffix=.o)"
+        binO0File="$(mktemp --suffix=.bin)"
+        cat > "$cO0File" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+unsigned short addOne(unsigned short a) { return a + 1; }
+int main(void) {
+    unsigned short r = addOne(7);
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = r;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O0 -ffunction-sections -c \
+            "$cO0File" -o "$oO0File"
+        "$PROJECT_ROOT/tools/link816" -o "$binO0File" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oO0File" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binO0File" --check 0x025000=0008 >/dev/null 2>&1; then
+            die "MAME: -O0 addOne(7) != 8 (lda overwrite immediate / regalloc choice)"
+        fi
+        rm -f "$cO0File" "$oO0File" "$binO0File"
+
+        log "check: MAME runs bubble sort with mySwap helper [4,1,3,2] → [1,2,3,4] (greedy across helper-call)"
+        cBshFile="$(mktemp --suffix=.c)"
+        oBshFile="$(mktemp --suffix=.o)"
+        binBshFile="$(mktemp --suffix=.bin)"
+        cat > "$cBshFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+unsigned short bsdata[4] = { 4, 1, 3, 2 };
+__attribute__((noinline)) void mySwap(unsigned short *a, unsigned short *b) {
+    unsigned short t = *a; *a = *b; *b = t;
+}
+__attribute__((noinline)) void mySort(unsigned short *arr, unsigned short n) {
+    for (unsigned short i = 0; i < n - 1; i++)
+        for (unsigned short j = 0; j < n - i - 1; j++)
+            if (arr[j] > arr[j+1])
+                mySwap(&arr[j], &arr[j+1]);
+}
+int main(void) {
+    mySort(bsdata, 4);
+    unsigned short d0 = bsdata[0], d1 = bsdata[1], d2 = bsdata[2], d3 = bsdata[3];
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = d0;
+    *(volatile unsigned short *)0x5002 = d1;
+    *(volatile unsigned short *)0x5004 = d2;
+    *(volatile unsigned short *)0x5006 = d3;
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cBshFile" -o "$oBshFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binBshFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oBshFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
+                  "$binBshFile" --check 0x025000=0001 0x025002=0002 \
+                  0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
+            die "MAME: mySort with mySwap helper miscompiled (greedy regalloc across call)"
+        fi
+        rm -f "$cBshFile" "$oBshFile" "$binBshFile"
+
+        log "check: MAME runs dmul(8.0,2.0) AFTER bank-switch → 16.0 (DPF0 store + __muldf3)"
+        cDmFile="$(mktemp --suffix=.c)"
+        oDmFile="$(mktemp --suffix=.o)"
+        binDmFile="$(mktemp --suffix=.bin)"
+        cat > "$cDmFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) double dmul(double a, double b) { return a * b; }
+int main(void) {
+    union { double d; unsigned short w[4]; } u;
+    switchToBank2();
+    u.d = dmul(8.0, 2.0);
+    *(volatile unsigned short *)0x5000 = u.w[0];
+    *(volatile unsigned short *)0x5002 = u.w[1];
+    *(volatile unsigned short *)0x5004 = u.w[2];
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cDmFile" -o "$oDmFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binDmFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmFile" --check \
+                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
+                  >/dev/null 2>&1; then
+            die "MAME: dmul(8,2) under DBR=2 produced wrong bits (DPF0 store / __muldf3)"
+        fi
+        rm -f "$cDmFile" "$oDmFile" "$binDmFile"
+
+        log "check: MAME runs dmath = (a+b)*(a-b), 5,3 → 16.0 (chained libcall ABI)"
+        cDmaFile="$(mktemp --suffix=.c)"
+        oDmaFile="$(mktemp --suffix=.o)"
+        binDmaFile="$(mktemp --suffix=.bin)"
+        cat > "$cDmaFile" <<'EOF'
+__attribute__((noinline)) void switchToBank2(void) {
+    __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
+}
+__attribute__((noinline)) double dadd(double a, double b) { return a + b; }
+__attribute__((noinline)) double dsub(double a, double b) { return a - b; }
+__attribute__((noinline)) double dmul(double a, double b) { return a * b; }
+__attribute__((noinline)) double dmath(double a, double b) {
+    return dmul(dadd(a, b), dsub(a, b));
+}
+int main(void) {
+    union { double d; unsigned short w[4]; } u;
+    u.d = dmath(5.0, 3.0);
+    switchToBank2();
+    *(volatile unsigned short *)0x5000 = u.w[0];
+    *(volatile unsigned short *)0x5002 = u.w[1];
+    *(volatile unsigned short *)0x5004 = u.w[2];
+    *(volatile unsigned short *)0x5006 = u.w[3];
+    while (1) {}
+}
+EOF
+        "$CLANG" --target=w65816 -O2 -ffunction-sections -c \
+            "$cDmaFile" -o "$oDmaFile"
+        "$PROJECT_ROOT/tools/link816" -o "$binDmaFile" --text-base 0x1000 \
+            "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmaFile" \
+            >/dev/null 2>&1
+        if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmaFile" --check \
+                  0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
+                  >/dev/null 2>&1; then
+            die "MAME: dmath(5,3) returned wrong high half (DP[\$F0] CSE across libcalls)"
+        fi
+        rm -f "$cDmaFile" "$oDmaFile" "$binDmaFile"
+
+        rm -f "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
     else
         warn "MAME or apple2gs ROMs not installed; skipping end-to-end test"
     fi
diff --git a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp
index 17c6dcf..562af1d 100644
--- a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp
+++ b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp
@@ -131,6 +131,7 @@ static bool clobbersImg(const MachineInstr &MI,
 
 bool W65816ABridgeViaX::runOnMachineFunction(MachineFunction &MF) {
   if (!MF.getRegInfo().getNumVirtRegs()) return false;
+  if (MF.getFunction().hasOptNone()) return false;
   MachineRegisterInfo &MRI = MF.getRegInfo();
   const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
   const W65816InstrInfo *TII = STI.getInstrInfo();
diff --git a/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp b/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp
index 7ba68b3..043f08b 100644
--- a/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp
+++ b/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp
@@ -83,21 +83,71 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
   switch (MI->getOpcode()) {
   default:
     break;
-  case W65816::ADJCALLSTACKDOWN:
+  case W65816::ADJCALLSTACKDOWN: {
+    // DOWN is a no-op in our scheme — the PUSH16 sequence in LowerCall
+    // already shifted SP incrementally as args were pushed.  Nothing
+    // to emit; PEI may or may not have processed it, either is fine.
+    return;
+  }
   case W65816::ADJCALLSTACKUP: {
-    // PEI's eliminateCallFramePseudoInstr removes these *only* when the
-    // function has frame work (StackSize > 0 or any FrameIndex use).
-    // Functions that just tail-call into a libcall (e.g. `int toInt(float
-    // x) { return (int)x; }` lowers to a single jsl __fixsfsi) have
-    // neither; PEI skips its call-frame phase and the pseudo survives
-    // to MC.  AsmStreamer renders the pseudo's "# ADJCALLSTACK..."
-    // string as a comment, but MCObjectStreamer asks the encoder to
-    // emit bytes — which fails ("Unsupported instruction MCInst 337").
-    // Dropping it here is correct: when amt is zero (the "no frame"
-    // path) the call sequence is a no-op anyway; when non-zero, PEI
-    // would have replaced it with PLA-loop / TSC-ADC sequence already.
-    // If we ever see a non-zero amount slip through, that's a real
-    // bug — emit nothing and trust the comment-stripped path.
+    // PEI's eliminateCallFramePseudoInstr handles UP whenever the
+    // function has any frame work (StackSize > 0 or any FI use).
+    // Frame-less functions — e.g. `unsigned short sqr(unsigned short
+    // x) { return x*x; }` lowers to PUSH16 + jsl __mulhi3 + RTL with
+    // no locals — get skipped by PEI's call-frame phase, leaving
+    // ADJCALLSTACKUP as a pseudo all the way to here.  Previously we
+    // silently dropped it, which left SP off by N bytes after the
+    // call and corrupted the caller's stack frame (caught by sqr(x)
+    // segfaulting MAME).  Emit the SP fixup ourselves: PLY*N/2 for
+    // small even N, otherwise the TAY/TSC-ADC/TYA bracket.
+    int N = MI->getOperand(0).getImm();
+    if (N == 0) return;
+    // A holds the callee's return value; preserve it.  Walk forward
+    // looking for X/Y uses (i64-return halves) — same logic as
+    // eliminateCallFramePseudoInstr.
+    bool YLive = false;
+    for (auto J = std::next(MI->getIterator()); J != MI->getParent()->end();
+         ++J) {
+      if (J->isCall()) break;
+      bool yDef = false;
+      for (const MachineOperand &MO : J->operands()) {
+        if (!MO.isReg()) continue;
+        if (MO.getReg() == W65816::Y) {
+          if (MO.isUse()) { YLive = true; break; }
+          if (MO.isDef()) yDef = true;
+        }
+      }
+      if (YLive || yDef) break;
+    }
+    if (YLive) {
+      // Route through DP $E0 to preserve both A and Y.
+      MCInst Sta; Sta.setOpcode(W65816::STA_DP);
+      Sta.addOperand(MCOperand::createImm(0xE0));
+      EmitToStreamer(*OutStreamer, Sta);
+      MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
+      MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
+      MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
+      Adc.addOperand(MCOperand::createImm(N));
+      EmitToStreamer(*OutStreamer, Adc);
+      MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
+      MCInst Lda; Lda.setOpcode(W65816::LDA_DP);
+      Lda.addOperand(MCOperand::createImm(0xE0));
+      EmitToStreamer(*OutStreamer, Lda);
+    } else if (N <= 14 && (N % 2) == 0) {
+      for (int i = 0; i < N / 2; ++i) {
+        MCInst Ply; Ply.setOpcode(W65816::PLY);
+        EmitToStreamer(*OutStreamer, Ply);
+      }
+    } else {
+      MCInst Tay; Tay.setOpcode(W65816::TAY); EmitToStreamer(*OutStreamer, Tay);
+      MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
+      MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
+      MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
+      Adc.addOperand(MCOperand::createImm(N));
+      EmitToStreamer(*OutStreamer, Adc);
+      MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
+      MCInst Tya; Tya.setOpcode(W65816::TYA); EmitToStreamer(*OutStreamer, Tya);
+    }
     return;
   }
   case W65816::LDXi16imm: {
diff --git a/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp b/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp
index 3c69b9d..7fc390b 100644
--- a/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp
+++ b/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp
@@ -46,6 +46,7 @@
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
+#include "llvm/Support/raw_ostream.h"
 
 using namespace llvm;
 
@@ -100,7 +101,17 @@ static unsigned estimateDistance(MachineFunction &MF,
                                  const MachineInstr &Br,
                                  MachineBasicBlock *To) {
   const MachineBasicBlock *From = Br.getParent();
-  if (From == To) return 0;
+  // Self-loop branch: target is the start of From, branch is somewhere
+  // inside From.  Distance is the bytes from start of From to the
+  // branch instruction (i.e., everything before Br in From).
+  if (From == To) {
+    unsigned Bytes = 0;
+    for (const auto &MI : *From) {
+      if (&MI == &Br) break;
+      Bytes += TII->getInstSizeInBytes(MI);
+    }
+    return Bytes;
+  }
 
   // Two cases by layout direction:
   //   forward: bytes after Br in From, plus all of MBBs strictly
@@ -276,11 +287,30 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
   // Step 2: iterate to fixed-point.  Each expansion adds 3 bytes
   // (bridge BRA), which may push another previously-OK branch over
   // the threshold.  Cap at MAX_ITERS to avoid pathological cases.
-  const unsigned EXPAND_DIST_THRESHOLD = 100;  // safe under +/-128
+  const unsigned EXPAND_DIST_THRESHOLD = 90;  // tighter margin under +/-128
   const unsigned MAX_ITERS = 10;
   for (unsigned iter = 0; iter < MAX_ITERS; ++iter) {
     bool Changed = false;
 
+    // Promote long BRA to BRL.  The assembler's BRA→BRL relaxation
+    // sometimes fails to fire when the target symbol resolves early
+    // in MC layout — the linker then sees a PCREL8 reloc that's out
+    // of range.  Force the BRL ourselves when the estimate exceeds
+    // the safe threshold; saves one byte if BRA would have fit, but
+    // beats a hard link error.
+    for (auto &MBB : MF) {
+      for (auto &MI : MBB.terminators()) {
+        if (MI.getOpcode() != W65816::BRA) continue;
+        if (MI.getNumOperands() < 1 || !MI.getOperand(0).isMBB()) continue;
+        MachineBasicBlock *Target = MI.getOperand(0).getMBB();
+        unsigned Dist = estimateDistance(MF, TII, MI, Target);
+        if (Dist > EXPAND_DIST_THRESHOLD) {
+          MI.setDesc(TII->get(W65816::BRL));
+          Changed = true;
+        }
+      }
+    }
+
     // Collect candidates.  After step 1, each MBB has at most one
     // conditional terminator, so we walk terminators().
     SmallVector<std::pair<MachineBasicBlock *, MachineInstr *>, 8> Candidates;
@@ -337,6 +367,27 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
       // fall-through marker after stays after.
       auto insertPt = MBB->getFirstTerminator();
       BuildMI(*MBB, insertPt, DL, TII->get(InvOpc)).addMBB(Skip);
+      // After the rewrite, MBB falls through to Bridge (which now sits
+      // immediately after MBB in layout).  Any unconditional BRA/BRL
+      // already at the end of MBB used to direct the fall-through to
+      // Skip — but with Bridge interposed, that BRA would skip past
+      // Bridge entirely and Bridge becomes unreachable.  Remove it.
+      // (Skip is still reachable via INV_Bxx; Target is reachable via
+      // fall-through-to-Bridge then BRL.)  Caught by vprintf crashing
+      // because dropDeadConditionalsToBRATarget then dropped the
+      // INV_Bxx as redundant with the leftover BRA Skip.
+      while (insertPt != MBB->end()) {
+        unsigned NextOpc = insertPt->getOpcode();
+        if (NextOpc == W65816::BRA || NextOpc == W65816::BRL) {
+          if (insertPt->getNumOperands() >= 1 &&
+              insertPt->getOperand(0).isMBB() &&
+              insertPt->getOperand(0).getMBB() == Skip) {
+            insertPt = insertPt->eraseFromParent();
+            continue;
+          }
+        }
+        ++insertPt;
+      }
 
       // Bridge: BRL Target.  Always emit the long form rather than
       // relying on the assembler to relax BRA→BRL — the relaxation
diff --git a/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp b/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp
index 8a2df0b..4f5f6f6 100644
--- a/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp
+++ b/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp
@@ -162,15 +162,39 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
   // Insert before the terminator (the return).
   DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc();
 
+  // Detect whether the return live-out includes Y or X — for i64 returns
+  // (Outs[0..2] -> A,X,Y), Y holds bits 32-47 and X holds bits 16-31, so
+  // any TAY/PLY/TAX in the SP-restore would corrupt the return value.
+  // The RTL terminator carries implicit-uses for every live-out return
+  // register; scan them to decide which scratch we can use safely.
+  bool YLive = false;
+  bool XLive = false;
+  if (MBBI != MBB.end() && MBBI->isReturn()) {
+    for (const MachineOperand &MO : MBBI->operands()) {
+      if (!MO.isReg() || !MO.isImplicit() || !MO.isUse()) continue;
+      if (MO.getReg() == W65816::Y) YLive = true;
+      else if (MO.getReg() == W65816::X) XLive = true;
+    }
+  }
+
   // VLA cleanup: restore entry SP from DP $F4 (saved in prologue).
   // This subsumes BOTH the static frame and any dynamic_stackalloc
   // bytes — we can skip the per-byte PLY/PLA loop entirely.  Preserve
-  // A through TAY/TYA since it holds the return value.
+  // A through TAY/TYA since it holds the return value.  For i64
+  // returns where Y is also live, route the save through DP $E0
+  // ($E0..$EF is libcall scratch — guaranteed dead by epilogue time).
   if (HasVLA) {
-    BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
-    BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
-    BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
-    BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
+    if (YLive) {
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
+    } else {
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
+      BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
+    }
     return;
   }
 
@@ -182,11 +206,26 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
   // N/2 PLY (pop into Y, discard); larger frames use
   // TAY/TSC/CLC/ADC #N/TCS/TYA.
   // Mirror the prologue threshold (see comment there).
-  if (StackSize <= 6 && (StackSize % 2) == 0) {
+  if (StackSize <= 6 && (StackSize % 2) == 0 && !YLive) {
+    // PLY clobbers Y, which is fine when Y isn't a return reg.
     for (uint64_t i = 0; i < StackSize / 2; ++i)
       BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY));
     return;
   }
+  if (YLive) {
+    // Y is a return register (i64 / double).  Save A via DP $E0
+    // instead of TAY so Y survives.  4 cyc slower than TAY/TYA but
+    // correct.  X is allowed to be live too — none of these touch X.
+    BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
+    BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
+    BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
+    BuildMI(MBB, MBBI, DL, TII.get(W65816::ADC_Imm16))
+        .addImm(StackSize);
+    BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
+    BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
+    (void)XLive;
+    return;
+  }
   BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
   BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
   BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
@@ -207,15 +246,56 @@ MachineBasicBlock::iterator W65816FrameLowering::eliminateCallFramePseudoInstr(
   // ADJCALLSTACKUP releases all the pushed bytes after a call.
   //
   // Critical: A holds the callee's return value here, so this MUST NOT
-  // clobber A.  The naive `tsc;clc;adc #N;tcs` does (TSC overwrites A),
-  // which silently corrupts every call's return value.  Same fix as the
-  // epilogue: small N via PLY (clobbers Y, preserves A); larger N via
-  // TAY/.../TYA bracket.
+  // clobber A.  PLY (small-N path) clobbers Y; TAY/.../TYA bracket
+  // (large-N path) also clobbers Y.  Both are fine for i8/i16/i32
+  // returns but DESTROY the return for i64/double (where X and Y hold
+  // mid halves).  Detect i64-return calls by walking back to the JSL
+  // and checking implicit-def $x/$y; in that case, save A via DP $E0
+  // (libcall scratch, dead by call-up time) so X and Y survive.
+  // Caught by `unsigned long long u64add(a,b)` through a noinline
+  // boundary returning Y = b_hi (the last popped) instead of the
+  // sum's mid-high.
   if (I->getOpcode() == W65816::ADJCALLSTACKUP) {
     int N = I->getOperand(0).getImm();
     if (N > 0) {
       DebugLoc DL = I->getDebugLoc();
-      if (N <= 14 && (N % 2) == 0) {
+      bool YLive = false;
+      bool XLive = false;
+      // Walk forward looking for COPY %vreg = $x / $y — LowerCall's
+      // pattern for materializing return halves.  JSLpseudo's tablegen
+      // declares only `Defs=[A]`, so implicit-defs of X/Y aren't on
+      // the call op itself.  We have to read what comes after.
+      // Stop at the next call (re-clobbers everything) or at any def
+      // of X/Y (cancels their post-call value).
+      bool Stopped = false;
+      for (auto J = std::next(I); J != MBB.end() && !Stopped; ++J) {
+        if (J->isCall()) break;
+        for (const MachineOperand &MO : J->operands()) {
+          if (!MO.isReg()) continue;
+          Register R = MO.getReg();
+          if (R == W65816::Y) {
+            if (MO.isUse()) YLive = true;
+            else if (MO.isDef() && !YLive) Stopped = true;
+          } else if (R == W65816::X) {
+            if (MO.isUse()) XLive = true;
+            else if (MO.isDef() && !XLive) Stopped = true;
+          }
+        }
+        if (YLive && XLive) break;
+      }
+      if (YLive) {
+        // i64 return: PLY would eat Y.  Route through DP $E0.  Worth
+        // ~4 cyc more than PLY*N/2 but correctness wins.  X is not
+        // touched by any of these insns either way, so XLive doesn't
+        // change anything here — track it for symmetry.
+        BuildMI(MBB, I, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
+        BuildMI(MBB, I, DL, TII.get(W65816::TSC));
+        BuildMI(MBB, I, DL, TII.get(W65816::CLC));
+        BuildMI(MBB, I, DL, TII.get(W65816::ADC_Imm16)).addImm(N);
+        BuildMI(MBB, I, DL, TII.get(W65816::TCS));
+        BuildMI(MBB, I, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
+        (void)XLive;
+      } else if (N <= 14 && (N % 2) == 0) {
         for (int i = 0; i < N / 2; ++i)
           BuildMI(MBB, I, DL, TII.get(W65816::PLY));
       } else {
diff --git a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp
index 1d0865e..bf398d8 100644
--- a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp
+++ b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp
@@ -861,10 +861,17 @@ W65816TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
       Glue = V.getValue(2);
       InVals.push_back(V);
     } else {
-      // 4th half: load from DP $F0.
-      SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
-      SDValue V = DAG.getLoad(VT, DL, Chain, DPAddr, MachinePointerInfo());
+      // 4th half: read DP[$F0..$F1] via CopyFromReg(DPF0).  DPF0 is a
+      // pseudo-physreg modeled as JSLpseudo's implicit-def, so each
+      // call's CopyFromReg has Glue tied to the corresponding call —
+      // the SDAG combiner can't merge them and the scheduler can't
+      // reorder them past the next call.  copyPhysReg lowers DPF0 →
+      // A as `LDA $F0`.  Without this, plain `getLoad(0xF0)` was
+      // being CSE'd / reordered across i64-returning calls, causing
+      // `dmath = (a+b)*(a-b)` to return 4 instead of 16.
+      SDValue V = DAG.getCopyFromReg(Chain, DL, W65816::DPF0, VT, Glue);
       Chain = V.getValue(1);
+      Glue = V.getValue(2);
       InVals.push_back(V);
     }
   }
@@ -900,11 +907,17 @@ SDValue W65816TargetLowering::LowerReturn(
   SDValue Glue;
   SmallVector<SDValue, 8> RetOps(1, Chain);
 
-  // Outs[3] -> store to DP $F0 (only for i64 returns).  Done first so
-  // its computation can use A freely before A holds the low result.
+  // Outs[3] -> DP $F0 via CopyToReg(DPF0).  Using the DPF0 fake physreg
+  // (lowered to `STA $F0` by copyPhysReg) is critical: a generic
+  // ISD::STORE with addr=0xF0 lowered to `sta (d,s),y`, an indirect
+  // through the DBR, which silently misbehaved when DBR != 0.  STA dp
+  // uses D + dp directly and is unaffected by DBR.  Done first so its
+  // computation can use A freely before A holds the low result.  Glued
+  // to RET_GLUE via the RetOps Register entry below so DCE doesn't
+  // strip the COPY.
   if (Outs.size() >= 4) {
-    SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
-    Chain = DAG.getStore(Chain, DL, OutVals[3], DPAddr, MachinePointerInfo());
+    Chain = DAG.getCopyToReg(Chain, DL, W65816::DPF0, OutVals[3], Glue);
+    Glue = Chain.getValue(1);
   }
   // Outs[2] -> Y.
   if (Outs.size() >= 3) {
@@ -926,6 +939,8 @@ SDValue W65816TargetLowering::LowerReturn(
     RetOps.push_back(DAG.getRegister(W65816::X, Outs[1].VT));
   if (Outs.size() >= 3)
     RetOps.push_back(DAG.getRegister(W65816::Y, Outs[2].VT));
+  if (Outs.size() >= 4)
+    RetOps.push_back(DAG.getRegister(W65816::DPF0, Outs[3].VT));
 
   RetOps[0] = Chain;
   if (Glue.getNode())
diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp
index 702d8ad..81226fa 100644
--- a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp
+++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp
@@ -92,6 +92,44 @@ void W65816InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
     BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(dstImg);
     return;
   }
+  // X → IMGn / IMGn → X: STX dp / LDX dp.  Avoids the A-bridge that
+  // TAX/TXA would impose; critical for i32-first-arg signatures
+  // (live-in $a + $x) where bridging X via A clobbers $a's value
+  // before it can be saved.  Caught by udivmod and iterative qsort.
+  if (dstImg >= 0 && SrcReg == W65816::X) {
+    BuildMI(MBB, I, DL, get(W65816::STX_DP)).addImm(dstImg);
+    return;
+  }
+  if (DestReg == W65816::X && srcImg >= 0) {
+    BuildMI(MBB, I, DL, get(W65816::LDX_DP)).addImm(srcImg);
+    return;
+  }
+  // Y → IMGn / IMGn → Y: STY dp / LDY dp — symmetric.
+  if (dstImg >= 0 && SrcReg == W65816::Y) {
+    BuildMI(MBB, I, DL, get(W65816::STY_DP)).addImm(dstImg);
+    return;
+  }
+  if (DestReg == W65816::Y && srcImg >= 0) {
+    BuildMI(MBB, I, DL, get(W65816::LDY_DP)).addImm(srcImg);
+    return;
+  }
+  // DPF0 → A: emit `LDA $F0`.  DPF0 is the pseudo-physreg carrier
+  // for an i64-returning call's high 16 bits; LowerCall builds a
+  // CopyFromReg(DPF0) glued to the call so the SDAG combiner /
+  // scheduler can't merge or reorder reads across calls.
+  if (DestReg == W65816::A && SrcReg == W65816::DPF0) {
+    BuildMI(MBB, I, DL, get(W65816::LDA_DP)).addImm(0xF0);
+    return;
+  }
+  // A → DPF0: emit `STA $F0`.  Used by LowerReturn for the i64 high
+  // half; using a true direct-page store is critical because plain
+  // ISD::STORE with addr=0xF0 was lowering to `(d,s),y` indirect via
+  // DBR — which silently broke under DBR != 0 (e.g. after a bank
+  // switch).  STA dp uses D + dp directly, ignoring DBR.
+  if (DestReg == W65816::DPF0 && SrcReg == W65816::A) {
+    BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(0xF0);
+    return;
+  }
   llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented");
 }
 
@@ -101,8 +139,14 @@ void W65816InstrInfo::storeRegToStackSlot(
     MachineInstr::MIFlag Flags) const {
   // STAfi gets eliminated by W65816RegisterInfo::eliminateFrameIndex into
   // a real STA d,S.  Source is implicit A; emit the pseudo with the FI
-  // and zero offset.
+  // and zero offset.  When regalloc hands us a spill from X or Y, bridge
+  // through A (TXA / TYA) — same rationale as loadRegFromStackSlot.
   DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
+  if (SrcReg == W65816::X || SrcReg == W65816::Y) {
+    unsigned XferOp = (SrcReg == W65816::X) ? W65816::TXA : W65816::TYA;
+    BuildMI(MBB, MI, DL, get(XferOp));
+    SrcReg = W65816::A;
+  }
   BuildMI(MBB, MI, DL, get(W65816::STAfi))
       .addReg(SrcReg, getKillRegState(isKill))
       .addFrameIndex(FrameIdx)
@@ -115,9 +159,30 @@ void W65816InstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,
                                            const TargetRegisterClass *RC,
                                            Register VReg, unsigned SubReg,
                                            MachineInstr::MIFlag Flags) const {
-  // Mirror image of storeRegToStackSlot: emit LDAfi, which the frame
-  // index pass turns into LDA d,S.
+  // LDAfi only knows how to put the value in A.  If regalloc asks for
+  // a spill into X or Y, we have to bridge through A: LDA d,S then
+  // TAX / TAY.  Without this, the MIR has `$x = LDAfi` but the asm
+  // printer emits just `LDA d,S` (which writes A, not X) — a silent
+  // miscompile that surfaced as i64 subtract chains using stale X
+  // values for the second word (caught by udivmod's `a - q*b` mod
+  // computation).
   DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
+  if (DestReg == W65816::A) {
+    BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
+        .addFrameIndex(FrameIdx)
+        .addImm(0);
+    return;
+  }
+  if (DestReg == W65816::X || DestReg == W65816::Y) {
+    // Load via A, then transfer.  A is implicitly clobbered.
+    BuildMI(MBB, MI, DL, get(W65816::LDAfi), W65816::A)
+        .addFrameIndex(FrameIdx)
+        .addImm(0);
+    unsigned XferOp = (DestReg == W65816::X) ? W65816::TAX : W65816::TAY;
+    BuildMI(MBB, MI, DL, get(XferOp));
+    return;
+  }
+  // Fallback: assume A path (covers Acc16 / Wide16 vregs by class).
   BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
       .addFrameIndex(FrameIdx)
       .addImm(0);
diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.td b/src/llvm/lib/Target/W65816/W65816InstrInfo.td
index 01518df..641664b 100644
--- a/src/llvm/lib/Target/W65816/W65816InstrInfo.td
+++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.td
@@ -70,6 +70,7 @@ def W65816pushx : SDNode<"W65816ISD::PUSH_X", SDTNone,
                          [SDNPHasChain, SDNPInGlue, SDNPOutGlue,
                           SDNPSideEffect, SDNPMayStore]>;
 
+
 // SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the
 // flags from a preceding W65816cmp.  Lowered by EmitInstrWithCustomInserter
 // into a CMP (already in the BB) + Bxx + diamond CFG + PHI.
@@ -1356,10 +1357,18 @@ def : Pat<(store
 // function doesn't have to know how it was called to choose its
 // return instruction.  A pseudo bridges the i16 symbol operand
 // to JSL_Long's 24-bit operand class.
+// Defs include DPF0 — every i64-returning libcall clobbers DP[$F0]
+// (it's the carrier for the highest 16 bits of the return).  The
+// LowerCall side captures the pre-call DPF0 via CopyFromReg(DPF0)
+// glued to the call so the SDAG combiner / scheduler can't merge
+// or reorder reads across calls.  Without DPF0 in Defs, plain
+// `getLoad(0xF0)` was being CSE'd across calls, leading to
+// `dmath = (a+b)*(a-b)` returning 4 instead of 16.
 let isCall = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0,
-    Defs = [A] in {
+    Defs = [A, DPF0] in {
 def JSLpseudo : W65816Pseudo<(outs), (ins i16imm:$dst),
                              "# JSLpseudo $dst", []>;
 }
+
 def : Pat<(W65816call (i16 tglobaladdr:$dst)),  (JSLpseudo tglobaladdr:$dst)>;
 def : Pat<(W65816call (i16 texternalsym:$dst)), (JSLpseudo texternalsym:$dst)>;
diff --git a/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h b/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h
index 88c02b2..f6a4d78 100644
--- a/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h
+++ b/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h
@@ -40,6 +40,7 @@ class W65816MachineFunctionInfo : public MachineFunctionInfo {
   /// STA8abs needs an SEP/REP wrap in M=0 to avoid a 2-byte store).
   bool UsesAcc8 = false;
 
+
 public:
   W65816MachineFunctionInfo() = default;
 
diff --git a/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp b/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp
index e6f3a7f..dd7fc82 100644
--- a/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp
+++ b/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp
@@ -89,6 +89,31 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
           continue;
         unsigned Disp = MI.getOperand(0).getImm() & 0xFF;
         DebugLoc DL = MI.getDebugLoc();
+        // X-liveness check: SpillToX may have stashed a value in X
+        // that's used after this rewrite.  If so, save X to DP $E1
+        // (libcall scratch high half — $E0 is reserved for the A-save
+        // dance in eliminateCallFramePseudoInstr) and restore after.
+        // Walk forward from MI looking for an X use without a prior
+        // X def; if found, X is live and we must preserve it.
+        bool XLive = false;
+        for (auto Scan = std::next(MachineBasicBlock::iterator(&MI));
+             Scan != MBB.end(); ++Scan) {
+          if (Scan->isDebugInstr()) continue;
+          bool xDef = false;
+          for (const MachineOperand &MO : Scan->operands()) {
+            if (!MO.isReg()) continue;
+            if (MO.getReg() == W65816::X) {
+              if (MO.isUse()) { XLive = true; break; }
+              if (MO.isDef()) xDef = true;
+            }
+          }
+          if (XLive || xDef) break;
+        }
+        if (XLive) {
+          // Save X to DP $E2 (don't use $E0 — that's the A-preserve
+          // slot in call-frame teardown and may be live).
+          BuildMI(MBB, MI, DL, TII->get(W65816::STX_DP)).addImm(0xE2);
+        }
         if (IsLDA) {
           // LDA disp,S ; CLC ; ADC #neg ; TAX ; LDA $0000,X
           BuildMI(MBB, MI, DL, TII->get(W65816::LDA_StackRel))
@@ -127,6 +152,10 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
               .addImm(0)
               .addReg(W65816::A, RegState::Implicit);
         }
+        if (XLive) {
+          // Restore X from DP $E2.
+          BuildMI(MBB, MI, DL, TII->get(W65816::LDX_DP)).addImm(0xE2);
+        }
         // Erase original LDY and the (sr,s),Y op.
         if (LastLDY) { LastLDY->eraseFromParent(); LastLDY = nullptr; }
         MI.eraseFromParent();
diff --git a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp
index aa1752b..7d5715b 100644
--- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp
+++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp
@@ -73,7 +73,30 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
   bool NeedsCarryPrefix = false;
   bool IsSub = false;
   switch (Opc) {
-  case W65816::LDAfi: NewOpc = W65816::LDA_StackRel; break;
+  case W65816::LDAfi: {
+    // LDAfi targets A.  If the regalloc parked the dest in X or Y
+    // (which can happen via Idx16 vreg coalescing), bridge through A
+    // by appending a TAX / TAY.
+    Register Dst = MI.getOperand(0).getReg();
+    int FI = MI.getOperand(FIOperandNum).getIndex();
+    int FrameOffset = MFI.getObjectOffset(FI);
+    int ImmOffset = MI.getOperand(FIOperandNum + 1).getImm();
+    int Offset = FrameOffset + ImmOffset + (int)MFI.getStackSize() + SPAdj;
+    if (FrameOffset < 0) Offset += 1;
+    if (Offset < 0 || Offset > 0xFF)
+      report_fatal_error("W65816: frame offset out of stack-relative range");
+    BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
+            TII.get(W65816::LDA_StackRel))
+        .addImm(Offset)
+        .addReg(W65816::A, RegState::ImplicitDefine);
+    if (Dst == W65816::X) {
+      BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAX));
+    } else if (Dst == W65816::Y) {
+      BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAY));
+    }
+    MI.eraseFromParent();
+    return true;
+  }
   case W65816::STAfi: {
     // Wide16-source STAfi: if the source ended up in IMGn (DP-backed),
     // prepend LDA dp so the value reaches A before the actual store.
@@ -108,6 +131,12 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
       BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
               TII.get(W65816::LDA_DP)).addImm(srcDP);
     }
+    // Note: STAfi with X or Y source is NOT supported here — adding a
+    // TXA/TYA pre-bracket would clobber A which a downstream STAfi $a
+    // may still need (the prologue stashes arg0_lo from A and arg0_ml
+    // from X via two adjacent STAfi, and putting A's STA *before* X's
+    // is the caller's responsibility).  storeRegToStackSlot already
+    // bridges X/Y → A for spills it generates.
     BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
             TII.get(W65816::STA_StackRel))
         .addImm(Offset)
diff --git a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td
index d703239..574cefe 100644
--- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td
+++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td
@@ -55,6 +55,15 @@ def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>;
 def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>;
 def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>;
 
+// DPF0 — pseudo-physreg modeling the i16 storage at DP $F0..$F1.
+// Used as the carrier for the highest 16 bits of an i64/double
+// return.  JSLpseudo Defs DPF0 so the SDAG combiner / scheduler
+// can't merge or reorder reads of it across calls; we plumb the
+// 4th return half via CopyFromReg(DPF0) in LowerCall, which lowers
+// to `LDA $F0` via copyPhysReg.  Never allocated to a vreg —
+// always a transient bridge from DP[$F0] to A.
+def DPF0 : W65816Reg<24, "dpf0">, DwarfRegNum<[24]>;
+
 //===----------------------------------------------------------------------===//
 //  Register Classes
 //===----------------------------------------------------------------------===//
@@ -90,6 +99,13 @@ def Wide16 : RegisterClass<"W65816", [i16], 16,
 
 def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>;
 
+// Single-register class for DPF0, the i64-return high-half carrier.
+// Not allocatable — only used as a CopyFromReg source in LowerCall;
+// copyPhysReg lowers DPF0 → A by emitting `LDA $F0`.
+def DPF0Reg : RegisterClass<"W65816", [i16], 16, (add DPF0)> {
+  let isAllocatable = 0;
+}
+
 // Single-register class for the processor status register, used for condition
 // code modeling.  Not currently allocatable.
 def StatusReg : RegisterClass<"W65816", [i8], 8, (add P)> {
diff --git a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
index 11ebd30..a7966db 100644
--- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
+++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
@@ -1217,6 +1217,13 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
         }
         if (MI.isCall()) break;
         if (MI.modifiesRegister(W65816::Y, TRI)) break;
+        // killsRegister: an instruction with `implicit killed $y` USES Y
+        // and that's the LAST use — Y is dead after.  We must NOT treat
+        // a subsequent LDY_Imm16 #N as redundant after a kill, because
+        // the held value is conceptually gone.  Caught by `addOff(p,i)
+        // { p[i-1] += p[i]; }` where LDY -2 ; LDA_indY (kills Y) ; ... ;
+        // LDY -2 ; STA_indY needs the second LDY to reinitialize Y.
+        if (MI.killsRegister(W65816::Y, TRI)) break;
         if (MI.isInlineAsm() || MI.isBranch() || MI.isReturn()) break;
         ++It;
       }
diff --git a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
index e86633b..6ca79fa 100644
--- a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
+++ b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
@@ -14,6 +14,7 @@
 #include "W65816.h"
 #include "W65816MachineFunctionInfo.h"
 #include "TargetInfo/W65816TargetInfo.h"
+#include "llvm/CodeGen/MachineCSE.h"
 #include "llvm/CodeGen/Passes.h"
 #include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@@ -82,16 +83,19 @@ public:
   void addPreRegAlloc() override;
   void addPostRegAlloc() override;
   void addPreEmitPass() override;
+  void addMachineSSAOptimization() override;
 
-  // W65816's only 16-bit ALU register is A.  We use fast regalloc by
-  // default — always succeeds, ~30-50% bigger code than greedy in
-  // pathological cases but correctness is paramount.  Greedy fails
-  // outright on functions with 4+ simultaneously live i16 vregs (heap
-  // sift etc.).  TiedDefSpill (pre-RA) handles the tied-def-multi-use
-  // hazard for the sub-pattern that's frequent enough to matter.
+  // W65816's only 16-bit ALU register is A.  Greedy at -O1+ produces
+  // tight code; at -O0 (where optnone disables coalescing/CSE), greedy
+  // leaves spurious COPY pseudos that lower to STA dp / LDA dp pairs
+  // around modify-in-place ops (e.g. INA), miscompiling a + 1.  Use
+  // fast regalloc when the target framework signals unoptimized.
+  // TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for
+  // the sub-pattern that's frequent enough to matter at -O1+.
   //
-  FunctionPass *createTargetRegisterAllocator(bool /*Optimized*/) override {
-    return createGreedyRegisterAllocator();
+  FunctionPass *createTargetRegisterAllocator(bool Optimized) override {
+    return Optimized ? createGreedyRegisterAllocator()
+                     : createFastRegisterAllocator();
   }
 };
 
@@ -101,6 +105,24 @@ TargetPassConfig *W65816TargetMachine::createPassConfig(PassManagerBase &PM) {
   return new W65816PassConfig(*this, PM);
 }
 
+void W65816PassConfig::addMachineSSAOptimization() {
+  // MachineCSE incorrectly eliminates "redundant" CMP instructions when
+  // it sees an earlier identical CMP elsewhere in the function — the
+  // P (status) flag is considered "available", but on this target P is
+  // clobbered by every intervening LDA/STA/ADC, so the surviving Bxx
+  // ends up dispatching on stale flags.  We don't model `Uses=[P]` on
+  // Bxx because doing so causes regalloc/layout shifts that uncovered
+  // a different latent bug in vprintf.  Disabling the pass entirely
+  // is the lower-cost workaround until the Bxx-Uses=[P] regression is
+  // root-caused.  Caught by `printf("%d", n)` returning 0.
+  //
+  // Other SSA opts (early-tailduplication, opt-phis, dead-mi-elim,
+  // licm, machine-sink, peephole-opt, etc.) still run by chaining
+  // through the default impl — we just skip MachineCSE.
+  disablePass(&MachineCSELegacyID);
+  TargetPassConfig::addMachineSSAOptimization();
+}
+
 void W65816PassConfig::addPreRegAlloc() {
   addPass(createW65816ABridgeViaX());
   addPass(createW65816TiedDefSpill());
@@ -125,7 +147,11 @@ void W65816PassConfig::addPreEmitPass() {
   addPass(createW65816SpillToX());
   // Rewrite negative-Y indirect-Y stack-rel ops.  Must run BEFORE
   // BranchExpand because the rewrite expands one instruction into
-  // several and shifts branch distances.
+  // several and shifts branch distances.  The pass internally checks
+  // X-liveness and saves/restores X via DP $E0 when SpillToX has
+  // a value parked there; without that check, the rewrite's TAX
+  // would clobber spill-bridged values (caught by `addOff(p,i) {
+  // p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b).
   addPass(createW65816NegYIndY());
   // Branch expansion runs after that so the BRA introduced for long
   // conditional branches gets seen by SepRepCleanup (which can
diff --git a/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp b/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp
index 00d4ccb..ca63345 100644
--- a/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp
+++ b/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp
@@ -118,6 +118,11 @@ bool W65816TiedDefSpill::runOnMachineFunction(MachineFunction &MF) {
   // Only pre-RA: skip if vregs are already gone.
   if (!MF.getRegInfo().getNumVirtRegs())
     return false;
+  // At -O0/optnone, the spill+reload pattern this pass introduces
+  // doesn't get coalesced and ends up wasting frame space without
+  // helping greedy.  Same skip rationale as WidenAcc16.
+  if (MF.getFunction().hasOptNone())
+    return false;
 
   MachineRegisterInfo &MRI = MF.getRegInfo();
   const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
diff --git a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
index 9e3fdce..5e11bb6 100644
--- a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
+++ b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
@@ -119,6 +119,13 @@ static bool allUsesAcceptWide(Register VReg,
 
 bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
   if (!MF.getRegInfo().getNumVirtRegs()) return false;
+  // At -O0 / optnone, register coalescing doesn't run, so the COPY we
+  // insert to bridge Acc16 → Wide16 doesn't get folded; instead it
+  // forces wide16 spills through DP-mapped slots that collide and
+  // produce miscompiles around modify-in-place ops (lda dp; inc a;
+  // sta dp; lda dp reads pre-inc value).  The promotion is purely a
+  // performance optimization, so skip it for optnone functions.
+  if (MF.getFunction().hasOptNone()) return false;
   MachineRegisterInfo &MRI = MF.getRegInfo();
   const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
   const W65816InstrInfo *TII = STI.getInstrInfo();