65816-llvm-mos/STATUS.md
2026-05-25 21:00:32 -05:00

31 KiB
Raw Blame History

llvm816 — Current Status

LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from llvm-mos as a separate W65816 target.

What works

End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs).

Language coverage at -O2 (no extra flags):

  • All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
    • ASLA16 / shift libcalls.
  • Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes. Signed compare near INT_MIN handled via EOR-with- sign-bit transform.
  • Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double).
  • Pointer dereference (*p) lowers via LDAptr / STAptr / STBptr to [$E0],Y indirect-LONG with the bank byte at $E2 forced to 0 — DBR-independent, so pha;plb bank-switched callers don't corrupt data through callee local-pointer writes. Const-int pointers (*(volatile uint16 *)0x5000 = v MMIO idiom) lower to STAabs (DBR-relative) so bank-2 writes still work.
  • Bitfields, switch statements (verified up to ~12 cases + default), function pointers, function-pointer tables, indirect calls via __jsl_indir trampoline.
  • Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort.
  • Loops with goto / break / continue, nested loops, state machines.
  • <stdarg.h> varargs with int / long / unsigned long long mixed args.
  • Heap: malloc / free (libc.c first-fit allocator) — linked-list reverse with cons works; free-list coalesce verified.
  • Strings: hand-rolled strlen, strcmp, strcpy, strchr, atoi/itoa roundtrip.
  • Soft-float (single): all four ops + comparisons, MAME-verified.
  • Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Compiles at -O2 throughout. Long- running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug).
  • Inline assembly with "a", "x", "y" register constraints and arbitrary opcode bytes (used for the pha;plb bank-switch idiom).
  • C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions).
  • printf with %d %x %s %c %p and width/precision specifiers.
  • sprintf / snprintf / vsprintf / vsnprintf with the same format coverage as printf (%d %u %x %ld %lu %s %c %f %p %% + width). C99 truncation semantics for snprintf. %.Nf produces the correct fractional digits with round-half-up.
  • scanf family: sscanf / vsscanf parse a C string; fscanf / vfscanf bridge to vsscanf via a per-call line buffer (caps at 255 bytes / line; a longer line silently truncates). scanf reads from stdin which always returns EOF on this target — the surface compiles but isn't useful without a stdin source. Format directives: %d %i %u %x %X %o %s %c %ld %lu %lx %li %lo %%.
  • qsort + bsearch over arbitrary element size with a user cmp callback.
  • Standard string/stdlib glue: strcat, strncat, strpbrk, strspn, strcspn, atol, llabs (kept in their own translation unit so vprintf's branch layout doesn't shift).
  • <math.h>: fabs, floor, ceil, fmod, copysign, sqrt, pow, sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh (and float variants). Bit-twiddling for fabs/floor/ceil/ copysign; Newton iteration for sqrt; range-reduction + Taylor for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/ cosh/tanh. Accuracy is in the ~1e-6 range — good enough for typical numeric work, far short of glibc-quality. These are slow (each call is dozens to hundreds of soft-double libcalls) — pre-compute or cache when possible.
  • setjmp / longjmp from libgcc.s.
  • Static constructors via crt0's init_array walk.
  • <stdio.h> file I/O with two backends:
    • mfsmfsRegister(path, buf, size, cap, writable) stages a memory buffer as a named file. Used by smoke tests that don't have a real disk. Fully validated end-to-end.
    • GS/OSfopen falls through to gsosOpen for any path not in the mfs table. Routes through the GS/OS class-1 dispatcher via wrappers in runtime/src/iigsGsos.s (Open/Read/Write/Close/ SetMark/GetMark/SetEOF/GetEOF). The full stdio surface (fread/fwrite/fseek/ftell/fclose/fgetc/fputc/fputs/fgets/ungetc/ feof/ferror/clearerr/rewind/fprintf/vfprintf) dispatches on backend. link816 honors weak symbols so programs that don't use the GS/OS backend don't have to link iigsGsos.o.
    • Validation status: code path compiles, links, and runs under runViaFinder.sh --data injection. fopen + gsosOpen hangs when invoked under real GS/OS 6.0.2 (JSL $E100A8 doesn't return); root cause not yet diagnosed. Stub-dispatcher GS/OS smoke (the existing one) validates the wrapper contract independently. An XFAIL'd end-to-end smoke is in scripts/smokeTest.sh gated behind GSOS_FILE_SMOKE=1 for use after the dispatcher path is fixed. runViaFinder.sh --data /PATH=local_file is the automated-injection mechanism for runtime-test data files.
    • stdin/stdout/stderr route through putchar as before.
  • <wchar.h>: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy / wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs / wcstombs / mblen with the trivial 1:1 byte<->wide mapping (Latin-1). wchar_t is 16-bit on this target. Extended set: wmemcpy / wmemmove / wmemset / wmemcmp / wmemchr; wcstol / wcstoul / wcstoll / wcstoull / wcstod / wcstof; swprintf / vswprintf; wcsftime. All delegate to the byte equivalents under the Latin-1 model.
  • <signal.h>: in-process signal table. signal() registers a handler; raise() invokes it. Default actions: SIGABRT calls abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
  • <locale.h>: setlocale always returns "C"; localeconv returns a fixed C-locale lconv struct.
  • <fenv.h>: rounding mode + exception flag word tracked but no-op (softFloat / softDouble are fixed RNE; exceptions never raised). Surface compiles cleanly for portable code.
  • <tgmath.h>: C11 type-generic math via _Generic; selects sqrtf vs sqrt etc. based on argument type.
  • <stdatomic.h>: C11 atomic surface, all ops lower to plain ops (single-core uniprocessor — no real synchronization needed). _Atomic T is treated as plain T.
  • <threads.h>: stubs. thrd_create returns thrd_error; mutex/cond ops are no-ops; call_once and tss_* work since they're degenerate on a single-core target.
  • aligned_alloc / posix_memalign / aligned_free: wrap malloc with an over-allocation + pointer-stash trick. Match C11 contract — aligned_alloc(N, M) returns N-aligned, free with aligned_free.
  • <iso646.h>: alternative operator spellings (and, or, not, etc.) — C95 compat header.
  • <stdalign.h>: aliases _Alignas / _Alignof to alignas / alignof.
  • <stdnoreturn.h>: aliases _Noreturn to noreturn.
  • <uchar.h>: char16_t / char32_t typedefs + mbrtoc16 / c16rtomb / mbrtoc32 / c32rtomb conversion helpers. In our Latin-1 model these are 1:1 byte copies (no UTF-8 decode).
  • <wctype.h>: wide-char classification + case folding. Delegates to <ctype.h> for code-points 0..255; anything outside Latin-1 returns false / unchanged.
  • <complex.h>: C99 complex-number surface — clang built-in _Complex lowers to soft-double under the hood. Macros complex / _Complex_I / I / CMPLX / CMPLXF / CMPLXL plus inline creal / cimag / conj / cproj / cabs / carg and their f / l variants. Transcendental complex routines (csin/ccos/cexp/etc.) intentionally not provided — they would each need a polynomial-expansion implementation with limited IIgs value.
  • <assert.h>: adds C11 static_assert as a macro alias for the _Static_assert keyword.
  • <errno.h>: full C standard error codes (EDOM, ERANGE, EILSEQ) plus common POSIX codes (EPERM..EPIPE, ENAMETOOLONG, ENOSYS, ENOTEMPTY, ELOOP). strerror maps every defined code to a human-readable string.
  • <stdio.h>: adds C standard buffer-control surface (setvbuf / setbuf as no-ops, _IOFBF / _IOLBF / _IONBF / BUFSIZ); fgetpos / fsetpos wrap ftell / fseek; remove routes through mfsUnregister; rename / tmpfile / tmpnam are stubs.
  • C++ subset: classes, single inheritance, multiple inheritance (Drawable+Movable through one Sprite), virtual base diamond (A and B virtually derive Base; Diamond inherits from both with one shared Base subobject), virtual functions, polymorphism via base-class pointer arrays, virtual dtors, this-pointer adjustment for non-leftmost bases, vbase offset tables. RTTI / dynamic_cast works (downcast, MI cross-cast, virtual-base sibling cast) via a minimal libcxxabi shim (runtime/src/libcxxabi.c) that provides __dynamic_cast + the three typeinfo class vtables (__class_type_info, __si_class_type_info, __vmi_class_type_info) + sized operator delete + __cxa_pure_virtual.
  • C++ exceptions via clang++ -fsjlj-exceptions: throw, catch, catch-by-value, multiple catch handlers, exception destruction. W65816SjLjFinalize IR pass inserts the call-site dispatch and per-function catch table; runtime/src/libcxxabiSjlj.c provides the Itanium SJLJ surface (_Unwind_SjLj_*, __cxa_throw, __cxa_begin_catch, etc.) plus a no-op personality.

Toolchain:

  • clang / llc produce W65816 assembly + ELF object files.

  • tools/link816 resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load. Auto-relocates bss above text+rodata when the default --bss-base 0x2000 would overlap text, and skips past the IIgs IO window ($C000-$CFFF) if needed. --gc-sections (default ON) drops unreachable functions: a minimal program with full runtime linked shrinks from ~43KB to ~1.5KB.

  • link816 --segment-cap N packs .text greedily into multiple bank-aligned segments, capped at N bytes per segment. Segment 1 stays at --text-base in bank 0 (alongside rodata + bss + init); segments 2..M start at --segment-bank-base (default $040000) in successive banks. --manifest path.json writes a JSON file listing each segment's image, base, and entry offset. Cross-bank JSL (IMM24 reloc) just works — patched at link time with the full 24-bit address. Cross-bank IMM16 is permitted (uses DBR for bank — caller pins DBR to data's bank); cross-bank PCREL is rejected with a clear diagnostic. scripts/runMultiSeg.sh is a mini in-Lua loader for MAME that reads the manifest, places each segment's bytes, and runs from segment 1's entry — used by smoke to verify cross-bank JSL end-to-end (helper3 chain across 3 bank-aligned segments).

  • tools/omfEmit produces OMF v2.1 files in three modes: (a) single-segment — --input flat.bin --map flat.map --base ADDR --entry SYM, KIND=0x0000 (CODE, dynamic), ORG=0 (loader picks bank); (b) multi-segment — --manifest path.json reads link816's manifest and emits one OMF segment per entry with KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking the GS/OS Loader to place each at its declared bank-aligned address. All intra-segment relocations were already patched by the linker, so no INTERSEG/RELOC opcodes are needed for v1 static placement. (c) --stack-size N (auto-enables --expressload) appends a ~Direct DP/Stack segment (KIND=0x1012) of N bytes so apps can request a custom DP+stack allocation from GS/OS instead of the Loader's 4KB default. Validated end-to-end via runViaFinder.sh under real GS/OS 6.0.2 — the slow Loader path silently rejects multi-segment OMFs, so --stack-size is gated behind ExpressLoad emission.

  • link816 --debug-out FILE writes a DWARF sidecar with text/ rodata/bss/init_array relocations applied to every .debug_* section, so .debug_addr / .debug_line PC values are final- image addresses.

  • runtime/build.sh builds crt0, libc, soft-float, soft-double, libgcc into linkable objects.

  • scripts/smokeTest.sh runs 148 end-to-end checks at -O2: scalar ops, control flow, calling conventions, MAME execution regressions, link816 bss-base safety + weak-symbol resolution + heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link, iigs/gsos.h compile + link, standalone runtime headers, AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared- LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering, plus real-world coverage: Conway's Game of Life blinker (2D loop + neighbour bounds), binary search tree (recursive struct + malloc), function-pointer dispatch table (indirect JSL via __jsl_indir), memory-backed file I/O (mfsRegister + fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single inheritance), C++ multiple inheritance (Drawable+Movable), C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast + virtual-base sibling cast through libcxxabi shim), SJLJ exception runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions compile + link (the C++ frontend → backend path is execution- verified manually but skipped from MAME smoke due to a MAME-side flakiness — see "What's next"), GS/OS wrapper round-trip via stub dispatcher pre-loaded at $E100A8 (validates PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end), wchar / signal core APIs, hex dumper writing through fprintf, JSON tokenizer state machine, hash-table command shell (parser

    • dispatch + chained collisions over fprintf-to-mfs), scripts/bench.sh size-vs-Calypsi harness. 100% pass.
  • scripts/benchCycles.sh measures per-iteration cycle counts via MAME's emulated HBL counter. 13 benchmarks under benchmarks/ (8 int micro + 3 soft-FP + 2 "game-like": particles, mandelbrot). Current numbers (2026-05-20): bsearch 127, crc32 <65, dotProduct 144, fib 97, memcmp 113, popcount 93, strcpy 91, sumOfSquares 126 cyc/iter (100 iters); dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters); particles 2253 cyc/iter (3 iters — 32-particle physics tick); mandelbrot 11570 cyc/iter (1 iter — 4×4 fixed-point tile, max 8 Mandelbrot iters). Speed is the optimization priority, not size.

  • compare/ holds three side-by-side C tests with our asm and Calypsi's listing for static-size comparison: sumSquares/evalAt/mul16to32. bash compare/regen.sh recompiles each under both clang --target=w65816 -O2 -S and cc65816 --speed -O 2 --64bit-doubles and prints an ours/Calypsi instruction-count ratio. Current ratios (2026-05-20): sumSquares 0.84× (26 inst — we beat Calypsi's 31), evalAt 1.86× (472 inst), mul16to32 0.25× (1 inst — we beat Calypsi's 4). See compare/README.md.

Backend register allocation:

  • Greedy regalloc as default at -O1+; fast at -O0/optnone. Greedy was previously blocked by an upstream LLVM LiveRangeEdit::elimina- teDeadDef assertion firing on KILL pseudos with non-dead implicit- def $a. Fix landed in tools/llvm-mos/llvm/lib/CodeGen/InlineSpil- ler.cpp: when InlineSpiller converts a redundant STAfi to a KILL pseudo, mark BOTH explicit and implicit defs dead (the original loop only iterated MI.defs() = explicit-only, leaving the inherited implicit-def $a live). Bench impact: popcount 19.4%, strcpy 18.9%, memcmp 8.6%, bsearch 9.2%.

  • Pre-RA passes: WidenAcc16 (Acc16→Wide16 promotion, lets greedy spread i16 pressure across A and 16 IMG slots); TiedDefSpill (handles tied-def-multi-use hazard); ABridgeViaX (bridges via X/Y when free).

  • Post-RA passes: SpillToX (STA/LDA pairs → TAX/TXA bridges when X dead); StackSlotCleanup (deletes redundant adjacent spills); NegYIndY (rewrites negative-Y indirect-Y stack-rel ops to avoid the 24-bit-add bank-cross).

  • Pre-emit: BranchExpand (long Bxx → INV_Bxx skip; BRA target); SepRepCleanup (coalesces adjacent SEP/REP toggles, plus a cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching X-flag-only ops, branches, transfers — saves 4B / 12cyc per collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral MIs to fuse the closing REP into a following SEP.

  • Imaginary registers IMG0..IMG15 backed by DP $C0..$CE + $D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG) before stack spills kick in.

ABI:

  • arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL on the system stack with PHA. Caller deallocates via tsc;clc;adc #N;tcs or PLY*N/2.
  • Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits.
  • Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model.

IIgs toolbox:

  • iigs/toolbox.h — autogenerated wrappers for all ~1300 IIgs toolbox routines across 35 tool sets (Tool Locator, Memory Manager, Misc Tools, QuickDraw II / Aux, Event Manager, Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text Tools, Window Manager, Menu Manager, Control Manager, LineEdit, Dialog Manager, Scrap Manager, Standard File, Note Synth/Sequencer, Font Manager, List Manager, ACE, Resource Manager, MIDI, Video Overlay, TextEdit, Media Control, Print Manager, Scheduler, Desk Manager, …). Names match Apple's IIgs Toolbox Reference exactly (TLStartUp, MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers (zero/single-arg, i16-or-void return) inline in the header; 890 multi-arg ones live in runtime/src/iigsToolbox.s. Generated by scripts/genToolbox.py from ORCA-C's ORCACDefs/ (re-runnable when ORCA-C updates).

What's next

Work is now optimization-focused; the toolchain is feature-complete for the common-case C / minimal-C++ workload. Priority is speed (cycle counts), not size.

Recently landed (2026-05-25):

  • Layer 1 ptr32 deref-fold (always on) — Constant offset on a ptr32 deref folds into the [dp],Y Y register instead of a CLC/ADC carry-chain pre-add. Plus consecutive-deref CSE that shares the $E0/$E2 staging across s->a, s->b, ... accesses with the same base. Always on; saves ~3 instructions per struct-field access. See feedback_ptr32_deref_fold_layer1_landed.md.

  • Layer 2 ptr32 deref via (d,S),Y (opt-in)-mllvm -w65816-dbr-safe-ptrs switches ptr32 derefs to the one-instruction lda (d,S),Y (opcode 0xB3) at the cost of reading only 16 bits of pointer. Bank byte is implicit DBR. Correct only for code that touches memory inside DBR's bank — typical for malloc/globals/BSS-only programs (Lua, Picol). Lua 5.1.5 shrinks 20.6%, dropping our total from 1.45× to 1.15× Calypsi. Default off; per-TU opt-in. See feedback_ptr32_layer2_landed.md and docs/USAGE.md for the safety rules.

  • Inline-threshold lowered target-wide to 50 (was LLVM default 225). LLVM's default is tuned for desktop ISAs where call overhead is high relative to inlined-body byte cost. On W65816, jsl is cheap (4 bytes / ~8 cycles) but inlined ptr32 derefs are expensive even with Layer 2 — the tradeoff inverts. At 225, Lua's index2adr (41 callers in lapi.c) and CoreMark's matrix_test helpers got copied everywhere. At 50, neither does, and the cycle benchmark suite is unchanged. With Layer 2 + threshold=50, total Lua is 0.93× Calypsi and total CoreMark is 0.79× Calypsi (we beat by 21%). Override per-TU with -mllvm -inline-threshold=N. See feedback_lapi_inline_threshold.md and feedback_coremark_matrix_test_regression.md.

  • CoreMark 1.0 ported (tests/coremark/). EEMBC's standard embedded benchmark, ~2K LOC. Exercises linked-list traversal, matrix multiply, formal state machine, CRC — patterns Lua doesn't hit. Build requires --layer2 to fit a single bank (otherwise crosses the IO window at 0xC000). See tests/coremark/README.md and feedback_coremark_landed.md.

Speed wins queued, ranked by expected impact:

  • ptr32 pointer-increment overhead (partially addressed). The i32 += 1 post-PEI peephole (W65816I32IncFold) detects the 6-instruction LDA/ADCi16imm 1/STA/LDA/ADCEi16imm 0/STA pattern and rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE expansion in AsmPrinter). Saves ~13 cyc per increment on the no-carry common path. memcmp 1330 → 1194 (10.2%), strcpy 3325 → 3154 (5.1%). Now also tolerates intervening TAX/TXA pseudo-saves in the matcher (regalloc inserts them around STAfi's conservative Defs=[A]); LSR-introduced i32 PHIs like lsr.iv9 += 1 now match. LSR's *p++ → base+offset rewrite remains unaddressed; tried -disable-lsr and isLSRCostLess override, both regressed dotProduct.

  • W65816StackSlotMerge — value-equivalent stack slot coalesce (2026-05-13). Pre-emit pass that merges PHI src/dst stack-slot pairs which LLVM's StackSlotColoring can't see (they're simultaneously live but hold the same value). Detects the canonical loop-body LDA X ; STA Y PHI-copy in a self-looped MBB, verifies value equivalence via bidirectional twin-pairing (Case 1: same A in same MBB / Case 2: PHI-copy reload pattern / Case 3: matching LDA #const init in different MBBs), and renames slot X→Y function-wide. Runs AFTER SepRepCleanup so the PHI copies are out of their PHP/PLP wraps and offsets are stable. A-define detection is opcode-based, not operand-based — LDA_DP / LDA_Abs / LDA_Long etc. omit the implicit-def $a annotation in tablegen but semantically write A; the semanticallyDefsA helper falls back to an opcode whitelist. sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for the first time). sumOfSquares cyc/call: 18755 → 17391 (7.3%). strcpy: 2558 → 2387 (6.7%). See W65816StackSlotMerge.cpp.

  • LSR-widened i32 IV narrowing (W65816NarrowI32Mul Phase 2, 2026-05-13). After rewriting mul i32 X, Y to a __umulhisi3 call, scan for i32 PHIs whose only uses are (a) the truncs the rewrite emitted and (b) a single self-feeding add %P, const. When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in place, replace truncs, and erase the i32 chain. Care needed to break the PN ↔ Incr use-cycle before erasing. sumSquares frame: 14B → 12B; loop-internal i++ shrinks from 7→3 inst.

  • PHI-hoist accepts LDA_Imm16 / LDAi16imm (2026-05-13). Init blocks contain lda #const ; sta slot,s pairs wrapped in PHP/PLP around the pre-loop CMP — same shape as a PHI-copy wrap but with an immediate load instead of a memory load. Matcher extended to accept both the MC opcode (LDA_Imm16) and the surviving pseudo (LDAi16imm), with an added $a-live-out guard: if any successor MBB has $a in its live-in set, bail — the LDA's A-value is a fall-through register-PHI consumed by the successor's first STA, and hoisting clobbers it. Caught by sumTable where lda #0 ; sta 0x9,s (wrap+trailing) ALSO supplied A=0 to bb.2's sta 0x1,s.

  • 16x16→32 multiply via __umulhisi3 + W65816NarrowI32Mul IR pass (2026-05-13). Added __umulhisi3 (unsigned 16x16→32) to runtime/src/libgcc.s. New IR pass in addISelPrepare walks mul i32 X, Y and uses IR-level computeKnownBits plus a SCEV unsigned-range fallback (getUnsignedRange().getActiveBits() <= 16) to detect operands with provably-zero high 16 bits — fires on the canonical loop-internal (u32)i*i pattern after LSR widens the i16 IV to i32. Rewrites to a call to __umulhisi3. sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from baseline).

  • Dead TAX/TXA peephole (2026-05-13). STAfi's conservative Defs=[A] (for the IMG-source PHA-bracketed expansion path) causes regalloc to insert spurious TAX/TXA save/restore brackets even when STAfi's source is A directly. W65816SepRepCleanup now elides TXA/TYA whose next non-debug inst defines $a, and TAX/TAY whose target reg is dead before its next redefinition. Cross-MBB liveness via Succ->isLiveIn(...); bails on return-terminated MBBs (RTL doesn't model the i32-return convention). Tracks pRedef so TAX ; CLC ; ADC chains don't bail on ADC's $p-read (CLC freshens the carry flag).

  • i32 += i32 store-bypass (2026-05-13). Regalloc materializes the call-result A:X i32 pair into spill slots before the add, then reloads — emitting a 10-instruction STA-TXA-STA-LDA-CLC- ADC-STA-LDA-ADC-STA sequence. W65816SepRepCleanup matcher rewrites to 6-instruction CLC-ADC-STA-TXA-ADC-STA (TXA preserves carry; hi-half consumes it directly from X). Saves 4 inst / ~13 cyc per call-result-add site. sumOfSquares 20460 → 19096 (-6.7%).

  • PHI-copy hoist out of PHP/PLP wrap (2026-05-13). W65816SepRepCleanup detects the back-edge CMP ; PHP ; (LDA/STA pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop pattern and hoists the LDA/STA pairs + trailing above the CMP's $a-producer chain, dropping PHP/PLP. Two safety guards: (1) bump undo — in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements S; W65816StackSlotCleanup's wrap pass compensates inside the wrap), so the hoist subtracts 1 from each LDA_StackRel / STA_StackRel offset; trailing STAs (already outside the wrap) are untouched. (2) pair-count check — require #LDAs(Block) == #STAs(Block) + #STAs(Trailing); an extra LDA is a memory-to-register PHI value live-out at the back-edge (consumed by the loop top's first STA), and hoisting would clobber A. Saves 2 inst / 8 cyc per occurrence. sumOfSquares 19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%).

  • More peephole / libcall opportunities. __mulsi3 just gained early-exit when the multiplier shifts to 0; dotProduct dropped 4007→2472 (38.3%), sumOfSquares 40920→23870 (41.6%). Next candidates: shift-by-N inlining for shifts 5+ that currently go through __ashlsi3; a u32 += zext i16 SDAG combine to skip the hi-half carry chain when one operand has known-zero high 16 bits.

  • W65816StackRelToImg peephole pipeline (2026-05-20). Eight always-on peepholes plus an extended phase 4 in the pre-emit StackRelToImg pass: (1) elidePhaBracket with case-a single-store bracket + case-b ImgCalleeSave multi-store with STA-hoist + case-c STA_DP-only multi-pair + forward-walk liveness through conditional branches; (2) elideCallResultSaveSPReload drops STA/LDA $E0 round-trip in ADJCALLSTACKUP's Y-live i64-return path; (3) elideDeadStaCarry drops first STA in i32-carry STA/ADCE/STA pattern; (4) elideRedundantLdaAfterPha; (4b) elidePlaPhaPair collapses consecutive PLA;PHA; (5) elideStoreForwarding (gated to bail path + end-of-pass to avoid IMG-slot reallocation cascade). Phase 4 extended to walk past STX_DP/STY_DP between TYA and STA_DP with safety check (post-STA op must redefine A) and to handle STA_StackRel destination with offset compensation. Result: evalAt 498→472 inst (1.96×→1.86× vs Calypsi), fib -35% cyc/iter (149→97), popcount -11% (104→93), 35 libc functions get TAY/TYA bracket elided. Case (b) hoists the body's first STA before the ImgCalleeSave bracket, enabling the existing phase 4 to remove PEI's TAY/TYA round-trip in a synergistic chain.

  • __muldi3 32-bit short-circuit (2026-05-20). When a's high 32 bits ($E4/$E6) are zero, use a 32-iter shift-and-add loop instead of 64 iters. Fires on every mulhi64Aligned call from softDouble.c (4× per __muldf3), which always passes zero- extended u32 operands. Result: dmul 1605→1033 cyc/iter (-36%). Single-side check (just a) is correct since b's high half being non-zero doesn't affect correctness — iters 32-63 would just shift b without adding.

  • Lua 5.1.5 compiles cleanly (2026-05-20). Reference C implementation (17K lines, 24 source files) builds + links into a multi-segment binary. Loads in MAME. Lives under tests/lua/. Three large functions (luaV_execute, symbexec, auxsort) hit greedy regalloc's complexity budget and need -mllvm -regalloc=basic (still at -O2 — basic-regalloc -O2 is ~3.5× smaller than fast-regalloc -O0). Largest "real-world C" test in the project.

Open limitations:

  • Multi-bank BSS — full support up to 4 banks (256KB). link816 splits BSS into up to 4 contiguous segments at link time; each segment fits within a single bank. Linker emits __bss_seg{0..3}_lo16 / _bank / _size symbols. crt0 walks the table, setting DBR per segment. Per-segment size capped at 0xFF00 so the 16-bit cpx #__bss_segN_size loop comparison doesn't wrap to 0 on a full-bank segment (a single full bank is split into a 0xFF00-byte primary + 0x100-byte tail in the same bank). Smoke validates BSS spanning bank 3 + bank 4 (100KB) is zeroed end-to-end. Note: program access to non-DBR bank globals still requires DBR management — the compiler emits DBR-relative absolute for global accesses, so accessing BSS in bank N needs the program to set DBR=N or use sta long via inline asm.

  • C++ exceptions absent from CI smoke. The SJLJ runtime round-trip is in smoke; the full clang++ → backend → MAME execution path runs reliably interactively but is excluded from automated smoke due to MAME-side I/O flakiness.

  • GS/OS validation uses a stub dispatcher. The wrapper contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP fixup) is verified end-to-end in MAME against a stub (scripts/runInMameWithGsosStub.sh). Validation against a real bootable GS/OS volume is left out of CI as it needs a smartport hard-disk image and live Tool Locator init.

  • VLAs work end-to-end (2026-05-09). Backend Custom-lowers ISD::DYNAMIC_STACKALLOC for both i16 and i32 result types. Loop patterns now produce correct results: sum_n(3)→6 verified in MAME smoke. Fix: in VLA functions PEI expands STAfi/STA8fi/STAfi_indY to a 4-MC sequence ending in LDY $F8 which clobbers N/Z; the StackSlotCleanup PHP/PLP wrap pass treats those pseudos as flag-corrupting so PLP wraps the entire expansion. expandFarFI uses STY $F8/LDY $F8 to a DP scratch slot rather than PHY/PLY (PHY/PLY between PHP/PLP would pollute the saved P).

  • dpack and dclass now both inline (2026-05-10). dpack uses a volatile-output array rewrite to defeat the backend stack-slot coalesce bug that previously caused dadd(1.5, 2.5) → 0x4010_4010_0000_0000. dclass's pointer-arg stores lower to STBptr/STAptr (indirect-long, DBR-independent) and inline cleanly. All softDouble routines compile at -O2.

  • IMG8..IMG15 callee-save via W65816ImgCalleeSave (2026-05-13). New post-RA, pre-PEI pass detects use of IMG8..IMG15 ($C0..$CE) in a function and emits prologue save + epilogue restore so those slots behave as callee-saved AT THE ASM LEVEL — without going through LLVM's CSR mechanism (which would shift regalloc decisions and break unrelated tests). Save shape per used slot: PHA; LDA $C?; STAfi A,slot,2; PLA; restore mirrors it. The +2 ImmOffset compensates for PHA's SP shift so the lowered sta d,s lands on the same byte that subsequent normal-SP reads see. Cost: ~16 cycles + 6 bytes per used slot, applied only to functions that actually use those slots (most don't). Fixed picol expr 1+2 == 4 (now 3) and a class of recursive double-fn miscompiles with compound || conditions — see feedback_picol_expr_compound_or.md. Smoke green including a new orBug regression test guarding the fix.