65816-llvm-mos/STATUS.md

# llvm816 — Current Status

LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.

## What works

End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).

**Language coverage at -O2 (no extra flags):**

- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
  (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
  + ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
  the above sizes.  Signed compare near INT_MIN handled via EOR-with-
  sign-bit transform.
- Pointer arithmetic, array indexing, struct field access, struct
  return-by-value (up to 8 bytes — Pair, Vec4, double).
- Pointer dereference (`*p`) lowers via `LDAptr / STAptr / STBptr`
  to `[$E0],Y` indirect-LONG with the bank byte at `$E2` forced to 0
  — DBR-independent, so `pha;plb` bank-switched callers don't corrupt
  data through callee local-pointer writes.  Const-int pointers
  (`*(volatile uint16 *)0x5000 = v` MMIO idiom) lower to `STAabs`
  (DBR-relative) so bank-2 writes still work.
- Bitfields, switch statements (verified up to ~12 cases + default),
  function pointers, function-pointer tables, indirect calls via
  `__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
  insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
  reverse with `cons` works; free-list coalesce verified.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
  roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns
  bit-for-bit against gcc with round-to-nearest-even rounding;
  3-iter Newton sqrt converges.  Compiles at -O2 throughout.  Long-
  running iterations may hit MAME's 1-second sim-time budget (test
  config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
  arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
  ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p %a %A` and width/precision specifiers.
- sprintf / snprintf / vsprintf / vsnprintf with the same format
  coverage as printf (`%d %u %x %o %ld %lu %s %c %f %F %e %E %g %G
  %a %A %p %n %%` + flags `- + (space) # 0` + width + precision +
  length modifiers `hh h l ll j z t`).  C99 truncation semantics
  for snprintf.  `%.Nf` produces the correct fractional digits with
  round-half-up.  Hex-float `%a` / `%A` decodes IEEE-754 bits via
  4 u16 words (no i64 shifts), emits `0x1.{hex}p{signed-dec}` with
  glibc-style trailing-zero stripping when precision is unspecified;
  subnormals canonicalize as `0x0.{hex}p-1022`.  Inf/NaN parity
  across all FP conversions (`%f %F %g %G %e %E %a %A`).
- scanf family: `sscanf` / `vsscanf` parse a C string; `fscanf` /
  `vfscanf` bridge to vsscanf via a per-call line buffer (caps at
  255 bytes / line; a longer line silently truncates).  `scanf`
  reads from stdin which always returns EOF on this target — the
  surface compiles but isn't useful without a stdin source.
  Format directives: `%d %i %u %x %X %o %s %c %ld %lu %lx %li %lo %%`.
- qsort + bsearch over arbitrary element size with a user `cmp`
  callback.
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
  strcspn, atol, llabs (kept in their own translation unit so
  vprintf's branch layout doesn't shift).
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
  sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh,
  tanh (and float variants).  Bit-twiddling for fabs/floor/ceil/
  copysign; Newton iteration for sqrt; range-reduction + Taylor
  for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/
  cosh/tanh.  Accuracy is in the ~1e-6 range — good enough for
  typical numeric work, far short of glibc-quality.  These are
  slow (each call is dozens to hundreds of soft-double libcalls)
  — pre-compute or cache when possible.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
- `<stdio.h>` file I/O with two backends:
  - **mfs** — `mfsRegister(path, buf, size, cap, writable)` stages a
    memory buffer as a named file.  Used by smoke tests that don't
    have a real disk.  Fully validated end-to-end.
  - **GS/OS** — `fopen` falls through to `gsosOpen` for any path not
    in the mfs table.  Routes through the GS/OS class-1 dispatcher
    via wrappers in `runtime/src/iigsGsos.s` (Open/Read/Write/Close/
    SetMark/GetMark/SetEOF/GetEOF).  The full stdio surface
    (`fread/fwrite/fseek/ftell/fclose/fgetc/fputc/fputs/fgets/ungetc/
    feof/ferror/clearerr/rewind/fprintf/vfprintf`) dispatches on
    backend.  link816 honors weak symbols so programs that don't use
    the GS/OS backend don't have to link `iigsGsos.o`.
  - **Validation status:** code path compiles, links, and runs under
    `runViaFinder.sh --data` injection.  `fopen` + `gsosOpen` hangs
    when invoked under real GS/OS 6.0.2 (JSL $E100A8 doesn't return);
    root cause not yet diagnosed.  Stub-dispatcher GS/OS smoke (the
    existing one) validates the wrapper contract independently.  An
    XFAIL'd end-to-end smoke is in `scripts/smokeTest.sh` gated
    behind `GSOS_FILE_SMOKE=1` for use after the dispatcher path is
    fixed.  `runViaFinder.sh --data /PATH=local_file` is the
    automated-injection mechanism for runtime-test data files.
  - stdin/stdout/stderr route through `putchar` as before.
- `<wchar.h>`: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy /
  wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs /
  wcstombs / mblen with the trivial 1:1 byte<->wide mapping
  (Latin-1).  wchar_t is 16-bit on this target.  Extended set:
  wmemcpy / wmemmove / wmemset / wmemcmp / wmemchr;
  wcstol / wcstoul / wcstoll / wcstoull / wcstod / wcstof;
  swprintf / vswprintf; wcsftime.  All delegate to the byte
  equivalents under the Latin-1 model.
- `<signal.h>`: in-process signal table.  signal() registers a
  handler; raise() invokes it.  Default actions: SIGABRT calls
  abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
- `<locale.h>`: setlocale always returns "C"; localeconv returns
  a fixed C-locale lconv struct.
- `<fenv.h>`: rounding mode + exception flag word tracked but
  no-op (softFloat / softDouble are fixed RNE; exceptions never
  raised).  Surface compiles cleanly for portable code.
- `<tgmath.h>`: C11 type-generic math via `_Generic`; selects
  `sqrtf` vs `sqrt` etc. based on argument type.
- `<stdatomic.h>`: C11 atomic surface, all ops lower to plain
  ops (single-core uniprocessor — no real synchronization
  needed).  `_Atomic T` is treated as plain `T`.
- `<threads.h>`: stubs.  `thrd_create` returns `thrd_error`;
  mutex/cond ops are no-ops; `call_once` and `tss_*` work since
  they're degenerate on a single-core target.
- `aligned_alloc` / `posix_memalign` / `aligned_free`: wrap
  malloc with an over-allocation + pointer-stash trick.  Match
  C11 contract — `aligned_alloc(N, M)` returns N-aligned, free
  with `aligned_free`.
- `<iso646.h>`: alternative operator spellings (`and`, `or`,
  `not`, etc.) — C95 compat header.
- `<stdalign.h>`: aliases `_Alignas` / `_Alignof` to `alignas` /
  `alignof`.
- `<stdnoreturn.h>`: aliases `_Noreturn` to `noreturn`.
- `<uchar.h>`: `char16_t` / `char32_t` typedefs + `mbrtoc16` /
  `c16rtomb` / `mbrtoc32` / `c32rtomb` conversion helpers.  In
  our Latin-1 model these are 1:1 byte copies (no UTF-8 decode).
- `<wctype.h>`: wide-char classification + case folding.
  Delegates to `<ctype.h>` for code-points 0..255; anything
  outside Latin-1 returns false / unchanged.
- `<complex.h>`: C99 complex-number surface — clang built-in
  `_Complex` lowers to soft-double under the hood.  Macros
  `complex` / `_Complex_I` / `I` / `CMPLX` / `CMPLXF` / `CMPLXL`
  plus inline `creal` / `cimag` / `conj` / `cproj` / `cabs` /
  `carg` and their `f` / `l` variants.  Transcendental complex
  routines (csin/ccos/cexp/etc.) intentionally not provided —
  they would each need a polynomial-expansion implementation
  with limited IIgs value.
- `<iigs/sound.h>`: thin convenience wrappers around the SoundManager
  toolset (`iigsBeep`, `iigsLoadDocSample`, `iigsPlayDocSample`,
  `iigsSoundStop`, `iigsSoundWait`, plus `iigsSoundProbeInit` /
  `iigsSoundProbeShutdown` for CLI-style demos that don't want the
  full `startdesk()` tool chain).  As of Phase 1.6 (2026-06-01) the
  `IigsSoundParmT` layout matches ORCA's authoritative
  `SoundParamBlock` exactly (18 bytes); the prior 6-byte struct was
  silently wrong.  Channel/genNum is now `FFStartSound`'s arg0, not a
  struct field.  Phase 2.4 (2026-06-01) landed `iigsLoadDocSample`
  (wraps `WriteRamBlock` for caller-RAM-to-DOC-RAM staging) - see
  `demos/helloSample.c` for an end-to-end sine-wave probe.
- `<iigs/eventLoop.h>`: callback-based TaskMaster event loop
  (`iigsEventLoop(callbacks)` + `iigsEventLoopQuit()`).  Dispatches
  close-box clicks, menu picks, key events, mouse clicks, idle.
  Saves the typical 30-line dispatch switch every desktop app
  otherwise carries.
- `<iigs/resource.h>`: typed-C facade over the IIgs Resource
  Manager — `resourceProbeInit()`, `iigsLoadResource(type, id)`,
  `iigsGetResourceSize(type, id)`, `resourceRuntimeEnabled()`.
  **Phase 3.4 STUB-ONLY landing:** the toolset surface compiles
  and links cleanly into any demo, but all three runtime entry
  points return `RES_ERR_BLOCKED` today because the live path
  (MMStartUp + TLStartUp + ResourceStartUp + OpenResourceFile-on-
  own-pathname) reaches the same blocking code as `fopen` on
  GS/OS 6.0.2 — that is Phase 1.1 of the gap-closure plan, still
  open.  Flip `IIGS_RESOURCE_RUNTIME_ENABLED=1` after Phase 1.1
  lands and the existing typed wrappers route through to the real
  toolbox.
  - **Bundler:** `tools/rsrcBundle/rsrcBundle.py` reads a flat dir
    of `TYPECODE_ID.bin` files (e.g. `8014_0001.bin` = rText id 1),
    builds `rResourceMap` + `rIndex` per Apple IIgs Toolbox Reference
    Vol 3, stitches with the OMF data fork, emits an AppleSingle
    blob (Phase 0.7 decision) plus an optional `--sidecar`
    `_ResourceFork.bin` for cadius ingestion (cadius v1.4.6's
    AppleSingle parser drops resource_fork entries; the sidecar is
    what `ADDFILE` actually picks up).
  - **Inspector:** `tools/rsrcBundle/dumpFork.py` decodes the
    rResourceMap header + rIndex table for diff/debug.  Supports
    both raw forks and AppleSingle blobs (`--applesingle`).
  - **Integration:** `demos/build.sh` runs `rsrcBundle` as a
    post-step when `demos/<name>.rsrc/` exists; output goes to
    `demos/<name>.apl` + `demos/<name>.apl_ResourceFork.bin`.
  - **Demo:** `demos/rsrcProbe.c` exercises the stub surface end
    to end + verifies the bundler post-step under MAME (markers at
    `$70..$73`).
- `<assert.h>`: adds C11 `static_assert` as a macro alias for
  the `_Static_assert` keyword.
- `<errno.h>`: full C standard error codes (EDOM, ERANGE,
  EILSEQ) plus common POSIX codes (EPERM..EPIPE, ENAMETOOLONG,
  ENOSYS, ENOTEMPTY, ELOOP).  `strerror` maps every defined
  code to a human-readable string.
- `<stdio.h>`: adds C standard buffer-control surface
  (`setvbuf` / `setbuf` as no-ops, `_IOFBF` / `_IOLBF` / `_IONBF`
  / `BUFSIZ`); `fgetpos` / `fsetpos` wrap `ftell` / `fseek`;
  `remove` routes through `mfsUnregister`; `rename` / `tmpfile`
  / `tmpnam` are stubs.
- C++ subset: classes, single inheritance, multiple inheritance
  (Drawable+Movable through one Sprite), virtual base diamond
  (A and B virtually derive Base; Diamond inherits from both
  with one shared Base subobject), virtual functions,
  polymorphism via base-class pointer arrays, virtual dtors,
  this-pointer adjustment for non-leftmost bases, vbase offset
  tables.  RTTI / `dynamic_cast` works (downcast, MI cross-cast,
  virtual-base sibling cast) via a minimal libcxxabi shim
  (`runtime/src/libcxxabi.c`) that provides `__dynamic_cast` +
  the three typeinfo class vtables (`__class_type_info`,
  `__si_class_type_info`, `__vmi_class_type_info`) + sized
  `operator delete` + `__cxa_pure_virtual`.
- C++ ABI: `operator new` / `operator new[]` / `operator delete[]`
  (sized + unsized variants, both `j` and `m` size-type manglings),
  `__cxa_atexit` + `__run_cxa_atexit` (each crt0 walks the registered
  dtor table in LIFO order after `main()` returns — global / static-
  local non-trivial dtors actually run before halt/QUIT),
  `__cxa_guard_acquire` / `__cxa_guard_release` / `__cxa_guard_abort`
  (Meyers-singleton gates), `__dso_handle` (single-DSO cookie).
  All in `runtime/src/libcxxabi.c`; libgcc.s carries a weak no-op
  `__run_cxa_atexit` so pure-C programs that don't link libcxxabi
  still resolve.  Global ctors with non-trivial bodies are picked up
  by `crt0Gno` / `crt0Gsos` / `crt0` walking `.init_array` via the
  bank-0-anchored `__indirTarget` slot at DP `$B8`.
- C++ containers via **vendored ETL** (MIT, header-only at
  `runtime/include/c++/etl/`): `etl::vector<T,N>`, `etl::string<N>`,
  `etl::map<K,V,N>`, `etl::optional`, `etl::array`, etc.  Fixed-capacity
  (no malloc).  Target profile at `runtime/include/c++/etl_profile.h`
  disables `ETL_NO_STL`, exceptions, atomics, `std::ostream`.
  `demos/etlProbe.cpp` is the worked example (vector + string in 6.3 KB
  text).
- C++ exceptions via `clang++ -fsjlj-exceptions`: throw, catch,
  catch-by-value, multiple catch handlers, exception destruction.
  `W65816SjLjFinalize` IR pass inserts the call-site dispatch and
  per-function catch table; `runtime/src/libcxxabiSjlj.c` provides
  the Itanium SJLJ surface (`_Unwind_SjLj_*`, `__cxa_throw`,
  `__cxa_begin_catch`, etc.) plus a no-op personality.

**Toolchain:**

- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
  text/rodata/bss, emits a flat binary the IIgs ROM can load.
  Auto-relocates bss above text+rodata when the default
  `--bss-base 0x2000` would overlap text, and skips past the
  IIgs IO window ($C000-$CFFF) if needed.  `--gc-sections`
  (default ON) drops unreachable functions: a minimal program
  with full runtime linked shrinks from ~43KB to ~1.5KB.
- `link816 --segment-cap N` packs `.text` greedily into multiple
  bank-aligned segments, capped at N bytes per segment.  Segment 1
  stays at `--text-base` in bank 0 (alongside rodata + bss + init);
  segments 2..M start at `--segment-bank-base` (default $040000)
  in successive banks.  `--manifest path.json` writes a JSON file
  listing each segment's image, base, and entry offset.
  Cross-bank `JSL` (IMM24 reloc) just works — patched at link
  time with the full 24-bit address.  Cross-bank IMM16 is
  permitted (uses DBR for bank — caller pins DBR to data's bank);
  cross-bank PCREL is rejected with a clear diagnostic.
  `scripts/runMultiSeg.sh` is a mini in-Lua loader for MAME that
  reads the manifest, places each segment's bytes, and runs from
  segment 1's entry — used by smoke to verify cross-bank JSL
  end-to-end (helper3 chain across 3 bank-aligned segments).
- `tools/omfEmit` produces OMF v2.1 files in three modes:
  (a) single-segment — `--input flat.bin --map flat.map --base
  ADDR --entry SYM`, KIND=0x0000 (CODE, dynamic), ORG=0 (loader
  picks bank); (b) multi-segment — `--manifest path.json` reads
  link816's manifest and emits one OMF segment per entry with
  KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking
  the GS/OS Loader to place each at its declared bank-aligned
  address.  All intra-segment relocations were already patched by
  the linker, so no INTERSEG/RELOC opcodes are needed for v1
  static placement.  (c) `--stack-size N` (auto-enables
  `--expressload`) appends a `~Direct` DP/Stack segment
  (KIND=0x1012) of N bytes so apps can request a custom DP+stack
  allocation from GS/OS instead of the Loader's 4KB default.
  Validated end-to-end via `runViaFinder.sh` under real GS/OS
  6.0.2 — the slow Loader path silently rejects multi-segment
  OMFs, so `--stack-size` is gated behind ExpressLoad emission.
- `link816 --debug-out FILE` writes a DWARF sidecar with text/
  rodata/bss/init_array relocations applied to every `.debug_*`
  section, so `.debug_addr` / `.debug_line` PC values are final-
  image addresses.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
  libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 148 end-to-end checks at -O2:
  scalar ops, control flow, calling conventions, MAME execution
  regressions, link816 bss-base safety + weak-symbol resolution +
  heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
  iigs/gsos.h compile + link, standalone runtime headers,
  AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared-
  LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering,
  plus real-world coverage: Conway's Game of Life blinker
  (2D loop + neighbour bounds), binary search tree (recursive
  struct + malloc), function-pointer dispatch table (indirect
  JSL via `__jsl_indir`), memory-backed file I/O (mfsRegister +
  fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single
  inheritance), C++ multiple inheritance (Drawable+Movable),
  C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast +
  virtual-base sibling cast through libcxxabi shim), SJLJ exception
  runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via
  setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions
  compile + link (the C++ frontend → backend path is execution-
  verified manually but skipped from MAME smoke due to a
  MAME-side flakiness — see "What's next"), GS/OS wrapper
  round-trip via stub dispatcher pre-loaded at $E100A8 (validates
  PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end),
  wchar / signal core APIs, hex dumper writing through fprintf,
  JSON tokenizer state machine, hash-table command shell (parser
  + dispatch + chained collisions over fprintf-to-mfs),
  scripts/bench.sh size-vs-Calypsi harness.  100% pass.

- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts via
  MAME's `emu.time()` between A1A1/A2A2 markers.  Runs vs commercial
  Calypsi 5.16 (`scripts/benchCyclesCalypsi.sh`) for an apples-to-
  apples speed comparison.  Current numbers (2026-05-27, Layer 2):

  | Bench        | Ours  | Calypsi | Ratio  |
  |--------------|------:|--------:|-------:|
  | dotProduct   | 1534  | 5712    | 0.27×  |
  | bsearch      | 682   | 2387    | 0.29×  |
  | sumOfSquares | 6820  | 16368   | 0.42×  |
  | bubbleSort   | 11594 | 17050   | 0.68×  |
  | strLen       | 767   | 1023    | 0.75×  |
  | djb2Hash     | 2046  | 2643    | 0.77×  |
  | popcount     | 1194  | 1534    | 0.78×  |
  | strcpy       | 1108  | 1194    | 0.93×  |
  | memcmp       | 682   | 716     | 0.95×  |
  | fib          | 11594 | 10912   | 1.06×  |

  **Geomean: 0.62× Calypsi.**  9 of 10 below 1.0×; only fib trails
  (recursive call overhead, structural).  Speed is the optimization
  priority, not size.

- `compare/` holds three side-by-side C tests with our asm and
  Calypsi's listing for static-size comparison:
  `sumSquares`/`evalAt`/`mul16to32`.  `bash compare/regen.sh`
  recompiles each under both `clang --target=w65816 -O2 -S` and
  `cc65816 --speed -O 2 --64bit-doubles` and prints an
  ours/Calypsi instruction-count ratio.  See `compare/README.md`.

**Backend register allocation:**

- Greedy regalloc as default at -O1+; fast at -O0/optnone.  Greedy
  was previously blocked by an upstream LLVM `LiveRangeEdit::elimina-
  teDeadDef` assertion firing on KILL pseudos with non-dead implicit-
  def $a.  Fix landed in `tools/llvm-mos/llvm/lib/CodeGen/InlineSpil-
  ler.cpp`: when InlineSpiller converts a redundant STAfi to a KILL
  pseudo, mark BOTH explicit and implicit defs dead (the original loop
  only iterated `MI.defs()` = explicit-only, leaving the inherited
  implicit-def $a live).  Bench impact: popcount −19.4%, strcpy
  −18.9%, memcmp −8.6%, bsearch −9.2%.

- Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets
  greedy spread i16 pressure across A and 16 IMG slots);
  `TiedDefSpill` (handles tied-def-multi-use hazard);
  `ABridgeViaX` (bridges via X/Y when free).
- Post-RA passes: `SpillToX` (STA/LDA pairs → TAX/TXA bridges
  when X dead); `StackSlotCleanup` (deletes redundant adjacent
  spills); `NegYIndY` (rewrites negative-Y indirect-Y stack-rel
  ops to avoid the 24-bit-add bank-cross).
- Pre-emit: `BranchExpand` (long Bxx → INV_Bxx skip; BRA target);
  `SepRepCleanup` (coalesces adjacent SEP/REP toggles, plus a
  cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching
  X-flag-only ops, branches, transfers — saves 4B / 12cyc per
  collapse).  AsmPrinter LDAi8imm peephole walks past mode-neutral
  MIs to fuse the closing REP into a following SEP.
- Imaginary registers IMG0..IMG15 backed by DP $C0..$CE +
  $D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG)
  before stack spills kick in.

**ABI:**

- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
  on the system stack with PHA.  Caller deallocates via `tsc;clc;adc
  #N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
  the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
  for the +1 skew vs LLVM's full-descending model.

**IIgs toolbox:**

- `iigs/toolbox.h` — autogenerated wrappers for all ~1300 IIgs
  toolbox routines across 35 tool sets (Tool Locator, Memory
  Manager, Misc Tools, QuickDraw II / Aux, Event Manager,
  Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text
  Tools, Window Manager, Menu Manager, Control Manager,
  LineEdit, Dialog Manager, Scrap Manager, Standard File,
  Note Synth/Sequencer, Font Manager, List Manager, ACE,
  Resource Manager, MIDI, Video Overlay, TextEdit, Media
  Control, Print Manager, Scheduler, Desk Manager, …).  Names
  match Apple's IIgs Toolbox Reference exactly (TLStartUp,
  MMStartUp, NewWindow, SysBeep, …).  417 simple wrappers
  (zero/single-arg, i16-or-void return) inline in the header;
  890 multi-arg ones live in `runtime/src/iigsToolbox.s`.
  Generated by `scripts/genToolbox.py` from ORCA-C's
  `ORCACDefs/` (re-runnable when ORCA-C updates).

## What's next

Work is now optimization-focused; the toolchain is feature-complete
for the common-case C / minimal-C++ workload.  Priority is speed
(cycle counts), not size.

**Recently landed (2026-05-25):**

- **Layer 1 ptr32 deref-fold (always on)** — Constant offset on a
  ptr32 deref folds into the `[dp],Y` Y register instead of a CLC/ADC
  carry-chain pre-add.  Plus consecutive-deref CSE that shares the
  `$E0/$E2` staging across `s->a`, `s->b`, ... accesses with the same
  base.  Always on; saves ~3 instructions per struct-field access.
  See `feedback_ptr32_deref_fold_layer1_landed.md`.

- **Layer 2 ptr32 deref via `(d,S),Y` (opt-in)** —
  `-mllvm -w65816-dbr-safe-ptrs` switches ptr32 derefs to the
  one-instruction `lda (d,S),Y` (opcode 0xB3) at the cost of reading
  only 16 bits of pointer.  Bank byte is implicit DBR.  Correct only
  for code that touches memory inside DBR's bank — typical for
  malloc/globals/BSS-only programs (Lua, Picol).  Lua 5.1.5 shrinks
  20.6%, dropping our total from 1.45× to 1.15× Calypsi.  Default
  off; per-TU opt-in.  See `feedback_ptr32_layer2_landed.md` and
  `docs/USAGE.md` for the safety rules.

- **Inline-threshold lowered target-wide to 50** (was LLVM default
  225).  LLVM's default is tuned for desktop ISAs where call overhead
  is high relative to inlined-body byte cost.  On W65816, `jsl` is
  cheap (4 bytes / ~8 cycles) but inlined ptr32 derefs are expensive
  even with Layer 2 — the tradeoff inverts.  At 225, Lua's
  `index2adr` (41 callers in lapi.c) and CoreMark's `matrix_test`
  helpers got copied everywhere.  At 50, neither does, and the cycle
  benchmark suite is unchanged.  With Layer 2 + threshold=50, total
  Lua is **0.93× Calypsi** and total CoreMark is **0.73× Calypsi
  (we beat by 27%)** — the latter improved from 0.79× after the
  STA_DP / LDA_DP form pickup landed (W65816AsmPrinter now emits
  the 2-byte DP form for constant addresses in `$00..$FF`, saving
  ~1 byte per applicable op).  Override per-TU with
  `-mllvm -inline-threshold=N`.  See
  `feedback_lapi_inline_threshold.md` and
  `feedback_coremark_matrix_test_regression.md`.

- **CoreMark 1.0 ported** (`tests/coremark/`).  EEMBC's standard
  embedded benchmark, ~2K LOC.  Exercises linked-list traversal,
  matrix multiply, formal state machine, CRC — patterns Lua doesn't
  hit.  Build requires `--layer2` to fit a single bank
  (otherwise crosses the IO window at 0xC000).  See
  `tests/coremark/README.md` and `feedback_coremark_landed.md`.

**Speed wins queued, ranked by expected impact:**

- **ptr32 pointer-increment overhead** (partially addressed).  The
  `i32 += 1` post-PEI peephole (`W65816I32IncFold`) detects the
  6-instruction LDA/ADCi16imm 1/STA/LDA/ADCEi16imm 0/STA pattern and
  rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
  expansion in AsmPrinter).  Saves ~13 cyc per increment on the
  no-carry common path.  memcmp 1330 → 1194 (−10.2%), strcpy 3325 →
  3154 (−5.1%).  Now also tolerates intervening TAX/TXA pseudo-saves
  in the matcher (regalloc inserts them around STAfi's conservative
  `Defs=[A]`); LSR-introduced i32 PHIs like `lsr.iv9 += 1` now match.
  LSR's `*p++ → base+offset` rewrite remains unaddressed; tried
  `-disable-lsr` and `isLSRCostLess` override, both regressed
  dotProduct.

- **W65816StackSlotMerge — value-equivalent stack slot coalesce**
  (2026-05-13).  Pre-emit pass that merges PHI src/dst stack-slot
  pairs which LLVM's StackSlotColoring can't see (they're
  simultaneously live but hold the same value).  Detects the
  canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped
  MBB, verifies value equivalence via bidirectional twin-pairing
  (Case 1: same A in same MBB / Case 2: PHI-copy reload pattern /
  Case 3: matching `LDA #const` init in different MBBs), and
  renames slot X→Y function-wide.  Runs AFTER SepRepCleanup so the
  PHI copies are out of their PHP/PLP wraps and offsets are stable.
  **A-define detection is opcode-based, not operand-based** —
  LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a`
  annotation in tablegen but semantically write A; the
  `semanticallyDefsA` helper falls back to an opcode whitelist.
  sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for
  the first time).  sumOfSquares cyc/call: 18755 → 17391
  (**−7.3%**).  strcpy: 2558 → 2387 (−6.7%).  See
  W65816StackSlotMerge.cpp.

- **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2,
  2026-05-13).  After rewriting `mul i32 X, Y` to a `__umulhisi3`
  call, scan for i32 PHIs whose only uses are (a) the truncs the
  rewrite emitted and (b) a single self-feeding `add %P, const`.
  When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in
  place, replace truncs, and erase the i32 chain.  Care needed
  to break the PN ↔ Incr use-cycle before erasing.  sumSquares
  frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst.

- **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13).
  Init blocks contain `lda #const ; sta slot,s` pairs wrapped in
  PHP/PLP around the pre-loop CMP — same shape as a PHI-copy
  wrap but with an immediate load instead of a memory load.
  Matcher extended to accept both the MC opcode (`LDA_Imm16`) and
  the surviving pseudo (`LDAi16imm`), with an added **$a-live-out
  guard**: if any successor MBB has $a in its live-in set, bail —
  the LDA's A-value is a fall-through register-PHI consumed by
  the successor's first STA, and hoisting clobbers it.  Caught
  by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO
  supplied A=0 to `bb.2`'s `sta 0x1,s`.

- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
  pass** (2026-05-13).  Added `__umulhisi3` (unsigned 16x16→32) to
  `runtime/src/libgcc.s`.  New IR pass in `addISelPrepare` walks
  `mul i32 X, Y` and uses IR-level `computeKnownBits` plus a SCEV
  unsigned-range fallback (`getUnsignedRange().getActiveBits() <=
  16`) to detect operands with provably-zero high 16 bits — fires
  on the canonical loop-internal `(u32)i*i` pattern after LSR
  widens the i16 IV to i32.  Rewrites to a call to `__umulhisi3`.
  sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from
  baseline).

- **Dead TAX/TXA peephole** (2026-05-13).  STAfi's conservative
  `Defs=[A]` (for the IMG-source PHA-bracketed expansion path)
  causes regalloc to insert spurious TAX/TXA save/restore brackets
  even when STAfi's source is A directly.  `W65816SepRepCleanup`
  now elides TXA/TYA whose next non-debug inst defines `$a`, and
  TAX/TAY whose target reg is dead before its next redefinition.
  Cross-MBB liveness via `Succ->isLiveIn(...)`; bails on
  return-terminated MBBs (RTL doesn't model the i32-return
  convention).  Tracks `pRedef` so `TAX ; CLC ; ADC` chains
  don't bail on ADC's $p-read (CLC freshens the carry flag).

- **i32 += i32 store-bypass** (2026-05-13).  Regalloc materializes
  the call-result `A:X` i32 pair into spill slots before the add,
  then reloads — emitting a 10-instruction `STA-TXA-STA-LDA-CLC-
  ADC-STA-LDA-ADC-STA` sequence.  `W65816SepRepCleanup` matcher
  rewrites to 6-instruction `CLC-ADC-STA-TXA-ADC-STA` (TXA preserves
  carry; hi-half consumes it directly from X).  Saves 4 inst / ~13
  cyc per call-result-add site.  sumOfSquares 20460 → 19096 (-6.7%).

- **PHI-copy hoist out of PHP/PLP wrap** (2026-05-13).
  `W65816SepRepCleanup` detects the back-edge `CMP ; PHP ; (LDA/STA
  pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop` pattern and
  hoists the LDA/STA pairs + trailing above the CMP's $a-producer
  chain, dropping PHP/PLP.  Two safety guards: (1) **bump undo** —
  in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements
  S; `W65816StackSlotCleanup`'s wrap pass compensates inside the
  wrap), so the hoist subtracts 1 from each `LDA_StackRel` /
  `STA_StackRel` offset; trailing STAs (already outside the wrap)
  are untouched.  (2) **pair-count check** — require
  `#LDAs(Block) == #STAs(Block) + #STAs(Trailing)`; an extra LDA
  is a memory-to-register PHI value live-out at the back-edge
  (consumed by the loop top's first STA), and hoisting would
  clobber A.  Saves 2 inst / 8 cyc per occurrence.  sumOfSquares
  19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%).

- **More peephole / libcall opportunities.**  __mulsi3 just gained
  early-exit when the multiplier shifts to 0; dotProduct dropped
  4007→2472 (−38.3%), sumOfSquares 40920→23870 (−41.6%).  Next
  candidates: shift-by-N inlining for shifts 5+ that currently
  go through __ashlsi3; a `u32 += zext i16` SDAG combine to skip
  the hi-half carry chain when one operand has known-zero high
  16 bits.

- **W65816StackRelToImg peephole pipeline** (2026-05-20).  Eight
  always-on peepholes plus an extended phase 4 in the pre-emit
  StackRelToImg pass: (1) `elidePhaBracket` with case-a single-store
  bracket + case-b ImgCalleeSave multi-store with STA-hoist +
  case-c STA_DP-only multi-pair + forward-walk liveness through
  conditional branches; (2) `elideCallResultSaveSPReload` drops
  STA/LDA $E0 round-trip in ADJCALLSTACKUP's Y-live i64-return
  path; (3) `elideDeadStaCarry` drops first STA in i32-carry
  STA/ADCE/STA pattern; (4) `elideRedundantLdaAfterPha`; (4b)
  `elidePlaPhaPair` collapses consecutive PLA;PHA; (5)
  `elideStoreForwarding` (gated to bail path + end-of-pass to
  avoid IMG-slot reallocation cascade).  Phase 4 extended to walk
  past STX_DP/STY_DP between TYA and STA_DP with safety check
  (post-STA op must redefine A) and to handle STA_StackRel
  destination with offset compensation.  Result: evalAt 498→472
  inst (1.96×→1.86× vs Calypsi), fib -35% cyc/iter (149→97),
  popcount -11% (104→93), 35 libc functions get TAY/TYA bracket
  elided.  Case (b) hoists the body's first STA before the
  ImgCalleeSave bracket, enabling the existing phase 4 to remove
  PEI's TAY/TYA round-trip in a synergistic chain.

- **__muldi3 32-bit short-circuit** (2026-05-20).  When `a`'s high
  32 bits ($E4/$E6) are zero, use a 32-iter shift-and-add loop
  instead of 64 iters.  Fires on every `mulhi64Aligned` call from
  softDouble.c (4× per `__muldf3`), which always passes zero-
  extended u32 operands.  Result: **dmul 1605→1033 cyc/iter
  (-36%)**.  Single-side check (just `a`) is correct since `b`'s
  high half being non-zero doesn't affect correctness — iters 32-63
  would just shift b without adding.

- **Lua 5.1.5 compiles cleanly** (2026-05-20).  Reference C
  implementation (17K lines, 24 source files) builds + links into a
  multi-segment binary.  Loads in MAME.  Lives under `tests/lua/`.
  Three large functions (luaV_execute, symbexec, auxsort) hit
  greedy regalloc's complexity budget and need `-mllvm
  -regalloc=basic` (still at -O2 — basic-regalloc -O2 is ~3.5×
  smaller than fast-regalloc -O0).  Largest "real-world C" test
  in the project.

**Open limitations:**

- **Multi-bank BSS** — full support up to 4 banks (256KB).  link816
  splits BSS into up to 4 contiguous segments at link time; each
  segment fits within a single bank.  Linker emits
  `__bss_seg{0..3}_lo16 / _bank / _size` symbols.  crt0 walks the
  table, setting DBR per segment.  Per-segment size capped at
  0xFF00 so the 16-bit `cpx #__bss_segN_size` loop comparison
  doesn't wrap to 0 on a full-bank segment (a single full bank is
  split into a 0xFF00-byte primary + 0x100-byte tail in the same
  bank).  Smoke validates BSS spanning bank 3 + bank 4
  (100KB) is zeroed end-to-end.  Note: program access to non-DBR
  bank globals still requires DBR management — the compiler emits
  DBR-relative absolute for global accesses, so accessing BSS in
  bank N needs the program to set DBR=N or use `sta long` via
  inline asm.

- **C++ exceptions absent from CI smoke.**  The SJLJ runtime
  round-trip is in smoke; the full clang++ → backend → MAME
  execution path runs reliably interactively but is excluded
  from automated smoke due to MAME-side I/O flakiness.

- **GS/OS validation uses a stub dispatcher.**  The wrapper
  contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP
  fixup) is verified end-to-end in MAME against a stub
  (`scripts/runInMameWithGsosStub.sh`).  Validation against a
  real bootable GS/OS volume is left out of CI as it needs a
  smartport hard-disk image and live Tool Locator init.

- **VLAs work end-to-end** (2026-05-09).  Backend Custom-lowers
  `ISD::DYNAMIC_STACKALLOC` for both i16 and i32 result types.
  Loop patterns now produce correct results: `sum_n(3)→6`
  verified in MAME smoke.  Fix: in VLA functions PEI expands
  STAfi/STA8fi/STAfi_indY to a 4-MC sequence ending in `LDY $F8`
  which clobbers N/Z; the StackSlotCleanup PHP/PLP wrap pass
  treats those pseudos as flag-corrupting so PLP wraps the entire
  expansion.  `expandFarFI` uses `STY $F8`/`LDY $F8` to a DP
  scratch slot rather than PHY/PLY (PHY/PLY between PHP/PLP would
  pollute the saved P).

- **dpack and dclass now both inline** (2026-05-10).  dpack uses
  a volatile-output array rewrite to defeat the backend stack-slot
  coalesce bug that previously caused dadd(1.5, 2.5) →
  0x4010_4010_0000_0000.  dclass's pointer-arg stores lower to
  STBptr/STAptr (indirect-long, DBR-independent) and inline
  cleanly.  All softDouble routines compile at -O2.

- **IMG8..IMG15 callee-save via W65816ImgCalleeSave** (2026-05-13).
  New post-RA, pre-PEI pass detects use of IMG8..IMG15 ($C0..$CE)
  in a function and emits prologue save + epilogue restore so those
  slots behave as callee-saved AT THE ASM LEVEL — without going
  through LLVM's CSR mechanism (which would shift regalloc decisions
  and break unrelated tests).  Save shape per used slot: `PHA; LDA
  $C?; STAfi A,slot,2; PLA`; restore mirrors it.  The `+2` ImmOffset
  compensates for PHA's SP shift so the lowered `sta d,s` lands on
  the same byte that subsequent normal-SP reads see.  Cost: ~16
  cycles + 6 bytes per used slot, applied only to functions that
  actually use those slots (most don't).  Fixed picol `expr 1+2 == 4`
  (now `3`) and a class of recursive double-fn miscompiles with
  compound `||` conditions — see `feedback_picol_expr_compound_or.md`.
  Smoke green including a new orBug regression test guarding
  the fix.