# llvm816 — Current Status LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from llvm-mos as a separate `W65816` target. ## What works End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs). **Language coverage at -O2 (no extra flags):** - All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos + ASLA16 / shift libcalls. - Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes. Signed compare near INT_MIN handled via EOR-with- sign-bit transform. - Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double). - Pointer dereference (`*p`) lowers via `LDAptr / STAptr / STBptr` to `[$E0],Y` indirect-LONG with the bank byte at `$E2` forced to 0 — DBR-independent, so `pha;plb` bank-switched callers don't corrupt data through callee local-pointer writes. Const-int pointers (`*(volatile uint16 *)0x5000 = v` MMIO idiom) lower to `STAabs` (DBR-relative) so bank-2 writes still work. - Bitfields, switch statements (verified up to ~12 cases + default), function pointers, function-pointer tables, indirect calls via `__jsl_indir` trampoline. - Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort. - Loops with goto / break / continue, nested loops, state machines. - `` varargs with int / long / unsigned long long mixed args. - Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list reverse with `cons` works; free-list coalesce verified. - Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa roundtrip. - Soft-float (single): all four ops + comparisons, MAME-verified. - Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Compiles at -O2 throughout. Long- running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug). - Inline assembly with `"a"`, `"x"`, `"y"` register constraints and arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom). - C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions). - printf with `%d %x %s %c %p` and width/precision specifiers. - sprintf / snprintf / vsprintf / vsnprintf with the same format coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width). C99 truncation semantics for snprintf. `%.Nf` produces the correct fractional digits with round-half-up. - qsort + bsearch over arbitrary element size with a user `cmp` callback. - Standard string/stdlib glue: strcat, strncat, strpbrk, strspn, strcspn, atol, llabs (kept in their own translation unit so vprintf's branch layout doesn't shift). - ``: fabs, floor, ceil, fmod, copysign, sqrt, pow, sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh (and float variants). Bit-twiddling for fabs/floor/ceil/ copysign; Newton iteration for sqrt; range-reduction + Taylor for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/ cosh/tanh. Accuracy is in the ~1e-6 range — good enough for typical numeric work, far short of glibc-quality. These are slow (each call is dozens to hundreds of soft-double libcalls) — pre-compute or cache when possible. - `setjmp` / `longjmp` from libgcc.s. - Static constructors via crt0's init_array walk. - `` file I/O against an in-memory FS: `mfsRegister (path, buf, size, cap, writable)` stages a buffer as a named file; `fopen`/`fread`/`fwrite`/`fseek`/`ftell`/`fclose`/`fgetc` /`fgets`/`ungetc`/`fprintf` operate on it via a per-FILE (kind, buf, size, cap, pos, eof, err, unget) record. stdin/ stdout/stderr route through `putchar` as before. - ``: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy / wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs / wcstombs / mblen with the trivial 1:1 byte<->wide mapping (Latin-1). wchar_t is 16-bit on this target. - ``: in-process signal table. signal() registers a handler; raise() invokes it. Default actions: SIGABRT calls abort(), SIGINT/SIGTERM call exit(128+sig), others ignored. - ``: setlocale always returns "C"; localeconv returns a fixed C-locale lconv struct. - C++ subset: classes, single inheritance, multiple inheritance (Drawable+Movable through one Sprite), virtual base diamond (A and B virtually derive Base; Diamond inherits from both with one shared Base subobject), virtual functions, polymorphism via base-class pointer arrays, virtual dtors, this-pointer adjustment for non-leftmost bases, vbase offset tables. RTTI / `dynamic_cast` works (downcast, MI cross-cast, virtual-base sibling cast) via a minimal libcxxabi shim (`runtime/src/libcxxabi.c`) that provides `__dynamic_cast` + the three typeinfo class vtables (`__class_type_info`, `__si_class_type_info`, `__vmi_class_type_info`) + sized `operator delete` + `__cxa_pure_virtual`. Compile with `clang++ -fno-exceptions` (RTTI can stay on; exceptions remain out of scope — see "Yet to come"). **Toolchain:** - `clang` / `llc` produce W65816 assembly + ELF object files. - `tools/link816` resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load. Auto-relocates bss above text+rodata when the default `--bss-base 0x2000` would overlap text, and skips past the IIgs IO window ($C000-$CFFF) if needed. `--gc-sections` (default ON) drops unreachable functions: a minimal program with full runtime linked shrinks from ~43KB to ~1.5KB. - `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's native object format) for round-tripping with classic dev tools. - `link816 --debug-out FILE` writes a DWARF sidecar with text/ rodata/bss/init_array relocations applied to every `.debug_*` section, so `.debug_addr` / `.debug_line` PC values are final- image addresses. - `runtime/build.sh` builds crt0, libc, soft-float, soft-double, libgcc into linkable objects. - `scripts/smokeTest.sh` runs 122 end-to-end checks at -O2: scalar ops, control flow, calling conventions, MAME execution regressions, link816 bss-base safety + weak-symbol resolution + heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link, iigs/gsos.h compile + link, standalone runtime headers, AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared- LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering, plus real-world coverage: Conway's Game of Life blinker (2D loop + neighbour bounds), binary search tree (recursive struct + malloc), function-pointer dispatch table (indirect JSL via `__jsl_indir`), memory-backed file I/O (mfsRegister + fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single inheritance), C++ multiple inheritance (Drawable+Movable), C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast + virtual-base sibling cast through libcxxabi shim), GS/OS wrapper round-trip via stub dispatcher pre-loaded at $E100A8 (validates PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end), wchar / signal core APIs, hex dumper writing through fprintf, JSON tokenizer state machine, hash-table command shell (parser + dispatch + chained collisions over fprintf-to-mfs), scripts/bench.sh size-vs-Calypsi harness. 100% pass. - `scripts/bench.sh` compiles a microbenchmark suite with both clang (this toolchain) and Calypsi cc65816, comparing emitted text-section size. Current ratio: ~1.9x (down from 2.2x once the W65816 target started overriding `replexitval` to "never" by default in `LLVMInitializeW65816Target`; SCEV's closed-form rewrite was promoting i16 induction expressions to i64 and hitting `__muldi3`, which on a 16-bit target is dramatically bigger than the loop it replaces). sumOfSquares went 335B → 128B, a 2.6x shrink with no other benchmark affected. Eight benchmarks shipped under `benchmarks/`. Remaining gap is structural: Calypsi uses `(sr,s),Y` for stack-relative pointer indirection where we route through DP $E0 indirect- long for bank safety. **Backend register allocation:** - Basic regalloc as default at -O1+; fast at -O0/optnone. We use basic instead of greedy because greedy fails ("ran out of registers during register allocation") on functions with many cross-call Acc16 vregs (the `ok |= bit; helper(); ok |= bit;` pattern across many if-blocks). Basic handles those cleanly with negligible code-size overhead vs greedy on the bench suite (~0.6%). - Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets greedy spread i16 pressure across A and 16 IMG slots); `TiedDefSpill` (handles tied-def-multi-use hazard); `ABridgeViaX` (bridges via X/Y when free). - Post-RA passes: `SpillToX` (STA/LDA pairs → TAX/TXA bridges when X dead); `StackSlotCleanup` (deletes redundant adjacent spills); `NegYIndY` (rewrites negative-Y indirect-Y stack-rel ops to avoid the 24-bit-add bank-cross). - Pre-emit: `BranchExpand` (long Bxx → INV_Bxx skip; BRA target); `SepRepCleanup` (coalesces adjacent SEP/REP toggles, plus a cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching X-flag-only ops, branches, transfers — saves 4B / 12cyc per collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral MIs to fuse the closing REP into a following SEP. - Imaginary registers IMG0..IMG15 backed by DP $C0..$CE + $D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG) before stack spills kick in. **ABI:** - arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL on the system stack with PHA. Caller deallocates via `tsc;clc;adc #N;tcs` or `PLY*N/2`. - Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits. - Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model. **IIgs toolbox:** - `iigs/toolbox.h` — autogenerated wrappers for all ~1300 IIgs toolbox routines across 35 tool sets (Tool Locator, Memory Manager, Misc Tools, QuickDraw II / Aux, Event Manager, Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text Tools, Window Manager, Menu Manager, Control Manager, LineEdit, Dialog Manager, Scrap Manager, Standard File, Note Synth/Sequencer, Font Manager, List Manager, ACE, Resource Manager, MIDI, Video Overlay, TextEdit, Media Control, Print Manager, Scheduler, Desk Manager, …). Names match Apple's IIgs Toolbox Reference exactly (TLStartUp, MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers (zero/single-arg, i16-or-void return) inline in the header; 890 multi-arg ones live in `runtime/src/iigsToolbox.s`. Generated by `scripts/genToolbox.py` from ORCA-C's `ORCACDefs/` (re-runnable when ORCA-C updates). ## In flight (Nothing currently — the four previous in-flight items all landed: basic-regalloc-by-default replaced greedy and resolved the long-arg-chain failure; `time()` reads ReadTimeHex when the program has called `iigsToolboxInit()` and `clock()` reads the VBL counter via 24-bit absolute load; the (sr,s),Y bank-wrap addressing is no longer emitted by any inserter and the `W65816NegYIndY` workaround is disabled; LC ceiling extended from $E000 to $10000 since crt0's `lda $C083` read-twice enables RAM through $FFFF, gaining 8KB of bank-0 space.) ## Yet to come - **C++ exceptions** — `dynamic_cast` works (via libcxxabi shim, see "What works"); `throw`/`try`/`catch` does not. Implementing exceptions needs the full Itanium unwind ABI: `__cxa_throw`, `__cxa_allocate_exception`, `_Unwind_RaiseException`, a personality routine, and DWARF `.eh_frame` data the unwinder consumes to restore registers per-frame. The 65816's lack of any existing unwinder makes this a real project — defer until someone needs exception-based code on the IIgs. - **GS/OS validated against a real ProDOS volume** — the wrapper contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP fixup) is verified end-to-end in MAME against a stub dispatcher (`scripts/runInMameWithGsosStub.sh`). Validating against an actual GS/OS-loaded volume needs a bootable system disk image attached as a MAME smartport hard disk and Tool Locator init — out of scope for an automated CI smoke.