# llvm816 — Current Status LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from llvm-mos as a separate `W65816` target. ## What works End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs). **Language coverage at -O2 (no extra flags):** - All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos + ASLA16 / shift libcalls. - Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes. - Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double). - Bitfields, switch statements (verified up to ~12 cases + default), function pointers, function-pointer tables, indirect calls via `__jsl_indir` trampoline. - Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort. - Loops with goto / break / continue, nested loops, state machines. - `` varargs with int / long / unsigned long long mixed args. - Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list reverse with `cons` works. - Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa roundtrip. - Soft-float (single): all four ops + comparisons, MAME-verified. - Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Long-running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug). - Inline assembly with `"a"`, `"x"`, `"y"` register constraints and arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom). - C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions). - printf with `%d %x %s %c %p` and width/precision specifiers. - sprintf / snprintf / vsprintf / vsnprintf with the same format coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width). C99 truncation semantics for snprintf. `%.Nf` produces the correct fractional digits with round-half-up. - qsort + bsearch over arbitrary element size with a user `cmp` callback (insertion-sort variant — sidesteps the greedy regalloc bug in the recursive iterative-qsort form). - Standard string/stdlib glue: strcat, strncat, strpbrk, strspn, strcspn, atol, llabs (kept in their own translation unit so vprintf's branch layout doesn't shift). - ``: fabs, floor, ceil, fmod, copysign, sqrt, pow, sin, cos, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh (and float variants). Bit-twiddling for fabs/floor/ceil/copysign; Newton iteration for sqrt; range-reduction + Taylor for sin/cos/ exp/log/atan; identities for asin/acos/atan2/sinh/cosh/tanh. Accuracy is in the ~1e-6 range — good enough for typical numeric work, far short of glibc-quality. These are slow (each call is dozens to hundreds of soft-double libcalls) — pre-compute or cache when possible. - `setjmp` / `longjmp` from libgcc.s. - Static constructors via crt0's init_array walk. **Toolchain:** - `clang` / `llc` produce W65816 assembly + ELF object files. - `tools/link816` resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load. Auto-relocates bss above text+rodata when the default `--bss-base 0x2000` would overlap text, and skips past the IIgs IO window ($C000-$CFFF) if needed. - `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's native object format) for round-tripping with classic dev tools. - `runtime/build.sh` builds crt0, libc, soft-float, soft-double, libgcc into linkable objects. - `scripts/smokeTest.sh` runs 92 end-to-end checks (scalar ops, control flow, calling conventions, MAME execution, regressions, link816 bss-base safety, iigs/toolbox.h compile-check). Currently 100% pass at -O2 throughout. **ABI:** - arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL on the system stack with PHA. Caller deallocates via `tsc;clc;adc #N;tcs` or `PLY*N/2`. - Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits. - Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model. ## In flight Two open bugs tracked: 1. **#107 — strtok / qsort -O1+ miscompile — RESOLVED.** Three independent issues across the backend, runtime, and linker; all fixed. **Fix 1 (W65816StackSlotCleanup cross-MBB):** Pass -4 / Pass -4c collapsed `LDA fs.X; STA stk.Y; ... LDA_indY stk.Y` patterns with only an MBB-local safety check, missing cross-MBB readers of stk.Y. Greedy regalloc had spilled an in-place INA result back to stk.Y; eliminating the bb.3 init store left the bb.10 reload reading garbage. Function-wide cross-MBB check added. **Fix 2 (W65816SepRepCleanup LDAi8imm hoist):** Pre-pass that relocates LDAi8imm BEFORE byte-store SEP/REP wraps. LDAi8imm expands at AsmPrinter to its own SEP+LDA8+REP that toggles M; the post-RA scheduler was moving it INSIDE an STBptr wrap, so the LDAi8imm's REP fired BEFORE the byte STA. The STA then ran in M=16, writing 2 bytes of zero and clobbering the next byte. Hoist puts the toggle in the outer M=16 zone, leaving the byte STA in M=8. **Fix 3 (link816 bss-base safety + strtok_r noinline):** With the backend fixes, -O2 strtok grew large enough that the strtok() wrapper inlining (~290 extra bytes) pushed the binary's text+rodata past 0xC000 (IIgs IO window). Reads of string literals or stdio handles in that range hit IO registers and corrupted execution. Two complementary fixes: `__attribute__((noinline))` on `strtok_r` so the wrapper doesn't duplicate it (-O2 strtok.o now 1564B, was 2156B); link816 auto-relocates bss above text+rodata when default `--bss-base 0x2000` would overlap, and skips past the IO window if needed. strtok.c now compiles at -O2 with everything else. Smoke #84 (4-call strtok continuation) and #92 (recursive parser) both pass. Workaround comments in build.sh / smokeTest.sh removed. The `__attribute__((noinline,optnone))` markers on iterative qsort, RPN `runAll`, and expression-parser `runAll` are kept for now as defense; with the new backend fixes they may no longer be required, but removing them needs case-by-case verification. The W65816 backend assembler now supports all common indirect addressing modes (`(dp)`, `(dp),Y`, `(dp,X)`, `(d,s),Y`, `[dp]`, `[dp],Y`, and `JMP (abs)`). All `.byte` opcode hacks in the runtime have been removed in favour of the mnemonics. The disassembler decodes them too. Runtime now exposes a ~complete C99 subset: sprintf/snprintf with correct %.Nf precision, qsort/bsearch, the full string.h family (strcat/strncat/strpbrk/strspn/strcspn/ strtok/strtok_r), math.h with the eleven common transcendentals (sqrt/pow/sin/cos/exp/log/atan/atan2/asin/acos/sinh/cosh/tanh), atol/llabs/atexit/exit/abort, and a smoke test that exercises malloc + struct pointers + strcmp/strcpy via a working hash table end-to-end in MAME. `strtok` / `strtok_r` live in their own TU at `-O2` (with `__attribute__((noinline))` on `strtok_r` so the strtok() wrapper doesn't duplicate it). Multi-call strtok over "a,b,,c" works end-to-end in smoke. Latent backend issue: at certain rodata layouts, -O2 strtok_r's BB0_7 inner CMP loop miscompiles due to LICM/sink interaction; current smoke layout passes but adding bytes upstream (e.g. growing softDouble.o) can shift delim into a failing address. Surgical workaround `-mllvm -disable-machine- sink` on strtok.c is documented; not currently applied because smoke is green. A small **RPN calculator** test (smoke #87) chains strtok, atol, push/pop over a static stack, snprintf "%ld", and strcmp to verify the end-to-end composition under a realistic-ish workload — adds, subs, muls, divs, and 3-deep operand stacks all work. **setjmp / longjmp** (smoke #88) now work end-to-end: setjmp saves SP / 24-bit ret addr / DP, longjmp restores them and returns the val argument as setjmp's "second return". Required two fixes: (a) the W65816 assembler had no instruction definition for `(dp)` / `(dp), y` / `(dp, x)` indirect addressing modes, so the mnemonic forms silently fell through to absolute-,Y opcodes — fixed in `src/llvm/lib/Target/W65816/W65816InstrFormats.td` + `W65816InstrInfo.td` + `AsmParser/W65816AsmParser.cpp` (the runtime .byte hacks have been replaced with mnemonics); (b) added `__attribute__((returns_twice))` to the setjmp declaration so the optimizer doesn't constant-fold post-setjmp env reads to 0. **CRC32** (smoke #89) verifies the standard "123456789" → 0xCBF43926 end-to-end — exercises uint32_t shifts, XORs, char-by-char loops. **Brainfuck interpreter** (smoke #90) executes a small bf program and verifies the output bytes — exercises loop bracket matching, pointer math (data pointer), branching on cell value. **Recursive-descent expression parser** (smoke #92) evaluates "3+4", "2*3+4", "2+3*4", "(3+4)*5", "100/4-5*2+1" with proper operator precedence and parentheses — exercises mutual recursion, char-by-char tokenization, and integer arithmetic in concert. The **DWARF sidecar** (`link816 --debug-out FILE`) now applies text/rodata/bss/init_array relocations to every `.debug_*` section before writing it. PC values in `.debug_addr` and `.debug_line` end up as final-image addresses, so a consumer can map back to source lines without re-running the linker. Intra-debug references (e.g. `.debug_info` -> `.debug_str` offsets) are intentionally left object-local — sections are concatenated, not recompacted, and each slice carries an `; OBJ ... SEC ... SIZE ...` header so a multi-TU consumer can scope intra-debug offsets per-slice. The smoke test verifies the address of a known function appears in the patched sidecar bytes. ## Known issues / workarounds - **#70 FIXED**: greedy regalloc + W65816StackSlotCleanup Pass -2 was deleting an entry-side store to a slot that the loop body read. Pass -2 collapses `LDAfi slotA; STAfi slotB; LDAfi slotC; OPfi slotB` into `LDAfi slotC; OPfi slotA` (memory-to-memory copy through A elimination), but didn't check whether slotB had other refs in the function. In iterative qsort, slotB happened to be the spill home for `hi` — the Pass -2 transform deleted the only initialiser, leaving the loop body's `lda , s` reading garbage. Fix: function-wide `slotHasOtherRefs` safety check before erasing the spill. `softDouble.c` still uses `-mllvm -regalloc=fast` for `__muldf3`'s 64×64→128 multiply (different greedy bug — register-pressure-driven, not spill-deletion-driven). - **(d,s),y / (sr,s),y addressing wraps the bank** when Y is negative as 16-bit unsigned. Worked around by `W65816NegYIndY` rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays correct for negative offsets like `arr[i-1]`. - **(d,s),y for stack-local pointer dereferences uses DBR**, so user code that switches DBR (e.g. `pha;plb` to bank 2 to reach IIgs hardware) must not call into a function that takes the address of one of its locals — the callee's `*p = v` will write to the wrong bank. Documented; no compiler-side mitigation beyond the existing DPF0 fake-physreg routing for the i64-return high half. - **strtok -O2 layout-sensitive miscompile FIXED** — modelling `Uses=[P]` on the conditional branches (BEQ/BNE/BCS/BCC/BMI/BPL/ BVS/BVC) made MachineCSE see the dependency between an earlier CMP and the consuming Bxx, eliminating an entire class of layout-sensitive flag-corruption bugs. Verified by sweeping `--rodata-base` from text-end to text-end+300 in 13 increments — every layout returns the correct strtok result. As a follow-on, MachineCSE has been re-enabled (was previously disabled in `W65816TargetMachine::addMachineSSAOptimization` as a workaround for the same root cause). ## What's still needed for a "ship-ready" toolchain - **softDouble.c -O1 hold-out** — `__muldf3`'s 64×64→128 multiply with inlined alignment shifts overflows the greedy register allocator at -O2 ("ran out of registers during register allocation"). Builds correctly at -O1 (replaces the previous -O2 + -mllvm -regalloc=fast workaround; -O1 is smaller and doesn't require the non-default flag). - **More of the C standard library**: real `` file I/O (`fopen`, `fread`, `fwrite`, `fseek` are currently stubs returning success/zero) — would need a memory-backed FS or a MAME hook; `` / `` if any real-world code needs them. - **C++ runtime support**: vtable layout for multiple inheritance, RTTI, exceptions (or a documented `-fno-exceptions` requirement). - **REP/SEP scheduling pass** (design doc §3.3): the current prologue picks one M-mode for the whole function based on whether any 8-bit accumulator value is used. A per-region scheduler would reduce the SEP/REP wrap overhead on i8 stores. - **Toolbox / IIgs system call bindings**: header files declaring the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`, `DrawString`, …) with the right inline-asm dispatch glue. - **Real-world program coverage**: the smoke tests are microbenchmarks. A few known-good Apple IIgs C programs (e.g. a textfile pager, a small game) compiled and run end-to-end would catch issues no synthetic test currently exercises. - **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1 says the goal is to "match or exceed" Calypsi. We have neither baseline numbers nor a comparison harness yet.