65816-llvm-mos/STATUS.md
Scott Duensing d6c9fc8252 Checkpoint.
2026-04-30 20:48:41 -05:00

8 KiB

llvm816 — Current Status

LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from llvm-mos as a separate W65816 target.

What works

End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs).

Language coverage at -O2 (no extra flags):

  • All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
    • ASLA16 / shift libcalls.
  • Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes.
  • Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double).
  • Bitfields, switch statements (verified up to ~12 cases + default), function pointers, function-pointer tables, indirect calls via __jsl_indir trampoline.
  • Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort.
  • Loops with goto / break / continue, nested loops, state machines.
  • <stdarg.h> varargs with int / long / unsigned long long mixed args.
  • Heap: malloc / free (libc.c first-fit allocator) — linked-list reverse with cons works.
  • Strings: hand-rolled strlen, strcmp, strcpy, strchr, atoi/itoa roundtrip.
  • Soft-float (single): all four ops + comparisons, MAME-verified.
  • Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Long-running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug).
  • Inline assembly with "a", "x", "y" register constraints and arbitrary opcode bytes (used for the pha;plb bank-switch idiom).
  • C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions).
  • printf with %d %x %s %c %p and width/precision specifiers.
  • sprintf / snprintf / vsprintf / vsnprintf with the same format coverage as printf (%d %u %x %ld %lu %s %c %f %p %% + width). C99 truncation semantics for snprintf.
  • qsort + bsearch over arbitrary element size with a user cmp callback (insertion-sort variant — sidesteps the greedy regalloc bug in the recursive iterative-qsort form).
  • Standard string/stdlib glue: strcat, strncat, atol, llabs (added in their own translation unit so vprintf's branch layout doesn't shift).
  • <math.h> basics: fabs, floor, ceil, fmod, copysign (and the float variants). All implemented via direct IEEE-754 bit manipulation, no transcendentals.
  • setjmp / longjmp from libgcc.s.
  • Static constructors via crt0's init_array walk.

Toolchain:

  • clang / llc produce W65816 assembly + ELF object files.
  • tools/link816 resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load.
  • tools/omfEmit produces OMF v2.1 single-segment files (the IIgs's native object format) for round-tripping with classic dev tools.
  • runtime/build.sh builds crt0, libc, soft-float, soft-double, libgcc into linkable objects.
  • scripts/smokeTest.sh runs ~80 end-to-end checks (scalar ops, control flow, calling conventions, MAME execution, regressions). Currently 100% pass.

ABI:

  • arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL on the system stack with PHA. Caller deallocates via tsc;clc;adc #N;tcs or PLY*N/2.
  • Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits.
  • Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model.

In flight

Nothing currently in flight. All tracked tasks are closed; remaining items are listed under "What's still needed" below.

Runtime grew sprintf/snprintf, qsort/bsearch, math.h basics, and the small string/stdlib gaps (strcat, strncat, atol, llabs). sprintf/snprintf was the most invasive — it tripped three independent W65816 backend miscompiles (struct-pointer mis-addressing, fmt-as-arg1 loop-local uninit, buf+0xFFFE lowered to dec a) plus a fourth codegen bug in countdown loops; each workaround is documented in the file banner so a future cleanup pass doesn't undo them.

The DWARF sidecar (link816 --debug-out FILE) now applies text/rodata/bss/init_array relocations to every .debug_* section before writing it. PC values in .debug_addr and .debug_line end up as final-image addresses, so a consumer can map back to source lines without re-running the linker. Intra-debug references (e.g. .debug_info -> .debug_str offsets) are intentionally left object-local — sections are concatenated, not recompacted, and each slice carries an ; OBJ ... SEC ... SIZE ... header so a multi-TU consumer can scope intra-debug offsets per-slice. The smoke test verifies the address of a known function appears in the patched sidecar bytes.

Known issues / workarounds

  • Greedy register allocator mis-orders spills in iterative quicksort with if/else recursion choice (#70). Live-range tracking for hi is wrong across the inner loop and post-loop swap call, producing miscompiled code. Reproduces only at -O1/-O2 with greedy. Workarounds (any one):

    • __attribute__((noinline,optnone)) on the affected function — routes through fast regalloc per-function. Verified in smoke test; recommended for new code that hits this.
    • -mllvm -regalloc=fast for the whole translation unit. softDouble.c already uses this for __muldf3 (build.sh applies it automatically).
    • Rewrite the loop with explicit recursion guards instead of the iterative tail-elim form.

    Real fix needs deeper greedy work; deferred behind the per- function attribute since it covers the practical cases.

  • (d,s),y / (sr,s),y addressing wraps the bank when Y is negative as 16-bit unsigned. Worked around by W65816NegYIndY rewriting the affected ops to TAX ; LDA/STA $0000,X. Stays correct for negative offsets like arr[i-1].

  • (d,s),y for stack-local pointer dereferences uses DBR, so user code that switches DBR (e.g. pha;plb to bank 2 to reach IIgs hardware) must not call into a function that takes the address of one of its locals — the callee's *p = v will write to the wrong bank. Documented; no compiler-side mitigation beyond the existing DPF0 fake-physreg routing for the i64-return high half.

What's still needed for a "ship-ready" toolchain

  • Greedy regalloc spill-ordering fix — see above. Removes the need for the per-file -regalloc=fast workaround on softDouble.c and unblocks pattern-rich code that currently must be compiled at -O0 for correctness.

  • More of the C standard library: <math.h> transcendentals (sin, cos, exp, log, pow), real <stdio.h> file I/O (fopen, fread, fwrite, fseek are currently stubs returning success/zero), and full snprintf %f fractional precision (today only the integer part of %f is reliable — same caveat as libc's writeDouble, the soft-double (long)(frac * mul) step loses precision).

  • C++ runtime support: vtable layout for multiple inheritance, RTTI, exceptions (or a documented -fno-exceptions requirement).

  • REP/SEP scheduling pass (design doc §3.3): the current prologue picks one M-mode for the whole function based on whether any 8-bit accumulator value is used. A per-region scheduler would reduce the SEP/REP wrap overhead on i8 stores.

  • Toolbox / IIgs system call bindings: header files declaring the Apple IIgs system calls (SystemTask, WaitMouseUp, DrawString, …) with the right inline-asm dispatch glue.

  • Real-world program coverage: the smoke tests are microbenchmarks. A few known-good Apple IIgs C programs (e.g. a textfile pager, a small game) compiled and run end-to-end would catch issues no synthetic test currently exercises.

  • Cycle-time / size benchmarks vs Calypsi 5.16: design doc §1 says the goal is to "match or exceed" Calypsi. We have neither baseline numbers nor a comparison harness yet.