65816-llvm-mos/STATUS.md
Scott Duensing d6a34075a5 Checkpoint
2026-05-01 20:24:30 -05:00

16 KiB
Raw Blame History

llvm816 — Current Status

LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from llvm-mos as a separate W65816 target.

What works

End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs).

Language coverage at -O2 (no extra flags):

  • All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
    • ASLA16 / shift libcalls.
  • Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes.
  • Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double).
  • Bitfields, switch statements (verified up to ~12 cases + default), function pointers, function-pointer tables, indirect calls via __jsl_indir trampoline.
  • Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort.
  • Loops with goto / break / continue, nested loops, state machines.
  • <stdarg.h> varargs with int / long / unsigned long long mixed args.
  • Heap: malloc / free (libc.c first-fit allocator) — linked-list reverse with cons works.
  • Strings: hand-rolled strlen, strcmp, strcpy, strchr, atoi/itoa roundtrip.
  • Soft-float (single): all four ops + comparisons, MAME-verified.
  • Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Long-running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug).
  • Inline assembly with "a", "x", "y" register constraints and arbitrary opcode bytes (used for the pha;plb bank-switch idiom).
  • C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions).
  • printf with %d %x %s %c %p and width/precision specifiers.
  • sprintf / snprintf / vsprintf / vsnprintf with the same format coverage as printf (%d %u %x %ld %lu %s %c %f %p %% + width). C99 truncation semantics for snprintf. %.Nf produces the correct fractional digits with round-half-up.
  • qsort + bsearch over arbitrary element size with a user cmp callback (insertion-sort variant — sidesteps the greedy regalloc bug in the recursive iterative-qsort form).
  • Standard string/stdlib glue: strcat, strncat, strpbrk, strspn, strcspn, atol, llabs (kept in their own translation unit so vprintf's branch layout doesn't shift).
  • <math.h>: fabs, floor, ceil, fmod, copysign, sqrt, pow, sin, cos, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh (and float variants). Bit-twiddling for fabs/floor/ceil/copysign; Newton iteration for sqrt; range-reduction + Taylor for sin/cos/ exp/log/atan; identities for asin/acos/atan2/sinh/cosh/tanh. Accuracy is in the ~1e-6 range — good enough for typical numeric work, far short of glibc-quality. These are slow (each call is dozens to hundreds of soft-double libcalls) — pre-compute or cache when possible.
  • setjmp / longjmp from libgcc.s.
  • Static constructors via crt0's init_array walk.

Toolchain:

  • clang / llc produce W65816 assembly + ELF object files.
  • tools/link816 resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load. Auto-relocates bss above text+rodata when the default --bss-base 0x2000 would overlap text, and skips past the IIgs IO window ($C000-$CFFF) if needed.
  • tools/omfEmit produces OMF v2.1 single-segment files (the IIgs's native object format) for round-tripping with classic dev tools.
  • runtime/build.sh builds crt0, libc, soft-float, soft-double, libgcc into linkable objects.
  • scripts/smokeTest.sh runs 99 end-to-end checks (scalar ops, control flow, calling conventions, MAME execution, regressions, link816 bss-base safety, iigs/toolbox.h compile-check, standalone runtime headers, AsmPrinter peepholes for STZ / PEA / PEI — single-STA, shared-LDA-multi-STA, and DPF0-forwarding cases). Currently 100% pass at -O2 throughout.

ABI:

  • arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL on the system stack with PHA. Caller deallocates via tsc;clc;adc #N;tcs or PLY*N/2.
  • Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits.
  • Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model.

In flight

Two open bugs tracked:

  1. #107 — strtok / qsort -O1+ miscompile — RESOLVED. Three independent issues across the backend, runtime, and linker; all fixed.

    Fix 1 (W65816StackSlotCleanup cross-MBB): Pass -4 / Pass -4c collapsed LDA fs.X; STA stk.Y; ... LDA_indY stk.Y patterns with only an MBB-local safety check, missing cross-MBB readers of stk.Y. Greedy regalloc had spilled an in-place INA result back to stk.Y; eliminating the bb.3 init store left the bb.10 reload reading garbage. Function-wide cross-MBB check added.

    Fix 2 (W65816SepRepCleanup LDAi8imm hoist): Pre-pass that relocates LDAi8imm BEFORE byte-store SEP/REP wraps. LDAi8imm expands at AsmPrinter to its own SEP+LDA8+REP that toggles M; the post-RA scheduler was moving it INSIDE an STBptr wrap, so the LDAi8imm's REP fired BEFORE the byte STA. The STA then ran in M=16, writing 2 bytes of zero and clobbering the next byte. Hoist puts the toggle in the outer M=16 zone, leaving the byte STA in M=8.

    Fix 3 (link816 bss-base safety + strtok_r noinline): With the backend fixes, -O2 strtok grew large enough that the strtok() wrapper inlining (~290 extra bytes) pushed the binary's text+rodata past 0xC000 (IIgs IO window). Reads of string literals or stdio handles in that range hit IO registers and corrupted execution. Two complementary fixes: __attribute__((noinline)) on strtok_r so the wrapper doesn't duplicate it (-O2 strtok.o now 1564B, was 2156B); link816 auto-relocates bss above text+rodata when default --bss-base 0x2000 would overlap, and skips past the IO window if needed.

    strtok.c now compiles at -O2 with everything else. Smoke #84 (4-call strtok continuation) and #92 (recursive parser) both pass. Workaround comments in build.sh / smokeTest.sh removed.

    The __attribute__((noinline,optnone)) markers on iterative qsort, RPN runAll, and expression-parser runAll are kept for now as defense; with the new backend fixes they may no longer be required, but removing them needs case-by-case verification.

The W65816 backend assembler now supports all common indirect addressing modes ((dp), (dp),Y, (dp,X), (d,s),Y, [dp], [dp],Y, and JMP (abs)). All .byte opcode hacks in the runtime have been removed in favour of the mnemonics. The disassembler decodes them too.

Runtime now exposes a ~complete C99 subset: sprintf/snprintf with correct %.Nf precision, qsort/bsearch, the full string.h family (strcat/strncat/strpbrk/strspn/strcspn/ strtok/strtok_r), math.h with the eleven common transcendentals (sqrt/pow/sin/cos/exp/log/atan/atan2/asin/acos/sinh/cosh/tanh), atol/llabs/atexit/exit/abort, and a smoke test that exercises malloc + struct pointers + strcmp/strcpy via a working hash table end-to-end in MAME.

strtok / strtok_r live in their own TU at -O2 (with __attribute__((noinline)) on strtok_r so the strtok() wrapper doesn't duplicate it). Multi-call strtok over "a,b,,c" works end-to-end in smoke. The layout-sensitive miscompile that previously haunted strtok_r's inner CMP loop has been fixed by modelling Uses=[P] on the conditional branches (the LICM/sink interaction that elided "redundant" CMPs no longer fires); no surgical workaround flags needed.

A small RPN calculator test (smoke #87) chains strtok, atol, push/pop over a static stack, snprintf "%ld", and strcmp to verify the end-to-end composition under a realistic-ish workload — adds, subs, muls, divs, and 3-deep operand stacks all work.

setjmp / longjmp (smoke #88) now work end-to-end: setjmp saves SP / 24-bit ret addr / DP, longjmp restores them and returns the val argument as setjmp's "second return". Required two fixes: (a) the W65816 assembler had no instruction definition for (dp) / (dp), y / (dp, x) indirect addressing modes, so the mnemonic forms silently fell through to absolute-,Y opcodes — fixed in src/llvm/lib/Target/W65816/W65816InstrFormats.td + W65816InstrInfo.td + AsmParser/W65816AsmParser.cpp (the runtime .byte hacks have been replaced with mnemonics); (b) added __attribute__((returns_twice)) to the setjmp declaration so the optimizer doesn't constant-fold post-setjmp env reads to 0.

CRC32 (smoke #89) verifies the standard "123456789" → 0xCBF43926 end-to-end — exercises uint32_t shifts, XORs, char-by-char loops.

Brainfuck interpreter (smoke #90) executes a small bf program and verifies the output bytes — exercises loop bracket matching, pointer math (data pointer), branching on cell value.

Recursive-descent expression parser (smoke #92) evaluates "3+4", "23+4", "2+34", "(3+4)5", "100/4-52+1" with proper operator precedence and parentheses — exercises mutual recursion, char-by-char tokenization, and integer arithmetic in concert.

The DWARF sidecar (link816 --debug-out FILE) now applies text/rodata/bss/init_array relocations to every .debug_* section before writing it. PC values in .debug_addr and .debug_line end up as final-image addresses, so a consumer can map back to source lines without re-running the linker. Intra-debug references (e.g. .debug_info -> .debug_str offsets) are intentionally left object-local — sections are concatenated, not recompacted, and each slice carries an ; OBJ ... SEC ... SIZE ... header so a multi-TU consumer can scope intra-debug offsets per-slice. The smoke test verifies the address of a known function appears in the patched sidecar bytes.

Known issues / workarounds

  • (d,s),y / (sr,s),y addressing wraps the bank when Y is negative as 16-bit unsigned. Worked around by W65816NegYIndY rewriting the affected ops to TAX ; LDA/STA $0000,X. Stays correct for negative offsets like arr[i-1].

  • (d,s),y for stack-local pointer dereferences uses DBR, so user code that switches DBR (e.g. pha;plb to bank 2 to reach IIgs hardware) must not call into a function that takes the address of one of its locals — the callee's *p = v will write to the wrong bank. Documented; no compiler-side mitigation beyond the existing DPF0 fake-physreg routing for the i64-return high half. Workaround: inline pointer-arg helpers so the writes stay in the caller's frame using stack-rel direct stores. The W65816 only has three DBR-independent addressing modes (abs_long, abs_long,X, [dp],Y) — none cheap to retrofit into the current pointer-deref lowering (+5 bytes minimum per access). Real fix needs PHB/PLB at noinline-pointer-callee entry/exit.

Recently fixed

  • #70 — iterative qsort -O2 miscompileW65816StackSlotCleanup Pass -2 was deleting a store to a slot the loop body read. Function-wide slotHasOtherRefs safety check added (Pass -1 and Pass -2c hardened with the same pattern). Iterative qsort at plain -O2 + greedy now compiles correctly; the optnone workaround in smoke #70 was removed.

  • strtok -O2 layout-sensitive miscompile — modelling Uses=[P] on the conditional branches (BEQ/BNE/BCS/BCC/BMI/BPL/BVS/BVC) made MachineCSE / scheduler / LICM / sink see the CMP→Bxx flag dependency. An entire class of layout-sensitive flag-corruption bugs went away; verified by sweeping --rodata-base from text-end to text-end+300 in 13 increments — every layout returns the correct strtok result. As a follow-on, MachineCSE has been re-enabled (was previously disabled in W65816TargetMachine::addMachineSSAOpti­ mization as a workaround for the same root cause).

  • link816 silently produced 4.3GB binaries when --rodata-base was set inside the text region. Now dies with a clear error: --rodata-base 0xX overlaps text 0xY+N (must start at or after 0xZ).

  • link816 BSS-relocate landed in IIgs Language Card area — when text+rodata grew past $C000, link816 placed BSS at $D000 (the LC1 area), where IIgs-by-default maps ROM (writes drop silently, reads return ROM bytes). Globals never initialised; caught by the expression-parser smoke (#92) when adding rand / strnlen / etc. pushed the runtime past that threshold. Two-part fix: crt0 now enables LC1 RAM via the standard lda $C083 read-twice trick at startup, and link816 hard-fails (rather than silently corrupt) if BSS would exceed the LC1 ceiling ($E000) — past that you'd need crt0 to also enable LC2 / shadow RAM, which we haven't wired up.

  • STZ peephole multi-STA latent miscompile — AsmPrinter's LDA #0; STA $g -> STZ $g peephole eliminated the LDA but only consumed the FIRST STA. When SDAG-CSE shared one LDA #0 across multiple STAs (g16=0; g32=0; is one IR shape), trailing STAs read whatever was in A on entry — silently corrupting any global where A wasn't 0 at function entry. Smoke happened to pass because A was 0 by luck in every covered path. Fixed by gating the peephole on the consuming STA killing A (regalloc only sets killed on the last reader); smoke #98 added to lock the multi-STA case.

  • PEI AsmPrinter peephole — new: LDA $dp; PHA -> PEI $dp saves 1 byte and avoids touching A. Fires on the copyPhysReg(A=DPF0); PUSH16 pattern (i64-libcall return-value forwarding into the next call's stacked args), which appears in every chained soft-double / soft-int64 expression. Saves 68 bytes across the runtime (-64 in math.o alone). Same next-instruction-modifies-A safety check as the PEA peephole. Smoke #99 added.

  • PEA peephole opcode-allowlist replaced with modifiesRegister — the next-after-PUSH16 check that gates the PEA peephole was a hand-curated list of opcodes that obviously redefine A; switched to MachineInstr::modifiesRegister(A, TRI) which also catches implicit-defs (e.g. JSL clobbering A as part of the call ABI). Saves a few bytes and is more robust.

  • libgcc.s lda #0; sta $XX -> stz $XX — 7 sites converted in libgcc.s after STZ landed in the assembler. Saves 28 bytes; also removes two PHA/PLA save-restore wraps around the LDA #0 (STZ doesn't touch A, so the wraps are unnecessary).

What's still needed for a "ship-ready" toolchain

  • softDouble.c -O1 hold-out__muldf3's u64 lifetime pressure overflows the greedy register allocator at -O2 ("ran out of registers during register allocation"). Builds correctly at -O1. Investigated: marking dpack noinline reduces pressure but isn't enough; making dclass noinline would unblock -O2 (verified) but the (d,s),y-uses-DBR bug then corrupts dclass's pointer-arg writes when a caller has switched DBR (caught by smoke's dmul-after-bank-switch test). Real fix is gated on the broader DBR-pointer-deref limitation listed above.

  • More of the C standard library: real <stdio.h> file I/O (fopen, fread, fwrite, fseek are currently stubs returning success/zero) — would need a memory-backed FS or a MAME hook. <locale.h> / <signal.h> are stubbed (compile and return safe defaults); <wchar.h> / <time.h> mostly absent.

  • C++ runtime support: vtable layout for multiple inheritance, RTTI, exceptions (or a documented -fno-exceptions requirement).

  • REP/SEP scheduling pass (design doc §3.3): the current prologue picks one M-mode for the whole function based on whether any 8-bit accumulator value is used. A per-region scheduler would reduce the SEP/REP wrap overhead on i8 stores.

  • Toolbox / IIgs system call bindings: header files declaring the Apple IIgs system calls (SystemTask, WaitMouseUp, DrawString, …) with the right inline-asm dispatch glue.

  • Real-world program coverage: the smoke tests are microbenchmarks. A few known-good Apple IIgs C programs (e.g. a textfile pager, a small game) compiled and run end-to-end would catch issues no synthetic test currently exercises.

  • Cycle-time / size benchmarks vs Calypsi 5.16: design doc §1 says the goal is to "match or exceed" Calypsi. We have neither baseline numbers nor a comparison harness yet.