65816-llvm-mos/STATUS.md
Scott Duensing f80a49dc1e Checkpoint
2026-04-30 19:16:32 -05:00

145 lines
6.7 KiB
Markdown

# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns;
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
1-second sim-time budget (test config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
native object format) for round-tripping with classic dev tools.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs ~80 end-to-end checks (scalar ops,
control flow, calling conventions, MAME execution, regressions).
Currently 100% pass.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
## In flight (build-system level)
- **DWARF sidecar — minimal version landed** (#51): `link816 --debug-out
FILE` collects every `.debug_*` section from the input objects and
writes them to a sidecar with section headers. Addresses are still
object-file-local (no relocation processing). A consumer that wants
source-mapped final-image addresses must re-run reloc against the
text/rodata bases, or use offsets within their object scope. Future
work: apply text/rodata relocations to `.debug_info` / `.debug_line`
so addresses match the final image, and emit a TOC the consumer
can index by source file or function.
## Known issues / workarounds
- **Greedy register allocator mis-orders spills** in iterative
quicksort with `if/else` recursion choice (#70). Complex live
ranges across two `swap()` calls produce wrong pointer args.
Reproduces only at `-O1`/`-O2` with greedy. Workaround:
`-mllvm -regalloc=fast` for the affected translation unit, or
rewrite the qsort with explicit recursion guards instead of the
iterative tail-elim form. `softDouble.c` already uses this
flag for `__muldf3` (build.sh applies it automatically). Real
fix is either a pre-RA pass that explicitly spills loop-carried
pointer args or a targeted greedy heuristic patch.
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
correct for negative offsets like `arr[i-1]`.
- **(d,s),y for stack-local pointer dereferences uses DBR**, so
user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
IIgs hardware) must not call into a function that takes the
address of one of its locals — the callee's `*p = v` will write
to the wrong bank. Documented; no compiler-side mitigation
beyond the existing DPF0 fake-physreg routing for the i64-return
high half.
## What's still needed for a "ship-ready" toolchain
- **Greedy regalloc spill-ordering fix** — see above. Removes the
need for the per-file `-regalloc=fast` workaround on
`softDouble.c` and unblocks pattern-rich code that currently
must be compiled at `-O0` for correctness.
- **Round-to-nearest-even in `__divdf3`** — currently
truncate-toward-zero, which differs from gcc by ±1 ULP in
several test cases. Acceptable today (Newton iterations still
converge); revisit when an exact-match test suite lands.
- **DWARF sidecar with relocations applied** — current (#51) version
is raw section pass-through; addresses are object-file-local. A
real source-level debugger needs the linker to apply text/rodata
relocations to `.debug_info` / `.debug_line` first.
- **More of the C standard library**: `<math.h>` transcendental
functions (sin, cos, exp, log, pow), `<string.h>` beyond what's
hand-coded, `<stdio.h>` file I/O (`fopen`, `fread`, `fwrite`,
`fseek`).
- **C++ runtime support**: vtable layout for multiple inheritance,
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
- **REP/SEP scheduling pass** (design doc §3.3): the current
prologue picks one M-mode for the whole function based on
whether any 8-bit accumulator value is used. A per-region
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
- **Toolbox / IIgs system call bindings**: header files declaring
the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
`DrawString`, …) with the right inline-asm dispatch glue.
- **Real-world program coverage**: the smoke tests are
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
a textfile pager, a small game) compiled and run end-to-end
would catch issues no synthetic test currently exercises.
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
says the goal is to "match or exceed" Calypsi. We have neither
baseline numbers nor a comparison harness yet.