65816-llvm-mos/STATUS.md
Scott Duensing 18ef7e1fa6 Checkpoint
2026-05-01 17:22:55 -05:00

279 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns
bit-for-bit against gcc with round-to-nearest-even rounding;
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
1-second sim-time budget (test config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- sprintf / snprintf / vsprintf / vsnprintf with the same format
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
C99 truncation semantics for snprintf. `%.Nf` produces the
correct fractional digits with round-half-up.
- qsort + bsearch over arbitrary element size with a user `cmp`
callback (insertion-sort variant — sidesteps the greedy regalloc
bug in the recursive iterative-qsort form).
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
strcspn, atol, llabs (kept in their own translation unit so
vprintf's branch layout doesn't shift).
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
sin, cos, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh
(and float variants). Bit-twiddling for fabs/floor/ceil/copysign;
Newton iteration for sqrt; range-reduction + Taylor for sin/cos/
exp/log/atan; identities for asin/acos/atan2/sinh/cosh/tanh.
Accuracy is in the ~1e-6 range — good enough for typical numeric
work, far short of glibc-quality. These are slow (each call is
dozens to hundreds of soft-double libcalls) — pre-compute or
cache when possible.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
Auto-relocates bss above text+rodata when the default
`--bss-base 0x2000` would overlap text, and skips past the
IIgs IO window ($C000-$CFFF) if needed.
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
native object format) for round-tripping with classic dev tools.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 92 end-to-end checks (scalar ops,
control flow, calling conventions, MAME execution, regressions,
link816 bss-base safety, iigs/toolbox.h compile-check).
Currently 100% pass at -O2 throughout.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
## In flight
Two open bugs tracked:
1. **#107 — strtok / qsort -O1+ miscompile — RESOLVED.** Three
independent issues across the backend, runtime, and linker;
all fixed.
**Fix 1 (W65816StackSlotCleanup cross-MBB):** Pass -4 /
Pass -4c collapsed `LDA fs.X; STA stk.Y; ... LDA_indY stk.Y`
patterns with only an MBB-local safety check, missing cross-MBB
readers of stk.Y. Greedy regalloc had spilled an in-place INA
result back to stk.Y; eliminating the bb.3 init store left the
bb.10 reload reading garbage. Function-wide cross-MBB check
added.
**Fix 2 (W65816SepRepCleanup LDAi8imm hoist):** Pre-pass that
relocates LDAi8imm BEFORE byte-store SEP/REP wraps. LDAi8imm
expands at AsmPrinter to its own SEP+LDA8+REP that toggles M;
the post-RA scheduler was moving it INSIDE an STBptr wrap, so
the LDAi8imm's REP fired BEFORE the byte STA. The STA then
ran in M=16, writing 2 bytes of zero and clobbering the next
byte. Hoist puts the toggle in the outer M=16 zone, leaving
the byte STA in M=8.
**Fix 3 (link816 bss-base safety + strtok_r noinline):** With
the backend fixes, -O2 strtok grew large enough that the
strtok() wrapper inlining (~290 extra bytes) pushed the
binary's text+rodata past 0xC000 (IIgs IO window). Reads of
string literals or stdio handles in that range hit IO
registers and corrupted execution. Two complementary fixes:
`__attribute__((noinline))` on `strtok_r` so the wrapper
doesn't duplicate it (-O2 strtok.o now 1564B, was 2156B);
link816 auto-relocates bss above text+rodata when default
`--bss-base 0x2000` would overlap, and skips past the IO
window if needed.
strtok.c now compiles at -O2 with everything else. Smoke
#84 (4-call strtok continuation) and #92 (recursive parser)
both pass. Workaround comments in build.sh / smokeTest.sh
removed.
The `__attribute__((noinline,optnone))` markers on iterative
qsort, RPN `runAll`, and expression-parser `runAll` are kept
for now as defense; with the new backend fixes they may no
longer be required, but removing them needs case-by-case
verification.
The W65816 backend assembler now supports all common indirect
addressing modes (`(dp)`, `(dp),Y`, `(dp,X)`, `(d,s),Y`,
`[dp]`, `[dp],Y`, and `JMP (abs)`). All `.byte` opcode hacks in
the runtime have been removed in favour of the mnemonics. The
disassembler decodes them too.
Runtime now exposes a ~complete C99 subset: sprintf/snprintf with correct %.Nf precision, qsort/bsearch,
the full string.h family (strcat/strncat/strpbrk/strspn/strcspn/
strtok/strtok_r), math.h with the eleven common transcendentals
(sqrt/pow/sin/cos/exp/log/atan/atan2/asin/acos/sinh/cosh/tanh),
atol/llabs/atexit/exit/abort, and a smoke test that exercises
malloc + struct pointers + strcmp/strcpy via a working hash table
end-to-end in MAME.
`strtok` / `strtok_r` live in their own TU at `-O2` (with
`__attribute__((noinline))` on `strtok_r` so the strtok() wrapper
doesn't duplicate it). Multi-call strtok over "a,b,,c" works
end-to-end in smoke. Latent backend issue: at certain rodata
layouts, -O2 strtok_r's BB0_7 inner CMP loop miscompiles due to
LICM/sink interaction; current smoke layout passes but adding
bytes upstream (e.g. growing softDouble.o) can shift delim into
a failing address. Surgical workaround `-mllvm -disable-machine-
sink` on strtok.c is documented; not currently applied because
smoke is green.
A small **RPN calculator** test (smoke #87) chains strtok, atol,
push/pop over a static stack, snprintf "%ld", and strcmp to verify
the end-to-end composition under a realistic-ish workload — adds,
subs, muls, divs, and 3-deep operand stacks all work.
**setjmp / longjmp** (smoke #88) now work end-to-end: setjmp saves
SP / 24-bit ret addr / DP, longjmp restores them and returns the
val argument as setjmp's "second return". Required two fixes:
(a) the W65816 assembler had no instruction definition for
`(dp)` / `(dp), y` / `(dp, x)` indirect addressing modes, so the
mnemonic forms silently fell through to absolute-,Y opcodes —
fixed in `src/llvm/lib/Target/W65816/W65816InstrFormats.td` +
`W65816InstrInfo.td` + `AsmParser/W65816AsmParser.cpp` (the runtime
.byte hacks have been replaced with mnemonics); (b) added
`__attribute__((returns_twice))` to the setjmp declaration so the
optimizer doesn't constant-fold post-setjmp env reads to 0.
**CRC32** (smoke #89) verifies the standard "123456789" → 0xCBF43926
end-to-end — exercises uint32_t shifts, XORs, char-by-char loops.
**Brainfuck interpreter** (smoke #90) executes a small bf program
and verifies the output bytes — exercises loop bracket matching,
pointer math (data pointer), branching on cell value.
**Recursive-descent expression parser** (smoke #92) evaluates
"3+4", "2*3+4", "2+3*4", "(3+4)*5", "100/4-5*2+1" with proper
operator precedence and parentheses — exercises mutual recursion,
char-by-char tokenization, and integer arithmetic in concert.
The **DWARF sidecar** (`link816 --debug-out FILE`) now applies
text/rodata/bss/init_array relocations to every `.debug_*` section
before writing it. PC values in `.debug_addr` and `.debug_line` end
up as final-image addresses, so a consumer can map back to source
lines without re-running the linker. Intra-debug references (e.g.
`.debug_info` -> `.debug_str` offsets) are intentionally left
object-local — sections are concatenated, not recompacted, and each
slice carries an `; OBJ ... SEC ... SIZE ...` header so a multi-TU
consumer can scope intra-debug offsets per-slice. The smoke test
verifies the address of a known function appears in the patched
sidecar bytes.
## Known issues / workarounds
- **#70 FIXED**: greedy regalloc + W65816StackSlotCleanup Pass -2
was deleting an entry-side store to a slot that the loop body
read. Pass -2 collapses `LDAfi slotA; STAfi slotB; LDAfi slotC;
OPfi slotB` into `LDAfi slotC; OPfi slotA` (memory-to-memory copy
through A elimination), but didn't check whether slotB had other
refs in the function. In iterative qsort, slotB happened to be
the spill home for `hi` — the Pass -2 transform deleted the only
initialiser, leaving the loop body's `lda <hi-slot>, s` reading
garbage. Fix: function-wide `slotHasOtherRefs` safety check
before erasing the spill. `softDouble.c` still uses
`-mllvm -regalloc=fast` for `__muldf3`'s 64×64→128 multiply
(different greedy bug — register-pressure-driven, not
spill-deletion-driven).
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
correct for negative offsets like `arr[i-1]`.
- **(d,s),y for stack-local pointer dereferences uses DBR**, so
user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
IIgs hardware) must not call into a function that takes the
address of one of its locals — the callee's `*p = v` will write
to the wrong bank. Documented; no compiler-side mitigation
beyond the existing DPF0 fake-physreg routing for the i64-return
high half.
- **strtok -O2 layout-sensitive miscompile FIXED** — modelling
`Uses=[P]` on the conditional branches (BEQ/BNE/BCS/BCC/BMI/BPL/
BVS/BVC) made MachineCSE see the dependency between an earlier
CMP and the consuming Bxx, eliminating an entire class of
layout-sensitive flag-corruption bugs. Verified by sweeping
`--rodata-base` from text-end to text-end+300 in 13 increments
— every layout returns the correct strtok result.
As a follow-on, MachineCSE has been re-enabled (was previously
disabled in `W65816TargetMachine::addMachineSSAOptimization` as
a workaround for the same root cause).
## What's still needed for a "ship-ready" toolchain
- **softDouble.c -O1 hold-out** — `__muldf3`'s 64×64→128 multiply
with inlined alignment shifts overflows the greedy register
allocator at -O2 ("ran out of registers during register
allocation"). Builds correctly at -O1 (replaces the previous
-O2 + -mllvm -regalloc=fast workaround; -O1 is smaller and
doesn't require the non-default flag).
- **More of the C standard library**: real `<stdio.h>` file I/O
(`fopen`, `fread`, `fwrite`, `fseek` are currently stubs
returning success/zero) — would need a memory-backed FS or a
MAME hook; `<locale.h>` / `<wchar.h>` if any real-world code
needs them.
- **C++ runtime support**: vtable layout for multiple inheritance,
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
- **REP/SEP scheduling pass** (design doc §3.3): the current
prologue picks one M-mode for the whole function based on
whether any 8-bit accumulator value is used. A per-region
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
- **Toolbox / IIgs system call bindings**: header files declaring
the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
`DrawString`, …) with the right inline-asm dispatch glue.
- **Real-world program coverage**: the smoke tests are
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
a textfile pager, a small game) compiled and run end-to-end
would catch issues no synthetic test currently exercises.
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
says the goal is to "match or exceed" Calypsi. We have neither
baseline numbers nor a comparison harness yet.