65816-llvm-mos/STATUS.md
Scott Duensing d6a34075a5 Checkpoint
2026-05-01 20:24:30 -05:00

329 lines
16 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns
bit-for-bit against gcc with round-to-nearest-even rounding;
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
1-second sim-time budget (test config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- sprintf / snprintf / vsprintf / vsnprintf with the same format
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
C99 truncation semantics for snprintf. `%.Nf` produces the
correct fractional digits with round-half-up.
- qsort + bsearch over arbitrary element size with a user `cmp`
callback (insertion-sort variant — sidesteps the greedy regalloc
bug in the recursive iterative-qsort form).
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
strcspn, atol, llabs (kept in their own translation unit so
vprintf's branch layout doesn't shift).
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
sin, cos, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh
(and float variants). Bit-twiddling for fabs/floor/ceil/copysign;
Newton iteration for sqrt; range-reduction + Taylor for sin/cos/
exp/log/atan; identities for asin/acos/atan2/sinh/cosh/tanh.
Accuracy is in the ~1e-6 range — good enough for typical numeric
work, far short of glibc-quality. These are slow (each call is
dozens to hundreds of soft-double libcalls) — pre-compute or
cache when possible.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
Auto-relocates bss above text+rodata when the default
`--bss-base 0x2000` would overlap text, and skips past the
IIgs IO window ($C000-$CFFF) if needed.
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
native object format) for round-tripping with classic dev tools.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 99 end-to-end checks (scalar ops,
control flow, calling conventions, MAME execution, regressions,
link816 bss-base safety, iigs/toolbox.h compile-check, standalone
runtime headers, AsmPrinter peepholes for STZ / PEA / PEI —
single-STA, shared-LDA-multi-STA, and DPF0-forwarding cases).
Currently 100% pass at -O2 throughout.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
## In flight
Two open bugs tracked:
1. **#107 — strtok / qsort -O1+ miscompile — RESOLVED.** Three
independent issues across the backend, runtime, and linker;
all fixed.
**Fix 1 (W65816StackSlotCleanup cross-MBB):** Pass -4 /
Pass -4c collapsed `LDA fs.X; STA stk.Y; ... LDA_indY stk.Y`
patterns with only an MBB-local safety check, missing cross-MBB
readers of stk.Y. Greedy regalloc had spilled an in-place INA
result back to stk.Y; eliminating the bb.3 init store left the
bb.10 reload reading garbage. Function-wide cross-MBB check
added.
**Fix 2 (W65816SepRepCleanup LDAi8imm hoist):** Pre-pass that
relocates LDAi8imm BEFORE byte-store SEP/REP wraps. LDAi8imm
expands at AsmPrinter to its own SEP+LDA8+REP that toggles M;
the post-RA scheduler was moving it INSIDE an STBptr wrap, so
the LDAi8imm's REP fired BEFORE the byte STA. The STA then
ran in M=16, writing 2 bytes of zero and clobbering the next
byte. Hoist puts the toggle in the outer M=16 zone, leaving
the byte STA in M=8.
**Fix 3 (link816 bss-base safety + strtok_r noinline):** With
the backend fixes, -O2 strtok grew large enough that the
strtok() wrapper inlining (~290 extra bytes) pushed the
binary's text+rodata past 0xC000 (IIgs IO window). Reads of
string literals or stdio handles in that range hit IO
registers and corrupted execution. Two complementary fixes:
`__attribute__((noinline))` on `strtok_r` so the wrapper
doesn't duplicate it (-O2 strtok.o now 1564B, was 2156B);
link816 auto-relocates bss above text+rodata when default
`--bss-base 0x2000` would overlap, and skips past the IO
window if needed.
strtok.c now compiles at -O2 with everything else. Smoke
#84 (4-call strtok continuation) and #92 (recursive parser)
both pass. Workaround comments in build.sh / smokeTest.sh
removed.
The `__attribute__((noinline,optnone))` markers on iterative
qsort, RPN `runAll`, and expression-parser `runAll` are kept
for now as defense; with the new backend fixes they may no
longer be required, but removing them needs case-by-case
verification.
The W65816 backend assembler now supports all common indirect
addressing modes (`(dp)`, `(dp),Y`, `(dp,X)`, `(d,s),Y`,
`[dp]`, `[dp],Y`, and `JMP (abs)`). All `.byte` opcode hacks in
the runtime have been removed in favour of the mnemonics. The
disassembler decodes them too.
Runtime now exposes a ~complete C99 subset: sprintf/snprintf with correct %.Nf precision, qsort/bsearch,
the full string.h family (strcat/strncat/strpbrk/strspn/strcspn/
strtok/strtok_r), math.h with the eleven common transcendentals
(sqrt/pow/sin/cos/exp/log/atan/atan2/asin/acos/sinh/cosh/tanh),
atol/llabs/atexit/exit/abort, and a smoke test that exercises
malloc + struct pointers + strcmp/strcpy via a working hash table
end-to-end in MAME.
`strtok` / `strtok_r` live in their own TU at `-O2` (with
`__attribute__((noinline))` on `strtok_r` so the strtok() wrapper
doesn't duplicate it). Multi-call strtok over "a,b,,c" works
end-to-end in smoke. The layout-sensitive miscompile that
previously haunted strtok_r's inner CMP loop has been fixed by
modelling `Uses=[P]` on the conditional branches (the LICM/sink
interaction that elided "redundant" CMPs no longer fires); no
surgical workaround flags needed.
A small **RPN calculator** test (smoke #87) chains strtok, atol,
push/pop over a static stack, snprintf "%ld", and strcmp to verify
the end-to-end composition under a realistic-ish workload — adds,
subs, muls, divs, and 3-deep operand stacks all work.
**setjmp / longjmp** (smoke #88) now work end-to-end: setjmp saves
SP / 24-bit ret addr / DP, longjmp restores them and returns the
val argument as setjmp's "second return". Required two fixes:
(a) the W65816 assembler had no instruction definition for
`(dp)` / `(dp), y` / `(dp, x)` indirect addressing modes, so the
mnemonic forms silently fell through to absolute-,Y opcodes —
fixed in `src/llvm/lib/Target/W65816/W65816InstrFormats.td` +
`W65816InstrInfo.td` + `AsmParser/W65816AsmParser.cpp` (the runtime
.byte hacks have been replaced with mnemonics); (b) added
`__attribute__((returns_twice))` to the setjmp declaration so the
optimizer doesn't constant-fold post-setjmp env reads to 0.
**CRC32** (smoke #89) verifies the standard "123456789" → 0xCBF43926
end-to-end — exercises uint32_t shifts, XORs, char-by-char loops.
**Brainfuck interpreter** (smoke #90) executes a small bf program
and verifies the output bytes — exercises loop bracket matching,
pointer math (data pointer), branching on cell value.
**Recursive-descent expression parser** (smoke #92) evaluates
"3+4", "2*3+4", "2+3*4", "(3+4)*5", "100/4-5*2+1" with proper
operator precedence and parentheses — exercises mutual recursion,
char-by-char tokenization, and integer arithmetic in concert.
The **DWARF sidecar** (`link816 --debug-out FILE`) now applies
text/rodata/bss/init_array relocations to every `.debug_*` section
before writing it. PC values in `.debug_addr` and `.debug_line` end
up as final-image addresses, so a consumer can map back to source
lines without re-running the linker. Intra-debug references (e.g.
`.debug_info` -> `.debug_str` offsets) are intentionally left
object-local — sections are concatenated, not recompacted, and each
slice carries an `; OBJ ... SEC ... SIZE ...` header so a multi-TU
consumer can scope intra-debug offsets per-slice. The smoke test
verifies the address of a known function appears in the patched
sidecar bytes.
## Known issues / workarounds
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
correct for negative offsets like `arr[i-1]`.
- **(d,s),y for stack-local pointer dereferences uses DBR**, so
user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
IIgs hardware) must not call into a function that takes the
address of one of its locals — the callee's `*p = v` will write
to the wrong bank. Documented; no compiler-side mitigation
beyond the existing DPF0 fake-physreg routing for the i64-return
high half. Workaround: inline pointer-arg helpers so the writes
stay in the caller's frame using stack-rel direct stores. The
W65816 only has three DBR-independent addressing modes
(abs_long, abs_long,X, [dp],Y) — none cheap to retrofit into
the current pointer-deref lowering (+5 bytes minimum per access).
Real fix needs PHB/PLB at noinline-pointer-callee entry/exit.
## Recently fixed
- **#70 — iterative qsort -O2 miscompile** — `W65816StackSlotCleanup`
Pass -2 was deleting a store to a slot the loop body read.
Function-wide `slotHasOtherRefs` safety check added (Pass -1 and
Pass -2c hardened with the same pattern). Iterative qsort at
plain -O2 + greedy now compiles correctly; the `optnone` workaround
in smoke #70 was removed.
- **strtok -O2 layout-sensitive miscompile** — modelling `Uses=[P]`
on the conditional branches (BEQ/BNE/BCS/BCC/BMI/BPL/BVS/BVC) made
MachineCSE / scheduler / LICM / sink see the CMP→Bxx flag
dependency. An entire class of layout-sensitive flag-corruption
bugs went away; verified by sweeping `--rodata-base` from text-end
to text-end+300 in 13 increments — every layout returns the correct
strtok result. As a follow-on, MachineCSE has been re-enabled
(was previously disabled in `W65816TargetMachine::addMachineSSAOpti­
mization` as a workaround for the same root cause).
- **link816 silently produced 4.3GB binaries** when `--rodata-base`
was set inside the text region. Now dies with a clear error:
`--rodata-base 0xX overlaps text 0xY+N (must start at or after 0xZ)`.
- **link816 BSS-relocate landed in IIgs Language Card area** —
when text+rodata grew past $C000, link816 placed BSS at $D000
(the LC1 area), where IIgs-by-default maps ROM (writes drop
silently, reads return ROM bytes). Globals never initialised;
caught by the expression-parser smoke (#92) when adding rand /
strnlen / etc. pushed the runtime past that threshold. Two-part
fix: crt0 now enables LC1 RAM via the standard `lda $C083`
read-twice trick at startup, and link816 hard-fails (rather
than silently corrupt) if BSS would exceed the LC1 ceiling
($E000) — past that you'd need crt0 to also enable LC2 / shadow
RAM, which we haven't wired up.
- **STZ peephole multi-STA latent miscompile** — AsmPrinter's
`LDA #0; STA $g` -> `STZ $g` peephole eliminated the LDA but
only consumed the FIRST `STA`. When SDAG-CSE shared one
`LDA #0` across multiple `STA`s (`g16=0; g32=0;` is one IR
shape), trailing `STA`s read whatever was in A on entry —
silently corrupting any global where A wasn't 0 at function
entry. Smoke happened to pass because A was 0 by luck in
every covered path. Fixed by gating the peephole on the
consuming `STA` killing A (regalloc only sets `killed` on the
last reader); smoke #98 added to lock the multi-STA case.
- **PEI AsmPrinter peephole** — new: `LDA $dp; PHA` -> `PEI $dp`
saves 1 byte and avoids touching A. Fires on the
`copyPhysReg(A=DPF0); PUSH16` pattern (i64-libcall return-value
forwarding into the next call's stacked args), which appears
in every chained soft-double / soft-int64 expression. Saves
68 bytes across the runtime (-64 in math.o alone). Same
next-instruction-modifies-A safety check as the PEA peephole.
Smoke #99 added.
- **PEA peephole opcode-allowlist replaced with `modifiesRegister`** —
the next-after-PUSH16 check that gates the PEA peephole was a
hand-curated list of opcodes that obviously redefine A; switched
to `MachineInstr::modifiesRegister(A, TRI)` which also catches
implicit-defs (e.g. JSL clobbering A as part of the call ABI).
Saves a few bytes and is more robust.
- **libgcc.s `lda #0; sta $XX` -> `stz $XX`** — 7 sites converted
in libgcc.s after STZ landed in the assembler. Saves 28 bytes;
also removes two PHA/PLA save-restore wraps around the LDA #0
(STZ doesn't touch A, so the wraps are unnecessary).
## What's still needed for a "ship-ready" toolchain
- **softDouble.c -O1 hold-out** — `__muldf3`'s u64 lifetime pressure
overflows the greedy register allocator at -O2 ("ran out of
registers during register allocation"). Builds correctly at
-O1. Investigated: marking dpack noinline reduces pressure but
isn't enough; making dclass noinline would unblock -O2 (verified)
but the (d,s),y-uses-DBR bug then corrupts dclass's pointer-arg
writes when a caller has switched DBR (caught by smoke's
dmul-after-bank-switch test). Real fix is gated on the broader
DBR-pointer-deref limitation listed above.
- **More of the C standard library**: real `<stdio.h>` file I/O
(`fopen`, `fread`, `fwrite`, `fseek` are currently stubs
returning success/zero) — would need a memory-backed FS or a
MAME hook. `<locale.h>` / `<signal.h>` are stubbed (compile and
return safe defaults); `<wchar.h>` / `<time.h>` mostly absent.
- **C++ runtime support**: vtable layout for multiple inheritance,
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
- **REP/SEP scheduling pass** (design doc §3.3): the current
prologue picks one M-mode for the whole function based on
whether any 8-bit accumulator value is used. A per-region
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
- **Toolbox / IIgs system call bindings**: header files declaring
the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
`DrawString`, …) with the right inline-asm dispatch glue.
- **Real-world program coverage**: the smoke tests are
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
a textfile pager, a small game) compiled and run end-to-end
would catch issues no synthetic test currently exercises.
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
says the goal is to "match or exceed" Calypsi. We have neither
baseline numbers nor a comparison harness yet.