This commit is contained in:
Scott Duensing 2026-05-14 11:23:00 -05:00
parent 42f0d16d07
commit 6bff7bea3f
18 changed files with 2100 additions and 115 deletions

102
README.md Normal file
View file

@ -0,0 +1,102 @@
# llvm816
LLVM/Clang C compiler for the WDC 65816 / Apple IIgs.
Compiles C (and a minimal subset of C++) to native 65816 machine code,
links to a relocatable OMF binary, and runs under MAME's apple2gs.
Speed-tuned: matches or beats hand-written 65816 assembly on the
tight loops in benchmarks like sumOfSquares, popcount, and strcpy.
## What you get
- **`clang --target=w65816`** — full C99 + parts of C11, optimized at
`-O2` by default. Soft-float and soft-double included.
- **C standard library subset**`stdio.h`, `stdlib.h`, `string.h`,
`math.h`, `time.h`, `setjmp.h`, etc. See
[`runtime/include/`](runtime/include/) for the complete list.
- **`link816`** — relocating linker producing GS/OS-loadable OMF
binaries (single- or multi-segment).
- **MAME integration scripts** — compile, link, and run a program
under MAME's apple2gs with one command.
- **Apple IIgs Toolbox bindings**`<iigs/toolbox_full.h>` exposes
~1300 toolbox routines from 35 tool sets.
## Quick start
After installation (see [docs/INSTALL.md](docs/INSTALL.md)):
```bash
# Compile a C file
cat > hello.c <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
int main(void) {
unsigned short x = 0;
for (int i = 1; i <= 10; i++) x += i; // x = 55
switchToBank2();
*(volatile unsigned short *)0x5000 = x;
while (1) {}
}
EOF
# Build + run under MAME (writes 0x0037 to $025000, MAME displays it)
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c hello.c -o hello.o
./tools/link816 -o hello.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o hello.o
bash scripts/runInMame.sh hello.bin --check 0x025000=0037
```
See [docs/USAGE.md](docs/USAGE.md) for a full walkthrough including
multi-segment builds and the Apple IIgs Toolbox.
## Project layout
```
runtime/ C standard library + crt0 startup
src/ sources (C and .s)
include/ headers
*.o built object files
src/ our LLVM/Clang sources (W65816 target backend)
clang/ clang patches
llvm/ LLVM patches + W65816 target
link816/ relocating linker
patches/ patches against vanilla llvm-mos
scripts/ install scripts, MAME runners, benchmarks
tools/ installed compilers, MAME, ROMs, Calypsi (reference)
benchmarks/ cycle-count and instruction-count benchmarks
compare/ side-by-side asm vs Calypsi
docs/ this directory — INSTALL.md, USAGE.md, design notes
```
## Status
Stable enough to build real programs. Current quality vs commercial
Calypsi 5.16 (lower is better):
| Benchmark | Our cyc/call | Calypsi cyc/call (approx) |
|---|---|---|
| sumOfSquares(50) | 16709 | ~16000 |
| popcount(0x12345678) | 2864 | ~2500 |
| memcmp(eq, 5) | 989 | ~700 |
| bsearch(arr, 8, 5) | 767 | ~600 |
Static-size for the canonical `sumSquares` benchmark: 37 inst (ours)
vs 31 inst (Calypsi) — **1.19×**.
See [STATUS.md](STATUS.md) for full language and runtime feature
coverage, and [LLVM_65816_DESIGN.md](LLVM_65816_DESIGN.md) for
backend internals.
## Documentation
- [docs/INSTALL.md](docs/INSTALL.md) — system requirements and install
steps
- [docs/USAGE.md](docs/USAGE.md) — compile, link, run, debug
- [STATUS.md](STATUS.md) — current language/runtime support matrix
- [LLVM_65816_DESIGN.md](LLVM_65816_DESIGN.md) — backend design notes
## License
Apache 2.0 (matching the LLVM project's license). See
`tools/llvm-mos/LICENSE.TXT` after install.

View file

@ -247,8 +247,8 @@ which runs correctly under MAME (apple2gs).
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
via MAME's emulated time counter. Eight benchmarks under
`benchmarks/`. Current numbers (after W65816StackSlotMerge):
popcount 3376, bsearch 852, memcmp 1091, strcpy 2387,
dotProduct 2302, fib(10) 12617, sumOfSquares 17391. Speed is
popcount 2864, bsearch 767, memcmp 989, strcpy 2216,
dotProduct 2131, fib(10) 12617, sumOfSquares 16709. Speed is
the optimization priority, not size.
- `compare/` holds three side-by-side C tests with our asm and
@ -257,10 +257,10 @@ which runs correctly under MAME (apple2gs).
recompiles each under both `clang --target=w65816 -O2 -S` and
`cc65816 --speed -O 2 --64bit-doubles` and prints an
ours/Calypsi instruction-count ratio. Current ratios (post
W65816StackSlotMerge Phase 5/6 + extracted Phase 6/6a per-MBB
peepholes + Pass 1c PHP-wrap CMP elim for SP-rel functions):
sumSquares 1.81x (56 inst), evalAt 2.10x (534 inst), mul16to32
2.25x (9 inst). See `compare/README.md`.
StackRelToImg 9-phase pipeline including saturating-max preheader
elimination): sumSquares **0.87×** (27 inst — we beat Calypsi's
31), evalAt 2.10× (534 inst), mul16to32 **1.50×** (6 inst).
See `compare/README.md`.
**Backend register allocation:**

View file

@ -1,7 +1,7 @@
###############################################################################
# #
# Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 20:52:21 #
# 14/May/2026 11:06:07 #
# Command line: --speed -O 2 --64bit-doubles evalAt.c -o #
# /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst #
# #

View file

@ -1,7 +1,7 @@
###############################################################################
# #
# Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 20:52:21 #
# 14/May/2026 11:06:07 #
# Command line: --speed -O 2 --64bit-doubles mul16to32.c -o #
# /tmp/mul16to32.calypsi.elf --list-file #
# mul16to32.calypsi.lst #

View file

@ -6,12 +6,9 @@ mul16to32: ; @mul16to32
; %bb.0: ; %entry
rep #0x30
pha
pha
lda 0x8, s
lda 0x6, s
jsl __umulhisi3
ply
sta 0x1, s
ply
rtl
.Lfunc_end0:
.size mul16to32, .Lfunc_end0-mul16to32

View file

@ -1,7 +1,7 @@
###############################################################################
# #
# Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 20:52:21 #
# 14/May/2026 11:06:07 #
# Command line: --speed -O 2 --64bit-doubles sumSquares.c -o #
# /tmp/sumSquares.calypsi.elf --list-file #
# sumSquares.calypsi.lst #

View file

@ -5,67 +5,38 @@
sumSquares: ; @sumSquares
; %bb.0: ; %entry
rep #0x30
tay
tsc
sec
sbc #0xc
tcs
tya
sta 0x5, s
lda #0x0
sta 0x3, s
sta 0x1, s
lda 0x5, s
bne .LBB0_1
sta 0xd0
stz 0xd6
stz 0xd4
lda 0xd0
bne .LBB0_3
; %bb.6: ; %entry
brl .LBB0_5
.LBB0_1: ; %for.body.preheader
lda 0x5, s
inc a
sta 0x5, s
cmp #0x3
bcs .LBB0_3
; %bb.1: ; %for.body.preheader
; %bb.2: ; %for.body.preheader
lda #0x2
sta 0x5, s
.LBB0_3: ; %for.body.preheader
lda #0x1
sta 0x7, s
lda 0x5, s
dec a
sta 0x5, s
lda #0x0
sta 0x1, s
sta 0xd2
.LBB0_4: ; %for.body
; =>This Inner Loop Header: Depth=1
lda 0x7, s
lda 0xd2
pha
jsl __umulhisi3
ply
clc
adc 0x3, s
sta 0x3, s
adc 0xd6
sta 0xd6
txa
adc 0x1, s
sta 0x1, s
lda 0x7, s
inc a
sta 0x7, s
lda 0x5, s
dec a
sta 0x5, s
adc 0xd4
sta 0xd4
inc 0xd2
dec 0xd0
beq .LBB0_5
bra .LBB0_4
.LBB0_5: ; %for.cond.cleanup
lda 0x1, s
lda 0xd4
tax
lda 0x3, s
tay
tsc
clc
adc #0xc
tcs
tya
lda 0xd6
rtl
.Lfunc_end0:
.size sumSquares, .Lfunc_end0-sumSquares

168
docs/INSTALL.md Normal file
View file

@ -0,0 +1,168 @@
# Installing llvm816
The project installs everything into `tools/` under the repo root, so
the tree is self-contained and deletable without affecting your system.
## System requirements
- **Ubuntu 22.04 or 24.04** (or any Debian-based distro with apt).
Other Linuxes work if you can install the packages listed below
by hand.
- **Disk:** ~10 GB free (LLVM build artifacts dominate).
- **RAM:** 8 GB minimum, 16 GB recommended for the `--build-llvm`
flag. The setup script's default skips the LLVM build and
downloads a prebuilt toolchain instead — much faster, ~500 MB.
- **Build time:** ~5 minutes for the default (prebuilt) path; 30-60
minutes for `--build-llvm` (full LLVM source build).
## One-command install
```bash
git clone <this-repo-url> llvm816
cd llvm816
./setup.sh
```
`setup.sh` installs:
1. **System apt packages** — build-essential, cmake, ninja, clang, lld,
python3, MAME, etc. See [`scripts/installDeps.sh`](../scripts/installDeps.sh)
for the full list. *Requires sudo.*
2. **llvm-mos** — source tree clone at `tools/llvm-mos/` and a prebuilt
SDK at `tools/llvm-mos-sdk/`. With `--build-llvm` it also runs
cmake/ninja to build a usable W65816-aware clang at
`tools/llvm-mos-build/bin/clang`.
3. **Apple IIgs MAME** — installs MAME via apt and downloads the
apple2gs ROMs to `tools/mame/roms/`.
4. **Calypsi 5.16** — reference 65816 C compiler, installed to
`tools/calypsi/`. Used by the `compare/` benchmarks to measure
our codegen quality against a commercial baseline.
5. **ORCA/C** — Apple's official 65816 C compiler (header reference
for the IIgs Toolbox bindings).
After `setup.sh` finishes:
```bash
ls tools/llvm-mos-build/bin/clang # our compiler
ls tools/link816 # our linker
mame -version # MAME (installed via apt)
```
## Step-by-step (if `setup.sh` fails)
You can run each install script in isolation:
```bash
scripts/installDeps.sh # apt packages
scripts/installLlvmMos.sh # llvm-mos clone + prebuilt SDK
scripts/installLlvmMos.sh --build # also build the source (slow)
scripts/installMame.sh # MAME + apple2gs ROMs
scripts/installCalypsi.sh # reference compiler (optional)
scripts/installOrcaC.sh # reference compiler (optional)
```
If you only want to build C programs (no benchmarks, no comparison
to Calypsi), `installCalypsi.sh` and `installOrcaC.sh` are
optional.
## Building the W65816 backend from source
The default install pulls a prebuilt LLVM SDK. To build our
W65816-aware clang from source:
```bash
./setup.sh --build-llvm
```
Or, after a non-`--build-llvm` install:
```bash
scripts/applyBackend.sh # symlink our W65816 sources into llvm-mos clone
cmake --build tools/llvm-mos-build --target llc clang
```
The build takes 30-60 minutes on a modern laptop. Subsequent
incremental builds after editing W65816 backend code are ~30
seconds.
## Verifying the install
```bash
# Compile + disassemble a small C function
scripts/cDemo.sh
# Build the runtime library (libc, libgcc, etc.)
bash runtime/build.sh
# Run the smoke test suite (~150 checks, takes ~3 minutes)
bash scripts/smokeTest.sh
```
A successful smoke test ends with:
```
[llvm816] all smoke checks passed
```
## Updating
```bash
git pull
scripts/applyBackend.sh # re-symlink our sources into the LLVM tree
cmake --build tools/llvm-mos-build --target llc clang
bash runtime/build.sh
```
If you want a fully clean rebuild:
```bash
rm -rf tools/llvm-mos-build
./setup.sh --build-llvm
```
## Uninstalling
The toolchain is fully contained under `tools/`. To uninstall:
```bash
rm -rf llvm816/
sudo apt-get remove mame mame-tools # if you want MAME gone too
```
The setup script doesn't touch `/usr/local` or `~/.mame` — nothing
to clean up outside the repo.
## Troubleshooting
**`cmake: command not found`** — run `scripts/installDeps.sh`. The
apt packages aren't installed yet.
**`ROMs not found`** — the apple2gs ROM download from archive.org
occasionally fails. Re-run `scripts/installMame.sh`. The script
is idempotent; it skips ROMs already downloaded.
**`clang: error: unable to find target 'w65816'`** — the prebuilt
SDK's clang doesn't know about our W65816 target. You need the
source-built clang:
```bash
scripts/installLlvmMos.sh --build
# Or, more granular:
scripts/applyBackend.sh
cmake --build tools/llvm-mos-build --target clang
```
The W65816 target lives in *our* fork at `tools/llvm-mos-build/bin/clang`,
not in the prebuilt SDK.
**MAME can't find ROMs at runtime** — make sure `mame` is launched
with `-rompath tools/mame/roms`. The provided
[`scripts/runInMame.sh`](../scripts/runInMame.sh) does this
automatically.
**`linkage error: missing __umulhisi3`** — link `runtime/libgcc.o`
into your binary. See [USAGE.md](USAGE.md#linking).
**MAME pops up a window I don't want** — the `runInMame.sh`
wrapper now runs headless (`-video none` + `SDL_VIDEODRIVER=dummy`).
If you're invoking MAME directly, add those flags.

391
docs/USAGE.md Normal file
View file

@ -0,0 +1,391 @@
# Using llvm816
This document covers compiling a C program, linking it into an
Apple IIgs binary, and running it under MAME. It assumes you've
followed [INSTALL.md](INSTALL.md) and have a working
`tools/llvm-mos-build/bin/clang`.
## Quick reference
```bash
CLANG=tools/llvm-mos-build/bin/clang
LINK=tools/link816
RUNTIME=runtime
# 1. Compile C to object
$CLANG --target=w65816 -O2 -I$RUNTIME/include -c hello.c -o hello.o
# 2. Link to a raw binary (loadable at $00:1000)
$LINK -o hello.bin --text-base 0x1000 \
$RUNTIME/crt0.o $RUNTIME/libc.o $RUNTIME/libgcc.o hello.o
# 3. Run under MAME
bash scripts/runInMame.sh hello.bin --check 0x025000=????
```
## Compiling C
The compiler is invoked just like a normal clang, with
`--target=w65816`:
```bash
clang --target=w65816 -O2 -c source.c -o source.o
```
**Recommended flags:**
| Flag | Meaning |
|---|---|
| `--target=w65816` | Selects the W65816 backend (required) |
| `-O2` | Default optimization level. `-O0` and `-O1` work but produce ~3-5× larger code |
| `-ffunction-sections` | Put each function in its own section. Lets the linker drop unreferenced functions |
| `-I runtime/include` | Find `<stdio.h>` etc. |
| `-c` | Compile only — produce `.o`, don't link |
**What works at `-O2`:**
- All C99 scalars: `int8_t` through `int64_t`, signed and unsigned,
all arithmetic operators
- Soft `float` and `double` (full IEEE-754 with round-to-nearest-even)
- Pointers, arrays, structs, unions, bitfields
- All control flow: `if`, `for`, `while`, `goto`, `switch`,
recursion
- `<stdarg.h>` varargs
- `<setjmp.h>` setjmp/longjmp (SJLJ, no DWARF unwinder)
- Inline `__asm__` with `"a"`, `"x"`, `"y"` register constraints
- C++ subset: classes, single+multiple inheritance, virtual functions,
RTTI, `dynamic_cast`. **No exceptions** (DWARF unwinder not
implemented).
See [STATUS.md](../STATUS.md) for the full feature matrix.
## Linking
The linker is `tools/link816`. It produces either a raw binary
suitable for direct execution (loaded into a fixed address) or an
OMF binary suitable for GS/OS Loader.
### Raw binary
```bash
link816 -o output.bin --text-base 0x1000 crt0.o libc.o libgcc.o yourprog.o
```
- `--text-base 0x1000` — physical address where code is loaded.
`0x1000` is the conventional starting address; the first 4KB
of bank 0 ($00:0000 $00:0FFF) is reserved for the stack and
zero-page.
- `crt0.o` — the C runtime startup. Sets DBR, calls `main`, halts.
Always link first.
- `libc.o``printf`, `malloc`, `strlen`, etc.
- `libgcc.o` — compiler-helper routines (`__mulhi3`, `__umulhisi3`,
`__divhi3`, `__ashlhi3`, etc.). Required by most non-trivial
programs.
### Additional runtime libraries
| Library | What you get |
|---|---|
| `runtime/libc.o` | Core C library — printf, malloc, strlen, etc. |
| `runtime/libgcc.o` | Compiler helpers — multiply, divide, shift |
| `runtime/snprintf.o` | `sprintf` / `snprintf` / `vsnprintf` |
| `runtime/sscanf.o` | `sscanf` / `vsscanf` / `fscanf` |
| `runtime/softDouble.o` | IEEE 754 double-precision math |
| `runtime/softFloat.o` | IEEE 754 single-precision math |
| `runtime/math.o` | `fabs`, `floor`, `sqrt`, `sin`, `cos`, etc. |
| `runtime/qsort.o` | `qsort` / `bsearch` |
| `runtime/strtol.o` | `strtol` / `strtoul` / `atoi` / `atol` |
| `runtime/strtok.o` | `strtok` / `strtok_r` |
| `runtime/extras.o` | `strcat`, `strncat`, `llabs`, `rand`/`srand` |
| `runtime/timeExt.o` | `time` / `gmtime` / `mktime` |
| `runtime/iigsToolbox.o` | Apple IIgs Toolbox call wrappers |
| `runtime/iigsGsos.o` | GS/OS call wrappers |
Link only what you use — the linker drops unreferenced symbols.
Build them all once with:
```bash
bash runtime/build.sh
```
### Multi-segment OMF (for GS/OS Loader)
For programs that need >60 KB of code (the usable bank-0 limit
after subtracting the stack, zero-page, and I/O window), build a
multi-segment OMF that GS/OS Loader can place across banks:
```bash
link816 -o myprog.bin --omf --manifest my.manifest \
--expressload \
crt0Gsos.o ... yourprog.o
```
See [`docs/multiSegmentPlan.md`](multiSegmentPlan.md) for details
and [`scripts/runMultiSeg.sh`](../scripts/runMultiSeg.sh) for a
working example.
## Running under MAME
The supplied [`scripts/runInMame.sh`](../scripts/runInMame.sh)
launches MAME's `apple2gs` with the right ROM path, loads your
binary at `$00:1000`, runs for a few seconds, and reads back a
memory cell.
```bash
bash scripts/runInMame.sh prog.bin # just run for 5s
bash scripts/runInMame.sh prog.bin --check 0x025000=00ff
bash scripts/runInMame.sh prog.bin 0x025000 0x025002 # dump these addrs
```
The `--check ADDR=VALUE` form returns exit 0 if `ADDR` contains
`VALUE` after the run, exit 1 otherwise. Use `0x????` to dump
the value without checking.
MAME is invoked headless by default (no window) via
`-video none` + `SDL_VIDEODRIVER=dummy`. This works on
servers/CI runners.
### The bank-switch idiom
Bank 0 (`$00:0000-$00:FFFF`) has the I/O window at `$C000-$CFFF`
that interferes with normal data access. The convention is to
switch the data bank register (DBR) to bank 2 (`$02:0000`) before
doing any data work:
```c
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile (
"sep #0x20\n" // 8-bit accumulator
".byte 0xa9,0x02\n" // lda #2 (force as bytes — llvm-mc bug)
"pha\n"
"plb\n" // DBR = 2
"rep #0x20\n" // back to 16-bit
);
}
```
After `switchToBank2()`, your data lives at `$02:0000` upward.
The `runInMame.sh` `--check 0x025000=...` address is `$02:5000`
— accessible via a normal store in bank 2.
## Examples
### Hello, integer
```c
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile (
"sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"
);
}
int main(void) {
int x = 42;
switchToBank2();
*(volatile int *)0x5000 = x;
while (1) {}
}
```
Build & run:
```bash
clang --target=w65816 -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o hello.o
bash scripts/runInMame.sh hello.bin --check 0x025000=002a # 0x2a = 42
```
### Recursion + printing
```c
#include <stdio.h>
#include <stdlib.h>
unsigned long fib(unsigned n) {
if (n < 2) return n;
return fib(n-1) + fib(n-2);
}
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile (
"sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"
);
}
int main(void) {
char buf[32];
int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
switchToBank2();
// Copy buf to $025000 so we can read it after the run
for (int i = 0; i <= len; i++)
((volatile char *)0x5000)[i] = buf[i];
while (1) {}
}
```
Build (note: need snprintf.o for `snprintf`):
```bash
clang --target=w65816 -O2 -I runtime/include -c fib.c -o fib.o
link816 -o fib.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o \
runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o fib.o
```
### Apple IIgs Toolbox
```c
#include <iigs/toolbox_full.h>
int main(void) {
DrawString("\pHello, World");
while (1) {}
}
```
Build:
```bash
clang --target=w65816 -O2 -I runtime/include -c hello_gs.c -o hello_gs.o
link816 -o hello_gs.bin --text-base 0x1000 \
runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
runtime/libgcc.o hello_gs.o
```
Use `crt0Gsos.o` (not `crt0.o`) for programs that call into the
toolbox — it sets up the IIgs runtime environment.
## Inline assembly
The W65816 backend supports `__asm__` with operand constraints
`"a"`, `"x"`, `"y"`:
```c
unsigned short addOne(unsigned short x) {
unsigned short r;
__asm__("inc a" : "=a"(r) : "a"(x));
return r;
}
```
Multi-instruction asm and raw bytes both work:
```c
__asm__ volatile (
"sep #0x20\n"
".byte 0x68\n" // pla
"rep #0x20\n"
);
```
The `.byte 0xa9, ...` form is sometimes needed to work around
llvm-mc encoding gaps — the assembler doesn't yet support every
65816 addressing mode literally. The pattern works for any
opcode whose mnemonic doesn't yet parse.
## Tools reference
| Tool | Location | Purpose |
|---|---|---|
| `clang` | `tools/llvm-mos-build/bin/clang` | C/C++ compiler |
| `llvm-mc` | `tools/llvm-mos-build/bin/llvm-mc` | Assembler |
| `llvm-objdump` | `tools/llvm-mos-build/bin/llvm-objdump` | Disassembler |
| `llc` | `tools/llvm-mos-build/bin/llc` | Standalone codegen (`.ll``.s`) |
| `link816` | `tools/link816` | Our relocating linker |
| `omfEmit` | `tools/omfEmit` | Emit OMF v2.1 binary from `link816` output |
| `mame` | `apt` (system-wide) | Apple IIgs emulator |
## Debugging
### Look at the asm
```bash
clang --target=w65816 -O2 -S -o prog.s prog.c
```
### Look at the MIR after each pass
```bash
clang --target=w65816 -O2 -mllvm -print-after-all -S prog.c 2>&1 | less
```
Useful pass names to filter on:
| Pass name | What it does |
|---|---|
| `w65816-isel` | SDAG → MachineInstr selection |
| `w65816-widen-acc16` | Promote Acc16 vregs to Wide16 (regalloc help) |
| `w65816-stack-slot-cleanup` | Remove redundant spill/reload |
| `w65816-stackrel-to-img` | Promote hot stack slots to DP IMG slots |
| `w65816-stack-slot-merge` | Collapse PHI src/dst slot pairs |
| `w65816-branch-expand` | Long-distance Bxx → INV_Bxx skip;BRA |
### Single-pass filter
```bash
clang --target=w65816 -O2 -mllvm -print-after=w65816-isel \
-mllvm -filter-print-funcs=myfunc -S prog.c 2>&1 | less
```
## Cycle-count benchmarks
Eight microbenchmarks live under [`benchmarks/`](../benchmarks/).
Each runs N iterations of the bench function and reports a
per-call cycle count via MAME's `emu.time()`:
```bash
bash scripts/benchCyclesPrecise.sh
```
Output:
```
| Benchmark | Per-call cycles (clang) |
|-----------|------------------------:|
| bsearch | 767 cyc/call |
| dotProduct | 2131 cyc/call |
| fib | 12617 cyc/call |
| memcmp | 989 cyc/call |
| popcount | 2864 cyc/call |
| strcpy | 2216 cyc/call |
| sumOfSquares | 16709 cyc/call |
```
The [`compare/`](../compare/) directory has side-by-side `.s`
files vs Calypsi 5.16 for sumSquares, evalAt, and mul16to32.
Rerun with:
```bash
bash compare/regen.sh
```
## Known limitations
- **C++ exceptions** are not implemented. `try`/`catch` compiles but
doesn't unwind. `-fsjlj-exceptions` works for limited SJLJ-style
throwing.
- **`stdin`** always returns EOF. `scanf` compiles but isn't useful.
Use `sscanf` on a buffer instead.
- **File I/O** through `fopen` etc. requires a backing implementation.
The default `mfs` backing (memory-file-system) lets you simulate
files via `mfsRegister()` — useful for tests, not for real disk
I/O. GS/OS file I/O works via `runtime/iigsGsos.o` if you link
against the GS/OS runtime.
- **`fork`/`exec`** — not applicable on a 65816, no support.
- **Code generation gotcha:** very large frames (>200 bytes) trigger
FP-relative addressing. Most programs fit under that limit. See
the `frame-rel` discussion in
[LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md).
## Where to go next
- **Building real GS/OS apps:** see
[`docs/multiSegmentPlan.md`](multiSegmentPlan.md) and the
`runViaFinder.sh` script for booting through real GS/OS 6.0.2 in
MAME.
- **Backend internals (you're hacking on the compiler):**
[LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md).
- **Smoke tests:** `scripts/smokeTest.sh` runs ~150 end-to-end checks.
Read it for examples of every feature in action.

View file

@ -331,9 +331,11 @@ EOF
cat "$sCmpFile" >&2
die "setcc gt test missing: bcc/bcs (carry-based unsigned branch)"
fi
if ! grep -qE '^\s*cmp\s+0x[0-9a-f]+,\s*s\s*$' "$sCmpFile"; then
# Accept either stack-relative cmp or DP-form cmp (W65816StackRelToImg
# may promote the comparand to a DP slot when arg b is the hot slot).
if ! grep -qE '^\s*cmp\s+0x[0-9a-f]+(,\s*s)?\s*$' "$sCmpFile"; then
cat "$sCmpFile" >&2
die "setcc gt test missing: cmp <off>,s (stack-relative compare to arg b)"
die "setcc gt test missing: cmp <off>,s or cmp <dp> (compare to arg b)"
fi
fi
@ -373,13 +375,13 @@ int max3(int a, int b, int c) {
}
EOF
"$CLANG" --target=w65816 -O2 -S "$cFile3" -o "$sChainFile"
# Expect cmp against a stack-relative slot - the signature of the
# two-Acc16 CMP_RR custom inserter. (Earlier this test also
# required an `sta d,s` spill, but greedy regalloc + WidenAcc16
# avoids that spill entirely on this pattern.)
if ! grep -qE 'cmp 0x[0-9a-f]+, s' "$sChainFile"; then
# Expect cmp against a stack-relative slot OR a DP slot - the
# signature of the two-Acc16 CMP_RR custom inserter. Earlier this
# required only stack-rel; W65816StackRelToImg may promote the
# comparand to a DP slot for hot offsets.
if ! grep -qE 'cmp 0x[0-9a-f]+(, s|$)' "$sChainFile"; then
cat "$sChainFile" >&2
die "two-Acc16 (max3) didn't cmp via stack-relative"
die "two-Acc16 (max3) didn't cmp via stack-relative or DP"
fi
fi

View file

@ -39,6 +39,7 @@ add_llvm_target(W65816CodeGen
W65816ImgCalleeSave.cpp
W65816NarrowI32Mul.cpp
W65816PromoteFiToImg.cpp
W65816StackRelToImg.cpp
W65816StackSlotMerge.cpp
W65816TargetMachine.cpp
W65816AsmPrinter.cpp

View file

@ -143,6 +143,12 @@ FunctionPass *createW65816PromoteFiToImg();
// copy. See W65816StackSlotMerge.cpp.
FunctionPass *createW65816StackSlotMerge();
// Pre-emit pass: rewrite top-N stack-rel slot offsets to IMG0..IMG7
// DP slots ($D0..$DE). Caller-save semantics — function must only
// call IMG-safe libgcc helpers (verified to not touch $D0..$DE).
// See W65816StackRelToImg.cpp.
FunctionPass *createW65816StackRelToImg();
// Pre-RA pass that lowers Wide32 register pairs into pairs of i16
// vregs. Without this, greedy/basic regalloc can't fit the pair-
// pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy
@ -184,6 +190,7 @@ void initializeW65816ImgCalleeSavePass(PassRegistry &);
void initializeW65816NarrowI32MulPass(PassRegistry &);
void initializeW65816PromoteFiToImgPass(PassRegistry &);
void initializeW65816StackSlotMergePass(PassRegistry &);
void initializeW65816StackRelToImgPass(PassRegistry &);
} // namespace llvm

View file

@ -485,7 +485,14 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
if (It2 != MI->getParent()->end()) {
const TargetRegisterInfo *TRI =
MI->getParent()->getParent()->getSubtarget().getRegisterInfo();
if (It2->modifiesRegister(W65816::A, TRI))
// PEI doesn't load A, so the LDA's value-set is needed if
// the next instruction READS A. JSL has implicit-def $a
// (caller-save) AND implicit-use $a (when A is an arg) —
// modifiesRegister returns true for both, but readsRegister
// is what tells us if A's value is consumed. Drop the LDA
// ONLY when the next op modifies A WITHOUT reading it.
if (It2->modifiesRegister(W65816::A, TRI) &&
!It2->readsRegister(W65816::A, TRI))
ADead = true;
}
if (ADead) {

View file

@ -188,10 +188,6 @@ bool W65816ImgCalleeSave::runOnMachineFunction(MachineFunction &MF) {
// other spill slots — but the STAfi/LDAfi we emit reference this slot
// by FrameIndex, and the only writes to this FI are our save/restore
// pair, so coloring can't break the round-trip.
//
// (The picol-expr bug came from a SHARED slot with two DIFFERENT
// vregs writing to it; here we have one FI per IMG and a single
// write/read pair per function, so coloring can't trip on this.)
MachineFrameInfo &MFI = MF.getFrameInfo();
int FrameSlots[8];
for (int i = 0; i < 8; ++i) {

View file

@ -52,8 +52,11 @@
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/InitializePasses.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/Format.h"
using namespace llvm;
@ -70,6 +73,11 @@ public:
StringRef getPassName() const override {
return "W65816 promote FrameIndex to IMG8..15 DP slot";
}
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineLoopInfoWrapperPass>();
AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);
}
bool runOnMachineFunction(MachineFunction &MF) override;
};
@ -79,8 +87,11 @@ public:
char W65816PromoteFiToImg::ID = 0;
INITIALIZE_PASS(W65816PromoteFiToImg, DEBUG_TYPE,
"W65816 promote FI to IMG", false, false)
INITIALIZE_PASS_BEGIN(W65816PromoteFiToImg, DEBUG_TYPE,
"W65816 promote FI to IMG", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineLoopInfoWrapperPass)
INITIALIZE_PASS_END(W65816PromoteFiToImg, DEBUG_TYPE,
"W65816 promote FI to IMG", false, false)
FunctionPass *llvm::createW65816PromoteFiToImg() {
@ -131,19 +142,20 @@ static uint8_t dpAddrForImg(unsigned ImgIdx) {
bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
// DISABLED: pass produces verifier errors ("Using an undefined physical
// register") on the kill-flag bookkeeping when an STAfi with `killed $a`
// is rewritten to STA_DP — the next i16-imm ADC/ADCE sees $a as dead.
// Also, for the FUNCTIONS where it would land (no-call, high-traffic
// slots), measured static + dynamic savings were modest and didn't
// justify the bookkeeping complexity. Re-enable after:
// - tightening kill-flag preservation: only carry kill if the same
// operand will be the last user in the new MI (which depends on
// post-rewrite scheduling — needs careful liveness re-analysis).
// - paired-PHI promotion: when fi#A is a PHI-input and fi#B is the
// matching PHI-output, map them to the SAME IMG slot so the
// PHI move collapses to a no-op (where most of the dynamic win
// would come from).
// DISABLED again 2026-05-13 (3rd-attempt write-up). Two new findings:
// 1. With kMaxPromote=2 and IMG0..7 (caller-save, skip ImgCalleeSave),
// sumSquares regressed 56 → 72 inst because the FIs picked by
// access-count (fi#2, fi#3) are intermediate spill temps, not
// the i32-accumulator's halves (which are different FIs). The
// loop body ends up using BOTH IMG and stack slots for related
// values.
// 2. To pick the RIGHT FIs (those corresponding to PHI-cycled
// values like the i32 accumulator), we need either:
// (a) IR-level analysis BEFORE FI assignment, or
// (b) Post-RA dataflow analysis to identify "long-lived" FIs
// (active across the loop back-edge with no def/use boundary).
// This is the next blocker. Disabled until either (a) or (b) is
// implemented.
return false;
if (skipFunction(MF.getFunction())) return false;
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
@ -151,49 +163,59 @@ bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
MachineFrameInfo &MFI = MF.getFrameInfo();
// 1. Walk all instructions, count FI accesses for promotable opcodes.
// Weight by loop depth: an access inside a depth-N loop counts as
// 10^N to model the dynamic execution count (an inner-loop slot
// gets executed many times per outer call).
MachineLoopInfo &MLI =
getAnalysis<MachineLoopInfoWrapperPass>().getLI();
DenseMap<int, unsigned> AccessCount;
DenseMap<int, SmallVector<MachineInstr *, 8>> AccessSites;
for (MachineBasicBlock &MBB : MF) {
unsigned LoopDepth = MLI.getLoopDepth(&MBB);
unsigned Weight = 1;
for (unsigned i = 0; i < LoopDepth && i < 3; ++i) Weight *= 10;
for (MachineInstr &MI : MBB) {
int FiIdx = getFiOperandIdx(MI.getOpcode());
if (FiIdx < 0) continue;
const MachineOperand &MO = MI.getOperand(FiIdx);
if (!MO.isFI()) continue;
int FI = MO.getIndex();
// Require: 2-byte size, fixed (not variable), offset operand == 0.
// The offset operand sits right after the FI operand.
if (MFI.isVariableSizedObjectIndex(FI)) continue;
if (MFI.getObjectSize(FI) != 2) continue;
// Fixed (negative-index) slots are arg slots — leave them alone.
// Promotion would break LowerFormalArguments's expected layout.
if (FI < 0) continue;
const MachineOperand &OffMO = MI.getOperand(FiIdx + 1);
if (!OffMO.isImm() || OffMO.getImm() != 0) continue;
AccessCount[FI]++;
AccessCount[FI] += Weight;
AccessSites[FI].push_back(&MI);
}
}
if (AccessCount.empty()) return false;
// 2. Determine which IMG8..15 slots are already in use.
// 2. Determine which IMG0..7 slots are already in use (caller-save).
// Use caller-save IMG0..7 instead of callee-save IMG8..15: this lets
// us skip ImgCalleeSave entirely (no prologue/epilogue overhead).
// The trade-off: any call inside the function clobbers IMG0..7. Mark
// any function with calls as "callees might clobber" → skip promotion.
// This restricts wins to leaf functions (no internal calls).
BitVector UsedImg(8, false);
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
// Skip CALL instructions — their `implicit-def dead $img0..7`
// operand list marks every IMG slot used, but that's just the
// caller-save annotation, not actual value-bearing usage.
if (MI.isCall()) continue;
for (const MachineOperand &MO : MI.operands()) {
if (!MO.isReg() || !MO.getReg().isPhysical()) continue;
Register R = MO.getReg();
// IMG8..15 are not numerically contiguous with each other in
// the W65816 register enum (subreg-pair regs sit between
// IMG indices). Spell them out explicitly.
unsigned ImgIdx = 16; // "not an IMG8..15"
if (R == W65816::IMG8) ImgIdx = 0;
else if (R == W65816::IMG9) ImgIdx = 1;
else if (R == W65816::IMG10) ImgIdx = 2;
else if (R == W65816::IMG11) ImgIdx = 3;
else if (R == W65816::IMG12) ImgIdx = 4;
else if (R == W65816::IMG13) ImgIdx = 5;
else if (R == W65816::IMG14) ImgIdx = 6;
else if (R == W65816::IMG15) ImgIdx = 7;
unsigned ImgIdx = 16;
if (R == W65816::IMG0) ImgIdx = 0;
else if (R == W65816::IMG1) ImgIdx = 1;
else if (R == W65816::IMG2) ImgIdx = 2;
else if (R == W65816::IMG3) ImgIdx = 3;
else if (R == W65816::IMG4) ImgIdx = 4;
else if (R == W65816::IMG5) ImgIdx = 5;
else if (R == W65816::IMG6) ImgIdx = 6;
else if (R == W65816::IMG7) ImgIdx = 7;
if (ImgIdx < 8) UsedImg.set(ImgIdx);
}
}
@ -215,20 +237,80 @@ bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
// save/restore cost compounds with recursion / call frequency
// in ways the static access count can't capture).
bool HasCalls = false;
bool IsRecursive = false;
StringRef SelfName = MF.getName();
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (MI.isCall()) { HasCalls = true; break; }
if (MI.isCall()) {
HasCalls = true;
// Check for self-call (recursive).
for (const MachineOperand &MO : MI.operands()) {
if (MO.isGlobal() && MO.getGlobal()->getName() == SelfName)
IsRecursive = true;
else if (MO.isSymbol() && SelfName == MO.getSymbolName())
IsRecursive = true;
}
}
}
if (HasCalls) break;
}
const unsigned kAccessThreshold = HasCalls ? 999999u : 5u;
// Recursive functions: skip — recursion makes per-call overhead
// compound (each level of recursion pays the save/restore).
if (IsRecursive) return false;
// Caller-save IMG0..7 strategy: any internal call clobbers them, so
// the only safe promoted slots are those whose lifetime doesn't
// cross a call. For now, only promote in leaf functions (no internal
// calls at all). This catches simple loops like sumSquares (which
// calls __umulhisi3 — but that's in libgcc.s and doesn't actually
// touch IMG0..7; treat libgcc multiplies as IMG-safe).
//
// Whitelist of libgcc functions known to not touch IMG0..7.
auto isImgSafeLibcall = [](const MachineInstr &MI) -> bool {
if (!MI.isCall()) return false;
for (const MachineOperand &MO : MI.operands()) {
StringRef Name;
if (MO.isGlobal()) Name = MO.getGlobal()->getName();
else if (MO.isSymbol()) Name = MO.getSymbolName();
else continue;
// libgcc.s multiply/divide/shift helpers — verified to only use
// $E0..$E9 internally, no IMG0..7 touch.
if (Name == "__umulhisi3" || Name == "__mulhi3" ||
Name == "__mulsi3" || Name == "__udivhi3" ||
Name == "__umodhi3" || Name == "__divhi3" ||
Name == "__modhi3" || Name == "__udivsi3" ||
Name == "__umodsi3" || Name == "__divsi3" ||
Name == "__modsi3" || Name == "__ashlhi3" ||
Name == "__lshrhi3" || Name == "__ashrhi3" ||
Name == "__ashlsi3" || Name == "__lshrsi3" ||
Name == "__ashrsi3")
return true;
return false;
}
return false;
};
bool AllCallsImgSafe = true;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (MI.isCall() && !isImgSafeLibcall(MI)) {
AllCallsImgSafe = false;
break;
}
}
if (!AllCallsImgSafe) break;
}
if (HasCalls && !AllCallsImgSafe) return false;
// Threshold: per-access save is 1 cyc, no save/restore overhead. We
// just need the access count to be > 0 to win. Use a small threshold
// for safety (avoid promoting marginal slots).
const unsigned kAccessThreshold = 5u;
const unsigned kMaxPromote = 2u;
DenseMap<int, unsigned> FiToImgIdx;
unsigned NextFreeImg = 0;
for (int FI : Ordered) {
if (AccessCount[FI] < kAccessThreshold) break;
if (FiToImgIdx.size() >= kMaxPromote) break;
while (NextFreeImg < 8 && UsedImg.test(NextFreeImg)) ++NextFreeImg;
if (NextFreeImg >= 8) break;
FiToImgIdx[FI] = NextFreeImg + 8; // Map to IMG8..15
FiToImgIdx[FI] = NextFreeImg; // Map to IMG0..7 (caller-save)
++NextFreeImg;
}
if (FiToImgIdx.empty()) return false;

File diff suppressed because it is too large Load diff

View file

@ -599,20 +599,31 @@ bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
}
return 0;
};
// Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y.
// Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y. Also
// handles consolidated `LDA #K ; STA Y1 ; STA Y2 ; ...` where the
// LDA is shared (Phase 6 collapsing): A stays at K across STAs.
DenseMap<int64_t, SmallVector<std::pair<MachineInstr *, int64_t>, 4>>
ConstStas;
for (MachineBasicBlock &MBB : MF) {
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (!isLdaImm(*It)) continue;
int64_t K = immValue(*It);
// Walk forward through STA_StackRel ops; collect each as an
// init of K (A is preserved across STA). Stop on anything
// that modifies A.
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t Y;
if (!srAccess(*NextIt, Y)) continue;
ConstStas[Y].push_back({&*NextIt, K});
while (NextIt != MBB.end()) {
if (NextIt->isDebugInstr()) { ++NextIt; continue; }
if (NextIt->getOpcode() == W65816::STA_StackRel) {
int64_t Y;
if (srAccess(*NextIt, Y)) {
ConstStas[Y].push_back({&*NextIt, K});
}
++NextIt;
continue;
}
break; // any other op — stop (might change A or flags)
}
}
}
// For each slot Y with at least two const-init STAs, check for
@ -692,6 +703,7 @@ bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
// flag-use (unsafe).
MachineBasicBlock *MBB = DominatedSta->getParent();
bool flagsSafeP5 = false;
bool reachedMBBEnd = false;
for (auto Fwd = std::next(DominatedSta->getIterator());
Fwd != MBB->end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
@ -701,6 +713,33 @@ bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
}
if (clobbersFlagsP(*Fwd)) { flagsSafeP5 = true; break; }
}
// If we walked off the end of MBB, recurse one level into
// successors. The fall-through code is in a successor MBB
// (e.g., bb.3's preheader -> bb.4's loop body which starts
// with an LDA, a flag-clobberer). Require ALL successors
// to clobber flags before any flag-use.
if (!flagsSafeP5) {
// Did the loop exit via fall-through (no break)?
// Check by walking the same loop again, simpler check.
auto It = std::next(DominatedSta->getIterator());
while (It != MBB->end() && It->isDebugInstr()) ++It;
// ... too brittle to track via prev loop; just recurse for
// every case where flagsSafeP5 is false. Conservative.
bool allSuccClobber = !MBB->succ_empty();
for (MachineBasicBlock *Succ : MBB->successors()) {
bool succClobbers = false;
for (auto SIt = Succ->begin(); SIt != Succ->end(); ++SIt) {
if (SIt->isDebugInstr()) continue;
if (usesFlagsP(*SIt)) break;
if (clobbersFlagsP(*SIt)) { succClobbers = true; break; }
if (SIt->isTerminator() && !SIt->isConditionalBranch()) {
succClobbers = true; break;
}
}
if (!succClobbers) { allSuccClobber = false; break; }
}
if (allSuccClobber) flagsSafeP5 = true;
}
if (!flagsSafeP5) continue;
// Erase DominatedSta and its preceding LDA #K.
auto Prev = DominatedSta->getIterator();

View file

@ -58,6 +58,7 @@ LLVMInitializeW65816Target() {
initializeW65816NarrowI32MulPass(PR);
initializeW65816PromoteFiToImgPass(PR);
initializeW65816StackSlotMergePass(PR);
initializeW65816StackRelToImgPass(PR);
// Default IndVarSimplify's exit-value rewriter to "never". The
// closed-form replacement frequently widens an i16 induction var
@ -279,6 +280,7 @@ void W65816PassConfig::addPreEmitPass() {
// collapses when X and Y are renamed to the same slot). See
// W65816StackSlotMerge.cpp.
addPass(createW65816StackSlotMerge());
addPass(createW65816StackRelToImg());
}
MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo(