33 KiB
Using llvm816
This document covers compiling a C program, linking it into an Apple IIgs binary, and running it under MAME. It assumes you've followed INSTALL.md and the install completed successfully.
If you've never used clang or a similar C compiler before, start with Quick orientation — it explains the moving parts. If you already know what clang is, jump to Your first program.
Quick orientation
What is clang?
Clang is a C / C++ compiler — the program that turns your .c source
file into machine code an actual CPU can execute. It's part of the
LLVM project and is the default C compiler on macOS and on most modern
Linux distributions. If you've used gcc before, clang takes nearly
the same command-line flags.
A normal install of clang produces code for the machine it's running on
— x86-64 if you're on a typical Linux PC. Clang has a cross-compiler
mode: pass --target=<arch> to make it emit code for a different
CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've
added to a fork of clang that ships with this project.
What gets installed where
After ./setup.sh completes, the project tree under your llvm816/
checkout looks roughly like this:
llvm816/ ← repo root; everything is contained here
├── docs/ ← this directory
├── runtime/ ← C standard library + startup code
│ ├── build.sh ← script that builds the runtime .o files
│ ├── include/ ← header files (<stdio.h>, etc.)
│ │ ├── stdio.h
│ │ ├── string.h
│ │ ├── ...
│ │ └── iigs/ ← Apple IIgs-specific headers
│ │ ├── toolbox.h ← ~1300 toolbox routine wrappers
│ │ ├── gsos.h
│ │ └── desktop.h
│ ├── src/ ← sources for the runtime (.c and .s)
│ └── *.o ← compiled runtime objects (after build)
├── scripts/ ← driver scripts
│ ├── runInMame.sh ← run a binary in MAME and check memory
│ ├── benchCycles.sh ← cycle-count benchmarks
│ └── smokeTest.sh ← ~150 end-to-end correctness checks
├── src/ ← OUR backend source (you compile from here)
├── tools/ ← installed tools (~7 GB total)
│ ├── llvm-mos/ ← LLVM source tree (~5 GB)
│ ├── llvm-mos-build/ ← built artifacts (~1.4 GB)
│ │ └── bin/
│ │ ├── clang ← THE COMPILER YOU USE
│ │ ├── clang++ ← same, for C++
│ │ ├── llc ← standalone IR → asm converter
│ │ ├── llvm-mc ← standalone assembler
│ │ ├── llvm-objdump ← disassembler
│ │ └── ...
│ ├── llvm-mos-sdk/ ← prebuilt llvm-mos SDK (~400 MB, mostly unused)
│ ├── link816 ← OUR LINKER (single binary, ~120 KB)
│ ├── omfEmit ← turns flat binary → Apple IIgs OMF v2.1
│ ├── mame/ ← Apple IIgs ROMs for MAME
│ ├── gsos/ ← GS/OS 6.0.2 / 6.0.4 disk images
│ ├── calypsi/ ← reference compiler for comparison (~580 MB)
│ └── orca-c/ ← reference compiler (header sources)
├── demos/ ← example IIgs programs
├── benchmarks/ ← cycle-count benchmarks
├── compare/ ← side-by-side ours-vs-Calypsi assembly
└── setup.sh ← one-shot installer
The two files you'll use most often:
| File | Purpose |
|---|---|
tools/llvm-mos-build/bin/clang |
The compiler. Pass --target=w65816 to make it emit Apple IIgs code |
tools/link816 |
The linker. Takes .o files and produces a flat binary the IIgs can load |
Nothing is installed into /usr/local, /opt, or anywhere else on
your system — the entire toolchain lives under your llvm816/ checkout.
To uninstall, delete the directory.
What about the system's /usr/bin/clang?
If your distribution provides a clang (most do), that's a different
clang for your machine's CPU. It does not know about the W65816
target. When following this document, always use the full path
./tools/llvm-mos-build/bin/clang (or set an alias / $PATH — see
Setting up your environment).
What the build process produces
When you compile a C file for the IIgs, the flow looks like this:
hello.c
│
│ clang --target=w65816 (cross-compile to 65816 machine code)
▼
hello.o (relocatable ELF object file)
│
│ + crt0.o + libc.o + libgcc.o (runtime libraries you link in)
│
│ link816 (our relocating linker)
▼
hello.bin (flat binary, loadable at $00:1000)
│
│ optionally: omfEmit hello.bin → hello.omf (for GS/OS Loader)
│
│ scripts/runInMame.sh hello.bin
▼
runs in MAME's emulated Apple IIgs
Three stages:
- Compile — clang turns
.cinto.o - Link —
link816combines.ofiles + runtime libraries into a binary - Run — MAME boots an emulated IIgs and executes the binary
Setting up your environment
To save typing, you can either edit your $PATH or use absolute paths.
The rest of this document uses absolute paths so the examples work
without any setup, but in practice you'll want shortcuts.
Option A: edit $PATH (recommended)
Add this to ~/.bashrc (or ~/.zshrc) so our tools are on your path:
export LLVM816_ROOT=$HOME/path/to/llvm816
export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH"
Then source ~/.bashrc (or restart your shell). After that you can
just type clang --target=w65816 ... without the path prefix.
Careful: putting
tools/llvm-mos-build/binfirst on$PATHmeans allclanginvocations in that shell go to our build, not the system clang. Ours still works for your machine's native target too (it's a multi-arch clang), but if you also need your distro's version, prefer Option B.
Option B: shell aliases
In ~/.bashrc:
LLVM816_ROOT=$HOME/path/to/llvm816
alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include"
alias link816="$LLVM816_ROOT/tools/link816"
Then:
w65clang -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 ...
Option C: nothing — just use full paths
Every example in this document spells out the full path, so this works too. Verbose, but unambiguous.
Your first program
Let's compile, link, and run a tiny program. Open a terminal in your
llvm816/ checkout directory.
1. Write the source
Create hello.c:
// hello.c — the smallest meaningful Apple IIgs program.
//
// Writes a value to bank-2 RAM at $02:5000, then halts. The MAME
// harness reads that memory cell to verify the result.
int main(void) {
int x = 6 * 7;
// Write directly to the 24-bit absolute address $02:5000. With
// ptr32 mode (our default), constant pointers to >16-bit addresses
// lower to `sta long $025000` — no bank-switching needed.
*(volatile int *)0x025000 = x;
while (1) {} // halt; the harness reads memory + exits
return 0;
}
2. Compile to a .o file
./tools/llvm-mos-build/bin/clang \
--target=w65816 \
-O2 \
-I runtime/include \
-c hello.c \
-o hello.o
What each flag does:
| Flag | Meaning |
|---|---|
--target=w65816 |
Required. Tells clang to emit W65816 machine code instead of the host CPU's code. |
-O2 |
Optimization level. -O2 is recommended; -O0 works but produces 3-5× larger code. |
-I runtime/include |
Look for <stdio.h> etc. in our runtime headers. |
-c |
Compile only — produce a .o, don't link. |
-o hello.o |
Write the object to hello.o. |
If the command succeeds, you'll have a hello.o next to your hello.c.
You can inspect it:
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40
3. Link to a flat binary
./tools/link816 \
-o hello.bin \
--text-base 0x1000 \
runtime/crt0.o \
runtime/libc.o \
runtime/libgcc.o \
hello.o
Each argument:
| Argument | Why |
|---|---|
-o hello.bin |
Output file. |
--text-base 0x1000 |
Where the code goes in memory. 0x1000 is conventional (first 4 KB of bank 0 is reserved for stack + zero page). |
runtime/crt0.o |
Must come first. The C runtime startup — sets up the stack, calls main, halts cleanly on return. |
runtime/libc.o |
Core C library (printf, malloc, strlen, etc.). |
runtime/libgcc.o |
Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program. |
hello.o |
Your code. |
link816 will print something like:
linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin
That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS.
4. Run it in MAME
bash scripts/runInMame.sh hello.bin --check 0x025000=002a
0x002a is hexadecimal for 42 (= 6 × 7), and 0x025000 is the
24-bit address bank $02 + offset $5000 — where your program wrote
x. The script boots MAME's emulated Apple IIgs, loads your binary
at $00:1000, runs for 5 seconds, reads memory at $02:5000, and
compares to the expected value.
A pass looks like:
MAME-LOADED bytes=128
MAME-READ addr=0x025000 val=0x002a
[llvm816] MAME OK: 1 reads matched
If you get MAME mismatch, your program wrote a different value (or
no value). Most common cause for a new project is writing to a
bank-0 address like *(volatile int *)0x5000 = x; (a plain $5000)
instead of a 24-bit address like *(volatile int *)0x025000 = x;
($02:5000). The verification harness reads bank 2; writes to bank 0
go to a different RAM cell and the comparison fails.
Compiling C — full reference
The compiler is invoked just like a normal clang, with one extra flag:
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o
Recommended flags
| Flag | Meaning |
|---|---|
--target=w65816 |
Selects the W65816 backend (required). |
-O2 |
Default optimization. -O0 and -O1 work but produce ~3-5× larger code. -O3 is the same as -O2 for our backend. |
-ffunction-sections |
Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries). |
-I runtime/include |
Find <stdio.h>, <stdlib.h>, <iigs/toolbox.h> etc. |
-c |
Compile only — produce .o, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects. |
-g |
Emit DWARF debug info. Useful with link816 --debug-out. |
-S |
Emit assembly (.s) instead of an object file. Useful for inspecting codegen. |
What works at -O2
- All C99 scalars:
int8_tthroughint64_t, signed and unsigned, all arithmetic operators - Soft
floatanddouble(full IEEE-754 with round-to-nearest-even) - Pointers, arrays, structs, unions, bitfields
- All control flow:
if,for,while,goto,switch, recursion <stdarg.h>varargs<setjmp.h>setjmp/longjmp (SJLJ, no DWARF unwinder)- Inline
__asm__with"a","x","y"register constraints - C++ subset: classes, single+multiple inheritance, virtual functions,
RTTI,
dynamic_cast. No exceptions (DWARF unwinder not implemented; SJLJ exceptions work via-fsjlj-exceptions).
See STATUS.md for the full feature matrix.
Linking — full reference
link816 produces a flat binary suitable for direct execution (loaded
into a fixed address) or, with --omf, an OMF binary that the GS/OS
Loader can load and relocate.
Raw binary (fixed-address load)
./tools/link816 -o output.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o
--text-base 0x1000— Where code is loaded.$1000is conventional; the first 4 KB of bank 0 ($00:0000-$00:0FFF) is reserved for the stack and direct page.--bss-base 0x020000— Where uninitialized data (BSS) goes. By default the linker places BSS immediately after rodata; supplying a different bank is useful when your text + data exceeds a single bank's free space.--map output.map— Writes a human-readable map file showing every symbol's address. Useful for debugging.--no-gc-sections— Keep all functions, even unreferenced ones. By defaultlink816 --gc-sections(ON) drops unused code, shrinking binaries dramatically (a minimal program with full runtime linked goes from ~43 KB to ~1.5 KB).
Runtime libraries
Each runtime library is built once by runtime/build.sh and lives as
a .o in runtime/. Link only what you use — --gc-sections drops
the rest.
| Library | When you need it |
|---|---|
runtime/crt0.o |
Always. C runtime startup. |
runtime/crt0Gsos.o |
Instead of crt0.o for programs launched by the GS/OS Loader. |
runtime/libc.o |
printf, malloc, strlen, the usual. Almost always. |
runtime/libgcc.o |
Compiler helpers — multiply, divide, shift. Almost always. |
runtime/snprintf.o |
If you use sprintf / snprintf / vsnprintf. |
runtime/sscanf.o |
If you use sscanf / vsscanf / fscanf. |
runtime/softDouble.o |
If you use double-precision arithmetic anywhere. |
runtime/softFloat.o |
If you use float-precision arithmetic. |
runtime/math.o |
fabs, floor, sqrt, sin, cos, pow, etc. |
runtime/qsort.o |
qsort / bsearch. |
runtime/strtol.o |
strtol / strtoul / atoi / atol. |
runtime/strtok.o |
strtok / strtok_r. |
runtime/extras.o |
strcat, strncat, llabs, rand/srand. |
runtime/timeExt.o |
time / gmtime / mktime. |
runtime/iigsToolbox.o |
Apple IIgs Toolbox call wrappers. |
runtime/iigsGsos.o |
GS/OS class-1 call wrappers (file I/O, etc.). |
runtime/desktop.o |
startdesk() helper used by demos that need a Window Manager environment. |
runtime/libcxxabi.o |
C++ ABI runtime (vtable RTTI, dynamic_cast). |
runtime/libcxxabiSjlj.o |
C++ SJLJ-exception support (paired with -fsjlj-exceptions). |
To (re)build the runtime:
bash runtime/build.sh
Multi-segment OMF (for GS/OS Loader)
For programs >60 KB (the usable bank-0 limit after the stack, zero page, and I/O window are subtracted), build a multi-segment OMF that GS/OS Loader places across banks:
./tools/link816 -o myprog.bin \
--text-base 0x1000 \
--segment-cap 0xB000 \
--segment-bank-base 0x040000 \
--manifest myprog.manifest.json \
runtime/crt0Gsos.o ... yourprog.o
./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf
See docs/multiSegmentPlan.md for details and
scripts/runMultiSeg.sh for a working
example.
Running under MAME
scripts/runInMame.sh launches MAME's
apple2gs driver, loads your binary at $00:1000, runs for a few
seconds, and reads a memory cell:
bash scripts/runInMame.sh prog.bin # just run for ~5 s
bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value
bash scripts/runInMame.sh prog.bin 0x025000 0x025002 # dump these addresses
--check ADDR=VALUEreturns exit 0 if memory matches, exit 1 if not. Used by smoke and CI.- The bare-address form dumps the value without comparing.
The runner is headless by default (-video none + SDL_VIDEODRIVER=dummy)
so it runs in a terminal-only environment. Useful environment
variables:
| Variable | Default | Purpose |
|---|---|---|
MAME_CHECK_FRAME |
300 |
Frame at which to read the check address (300 ≈ 5 s at 60 Hz). |
MAME_SECS |
6 |
How long to let MAME run before forcibly exiting. |
MAME_TIMEOUT |
30 |
Wall-clock timeout for the whole MAME invocation. |
MAME_RAMSIZE |
unset | Override the emulated RAM size (e.g. 8M). |
Writing to non-bank-0 RAM
The 65816 has two registers that select which bank a memory access goes to:
- PBR (Program Bank Register) — selects the bank for instruction
fetches. Set by
jsl long_addrandrtl. - DBR (Data Bank Register) — selects the bank for 16-bit absolute
data accesses like
lda $5000.
When the IIgs boots, DBR defaults to $00. Bank $00 contains the
I/O window at $C000-$CFFF, the language card area, and the stack —
not a great place for general data.
With ptr32 mode (the default — pointers are 32 bits / 24-bit addresses), constant pointers to non-bank-0 addresses lower automatically to long (24-bit absolute) instructions that ignore DBR:
*(volatile int *)0x025000 = 42; // → sta long $025000 (DBR-independent)
*(volatile char *)0xE10068 = 1; // → sta long $E10068 (vert position reg)
unsigned char v = *(volatile char *)0xE0C025; // ROM read
For typical programs — writing a result to a verification address,
poking IIgs hardware registers, accessing the SHR framebuffer at
$E1:2000 — you just dereference the absolute pointer and the
compiler does the right thing. DBR doesn't matter.
Legacy: the switchToBank2() idiom
You may see older code (pre-ptr32 migration) using a switchToBank2()
helper that pokes DBR to $02 so that subsequent 16-bit-absolute
stores like *(volatile X*)0x5000 = v land in bank 2:
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile (
"sep #0x20\n" // 8-bit A
".byte 0xa9,0x02\n" // lda #2 (hand-encoded)
"pha\n" // push A
"plb\n" // pop into DBR
"rep #0x20\n" // back to 16-bit A
);
}
// then:
switchToBank2();
*(volatile int *)0x5000 = x;
This still works but is no longer needed for new code. Prefer the
direct 24-bit pointer form (*(volatile int *)0x025000 = x;) — it's
clearer, requires no inline asm, and produces fewer instructions
because the bank byte is encoded inline.
There's still one case where it's useful: if you have a large amount
of data work in a single bank and want every store to be 3 bytes
(sta $5000,X etc.) instead of 4 bytes (sta long $025000,X). In
that case, set DBR once with the helper above and use 16-bit-absolute
addresses afterward. Otherwise, the direct form is simpler.
What never needs bank-switching
- Local variables on the stack — stack-relative accesses bypass DBR.
- Direct-page accesses —
lda $D0always reads$00:00D0. [dp],Yindirect-long pointers — they carry their own bank byte.- Function calls —
jsluses PBR + a long destination. - Pointers in ptr32 mode — every C pointer is 32 bits, so deref'ing any pointer (even one to bank 0) generates DBR-independent code.
Worked examples
Recursion + printing
// fib.c
#include <stdio.h>
#include <stdlib.h>
unsigned long fib(unsigned n) {
if (n < 2) return n;
return fib(n-1) + fib(n-2);
}
int main(void) {
char buf[32];
int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
// Copy the formatted string into bank-2 RAM at $02:5000 so the
// MAME harness can read it back. Each store goes through a 24-bit
// long-address write — no bank-switching needed.
for (int i = 0; i <= len; i++)
((volatile char *)0x025000)[i] = buf[i];
while (1) {}
}
Build (snprintf needs soft-double + sscanf to link cleanly):
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-I runtime/include -c fib.c -o fib.o
./tools/link816 -o fib.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o \
runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \
fib.o
bash scripts/runInMame.sh fib.bin --check 0x025000=0066 # 'f' (start of "fib")
Apple IIgs Toolbox
// hello_gs.c
#include <iigs/toolbox.h>
int main(void) {
SysBeep();
while (1) {}
}
Build (note crt0Gsos.o instead of crt0.o — sets up the toolbox
environment):
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-I runtime/include -c hello_gs.c -o hello_gs.o
./tools/link816 -o hello_gs.bin --text-base 0x1000 \
runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
runtime/libgcc.o hello_gs.o
Programs that call the toolbox usually run under real GS/OS rather than
in the headless harness. See demos/launch.sh and demos/build.sh
for a working pipeline.
Advanced: pointer-deref code generation
The W65816 backend treats every pointer as 32-bit (p:32:16 datalayout
— sizeof(void *) == 4 from the C compiler's perspective). The high
two bytes carry the bank byte plus a pad byte; the low two carry the
in-bank offset. This lets a single C pointer reach any byte in the
IIgs's 24-bit address space.
A pointer dereference has to read up to 24 bits of address to know
which bank to touch. The CPU's [dp],Y (indirect-long-Y, opcode
0xB7) reads a 24-bit pointer from a direct-page slot and uses it as
the effective address — three bytes wide, bank byte explicit. This
is the safe default path and it works regardless of where the
target memory lives.
There are two optimizations layered on top of the default path. One is always on and safe. The other is opt-in via a flag and needs care.
Layer 1: constant-offset peeling (default on, always safe)
When you write s->c for a struct field at offset 4, the natural
code is "compute s + 4, then deref". Layer 1 recognizes that
[dp],Y already has a Y register that's added to the 24-bit pointer
on the deref — so instead of computing s + 4 first, the backend
stages the base pointer at $E0..$E2 and loads Y = #4 for the
deref. Saves three instructions per struct-field access (the
clc; adc #4; ...; adc #0 carry chain).
A consecutive-access CSE peephole shares the $E0/$E2 staging
between adjacent derefs of the same base, so s->a + s->b + s->c + s->d stages once and emits four ldy #K; lda [$E0],Y pairs.
There's nothing to enable or disable. This was a +1% Lua-wide
size win on its own. It's always-on because it's structurally
equivalent to the un-optimized code — the same 24-bit deref, just
with the offset folded into Y instead of pre-added to the pointer.
Layer 2: -mllvm -w65816-dbr-safe-ptrs (opt-in, unsafe if misused)
The default [dp],Y deref needs three bytes of staging at $E0..$E2
because it reads a 24-bit pointer. Calypsi uses lda (d,S),Y
(opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE
instruction — but that opcode reads only 16 bits of pointer.
The bank byte is implicit DBR.
When you pass -mllvm -w65816-dbr-safe-ptrs, our backend uses the
same one-instruction path: it spills only the low 16 bits of the
pointer to a stack slot, sets Y to the offset, and emits
lda (slot,S),Y (or sta (slot,S),Y). Bank byte = whatever DBR
holds at runtime.
Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks by 20.6% with the flag on.
This is correct only when every pointer dereferenced in the TU points to memory inside DBR's current bank. Some examples:
| Pointer | Bank? | Safe with the flag? |
|---|---|---|
malloc() result |
DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) | Yes |
| Global variable address | DBR's bank (linker puts globals in the load segment) | Yes |
&local_array[i] in a stack frame |
Bank 0 (stack is always bank 0) | Yes IF DBR is 0 (typical) |
| Pointer returned by GS/OS Loader | The Loader's bank (might differ from yours) | No — would miscompile |
Pointer cast from a 0x010000+addr integer literal in bank 1 |
Bank 1 | No if DBR is not bank 1 |
&ROMVECTORS[0] from iigs/-style headers |
Various IIgs system banks | No in general |
For Lua, Picol, plain C programs that allocate via malloc and
operate on globals, this flag is safe. For GS/OS demos that interact
with Loader-returned segments or system memory, it would miscompile.
Default is off. Opt in per-TU:
clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o
If you set the flag and your code does dereference cross-bank pointers, the symptom is silent wrong-address reads — typically a read from the same in-bank offset but in DBR's bank instead of the intended one. No abort, no diagnostic.
Mixing safely: the flag is per-TU. You can compile your hot
struct-heavy code with the flag and your bank-aware code without.
The two .o files link cleanly together. Per-function or
per-parameter control isn't supported yet.
When the slot offset overflows 8 bits
lda (d,S),Y has an 8-bit d field — max slot offset 255 from SP.
If the function's frame is large enough that the spill slot exceeds
that, PEI emits a fallback sequence that long-indirects the slot via
[$F6],Y (the function's frame-pointer), then stages at $E0..$E2
and derefs via [$E0],Y. This is ~8 instructions — worse than the
plain [dp],Y path the flag was meant to replace. Functions that
hit this need usesDpFP=true (set automatically for large frames);
otherwise PEI emits a fatal error. In practice you'll only see this
on functions with hundreds of local variables.
Inline-threshold tuning (default lowered to 50)
LLVM's default inline-cost threshold is 225, tuned for desktop CPUs
where call overhead is high relative to the size of the inlined body.
On W65816 a jsl long:foo is just 4 bytes / ~8 cycles, but every
inlined pointer dereference expands to multiple instructions even
with Layer 2. Aggressive inlining bloats code without commensurate
cycle wins.
The W65816 backend lowers the default to 50. Calibration:
| Threshold | Lua size | CoreMark size | Cycle benches |
|---|---|---|---|
| 225 (LLVM stock) | 1.47× Calypsi | (not measured) | baseline |
| 75 | 1.16× | 0.87× | identical |
| 50 (current) | 1.13× | 0.79× | identical |
| 25 | 1.11× | 0.79× | identical |
At 225, Lua's index2adr (a multi-branch helper called 41 times in
lapi.c) was inlined into every API entry, adding ~2 KB per file —
and CoreMark's matrix_test was 17× Calypsi because the inliner
copied 5 nested-loop helpers into it. At 50, both regressions vanish
and the cycle benchmarks are unchanged.
To override (e.g. on size-sensitive ROMs or speed-critical loops):
# Force aggressive inlining (back to LLVM default)
clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o
# Force MORE conservative inlining
clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o
A function marked __attribute__((always_inline)) is always inlined
regardless of threshold. A function marked __attribute__((noinline))
is never inlined. Use these to override the global threshold for
specific cases.
Summary: which options to use when
| Goal | Compile flag |
|---|---|
| Smallest, safest binary (default) | clang --target=w65816 -O2 ... — Layer 1 is on, Layer 2 is off, threshold=50 |
| Smallest binary for code that touches only same-bank memory | Add -mllvm -w65816-dbr-safe-ptrs |
| Fastest possible code (size be damned) | Add -mllvm -inline-threshold=500 |
| Reproduce LLVM's stock inlining behavior | Add -mllvm -inline-threshold=225 |
| Maximum safety review of inlining decisions | Mark hot helpers __attribute__((noinline)) explicitly |
Inline assembly
The W65816 backend supports __asm__ with operand constraints
"a", "x", "y":
unsigned short addOne(unsigned short x) {
unsigned short r;
__asm__("inc a" : "=a"(r) : "a"(x));
return r;
}
Multi-instruction asm and raw bytes both work:
__asm__ volatile (
"sep #0x20\n"
".byte 0x68\n" // pla
"rep #0x20\n"
);
The .byte form is needed when llvm-mc can't yet parse an opcode
literally (some 65816 addressing modes still have gaps in the
assembler). Hand-encoding is a stopgap; report opcodes that need it.
Tools reference
| Tool | Location | Purpose |
|---|---|---|
clang |
tools/llvm-mos-build/bin/clang |
C / C++ compiler |
clang++ |
tools/llvm-mos-build/bin/clang++ |
C++ driver |
llc |
tools/llvm-mos-build/bin/llc |
Standalone codegen (.ll → .s) |
llvm-mc |
tools/llvm-mos-build/bin/llvm-mc |
Assembler |
llvm-objdump |
tools/llvm-mos-build/bin/llvm-objdump |
Disassembler |
link816 |
tools/link816 |
Our relocating linker |
omfEmit |
tools/omfEmit |
Emit OMF v2.1 binary from link816 output |
mame |
system apt install |
Apple IIgs emulator |
Debugging
Look at the asm
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c
cat prog.s
Look at the MIR after each backend pass
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-mllvm -print-after-all -S prog.c 2>&1 | less
Useful pass names to filter on:
| Pass name | What it does |
|---|---|
w65816-isel |
SDAG → MachineInstr selection |
w65816-widen-acc16 |
Promote Acc16 vregs to Wide16 (regalloc help) |
w65816-stack-slot-cleanup |
Remove redundant spill/reload |
w65816-stackrel-to-img |
Promote hot stack slots to DP IMG slots |
w65816-stack-slot-merge |
Collapse PHI src/dst slot pairs |
w65816-branch-expand |
Long-distance Bxx → INV_Bxx skip; BRA |
Single-pass filter
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-mllvm -print-after=w65816-isel \
-mllvm -filter-print-funcs=myfunc \
-S prog.c 2>&1 | less
Disassemble an object file
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o
Cycle-count benchmarks
13 microbenchmarks live under benchmarks/ — eight
integer/string micro-benches, three soft-double FP benches (dadd,
dmul, ddiv), and two "game-like" workloads: particles (32-particle
physics tick with i16 bounce/wall collision) and mandelbrot (4×4
fixed-point Mandelbrot tile exercising i32 multiply and conditional
control flow).
bash scripts/benchCycles.sh
Output (2026-05-21):
| Benchmark | Per-iteration cycles |
|-----------|---------------------:|
| bsearch | 127 cyc/iter (100 iters) |
| crc32 | <65 (under timer resolution) |
| dadd | 1157 cyc/iter (10 iters) |
| ddiv | 1261 cyc/iter (10 iters) |
| dmul | 1033 cyc/iter (10 iters) |
| dotProduct | 144 cyc/iter (100 iters) |
| fib | 97 cyc/iter (100 iters) |
| mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) |
| memcmp | 113 cyc/iter (100 iters) |
| particles | 2253 cyc/iter (3 iters, N=32) |
| popcount | 93 cyc/iter (100 iters) |
| strcpy | 91 cyc/iter (100 iters) |
| sumOfSquares | 126 cyc/iter (100 iters) |
The legacy scripts/benchCyclesPrecise.sh (per-call cycle count via
emu.time()) is still available but slower to run.
The compare/ directory has side-by-side .s files vs
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:
bash compare/regen.sh
Known limitations
- C++ exceptions are not implemented for DWARF unwinding.
try/catchcompiles but doesn't unwind.-fsjlj-exceptionsworks for limited SJLJ-style throwing. stdinalways returns EOF.scanfcompiles but isn't useful. Usesscanfon a buffer instead.- File I/O through
fopenrequires a backing implementation. The defaultmfsbacking (memory-file-system) lets you simulate files viamfsRegister()— useful for tests, not for real disk I/O. GS/OS file I/O works viaruntime/iigsGsos.oif you link against the GS/OS runtime. fork/exec— not applicable on a 65816, no support.- Code generation gotcha: very large stack frames (>200 bytes)
trigger FP-relative addressing. Most programs fit under that limit.
See the
frame-reldiscussion in LLVM_65816_DESIGN.md. - Three Lua functions (
luaV_execute,symbexec,auxsort) hit the greedy register allocator's complexity budget. Workaround: compile those TUs with-mllvm -regalloc=basic. Documented intests/lua/README.md.
Where to go next
- Building real GS/OS apps: see
docs/multiSegmentPlan.mdand thedemos/launch.shscript for booting through real GS/OS 6.0.2 in MAME. The 9 demos underdemos/are reasonable starting points. - Backend internals (you're hacking on the compiler): LLVM_65816_DESIGN.md.
- Smoke tests:
scripts/smokeTest.shruns ~150 end-to-end checks. Read it for examples of every feature in action. - Cycle-bench a Lua port or other real-world C: see
tests/lua/README.mdfor the recipe (vendoring + per-file regalloc tuning + libc stubs).