45 KiB
Using llvm816
This document covers compiling a C program, linking it into an Apple IIgs binary, and running it under MAME. It assumes you've followed INSTALL.md and the install completed successfully.
If you've never used clang or a similar C compiler before, start with Quick orientation — it explains the moving parts. If you already know what clang is, jump to Your first program.
Quick orientation
What is clang?
Clang is a C / C++ compiler — the program that turns your .c source
file into machine code an actual CPU can execute. It's part of the
LLVM project and is the default C compiler on macOS and on most modern
Linux distributions. If you've used gcc before, clang takes nearly
the same command-line flags.
A normal install of clang produces code for the machine it's running on
— x86-64 if you're on a typical Linux PC. Clang has a cross-compiler
mode: pass --target=<arch> to make it emit code for a different
CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've
added to a fork of clang that ships with this project.
What gets installed where
After ./setup.sh completes, the project tree under your llvm816/
checkout looks roughly like this:
llvm816/ ← repo root; everything is contained here
├── docs/ ← this directory
├── runtime/ ← C standard library + startup code
│ ├── build.sh ← script that builds the runtime .o files
│ ├── include/ ← header files (<stdio.h>, etc.)
│ │ ├── stdio.h
│ │ ├── string.h
│ │ ├── ...
│ │ └── iigs/ ← Apple IIgs-specific headers
│ │ ├── toolbox.h ← ~1300 toolbox routine wrappers
│ │ ├── gsos.h
│ │ └── desktop.h
│ ├── src/ ← sources for the runtime (.c and .s)
│ └── *.o ← compiled runtime objects (after build)
├── scripts/ ← driver scripts
│ ├── runInMame.sh ← run a binary in MAME and check memory
│ ├── benchCycles.sh ← cycle-count benchmarks
│ └── smokeTest.sh ← ~150 end-to-end correctness checks
├── src/ ← OUR backend source (you compile from here)
├── tools/ ← installed tools (~7 GB total)
│ ├── llvm-mos/ ← LLVM source tree (~5 GB)
│ ├── llvm-mos-build/ ← built artifacts (~1.4 GB)
│ │ └── bin/
│ │ ├── clang ← THE COMPILER YOU USE
│ │ ├── clang++ ← same, for C++
│ │ ├── llc ← standalone IR → asm converter
│ │ ├── llvm-mc ← standalone assembler
│ │ ├── llvm-objdump ← disassembler
│ │ └── ...
│ ├── llvm-mos-sdk/ ← prebuilt llvm-mos SDK (~400 MB, mostly unused)
│ ├── link816 ← OUR LINKER (single binary, ~120 KB)
│ ├── omfEmit ← turns flat binary → Apple IIgs OMF v2.1
│ ├── mame/ ← Apple IIgs ROMs for MAME
│ ├── gsos/ ← GS/OS 6.0.2 / 6.0.4 disk images
│ ├── calypsi/ ← reference compiler for comparison (~580 MB)
│ └── orca-c/ ← reference compiler (header sources)
├── demos/ ← example IIgs programs
├── benchmarks/ ← cycle-count benchmarks
├── compare/ ← side-by-side ours-vs-Calypsi assembly
└── setup.sh ← one-shot installer
The two files you'll use most often:
| File | Purpose |
|---|---|
tools/llvm-mos-build/bin/clang |
The compiler. Pass --target=w65816 to make it emit Apple IIgs code |
tools/link816 |
The linker. Takes .o files and produces a flat binary the IIgs can load |
Nothing is installed into /usr/local, /opt, or anywhere else on
your system — the entire toolchain lives under your llvm816/ checkout.
To uninstall, delete the directory.
What about the system's /usr/bin/clang?
If your distribution provides a clang (most do), that's a different
clang for your machine's CPU. It does not know about the W65816
target. When following this document, always use the full path
./tools/llvm-mos-build/bin/clang (or set an alias / $PATH — see
Setting up your environment).
What the build process produces
When you compile a C file for the IIgs, the flow looks like this:
hello.c
│
│ clang --target=w65816 (cross-compile to 65816 machine code)
▼
hello.o (relocatable ELF object file)
│
│ + crt0.o + libc.o + libgcc.o (runtime libraries you link in)
│
│ link816 (our relocating linker)
▼
hello.bin (flat binary, loadable at $00:1000)
│
│ optionally: omfEmit hello.bin → hello.omf (for GS/OS Loader)
│
│ scripts/runInMame.sh hello.bin
▼
runs in MAME's emulated Apple IIgs
Three stages:
- Compile — clang turns
.cinto.o - Link —
link816combines.ofiles + runtime libraries into a binary - Run — MAME boots an emulated IIgs and executes the binary
Setting up your environment
To save typing, you can either edit your $PATH or use absolute paths.
The rest of this document uses absolute paths so the examples work
without any setup, but in practice you'll want shortcuts.
Option A: edit $PATH (recommended)
Add this to ~/.bashrc (or ~/.zshrc) so our tools are on your path:
export LLVM816_ROOT=$HOME/path/to/llvm816
export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH"
Then source ~/.bashrc (or restart your shell). After that you can
just type clang --target=w65816 ... without the path prefix.
Careful: putting
tools/llvm-mos-build/binfirst on$PATHmeans allclanginvocations in that shell go to our build, not the system clang. Ours still works for your machine's native target too (it's a multi-arch clang), but if you also need your distro's version, prefer Option B.
Option B: shell aliases
In ~/.bashrc:
LLVM816_ROOT=$HOME/path/to/llvm816
alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include"
alias link816="$LLVM816_ROOT/tools/link816"
Then:
w65clang -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 ...
Option C: nothing — just use full paths
Every example in this document spells out the full path, so this works too. Verbose, but unambiguous.
Your first program
Let's compile, link, and run a tiny program. Open a terminal in your
llvm816/ checkout directory.
1. Write the source
Create hello.c:
// hello.c — the smallest meaningful Apple IIgs program.
//
// Writes a value to bank-2 RAM at $02:5000, then halts. The MAME
// harness reads that memory cell to verify the result.
int main(void) {
int x = 6 * 7;
// Write directly to the 24-bit absolute address $02:5000. With
// ptr32 mode (our default), constant pointers to >16-bit addresses
// lower to `sta long $025000` — no bank-switching needed.
*(volatile int *)0x025000 = x;
while (1) {} // halt; the harness reads memory + exits
return 0;
}
2. Compile to a .o file
./tools/llvm-mos-build/bin/clang \
--target=w65816 \
-O2 \
-I runtime/include \
-c hello.c \
-o hello.o
What each flag does:
| Flag | Meaning |
|---|---|
--target=w65816 |
Required. Tells clang to emit W65816 machine code instead of the host CPU's code. |
-O2 |
Optimization level. -O2 is recommended; -O0 works but produces 3-5× larger code. |
-I runtime/include |
Look for <stdio.h> etc. in our runtime headers. |
-c |
Compile only — produce a .o, don't link. |
-o hello.o |
Write the object to hello.o. |
If the command succeeds, you'll have a hello.o next to your hello.c.
You can inspect it:
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40
3. Link to a flat binary
./tools/link816 \
-o hello.bin \
--text-base 0x1000 \
runtime/crt0.o \
runtime/libc.o \
runtime/libgcc.o \
hello.o
Each argument:
| Argument | Why |
|---|---|
-o hello.bin |
Output file. |
--text-base 0x1000 |
Where the code goes in memory. 0x1000 is conventional (first 4 KB of bank 0 is reserved for stack + zero page). |
runtime/crt0.o |
Must come first. The C runtime startup — sets up the stack, calls main, halts cleanly on return. |
runtime/libc.o |
Core C library (printf, malloc, strlen, etc.). |
runtime/libgcc.o |
Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program. |
hello.o |
Your code. |
link816 will print something like:
linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin
That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS.
4. Run it in MAME
bash scripts/runInMame.sh hello.bin --check 0x025000=002a
0x002a is hexadecimal for 42 (= 6 × 7), and 0x025000 is the
24-bit address bank $02 + offset $5000 — where your program wrote
x. The script boots MAME's emulated Apple IIgs, loads your binary
at $00:1000, runs for 5 seconds, reads memory at $02:5000, and
compares to the expected value.
A pass looks like:
MAME-LOADED bytes=128
MAME-READ addr=0x025000 val=0x002a
[llvm816] MAME OK: 1 reads matched
If you get MAME mismatch, your program wrote a different value (or
no value). Most common cause for a new project is writing to a
bank-0 address like *(volatile int *)0x5000 = x; (a plain $5000)
instead of a 24-bit address like *(volatile int *)0x025000 = x;
($02:5000). The verification harness reads bank 2; writes to bank 0
go to a different RAM cell and the comparison fails.
Compiling C — full reference
The compiler is invoked just like a normal clang, with one extra flag:
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o
Recommended flags
| Flag | Meaning |
|---|---|
--target=w65816 |
Selects the W65816 backend (required). |
-O2 |
Default optimization. -O0 and -O1 work but produce ~3-5× larger code. -O3 is the same as -O2 for our backend. |
-ffunction-sections |
Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries). |
-I runtime/include |
Find <stdio.h>, <stdlib.h>, <iigs/toolbox.h> etc. |
-c |
Compile only — produce .o, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects. |
-g |
Emit DWARF debug info. Useful with link816 --debug-out. |
-S |
Emit assembly (.s) instead of an object file. Useful for inspecting codegen. |
What works at -O2
- All C99 scalars:
int8_tthroughint64_t, signed and unsigned, all arithmetic operators - Soft
floatanddouble(full IEEE-754 with round-to-nearest-even) - Pointers, arrays, structs, unions, bitfields
- All control flow:
if,for,while,goto,switch, recursion <stdarg.h>varargs<setjmp.h>setjmp/longjmp (SJLJ, no DWARF unwinder)- Inline
__asm__with"a","x","y"register constraints - C++ subset: classes, single + multiple inheritance, virtual base
diamonds, RTTI,
dynamic_cast,new/delete/new[]/delete[], global ctors via.init_array(walked by the crt0), Meyers singletons (gated by__cxa_guard_acquire/release), and global + static-local dtors actually run at exit time — each crt0 calls__run_cxa_atexitaftermain()returns to walk the registered table LIFO. SJLJ exceptions viaclang++ -fsjlj-exceptions(no DWARF unwinder). printf/snprintffamily: full C99 conversion + flag + width + precision + length surface —%d %i %u %x %X %o %c %s %p %f %F %e %E %g %G %n %%, flags- + space # 0, width and precision via decimal or*, length modifiershh h l ll j z t. Hex-float%a/%Ais the only intentional gap (niche).- IIgs desktop helpers:
<iigs/desktop.h>(startdesk/enddesk),<iigs/sound.h>(SysBeep + FFStartSound wrappers),<iigs/eventLoop.h>(callback-based TaskMaster dispatch — close, menu, key, mouse, idle). Seedemos/cxxProbe.cpp/ the smoke helpers test for usage. - Source-level debugger (post-mortem): build with
clang -gand link withlink816 --debug-out FOO.dwarf --map FOO.map, then resolve a runtime PC to source withscripts/pc2line.py --sidecar FOO.dwarf --map FOO.map 0xADDR. Output:PC=0x123A FILE=foo.c LINE=42 FUNC=add. Seescripts/mameDebug.shfor a wrapper that takes--break FUNC/--break FILE:LINEand runs under MAME. - C++ containers via vendored ETL (Embedded Template Library) —
#include "etl/vector.h",#include "etl/string.h",#include "etl/map.h",#include "etl/optional.h",#include "etl/delegate.h", etc. See theC++ shell commandssection below for usage.
See STATUS.md for the full feature matrix.
Linking — full reference
link816 produces a flat binary suitable for direct execution (loaded
into a fixed address) or, with --omf, an OMF binary that the GS/OS
Loader can load and relocate.
Raw binary (fixed-address load)
./tools/link816 -o output.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o
--text-base 0x1000— Where code is loaded.$1000is conventional; the first 4 KB of bank 0 ($00:0000-$00:0FFF) is reserved for the stack and direct page.--bss-base 0x020000— Where uninitialized data (BSS) goes. By default the linker places BSS immediately after rodata; supplying a different bank is useful when your text + data exceeds a single bank's free space.--map output.map— Writes a human-readable map file showing every symbol's address. Useful for debugging.--no-gc-sections— Keep all functions, even unreferenced ones. By defaultlink816 --gc-sections(ON) drops unused code, shrinking binaries dramatically (a minimal program with full runtime linked goes from ~43 KB to ~1.5 KB).
Runtime libraries
Each runtime library is built once by runtime/build.sh and lives as
a .o in runtime/. Link only what you use — --gc-sections drops
the rest.
| Library | When you need it |
|---|---|
runtime/crt0.o |
Always. C runtime startup. |
runtime/crt0Gsos.o |
Instead of crt0.o for programs launched by the GS/OS Loader. |
runtime/libc.o |
printf, malloc, strlen, the usual. Almost always. |
runtime/libgcc.o |
Compiler helpers — multiply, divide, shift. Almost always. |
runtime/snprintf.o |
If you use sprintf / snprintf / vsnprintf. |
runtime/sscanf.o |
If you use sscanf / vsscanf / fscanf. |
runtime/softDouble.o |
If you use double-precision arithmetic anywhere. |
runtime/softFloat.o |
If you use float-precision arithmetic. |
runtime/math.o |
fabs, floor, sqrt, sin, cos, pow, etc. |
runtime/qsort.o |
qsort / bsearch. |
runtime/strtol.o |
strtol / strtoul / atoi / atol. |
runtime/strtok.o |
strtok / strtok_r. |
runtime/extras.o |
strcat, strncat, llabs, rand/srand. |
runtime/timeExt.o |
time / gmtime / mktime. |
runtime/iigsToolbox.o |
Apple IIgs Toolbox call wrappers. |
runtime/iigsGsos.o |
GS/OS class-1 call wrappers (file I/O, etc.). |
runtime/desktop.o |
startdesk() helper used by demos that need a Window Manager environment. |
runtime/libcxxabi.o |
C++ ABI runtime (vtable RTTI, dynamic_cast). |
runtime/libcxxabiSjlj.o |
C++ SJLJ-exception support (paired with -fsjlj-exceptions). |
To (re)build the runtime:
bash runtime/build.sh
Multi-segment OMF (for GS/OS Loader)
For programs >60 KB (the usable bank-0 limit after the stack, zero page, and I/O window are subtracted), build a multi-segment OMF that GS/OS Loader places across banks:
./tools/link816 -o myprog.bin \
--text-base 0x1000 \
--segment-cap 0xB000 \
--segment-bank-base 0x040000 \
--manifest myprog.manifest.json \
runtime/crt0Gsos.o ... yourprog.o
./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf
See docs/multiSegmentPlan.md for details and
scripts/runMultiSeg.sh for a working
example.
Running under MAME
scripts/runInMame.sh launches MAME's
apple2gs driver, loads your binary at $00:1000, runs for a few
seconds, and reads a memory cell:
bash scripts/runInMame.sh prog.bin # just run for ~5 s
bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value
bash scripts/runInMame.sh prog.bin 0x025000 0x025002 # dump these addresses
--check ADDR=VALUEreturns exit 0 if memory matches, exit 1 if not. Used by smoke and CI.- The bare-address form dumps the value without comparing.
The runner is headless by default (-video none + SDL_VIDEODRIVER=dummy)
so it runs in a terminal-only environment. Useful environment
variables:
| Variable | Default | Purpose |
|---|---|---|
MAME_CHECK_FRAME |
300 |
Frame at which to read the check address (300 ≈ 5 s at 60 Hz). |
MAME_SECS |
6 |
How long to let MAME run before forcibly exiting. |
MAME_TIMEOUT |
30 |
Wall-clock timeout for the whole MAME invocation. |
MAME_RAMSIZE |
unset | Override the emulated RAM size (e.g. 8M). |
Writing to non-bank-0 RAM
The 65816 has two registers that select which bank a memory access goes to:
- PBR (Program Bank Register) — selects the bank for instruction
fetches. Set by
jsl long_addrandrtl. - DBR (Data Bank Register) — selects the bank for 16-bit absolute
data accesses like
lda $5000.
When the IIgs boots, DBR defaults to $00. Bank $00 contains the
I/O window at $C000-$CFFF, the language card area, and the stack —
not a great place for general data.
With ptr32 mode (the default — pointers are 32 bits / 24-bit addresses), constant pointers to non-bank-0 addresses lower automatically to long (24-bit absolute) instructions that ignore DBR:
*(volatile int *)0x025000 = 42; // → sta long $025000 (DBR-independent)
*(volatile char *)0xE10068 = 1; // → sta long $E10068 (vert position reg)
unsigned char v = *(volatile char *)0xE0C025; // ROM read
For typical programs — writing a result to a verification address,
poking IIgs hardware registers, accessing the SHR framebuffer at
$E1:2000 — you just dereference the absolute pointer and the
compiler does the right thing. DBR doesn't matter.
Legacy: the switchToBank2() idiom
You may see older code (pre-ptr32 migration) using a switchToBank2()
helper that pokes DBR to $02 so that subsequent 16-bit-absolute
stores like *(volatile X*)0x5000 = v land in bank 2:
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile (
"sep #0x20\n" // 8-bit A
".byte 0xa9,0x02\n" // lda #2 (hand-encoded)
"pha\n" // push A
"plb\n" // pop into DBR
"rep #0x20\n" // back to 16-bit A
);
}
// then:
switchToBank2();
*(volatile int *)0x5000 = x;
This still works but is no longer needed for new code. Prefer the
direct 24-bit pointer form (*(volatile int *)0x025000 = x;) — it's
clearer, requires no inline asm, and produces fewer instructions
because the bank byte is encoded inline.
There's still one case where it's useful: if you have a large amount
of data work in a single bank and want every store to be 3 bytes
(sta $5000,X etc.) instead of 4 bytes (sta long $025000,X). In
that case, set DBR once with the helper above and use 16-bit-absolute
addresses afterward. Otherwise, the direct form is simpler.
What never needs bank-switching
- Local variables on the stack — stack-relative accesses bypass DBR.
- Direct-page accesses —
lda $D0always reads$00:00D0. [dp],Yindirect-long pointers — they carry their own bank byte.- Function calls —
jsluses PBR + a long destination. - Pointers in ptr32 mode — every C pointer is 32 bits, so deref'ing any pointer (even one to bank 0) generates DBR-independent code.
Running under GNO/ME
The MAME path above runs your program bare-metal. GNO/ME 2.0.6 is a
Unix-like multitasking environment that runs on top of real GS/OS, and
a llvm816-compiled C (or C++) program can run as a native GNO shell
command — with console stdio, argv, and FILE* file I/O — booted
through GS/OS 6.0.4 in MAME. This is a sibling to the MAME path: a
different way to run the same C, inside a real OS.
This is verified headless and end-to-end. Three steps take you from C source to a running command.
1. Build the base GNO disk (once)
bash tools/gno/buildDisk.sh # -> tools/gno/gnobase.po
This assembles the GNO/ME userland into an 800 KB ProDOS volume. Re-run it only when the GNO archive set changes.
One-time prerequisites. buildDisk.sh needs nulib2 (a system
package: sudo apt-get install nulib2) and tools/cadius/cadius (run
bash scripts/installCadius.sh if it is missing), plus the GNO/ME 2.0.6
.shk archives under tools/gno/dist/. The runner in step 3 also needs
the GS/OS 6.0.4 system disk at tools/gsos/6.0.4 - System.Disk.po and
the same IIgs ROMs the MAME path uses. None of these are installed by
setup.sh today — see INSTALL.md for the full list. You
also need the GNO runtime objects, which bash runtime/build.sh builds
automatically.
2. Compile a C program into a GNO OMF
bash demos/buildGno.sh gnoHello # demos/gnoHello.c -> demos/gnoHello.omf
buildGno.sh takes a single basename (required); it reads
demos/<name>.c and emits demos/<name>.omf (plus .o/.bin/.map/
.reloc sidecars). Bundled examples: gnoHello, gnoCat, gnoFile,
gnoFmt, gnoStdin.
It links the GNO crt0 and runtime, then runs omfEmit --expressload --relocs ... --stack-size 0x4000. Override the DP/Stack size with the
GNO_STACK_SIZE environment variable if needed (default 0x4000).
3. Boot, log in, run, and check
bash scripts/runInGno.sh demos/gnoHello.omf --check 0x025000=C0DE
The runner boots GS/OS 6.0.4 + GNO in headless MAME, logs in as root,
runs your command, then probes memory. gnoHello writes 0xC0DE to
$02:5000 as a harness marker, so a successful run prints:
[llvm816] GNO check OK: 0x025000 = 0xc0de
--check takes ADDR=VALUE pairs (multiple allowed after one
--check). The address uses 0x form (0x025000); the expected
value is bare hex with no prefix (C0DE, not 0xC0DE). The runner
prints the matched value lowercased. Add --snapshots to capture a PNG
of each boot/login/run stage to /tmp/gnosnaps.
Things you must know
- The OMF command basename must be ProDOS-legal — no hyphen. Name
it
testgno, nottest-gno, or the command never launches. - stdio needs
libcGnolinked.buildGno.shdoes this for C. Without it the program runs but prints nothing (the console hooks fall through to a dead sink). - Console file descriptors follow GNO's convention: stdin=1, stdout=2, stderr=3 (a documented deviation from POSIX 0/1/2).
- Commands that do GS/OS file I/O need the
--stack-sizeDP/Stack OMF segment thatbuildGno.shpasses (0x4000); the 4 KB default crashes.
C++ shell commands
demos/buildGno.sh <name> auto-detects .c vs .cpp and switches to
clang++ -fno-exceptions -fno-rtti for the latter, linking
runtime/libcxxabi.o + libcxxabiSjlj.o so the C++ ABI hooks
(operator new/delete, __cxa_guard_*, __cxa_atexit +
__run_cxa_atexit, RTTI typeinfo, dynamic_cast, SJLJ exception
runtime) resolve. Link-time GC strips whatever isn't used, so a
pure-C .c program pays nothing extra for the additional .os on
the link line.
Global / static-local dtors run at exit. Each crt0 calls
__run_cxa_atexit after main() returns and before halt/QUIT — the
registered dtor table is walked in LIFO order, so destructors for
file-scope objects and static T x; locals actually execute.
demos/cxxProbe.cpp is the worked example.
ETL containers — the vendored Embedded Template Library at
runtime/include/c++/etl/ provides fixed-capacity STL-style containers
with no malloc and no exceptions. buildGno.sh adds
-I runtime/include/c++ to the compile line, so:
#include "etl/vector.h"
#include "etl/string.h"
#include "etl/map.h"
#include "etl/optional.h"
#include "etl/delegate.h"
static int doubler(int x) { return x * 2; }
int main(void) {
etl::vector<int, 8> v;
for (int i = 1; i <= 5; i++) v.push_back(i);
etl::string<32> s("Hello, ");
s += "ETL";
etl::map<int, int, 8> m;
m[1] = 100;
etl::optional<int> opt = 42;
// etl::delegate is the std::function-equivalent (type-erased callable).
// etl::function is for binding object methods, NOT general callables.
etl::delegate<int(int)> fn = etl::delegate<int(int)>::create<doubler>();
return fn(s.size()); // 20
}
The capacity N in etl::vector<T, N> (and etl::string<N>,
etl::map<K,V,N>, …) is a template parameter, so storage is
in-struct (no heap, no allocator). Pick N like you'd pick the size
of a C array. Same trade-off — too small overflows, too large wastes
BSS. Overflow today silently corrupts past the storage array (no
exceptions, default ETL_ASSERT is a no-op); install a callback via
etl::error_handler::set_callback(...) at startup if you want a halt
on overflow.
The target profile at runtime/include/c++/etl_profile.h sets
ETL_NO_STL, no atomics, no exceptions, no std::ostream — do not
override it in user code. Full container list at
etlcpp.com. demos/etlProbe.cpp exercises
vector + string + map + optional + delegate end-to-end (20 KB total).
For hand-driven builds without buildGno.sh, link libcGno before
libc so its strong console hooks win. See the gno target in
stuff/baztest/Makefile for a worked
recipe.
For the full picture — disk layout, the inline GS/OS QUIT convention,
the double-run/QUIT trap, argv handover, FILE* round-trips, and the
runInGno.sh environment hooks (GNO_STDIN, GNO_ADDFILE,
GNO_RUNCMD, GNO_POLL_FRAMES) — see
tools/gno/README.md.
Worked examples
Recursion + printing
// fib.c
#include <stdio.h>
#include <stdlib.h>
unsigned long fib(unsigned n) {
if (n < 2) return n;
return fib(n-1) + fib(n-2);
}
int main(void) {
char buf[32];
int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
// Copy the formatted string into bank-2 RAM at $02:5000 so the
// MAME harness can read it back. Each store goes through a 24-bit
// long-address write — no bank-switching needed.
for (int i = 0; i <= len; i++)
((volatile char *)0x025000)[i] = buf[i];
while (1) {}
}
Build (snprintf needs soft-double + sscanf to link cleanly):
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-I runtime/include -c fib.c -o fib.o
./tools/link816 -o fib.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o \
runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \
fib.o
bash scripts/runInMame.sh fib.bin --check 0x025000=0066 # 'f' (start of "fib")
Apple IIgs Toolbox
// hello_gs.c
#include <iigs/toolbox.h>
int main(void) {
SysBeep();
while (1) {}
}
Build (note crt0Gsos.o instead of crt0.o — sets up the toolbox
environment):
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-I runtime/include -c hello_gs.c -o hello_gs.o
./tools/link816 -o hello_gs.bin --text-base 0x1000 \
runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
runtime/libgcc.o hello_gs.o
Programs that call the toolbox usually run under real GS/OS rather than
in the headless harness. See demos/launch.sh and demos/build.sh
for a working pipeline.
Advanced: pointer-deref code generation
The W65816 backend treats every pointer as 32-bit (p:32:16 datalayout
— sizeof(void *) == 4 from the C compiler's perspective). The high
two bytes carry the bank byte plus a pad byte; the low two carry the
in-bank offset. This lets a single C pointer reach any byte in the
IIgs's 24-bit address space.
A pointer dereference has to read up to 24 bits of address to know
which bank to touch. The CPU's [dp],Y (indirect-long-Y, opcode
0xB7) reads a 24-bit pointer from a direct-page slot and uses it as
the effective address — three bytes wide, bank byte explicit. This
is the safe default path and it works regardless of where the
target memory lives.
There are two optimizations layered on top of the default path. One is always on and safe. The other is opt-in via a flag and needs care.
Layer 1: constant-offset peeling (default on, always safe)
When you write s->c for a struct field at offset 4, the natural
code is "compute s + 4, then deref". Layer 1 recognizes that
[dp],Y already has a Y register that's added to the 24-bit pointer
on the deref — so instead of computing s + 4 first, the backend
stages the base pointer at $E0..$E2 and loads Y = #4 for the
deref. Saves three instructions per struct-field access (the
clc; adc #4; ...; adc #0 carry chain).
A consecutive-access CSE peephole shares the $E0/$E2 staging
between adjacent derefs of the same base, so s->a + s->b + s->c + s->d stages once and emits four ldy #K; lda [$E0],Y pairs.
There's nothing to enable or disable. This was a +1% Lua-wide
size win on its own. It's always-on because it's structurally
equivalent to the un-optimized code — the same 24-bit deref, just
with the offset folded into Y instead of pre-added to the pointer.
Layer 2: -mllvm -w65816-dbr-safe-ptrs (opt-in, unsafe if misused)
The default [dp],Y deref needs three bytes of staging at $E0..$E2
because it reads a 24-bit pointer. Calypsi uses lda (d,S),Y
(opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE
instruction — but that opcode reads only 16 bits of pointer.
The bank byte is implicit DBR.
When you pass -mllvm -w65816-dbr-safe-ptrs, our backend uses the
same one-instruction path: it spills only the low 16 bits of the
pointer to a stack slot, sets Y to the offset, and emits
lda (slot,S),Y (or sta (slot,S),Y). Bank byte = whatever DBR
holds at runtime.
Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks by 20.6% with the flag on.
This is correct only when every pointer dereferenced in the TU points to memory inside DBR's current bank. Some examples:
| Pointer | Bank? | Safe with the flag? |
|---|---|---|
malloc() result |
DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) | Yes |
| Global variable address | DBR's bank (linker puts globals in the load segment) | Yes |
&local_array[i] in a stack frame |
Bank 0 (stack is always bank 0) | Yes IF DBR is 0 (typical) |
| Pointer returned by GS/OS Loader | The Loader's bank (might differ from yours) | No — would miscompile |
Pointer cast from a 0x010000+addr integer literal in bank 1 |
Bank 1 | No if DBR is not bank 1 |
&ROMVECTORS[0] from iigs/-style headers |
Various IIgs system banks | No in general |
For Lua, Picol, plain C programs that allocate via malloc and
operate on globals, this flag is safe. For GS/OS demos that interact
with Loader-returned segments or system memory, it would miscompile.
Default is off. Opt in per-TU:
clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o
If you set the flag and your code does dereference cross-bank pointers, the symptom is silent wrong-address reads — typically a read from the same in-bank offset but in DBR's bank instead of the intended one. No abort, no diagnostic.
Mixing safely: the flag is per-TU. You can compile your hot
struct-heavy code with the flag and your bank-aware code without.
The two .o files link cleanly together. Per-function or
per-parameter control isn't supported yet.
When the slot offset overflows 8 bits
lda (d,S),Y has an 8-bit d field — max slot offset 255 from SP.
If the function's frame is large enough that the spill slot exceeds
that, PEI emits a fallback sequence that long-indirects the slot via
[$F6],Y (the function's frame-pointer), then stages at $E0..$E2
and derefs via [$E0],Y. This is ~8 instructions — worse than the
plain [dp],Y path the flag was meant to replace. Functions that
hit this need usesDpFP=true (set automatically for large frames);
otherwise PEI emits a fatal error. In practice you'll only see this
on functions with hundreds of local variables.
Inline-threshold tuning (default lowered to 50)
LLVM's default inline-cost threshold is 225, tuned for desktop CPUs
where call overhead is high relative to the size of the inlined body.
On W65816 a jsl long:foo is just 4 bytes / ~8 cycles, but every
inlined pointer dereference expands to multiple instructions even
with Layer 2. Aggressive inlining bloats code without commensurate
cycle wins.
The W65816 backend lowers the default to 50. Calibration:
| Threshold | Lua size | CoreMark size | Cycle benches |
|---|---|---|---|
| 225 (LLVM stock) | 1.47× Calypsi | (not measured) | baseline |
| 75 | 1.16× | 0.87× | identical |
| 50 (current) | 1.13× | 0.79× | identical |
| 25 | 1.11× | 0.79× | identical |
At 225, Lua's index2adr (a multi-branch helper called 41 times in
lapi.c) was inlined into every API entry, adding ~2 KB per file —
and CoreMark's matrix_test was 17× Calypsi because the inliner
copied 5 nested-loop helpers into it. At 50, both regressions vanish
and the cycle benchmarks are unchanged.
To override (e.g. on size-sensitive ROMs or speed-critical loops):
# Force aggressive inlining (back to LLVM default)
clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o
# Force MORE conservative inlining
clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o
A function marked __attribute__((always_inline)) is always inlined
regardless of threshold. A function marked __attribute__((noinline))
is never inlined. Use these to override the global threshold for
specific cases.
Summary: which options to use when
| Goal | Compile flag |
|---|---|
| Smallest, safest binary (default) | clang --target=w65816 -O2 ... — Layer 1 is on, Layer 2 is off, threshold=50 |
| Smallest binary for code that touches only same-bank memory | Add -mllvm -w65816-dbr-safe-ptrs |
| Fastest possible code (size be damned) | Add -mllvm -inline-threshold=500 |
| Reproduce LLVM's stock inlining behavior | Add -mllvm -inline-threshold=225 |
| Maximum safety review of inlining decisions | Mark hot helpers __attribute__((noinline)) explicitly |
Inline assembly
The W65816 backend supports __asm__ with operand constraints
"a", "x", "y":
unsigned short addOne(unsigned short x) {
unsigned short r;
__asm__("inc a" : "=a"(r) : "a"(x));
return r;
}
Multi-instruction asm and raw bytes both work:
__asm__ volatile (
"sep #0x20\n"
".byte 0x68\n" // pla
"rep #0x20\n"
);
The .byte form is needed when llvm-mc can't yet parse an opcode
literally (some 65816 addressing modes still have gaps in the
assembler). Hand-encoding is a stopgap; report opcodes that need it.
Tools reference
| Tool | Location | Purpose |
|---|---|---|
clang |
tools/llvm-mos-build/bin/clang |
C / C++ compiler |
clang++ |
tools/llvm-mos-build/bin/clang++ |
C++ driver |
llc |
tools/llvm-mos-build/bin/llc |
Standalone codegen (.ll → .s) |
llvm-mc |
tools/llvm-mos-build/bin/llvm-mc |
Assembler |
llvm-objdump |
tools/llvm-mos-build/bin/llvm-objdump |
Disassembler |
link816 |
tools/link816 |
Our relocating linker |
omfEmit |
tools/omfEmit |
Emit OMF v2.1 binary from link816 output |
mame |
system apt install |
Apple IIgs emulator |
Debugging
Look at the asm
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c
cat prog.s
Look at the MIR after each backend pass
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-mllvm -print-after-all -S prog.c 2>&1 | less
Useful pass names to filter on:
| Pass name | What it does |
|---|---|
w65816-isel |
SDAG → MachineInstr selection |
w65816-widen-acc16 |
Promote Acc16 vregs to Wide16 (regalloc help) |
w65816-stack-slot-cleanup |
Remove redundant spill/reload |
w65816-stackrel-to-img |
Promote hot stack slots to DP IMG slots |
w65816-stack-slot-merge |
Collapse PHI src/dst slot pairs |
w65816-branch-expand |
Long-distance Bxx → INV_Bxx skip; BRA |
Single-pass filter
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-mllvm -print-after=w65816-isel \
-mllvm -filter-print-funcs=myfunc \
-S prog.c 2>&1 | less
Disassemble an object file
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o
ELF e_machine value
W65816 .o files use EM_W65816 = 0xFF16 in the ELF header.
The value sits in the 0xFF00-0xFFFF range reserved by the ELF spec for
vendor-private / experimental targets — no IANA registration required.
The 16 suffix is a mnemonic for "65816". (The natural choice, 65816
itself = 0x10118, does not fit the 16-bit Elf32_Half e_machine
field.)
Why this matters:
llvm-dwarfdump,readelf, and other generic ELF consumers used to warn on every invocation because the file claimedEM_NONE(= no machine). Setting a realEM_value silences the warning while still preventing a host-architecture.ofrom being accidentally linked.link816validatese_machineand rejects anything that isn'tEM_W65816(withEM_NONEstill accepted for backwards compatibility with any pre-Phase-1.13 object files lingering in a build tree).- The relocation numbers
R_W65816_*are unique underEM_W65816, so they're free to stay at the small stable integers1-8(seesrc/llvm/lib/Target/W65816/MCTargetDesc/W65816ELFObjectWriter.cpp).
Touchpoints if you ever need to change the value:
| File | What it does |
|---|---|
tools/llvm-mos/llvm/include/llvm/BinaryFormat/ELF.h |
Defines EM_W65816 enumerator |
src/llvm/lib/Target/W65816/MCTargetDesc/W65816ELFObjectWriter.cpp |
Passes value to MCELFObjectTargetWriter |
src/link816/link816.cpp |
Validates value on input |
Cycle-count benchmarks
Microbenchmarks live under benchmarks/ — integer/
string micro-benches plus soft-double FP benches.
W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs" bash scripts/benchCyclesPrecise.sh
This measures per-call cycle counts via MAME's emu.time() between
markers — apples-to-apples vs the matching
scripts/benchCyclesCalypsi.sh runner (commercial Calypsi 5.16).
Current ratios (2026-05-27, Layer 2):
| Benchmark | Ours | Calypsi | Ratio |
|--------------|------:|--------:|------:|
| dotProduct | 1534 | 5712 | 0.27× |
| bsearch | 682 | 2387 | 0.29× |
| sumOfSquares | 6820 | 16368 | 0.42× |
| bubbleSort | 11594 | 17050 | 0.68× |
| strLen | 767 | 1023 | 0.75× |
| djb2Hash | 2046 | 2643 | 0.77× |
| popcount | 1194 | 1534 | 0.78× |
| strcpy | 1108 | 1194 | 0.93× |
| memcmp | 682 | 716 | 0.95× |
| fib | 11594 | 10912 | 1.06× |
Geomean: 0.62× Calypsi. 9 of 10 below 1.0×. The Layer 2 flag
(-mllvm -w65816-dbr-safe-ptrs) enables stack-rel-indirect-Y ptr32
derefs — required for parity since Calypsi's pointer ABI assumes
DBR matches the pointer's bank.
The scripts/benchCycles.sh (HBL-tick-based) script is still around
but lower-resolution. Prefer the Precise runner above.
The compare/ directory has side-by-side .s files vs
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:
bash compare/regen.sh
UndefinedBehaviorSanitizer (UBSan, minimal runtime)
The W65816 target ships a hand-rolled minimal UBSan runtime
(runtime/ubsan.o). No driver-side magic: pass the flags and link
the runtime object explicitly.
# Compile with UBSan-min instrumentation.
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-fsanitize=undefined -fsanitize-minimal-runtime \
-ffunction-sections -I runtime/include \
-c prog.c -o prog.o
# Link, including runtime/ubsan.o so the 25 __ubsan_handle_*_minimal
# symbols clang emits calls to resolve cleanly. libgcc.o is needed
# whenever you exercise i16 div / i32 multiply / shift-by-N.
./tools/link816 -o prog.bin --text-base 0x1000 --bss-base 0xA000 \
runtime/crt0.o prog.o runtime/ubsan.o runtime/libgcc.o
What's covered (25 of the 25 handlers upstream's minimal runtime emits):
type-mismatch shift-out-of-bounds invalid-objc-cast
alignment-assumption out-of-bounds function-type-mismatch
add-overflow local-out-of-bounds implicit-conversion
sub-overflow builtin-unreachable (*) nonnull-arg
mul-overflow missing-return (*) nonnull-return
negate-overflow vla-bound-not-positive nullability-arg
divrem-overflow float-cast-overflow nullability-return
load-invalid-value pointer-overflow
invalid-builtin cfi-check-fail
(*) recovering-only — no _abort variant emitted upstream.
When a UB site fires, the runtime calls a per-kind handler that:
- Looks up the caller PC in a 20-entry dedup table (single-threaded, no atomics).
- If first-seen, emits one line via the existing
__putByteErrhook (GNO fd 3 / stderr) in the formatubsan: <kind> by 0x<8-hex>\n. - The recover variant returns; the
_abortvariant calls__builtin_trap()which lowers toBRK_pseudo+ sentinel0xBE @ $70- tight-loop spin.
ASan is out of scope — the 8:1 shadow-memory model would need ~2 MB of shadow for the 16 MB 65816 address space, while most IIgs programs run in one or two banks.
End-to-end smoke probe:
bash tests/ubsan/runUbsanProbe.sh
Exercises add-overflow + shift-out-of-bounds + divide-by-zero,
verifies each handler fires and execution recovers past the UB site
(sentinels at $025000..$025006). Wired into scripts/smokeTest.sh
as the Phase 6.2 stage; override with SMOKE_SKIP_UBSAN=1.
The probe deliberately overrides three handlers with strong defs that
record their firing in a state byte rather than printing — that lets
the test verify the call edge without pulling libc.o (and the
attached snprintf.o) into a smoke probe that doesn't need console
I/O. A diagnostic-format smoke (asserting on the ubsan: ...\n line)
is a follow-up under the cxxsmoke GNO MAME harness.
Known limitations
- C++ exceptions are not implemented for DWARF unwinding.
try/catchcompiles but doesn't unwind.-fsjlj-exceptionsworks for limited SJLJ-style throwing. stdinalways returns EOF.scanfcompiles but isn't useful. Usesscanfon a buffer instead.- File I/O through
fopenrequires a backing implementation. The defaultmfsbacking (memory-file-system) lets you simulate files viamfsRegister()— useful for tests, not for real disk I/O. GS/OS file I/O works viaruntime/iigsGsos.oif you link against the GS/OS runtime. fork/exec— not applicable on a 65816, no support.- Code generation gotcha: very large stack frames (>200 bytes)
trigger FP-relative addressing. Most programs fit under that limit.
See the
frame-reldiscussion in LLVM_65816_DESIGN.md. - Three Lua functions (
luaV_execute,symbexec,auxsort) hit the greedy register allocator's complexity budget. Workaround: compile those TUs with-mllvm -regalloc=basic. Documented intests/lua/README.md.
Where to go next
- Building real GS/OS apps: see
docs/multiSegmentPlan.mdand thedemos/launch.shscript for booting through real GS/OS 6.0.2 in MAME. The 9 demos underdemos/are reasonable starting points. - Running as a GNO/ME shell command: see Running under
GNO/ME above,
tools/gno/README.md, and thedemos/gno*.cexamples. - Backend internals (you're hacking on the compiler): LLVM_65816_DESIGN.md.
- Smoke tests:
scripts/smokeTest.shruns ~150 end-to-end checks. Read it for examples of every feature in action. - Cycle-bench a Lua port or other real-world C: see
tests/lua/README.mdfor the recipe (vendoring + per-file regalloc tuning + libc stubs).