65816-llvm-mos/docs/USAGE.md
Scott Duensing da095402ec Updated
2026-06-02 23:17:57 -05:00

1186 lines
45 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Using llvm816
This document covers compiling a C program, linking it into an Apple
IIgs binary, and running it under MAME. It assumes you've followed
[INSTALL.md](INSTALL.md) and the install completed successfully.
If you've never used **clang** or a similar C compiler before, start
with [Quick orientation](#quick-orientation) — it explains the moving
parts. If you already know what clang is, jump to
[Your first program](#your-first-program).
---
## Quick orientation
### What is clang?
Clang is a C / C++ compiler — the program that turns your `.c` source
file into machine code an actual CPU can execute. It's part of the
LLVM project and is the default C compiler on macOS and on most modern
Linux distributions. If you've used `gcc` before, clang takes nearly
the same command-line flags.
A normal install of clang produces code for the machine it's running on
— x86-64 if you're on a typical Linux PC. Clang has a **cross-compiler
mode**: pass `--target=<arch>` to make it emit code for a *different*
CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've
added to a fork of clang that ships with this project.
### What gets installed where
After `./setup.sh` completes, the project tree under your `llvm816/`
checkout looks roughly like this:
```
llvm816/ ← repo root; everything is contained here
├── docs/ ← this directory
├── runtime/ ← C standard library + startup code
│ ├── build.sh ← script that builds the runtime .o files
│ ├── include/ ← header files (<stdio.h>, etc.)
│ │ ├── stdio.h
│ │ ├── string.h
│ │ ├── ...
│ │ └── iigs/ ← Apple IIgs-specific headers
│ │ ├── toolbox.h ← ~1300 toolbox routine wrappers
│ │ ├── gsos.h
│ │ └── desktop.h
│ ├── src/ ← sources for the runtime (.c and .s)
│ └── *.o ← compiled runtime objects (after build)
├── scripts/ ← driver scripts
│ ├── runInMame.sh ← run a binary in MAME and check memory
│ ├── benchCycles.sh ← cycle-count benchmarks
│ └── smokeTest.sh ← ~150 end-to-end correctness checks
├── src/ ← OUR backend source (you compile from here)
├── tools/ ← installed tools (~7 GB total)
│ ├── llvm-mos/ ← LLVM source tree (~5 GB)
│ ├── llvm-mos-build/ ← built artifacts (~1.4 GB)
│ │ └── bin/
│ │ ├── clang ← THE COMPILER YOU USE
│ │ ├── clang++ ← same, for C++
│ │ ├── llc ← standalone IR → asm converter
│ │ ├── llvm-mc ← standalone assembler
│ │ ├── llvm-objdump ← disassembler
│ │ └── ...
│ ├── llvm-mos-sdk/ ← prebuilt llvm-mos SDK (~400 MB, mostly unused)
│ ├── link816 ← OUR LINKER (single binary, ~120 KB)
│ ├── omfEmit ← turns flat binary → Apple IIgs OMF v2.1
│ ├── mame/ ← Apple IIgs ROMs for MAME
│ ├── gsos/ ← GS/OS 6.0.2 / 6.0.4 disk images
│ ├── calypsi/ ← reference compiler for comparison (~580 MB)
│ └── orca-c/ ← reference compiler (header sources)
├── demos/ ← example IIgs programs
├── benchmarks/ ← cycle-count benchmarks
├── compare/ ← side-by-side ours-vs-Calypsi assembly
└── setup.sh ← one-shot installer
```
The two files you'll use most often:
| File | Purpose |
|---|---|
| **`tools/llvm-mos-build/bin/clang`** | The compiler. Pass `--target=w65816` to make it emit Apple IIgs code |
| **`tools/link816`** | The linker. Takes `.o` files and produces a flat binary the IIgs can load |
Nothing is installed into `/usr/local`, `/opt`, or anywhere else on
your system — the entire toolchain lives under your `llvm816/` checkout.
To uninstall, delete the directory.
### What about the system's `/usr/bin/clang`?
If your distribution provides a clang (most do), that's a *different*
clang for *your machine's* CPU. It does **not** know about the W65816
target. When following this document, always use the full path
`./tools/llvm-mos-build/bin/clang` (or set an alias / `$PATH` — see
[Setting up your environment](#setting-up-your-environment)).
### What the build process produces
When you compile a C file for the IIgs, the flow looks like this:
```
hello.c
│ clang --target=w65816 (cross-compile to 65816 machine code)
hello.o (relocatable ELF object file)
│ + crt0.o + libc.o + libgcc.o (runtime libraries you link in)
│ link816 (our relocating linker)
hello.bin (flat binary, loadable at $00:1000)
│ optionally: omfEmit hello.bin → hello.omf (for GS/OS Loader)
│ scripts/runInMame.sh hello.bin
runs in MAME's emulated Apple IIgs
```
Three stages:
1. **Compile** — clang turns `.c` into `.o`
2. **Link**`link816` combines `.o` files + runtime libraries into a binary
3. **Run** — MAME boots an emulated IIgs and executes the binary
---
## Setting up your environment
To save typing, you can either edit your `$PATH` or use absolute paths.
The rest of this document uses absolute paths so the examples work
without any setup, but in practice you'll want shortcuts.
### Option A: edit `$PATH` (recommended)
Add this to `~/.bashrc` (or `~/.zshrc`) so our tools are on your path:
```bash
export LLVM816_ROOT=$HOME/path/to/llvm816
export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH"
```
Then `source ~/.bashrc` (or restart your shell). After that you can
just type `clang --target=w65816 ...` without the path prefix.
> **Careful:** putting `tools/llvm-mos-build/bin` first on `$PATH` means
> *all* `clang` invocations in that shell go to our build, not the
> system clang. Ours still works for your machine's native target
> too (it's a multi-arch clang), but if you also need your distro's
> version, prefer Option B.
### Option B: shell aliases
In `~/.bashrc`:
```bash
LLVM816_ROOT=$HOME/path/to/llvm816
alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include"
alias link816="$LLVM816_ROOT/tools/link816"
```
Then:
```bash
w65clang -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 ...
```
### Option C: nothing — just use full paths
Every example in this document spells out the full path, so this works
too. Verbose, but unambiguous.
---
## Your first program
Let's compile, link, and run a tiny program. Open a terminal in your
`llvm816/` checkout directory.
### 1. Write the source
Create `hello.c`:
```c
// hello.c — the smallest meaningful Apple IIgs program.
//
// Writes a value to bank-2 RAM at $02:5000, then halts. The MAME
// harness reads that memory cell to verify the result.
int main(void) {
int x = 6 * 7;
// Write directly to the 24-bit absolute address $02:5000. With
// ptr32 mode (our default), constant pointers to >16-bit addresses
// lower to `sta long $025000` — no bank-switching needed.
*(volatile int *)0x025000 = x;
while (1) {} // halt; the harness reads memory + exits
return 0;
}
```
### 2. Compile to a `.o` file
```bash
./tools/llvm-mos-build/bin/clang \
--target=w65816 \
-O2 \
-I runtime/include \
-c hello.c \
-o hello.o
```
What each flag does:
| Flag | Meaning |
|---|---|
| `--target=w65816` | **Required.** Tells clang to emit W65816 machine code instead of the host CPU's code. |
| `-O2` | Optimization level. `-O2` is recommended; `-O0` works but produces 3-5× larger code. |
| `-I runtime/include` | Look for `<stdio.h>` etc. in our runtime headers. |
| `-c` | Compile only — produce a `.o`, don't link. |
| `-o hello.o` | Write the object to `hello.o`. |
If the command succeeds, you'll have a `hello.o` next to your `hello.c`.
You can inspect it:
```bash
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40
```
### 3. Link to a flat binary
```bash
./tools/link816 \
-o hello.bin \
--text-base 0x1000 \
runtime/crt0.o \
runtime/libc.o \
runtime/libgcc.o \
hello.o
```
Each argument:
| Argument | Why |
|---|---|
| `-o hello.bin` | Output file. |
| `--text-base 0x1000` | Where the code goes in memory. `0x1000` is conventional (first 4 KB of bank 0 is reserved for stack + zero page). |
| `runtime/crt0.o` | **Must come first.** The C runtime startup — sets up the stack, calls `main`, halts cleanly on return. |
| `runtime/libc.o` | Core C library (`printf`, `malloc`, `strlen`, etc.). |
| `runtime/libgcc.o` | Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program. |
| `hello.o` | Your code. |
`link816` will print something like:
```
linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin
```
That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS.
### 4. Run it in MAME
```bash
bash scripts/runInMame.sh hello.bin --check 0x025000=002a
```
`0x002a` is hexadecimal for 42 (= 6 × 7), and `0x025000` is the
24-bit address `bank $02 + offset $5000` — where your program wrote
`x`. The script boots MAME's emulated Apple IIgs, loads your binary
at `$00:1000`, runs for 5 seconds, reads memory at `$02:5000`, and
compares to the expected value.
A pass looks like:
```
MAME-LOADED bytes=128
MAME-READ addr=0x025000 val=0x002a
[llvm816] MAME OK: 1 reads matched
```
If you get `MAME mismatch`, your program wrote a different value (or
no value). Most common cause for a new project is writing to a
bank-0 address like `*(volatile int *)0x5000 = x;` (a plain `$5000`)
instead of a 24-bit address like `*(volatile int *)0x025000 = x;`
(`$02:5000`). The verification harness reads bank 2; writes to bank 0
go to a different RAM cell and the comparison fails.
---
## Compiling C — full reference
The compiler is invoked just like a normal clang, with one extra flag:
```bash
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o
```
### Recommended flags
| Flag | Meaning |
|---|---|
| `--target=w65816` | Selects the W65816 backend (required). |
| `-O2` | Default optimization. `-O0` and `-O1` work but produce ~3-5× larger code. `-O3` is the same as `-O2` for our backend. |
| `-ffunction-sections` | Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries). |
| `-I runtime/include` | Find `<stdio.h>`, `<stdlib.h>`, `<iigs/toolbox.h>` etc. |
| `-c` | Compile only — produce `.o`, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects. |
| `-g` | Emit DWARF debug info. Useful with `link816 --debug-out`. |
| `-S` | Emit assembly (`.s`) instead of an object file. Useful for inspecting codegen. |
### What works at `-O2`
- All C99 scalars: `int8_t` through `int64_t`, signed and unsigned,
all arithmetic operators
- Soft `float` and `double` (full IEEE-754 with round-to-nearest-even)
- Pointers, arrays, structs, unions, bitfields
- All control flow: `if`, `for`, `while`, `goto`, `switch`, recursion
- `<stdarg.h>` varargs
- `<setjmp.h>` setjmp/longjmp (SJLJ, no DWARF unwinder)
- Inline `__asm__` with `"a"`, `"x"`, `"y"` register constraints
- C++ subset: classes, single + multiple inheritance, virtual base
diamonds, RTTI, `dynamic_cast`, `new` / `delete` / `new[]` / `delete[]`,
global ctors via `.init_array` (walked by the crt0), Meyers singletons
(gated by `__cxa_guard_acquire`/`release`), and **global + static-local
dtors actually run** at exit time — each crt0 calls
`__run_cxa_atexit` after `main()` returns to walk the registered
table LIFO. SJLJ exceptions via `clang++ -fsjlj-exceptions` (no
DWARF unwinder).
- `printf` / `snprintf` family: full C99 conversion + flag + width +
precision + length surface — `%d %i %u %x %X %o %c %s %p %f %F %e
%E %g %G %n %%`, flags `- + space # 0`, width and precision via
decimal or `*`, length modifiers `hh h l ll j z t`. Hex-float
`%a` / `%A` is the only intentional gap (niche).
- IIgs desktop helpers: `<iigs/desktop.h>` (startdesk/enddesk),
`<iigs/sound.h>` (SysBeep + FFStartSound wrappers),
`<iigs/eventLoop.h>` (callback-based TaskMaster dispatch — close,
menu, key, mouse, idle). See `demos/cxxProbe.cpp` / the smoke
helpers test for usage.
- Source-level debugger (post-mortem): build with `clang -g` and link
with `link816 --debug-out FOO.dwarf --map FOO.map`, then resolve a
runtime PC to source with `scripts/pc2line.py --sidecar FOO.dwarf
--map FOO.map 0xADDR`. Output: `PC=0x123A FILE=foo.c LINE=42
FUNC=add`. See `scripts/mameDebug.sh` for a wrapper that takes
`--break FUNC` / `--break FILE:LINE` and runs under MAME.
- C++ containers via vendored **ETL** (Embedded Template Library) —
`#include "etl/vector.h"`, `#include "etl/string.h"`, `#include "etl/map.h"`,
`#include "etl/optional.h"`, `#include "etl/delegate.h"`, etc. See
the `C++ shell commands` section below for usage.
See [STATUS.md](../STATUS.md) for the full feature matrix.
---
## Linking — full reference
`link816` produces a flat binary suitable for direct execution (loaded
into a fixed address) or, with `--omf`, an OMF binary that the GS/OS
Loader can load and relocate.
### Raw binary (fixed-address load)
```bash
./tools/link816 -o output.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o
```
- `--text-base 0x1000` — Where code is loaded. `$1000` is conventional;
the first 4 KB of bank 0 (`$00:0000`-`$00:0FFF`) is reserved for the
stack and direct page.
- `--bss-base 0x020000` — Where uninitialized data (BSS) goes. By
default the linker places BSS immediately after rodata; supplying a
different bank is useful when your text + data exceeds a single
bank's free space.
- `--map output.map` — Writes a human-readable map file showing every
symbol's address. Useful for debugging.
- `--no-gc-sections` — Keep all functions, even unreferenced ones.
By default `link816 --gc-sections` (ON) drops unused code, shrinking
binaries dramatically (a minimal program with full runtime linked
goes from ~43 KB to ~1.5 KB).
### Runtime libraries
Each runtime library is built once by `runtime/build.sh` and lives as
a `.o` in `runtime/`. Link only what you use — `--gc-sections` drops
the rest.
| Library | When you need it |
|---|---|
| `runtime/crt0.o` | **Always.** C runtime startup. |
| `runtime/crt0Gsos.o` | Instead of `crt0.o` for programs launched by the GS/OS Loader. |
| `runtime/libc.o` | `printf`, `malloc`, `strlen`, the usual. Almost always. |
| `runtime/libgcc.o` | Compiler helpers — multiply, divide, shift. Almost always. |
| `runtime/snprintf.o` | If you use `sprintf` / `snprintf` / `vsnprintf`. |
| `runtime/sscanf.o` | If you use `sscanf` / `vsscanf` / `fscanf`. |
| `runtime/softDouble.o` | If you use `double`-precision arithmetic anywhere. |
| `runtime/softFloat.o` | If you use `float`-precision arithmetic. |
| `runtime/math.o` | `fabs`, `floor`, `sqrt`, `sin`, `cos`, `pow`, etc. |
| `runtime/qsort.o` | `qsort` / `bsearch`. |
| `runtime/strtol.o` | `strtol` / `strtoul` / `atoi` / `atol`. |
| `runtime/strtok.o` | `strtok` / `strtok_r`. |
| `runtime/extras.o` | `strcat`, `strncat`, `llabs`, `rand`/`srand`. |
| `runtime/timeExt.o` | `time` / `gmtime` / `mktime`. |
| `runtime/iigsToolbox.o` | Apple IIgs Toolbox call wrappers. |
| `runtime/iigsGsos.o` | GS/OS class-1 call wrappers (file I/O, etc.). |
| `runtime/desktop.o` | `startdesk()` helper used by demos that need a Window Manager environment. |
| `runtime/libcxxabi.o` | C++ ABI runtime (vtable RTTI, `dynamic_cast`). |
| `runtime/libcxxabiSjlj.o` | C++ SJLJ-exception support (paired with `-fsjlj-exceptions`). |
To (re)build the runtime:
```bash
bash runtime/build.sh
```
### Multi-segment OMF (for GS/OS Loader)
For programs >60 KB (the usable bank-0 limit after the stack, zero
page, and I/O window are subtracted), build a multi-segment OMF that
GS/OS Loader places across banks:
```bash
./tools/link816 -o myprog.bin \
--text-base 0x1000 \
--segment-cap 0xB000 \
--segment-bank-base 0x040000 \
--manifest myprog.manifest.json \
runtime/crt0Gsos.o ... yourprog.o
./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf
```
See [`docs/multiSegmentPlan.md`](multiSegmentPlan.md) for details and
[`scripts/runMultiSeg.sh`](../scripts/runMultiSeg.sh) for a working
example.
---
## Running under MAME
[`scripts/runInMame.sh`](../scripts/runInMame.sh) launches MAME's
`apple2gs` driver, loads your binary at `$00:1000`, runs for a few
seconds, and reads a memory cell:
```bash
bash scripts/runInMame.sh prog.bin # just run for ~5 s
bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value
bash scripts/runInMame.sh prog.bin 0x025000 0x025002 # dump these addresses
```
- `--check ADDR=VALUE` returns exit 0 if memory matches, exit 1 if not.
Used by smoke and CI.
- The bare-address form dumps the value without comparing.
The runner is headless by default (`-video none` + `SDL_VIDEODRIVER=dummy`)
so it runs in a terminal-only environment. Useful environment
variables:
| Variable | Default | Purpose |
|---|---|---|
| `MAME_CHECK_FRAME` | `300` | Frame at which to read the check address (300 ≈ 5 s at 60 Hz). |
| `MAME_SECS` | `6` | How long to let MAME run before forcibly exiting. |
| `MAME_TIMEOUT` | `30` | Wall-clock timeout for the whole MAME invocation. |
| `MAME_RAMSIZE` | unset | Override the emulated RAM size (e.g. `8M`). |
### Writing to non-bank-0 RAM
The 65816 has two registers that select which bank a memory access
goes to:
- **PBR** (Program Bank Register) — selects the bank for instruction
fetches. Set by `jsl long_addr` and `rtl`.
- **DBR** (Data Bank Register) — selects the bank for 16-bit absolute
data accesses like `lda $5000`.
When the IIgs boots, DBR defaults to `$00`. Bank `$00` contains the
I/O window at `$C000-$CFFF`, the language card area, and the stack —
not a great place for general data.
**With ptr32 mode** (the default — pointers are 32 bits / 24-bit
addresses), constant pointers to non-bank-0 addresses lower
automatically to long (24-bit absolute) instructions that *ignore DBR*:
```c
*(volatile int *)0x025000 = 42; // → sta long $025000 (DBR-independent)
*(volatile char *)0xE10068 = 1; // → sta long $E10068 (vert position reg)
unsigned char v = *(volatile char *)0xE0C025; // ROM read
```
For typical programs — writing a result to a verification address,
poking IIgs hardware registers, accessing the SHR framebuffer at
`$E1:2000` — you just dereference the absolute pointer and the
compiler does the right thing. **DBR doesn't matter.**
### Legacy: the `switchToBank2()` idiom
You may see older code (pre-ptr32 migration) using a `switchToBank2()`
helper that pokes DBR to `$02` so that subsequent 16-bit-absolute
stores like `*(volatile X*)0x5000 = v` land in bank 2:
```c
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile (
"sep #0x20\n" // 8-bit A
".byte 0xa9,0x02\n" // lda #2 (hand-encoded)
"pha\n" // push A
"plb\n" // pop into DBR
"rep #0x20\n" // back to 16-bit A
);
}
// then:
switchToBank2();
*(volatile int *)0x5000 = x;
```
This still works but is **no longer needed** for new code. Prefer the
direct 24-bit pointer form (`*(volatile int *)0x025000 = x;`) — it's
clearer, requires no inline asm, and produces fewer instructions
because the bank byte is encoded inline.
There's still one case where it's useful: if you have a *large amount*
of data work in a single bank and want every store to be 3 bytes
(`sta $5000,X` etc.) instead of 4 bytes (`sta long $025000,X`). In
that case, set DBR once with the helper above and use 16-bit-absolute
addresses afterward. Otherwise, the direct form is simpler.
### What never needs bank-switching
- **Local variables on the stack** — stack-relative accesses bypass DBR.
- **Direct-page accesses** — `lda $D0` always reads `$00:00D0`.
- **`[dp],Y` indirect-long pointers** — they carry their own bank byte.
- **Function calls** — `jsl` uses PBR + a long destination.
- **Pointers in ptr32 mode** — every C pointer is 32 bits, so deref'ing
any pointer (even one to bank 0) generates DBR-independent code.
---
## Running under GNO/ME
The MAME path above runs your program bare-metal. GNO/ME 2.0.6 is a
Unix-like multitasking environment that runs *on top of* real GS/OS, and
a llvm816-compiled C (or C++) program can run as a native GNO **shell
command** — with console stdio, `argv`, and `FILE*` file I/O — booted
through GS/OS 6.0.4 in MAME. This is a sibling to the MAME path: a
different way to run the same C, inside a real OS.
This is verified headless and end-to-end. Three steps take you from C
source to a running command.
### 1. Build the base GNO disk (once)
```bash
bash tools/gno/buildDisk.sh # -> tools/gno/gnobase.po
```
This assembles the GNO/ME userland into an 800 KB ProDOS volume. Re-run
it only when the GNO archive set changes.
**One-time prerequisites.** `buildDisk.sh` needs `nulib2` (a system
package: `sudo apt-get install nulib2`) and `tools/cadius/cadius` (run
`bash scripts/installCadius.sh` if it is missing), plus the GNO/ME 2.0.6
`.shk` archives under `tools/gno/dist/`. The runner in step 3 also needs
the GS/OS 6.0.4 system disk at `tools/gsos/6.0.4 - System.Disk.po` and
the same IIgs ROMs the MAME path uses. None of these are installed by
`setup.sh` today — see [INSTALL.md](INSTALL.md) for the full list. You
also need the GNO runtime objects, which `bash runtime/build.sh` builds
automatically.
### 2. Compile a C program into a GNO OMF
```bash
bash demos/buildGno.sh gnoHello # demos/gnoHello.c -> demos/gnoHello.omf
```
`buildGno.sh` takes a single basename (required); it reads
`demos/<name>.c` and emits `demos/<name>.omf` (plus `.o`/`.bin`/`.map`/
`.reloc` sidecars). Bundled examples: `gnoHello`, `gnoCat`, `gnoFile`,
`gnoFmt`, `gnoStdin`.
It links the GNO crt0 and runtime, then runs `omfEmit --expressload
--relocs ... --stack-size 0x4000`. Override the DP/Stack size with the
`GNO_STACK_SIZE` environment variable if needed (default `0x4000`).
### 3. Boot, log in, run, and check
```bash
bash scripts/runInGno.sh demos/gnoHello.omf --check 0x025000=C0DE
```
The runner boots GS/OS 6.0.4 + GNO in headless MAME, logs in as `root`,
runs your command, then probes memory. `gnoHello` writes `0xC0DE` to
`$02:5000` as a harness marker, so a successful run prints:
```
[llvm816] GNO check OK: 0x025000 = 0xc0de
```
`--check` takes `ADDR=VALUE` pairs (multiple allowed after one
`--check`). **The address uses `0x` form (`0x025000`); the expected
value is bare hex with no prefix (`C0DE`, not `0xC0DE`).** The runner
prints the matched value lowercased. Add `--snapshots` to capture a PNG
of each boot/login/run stage to `/tmp/gnosnaps`.
### Things you must know
- **The OMF command basename must be ProDOS-legal — no hyphen.** Name
it `testgno`, not `test-gno`, or the command never launches.
- **stdio needs `libcGno` linked.** `buildGno.sh` does this for C.
Without it the program runs but prints nothing (the console hooks fall
through to a dead sink).
- Console file descriptors follow GNO's convention: **stdin=1,
stdout=2, stderr=3** (a documented deviation from POSIX 0/1/2).
- Commands that do GS/OS file I/O need the `--stack-size` DP/Stack OMF
segment that `buildGno.sh` passes (`0x4000`); the 4 KB default crashes.
### C++ shell commands
`demos/buildGno.sh <name>` auto-detects `.c` vs `.cpp` and switches to
`clang++ -fno-exceptions -fno-rtti` for the latter, linking
`runtime/libcxxabi.o` + `libcxxabiSjlj.o` so the C++ ABI hooks
(`operator new`/`delete`, `__cxa_guard_*`, `__cxa_atexit` +
`__run_cxa_atexit`, RTTI typeinfo, `dynamic_cast`, SJLJ exception
runtime) resolve. Link-time GC strips whatever isn't used, so a
pure-C `.c` program pays nothing extra for the additional `.o`s on
the link line.
**Global / static-local dtors run at exit.** Each crt0 calls
`__run_cxa_atexit` after `main()` returns and before halt/QUIT — the
registered dtor table is walked in LIFO order, so destructors for
file-scope objects and `static T x;` locals actually execute.
`demos/cxxProbe.cpp` is the worked example.
**ETL containers** — the vendored Embedded Template Library at
`runtime/include/c++/etl/` provides fixed-capacity STL-style containers
with no `malloc` and no exceptions. `buildGno.sh` adds
`-I runtime/include/c++` to the compile line, so:
```cpp
#include "etl/vector.h"
#include "etl/string.h"
#include "etl/map.h"
#include "etl/optional.h"
#include "etl/delegate.h"
static int doubler(int x) { return x * 2; }
int main(void) {
etl::vector<int, 8> v;
for (int i = 1; i <= 5; i++) v.push_back(i);
etl::string<32> s("Hello, ");
s += "ETL";
etl::map<int, int, 8> m;
m[1] = 100;
etl::optional<int> opt = 42;
// etl::delegate is the std::function-equivalent (type-erased callable).
// etl::function is for binding object methods, NOT general callables.
etl::delegate<int(int)> fn = etl::delegate<int(int)>::create<doubler>();
return fn(s.size()); // 20
}
```
The capacity `N` in `etl::vector<T, N>` (and `etl::string<N>`,
`etl::map<K,V,N>`, …) is a **template parameter**, so storage is
in-struct (no heap, no allocator). Pick `N` like you'd pick the size
of a C array. Same trade-off — too small overflows, too large wastes
BSS. Overflow today silently corrupts past the storage array (no
exceptions, default `ETL_ASSERT` is a no-op); install a callback via
`etl::error_handler::set_callback(...)` at startup if you want a halt
on overflow.
The target profile at `runtime/include/c++/etl_profile.h` sets
`ETL_NO_STL`, no atomics, no exceptions, no `std::ostream` — **do not
override it** in user code. Full container list at
[etlcpp.com](https://www.etlcpp.com/). `demos/etlProbe.cpp` exercises
vector + string + map + optional + delegate end-to-end (20 KB total).
For hand-driven builds without `buildGno.sh`, link `libcGno` *before*
`libc` so its strong console hooks win. See the `gno` target in
[`stuff/baztest/Makefile`](../stuff/baztest/Makefile) for a worked
recipe.
For the full picture — disk layout, the inline GS/OS QUIT convention,
the double-run/QUIT trap, `argv` handover, `FILE*` round-trips, and the
`runInGno.sh` environment hooks (`GNO_STDIN`, `GNO_ADDFILE`,
`GNO_RUNCMD`, `GNO_POLL_FRAMES`) — see
[`tools/gno/README.md`](../tools/gno/README.md).
---
## Worked examples
### Recursion + printing
```c
// fib.c
#include <stdio.h>
#include <stdlib.h>
unsigned long fib(unsigned n) {
if (n < 2) return n;
return fib(n-1) + fib(n-2);
}
int main(void) {
char buf[32];
int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
// Copy the formatted string into bank-2 RAM at $02:5000 so the
// MAME harness can read it back. Each store goes through a 24-bit
// long-address write — no bank-switching needed.
for (int i = 0; i <= len; i++)
((volatile char *)0x025000)[i] = buf[i];
while (1) {}
}
```
Build (snprintf needs soft-double + sscanf to link cleanly):
```bash
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-I runtime/include -c fib.c -o fib.o
./tools/link816 -o fib.bin --text-base 0x1000 \
runtime/crt0.o runtime/libc.o runtime/libgcc.o \
runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \
fib.o
bash scripts/runInMame.sh fib.bin --check 0x025000=0066 # 'f' (start of "fib")
```
### Apple IIgs Toolbox
```c
// hello_gs.c
#include <iigs/toolbox.h>
int main(void) {
SysBeep();
while (1) {}
}
```
Build (note `crt0Gsos.o` instead of `crt0.o` — sets up the toolbox
environment):
```bash
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-I runtime/include -c hello_gs.c -o hello_gs.o
./tools/link816 -o hello_gs.bin --text-base 0x1000 \
runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
runtime/libgcc.o hello_gs.o
```
Programs that call the toolbox usually run under real GS/OS rather than
in the headless harness. See `demos/launch.sh` and `demos/build.sh`
for a working pipeline.
---
## Advanced: pointer-deref code generation
The W65816 backend treats every pointer as 32-bit (`p:32:16` datalayout
`sizeof(void *) == 4` from the C compiler's perspective). The high
two bytes carry the bank byte plus a pad byte; the low two carry the
in-bank offset. This lets a single C pointer reach any byte in the
IIgs's 24-bit address space.
A pointer dereference has to read up to 24 bits of address to know
which bank to touch. The CPU's `[dp],Y` (indirect-long-Y, opcode
0xB7) reads a 24-bit pointer from a direct-page slot and uses it as
the effective address — three bytes wide, bank byte explicit. This
is the **safe default** path and it works regardless of where the
target memory lives.
There are two optimizations layered on top of the default path. One
is **always on** and safe. The other is **opt-in via a flag** and
needs care.
### Layer 1: constant-offset peeling (default on, always safe)
When you write `s->c` for a struct field at offset `4`, the natural
code is "compute `s + 4`, then deref". Layer 1 recognizes that
`[dp],Y` already has a Y register that's added to the 24-bit pointer
on the deref — so instead of computing `s + 4` first, the backend
stages the **base pointer** at `$E0..$E2` and loads `Y = #4` for the
deref. Saves three instructions per struct-field access (the
`clc; adc #4; ...; adc #0` carry chain).
A consecutive-access CSE peephole shares the `$E0/$E2` staging
between adjacent derefs of the same base, so `s->a + s->b + s->c +
s->d` stages once and emits four `ldy #K; lda [$E0],Y` pairs.
There's nothing to enable or disable. This was a `+1%` Lua-wide
size win on its own. It's always-on because it's structurally
equivalent to the un-optimized code — the same 24-bit deref, just
with the offset folded into Y instead of pre-added to the pointer.
### Layer 2: `-mllvm -w65816-dbr-safe-ptrs` (opt-in, unsafe if misused)
The default `[dp],Y` deref needs three bytes of staging at `$E0..$E2`
because it reads a 24-bit pointer. Calypsi uses `lda (d,S),Y`
(opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE
instruction — but that opcode reads only **16 bits** of pointer.
The bank byte is implicit DBR.
When you pass `-mllvm -w65816-dbr-safe-ptrs`, our backend uses the
same one-instruction path: it spills only the low 16 bits of the
pointer to a stack slot, sets Y to the offset, and emits
`lda (slot,S),Y` (or `sta (slot,S),Y`). Bank byte = whatever DBR
holds at runtime.
Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks
by 20.6% with the flag on.
**This is correct only when every pointer dereferenced in the TU
points to memory inside DBR's current bank.** Some examples:
| Pointer | Bank? | Safe with the flag? |
|---|---|---|
| `malloc()` result | DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) | Yes |
| Global variable address | DBR's bank (linker puts globals in the load segment) | Yes |
| `&local_array[i]` in a stack frame | Bank 0 (stack is always bank 0) | Yes IF DBR is 0 (typical) |
| Pointer returned by GS/OS Loader | The Loader's bank (might differ from yours) | **No** — would miscompile |
| Pointer cast from a `0x010000+addr` integer literal in bank 1 | Bank 1 | **No** if DBR is not bank 1 |
| `&ROMVECTORS[0]` from `iigs/`-style headers | Various IIgs system banks | **No** in general |
For Lua, Picol, plain C programs that allocate via `malloc` and
operate on globals, this flag is safe. For GS/OS demos that interact
with Loader-returned segments or system memory, it would miscompile.
Default is **off**. Opt in per-TU:
```bash
clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o
```
If you set the flag and your code does dereference cross-bank
pointers, the symptom is silent wrong-address reads — typically a
read from the same in-bank offset but in DBR's bank instead of the
intended one. No abort, no diagnostic.
**Mixing safely:** the flag is per-TU. You can compile your hot
struct-heavy code with the flag and your bank-aware code without.
The two `.o` files link cleanly together. Per-function or
per-parameter control isn't supported yet.
#### When the slot offset overflows 8 bits
`lda (d,S),Y` has an 8-bit `d` field — max slot offset 255 from SP.
If the function's frame is large enough that the spill slot exceeds
that, PEI emits a fallback sequence that long-indirects the slot via
`[$F6],Y` (the function's frame-pointer), then stages at `$E0..$E2`
and derefs via `[$E0],Y`. This is ~8 instructions — worse than the
plain `[dp],Y` path the flag was meant to replace. Functions that
hit this need `usesDpFP=true` (set automatically for large frames);
otherwise PEI emits a fatal error. In practice you'll only see this
on functions with hundreds of local variables.
### Inline-threshold tuning (default lowered to 50)
LLVM's default inline-cost threshold is 225, tuned for desktop CPUs
where call overhead is high relative to the size of the inlined body.
On W65816 a `jsl long:foo` is just 4 bytes / ~8 cycles, but every
inlined pointer dereference expands to multiple instructions even
with Layer 2. Aggressive inlining bloats code without commensurate
cycle wins.
The W65816 backend lowers the default to **50**. Calibration:
| Threshold | Lua size | CoreMark size | Cycle benches |
|----------:|---------:|--------------:|--------------|
| 225 (LLVM stock) | 1.47× Calypsi | (not measured) | baseline |
| 75 | 1.16× | 0.87× | identical |
| **50 (current)** | **1.13×** | **0.79×** | identical |
| 25 | 1.11× | 0.79× | identical |
At 225, Lua's `index2adr` (a multi-branch helper called 41 times in
`lapi.c`) was inlined into every API entry, adding ~2 KB per file —
and CoreMark's `matrix_test` was 17× Calypsi because the inliner
copied 5 nested-loop helpers into it. At 50, both regressions vanish
and the cycle benchmarks are unchanged.
To override (e.g. on size-sensitive ROMs or speed-critical loops):
```bash
# Force aggressive inlining (back to LLVM default)
clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o
# Force MORE conservative inlining
clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o
```
A function marked `__attribute__((always_inline))` is always inlined
regardless of threshold. A function marked `__attribute__((noinline))`
is never inlined. Use these to override the global threshold for
specific cases.
### Summary: which options to use when
| Goal | Compile flag |
|---|---|
| Smallest, safest binary (default) | `clang --target=w65816 -O2 ...` — Layer 1 is on, Layer 2 is off, threshold=50 |
| Smallest binary for code that touches only same-bank memory | Add `-mllvm -w65816-dbr-safe-ptrs` |
| Fastest possible code (size be damned) | Add `-mllvm -inline-threshold=500` |
| Reproduce LLVM's stock inlining behavior | Add `-mllvm -inline-threshold=225` |
| Maximum safety review of inlining decisions | Mark hot helpers `__attribute__((noinline))` explicitly |
---
## Inline assembly
The W65816 backend supports `__asm__` with operand constraints
`"a"`, `"x"`, `"y"`:
```c
unsigned short addOne(unsigned short x) {
unsigned short r;
__asm__("inc a" : "=a"(r) : "a"(x));
return r;
}
```
Multi-instruction asm and raw bytes both work:
```c
__asm__ volatile (
"sep #0x20\n"
".byte 0x68\n" // pla
"rep #0x20\n"
);
```
The `.byte` form is needed when llvm-mc can't yet parse an opcode
literally (some 65816 addressing modes still have gaps in the
assembler). Hand-encoding is a stopgap; report opcodes that need it.
---
## Tools reference
| Tool | Location | Purpose |
|---|---|---|
| `clang` | `tools/llvm-mos-build/bin/clang` | C / C++ compiler |
| `clang++` | `tools/llvm-mos-build/bin/clang++` | C++ driver |
| `llc` | `tools/llvm-mos-build/bin/llc` | Standalone codegen (`.ll``.s`) |
| `llvm-mc` | `tools/llvm-mos-build/bin/llvm-mc` | Assembler |
| `llvm-objdump` | `tools/llvm-mos-build/bin/llvm-objdump` | Disassembler |
| `link816` | `tools/link816` | Our relocating linker |
| `omfEmit` | `tools/omfEmit` | Emit OMF v2.1 binary from `link816` output |
| `mame` | system `apt` install | Apple IIgs emulator |
---
## Debugging
### Look at the asm
```bash
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c
cat prog.s
```
### Look at the MIR after each backend pass
```bash
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-mllvm -print-after-all -S prog.c 2>&1 | less
```
Useful pass names to filter on:
| Pass name | What it does |
|---|---|
| `w65816-isel` | SDAG → MachineInstr selection |
| `w65816-widen-acc16` | Promote Acc16 vregs to Wide16 (regalloc help) |
| `w65816-stack-slot-cleanup` | Remove redundant spill/reload |
| `w65816-stackrel-to-img` | Promote hot stack slots to DP IMG slots |
| `w65816-stack-slot-merge` | Collapse PHI src/dst slot pairs |
| `w65816-branch-expand` | Long-distance Bxx → INV_Bxx skip; BRA |
### Single-pass filter
```bash
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-mllvm -print-after=w65816-isel \
-mllvm -filter-print-funcs=myfunc \
-S prog.c 2>&1 | less
```
### Disassemble an object file
```bash
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o
```
### ELF `e_machine` value
W65816 `.o` files use **`EM_W65816 = 0xFF16`** in the ELF header.
The value sits in the `0xFF00`-`0xFFFF` range reserved by the ELF spec for
vendor-private / experimental targets — no IANA registration required.
The `16` suffix is a mnemonic for "65816". (The natural choice, `65816`
itself = `0x10118`, does not fit the 16-bit `Elf32_Half` `e_machine`
field.)
Why this matters:
- `llvm-dwarfdump`, `readelf`, and other generic ELF consumers used to
warn on every invocation because the file claimed `EM_NONE` (= no
machine). Setting a real `EM_` value silences the warning while still
preventing a host-architecture `.o` from being accidentally linked.
- `link816` validates `e_machine` and rejects anything that isn't
`EM_W65816` (with `EM_NONE` still accepted for backwards compatibility
with any pre-Phase-1.13 object files lingering in a build tree).
- The relocation numbers `R_W65816_*` are unique under `EM_W65816`, so
they're free to stay at the small stable integers `1`-`8` (see
`src/llvm/lib/Target/W65816/MCTargetDesc/W65816ELFObjectWriter.cpp`).
Touchpoints if you ever need to change the value:
| File | What it does |
|---|---|
| `tools/llvm-mos/llvm/include/llvm/BinaryFormat/ELF.h` | Defines `EM_W65816` enumerator |
| `src/llvm/lib/Target/W65816/MCTargetDesc/W65816ELFObjectWriter.cpp` | Passes value to `MCELFObjectTargetWriter` |
| `src/link816/link816.cpp` | Validates value on input |
---
## Cycle-count benchmarks
Microbenchmarks live under [`benchmarks/`](../benchmarks/) — integer/
string micro-benches plus soft-double FP benches.
```bash
W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs" bash scripts/benchCyclesPrecise.sh
```
This measures per-call cycle counts via MAME's `emu.time()` between
markers — apples-to-apples vs the matching
`scripts/benchCyclesCalypsi.sh` runner (commercial Calypsi 5.16).
Current ratios (2026-05-27, Layer 2):
```
| Benchmark | Ours | Calypsi | Ratio |
|--------------|------:|--------:|------:|
| dotProduct | 1534 | 5712 | 0.27× |
| bsearch | 682 | 2387 | 0.29× |
| sumOfSquares | 6820 | 16368 | 0.42× |
| bubbleSort | 11594 | 17050 | 0.68× |
| strLen | 767 | 1023 | 0.75× |
| djb2Hash | 2046 | 2643 | 0.77× |
| popcount | 1194 | 1534 | 0.78× |
| strcpy | 1108 | 1194 | 0.93× |
| memcmp | 682 | 716 | 0.95× |
| fib | 11594 | 10912 | 1.06× |
```
**Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×. The Layer 2 flag
(`-mllvm -w65816-dbr-safe-ptrs`) enables stack-rel-indirect-Y ptr32
derefs — required for parity since Calypsi's pointer ABI assumes
DBR matches the pointer's bank.
The `scripts/benchCycles.sh` (HBL-tick-based) script is still around
but lower-resolution. Prefer the `Precise` runner above.
The [`compare/`](../compare/) directory has side-by-side `.s` files vs
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:
```bash
bash compare/regen.sh
```
---
## UndefinedBehaviorSanitizer (UBSan, minimal runtime)
The W65816 target ships a hand-rolled minimal UBSan runtime
(`runtime/ubsan.o`). No driver-side magic: pass the flags and link
the runtime object explicitly.
```bash
# Compile with UBSan-min instrumentation.
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
-fsanitize=undefined -fsanitize-minimal-runtime \
-ffunction-sections -I runtime/include \
-c prog.c -o prog.o
# Link, including runtime/ubsan.o so the 25 __ubsan_handle_*_minimal
# symbols clang emits calls to resolve cleanly. libgcc.o is needed
# whenever you exercise i16 div / i32 multiply / shift-by-N.
./tools/link816 -o prog.bin --text-base 0x1000 --bss-base 0xA000 \
runtime/crt0.o prog.o runtime/ubsan.o runtime/libgcc.o
```
What's covered (25 of the 25 handlers upstream's minimal runtime
emits):
```
type-mismatch shift-out-of-bounds invalid-objc-cast
alignment-assumption out-of-bounds function-type-mismatch
add-overflow local-out-of-bounds implicit-conversion
sub-overflow builtin-unreachable (*) nonnull-arg
mul-overflow missing-return (*) nonnull-return
negate-overflow vla-bound-not-positive nullability-arg
divrem-overflow float-cast-overflow nullability-return
load-invalid-value pointer-overflow
invalid-builtin cfi-check-fail
```
(*) recovering-only — no `_abort` variant emitted upstream.
When a UB site fires, the runtime calls a per-kind handler that:
1. Looks up the caller PC in a 20-entry dedup table (single-threaded,
no atomics).
2. If first-seen, emits one line via the existing `__putByteErr` hook
(GNO fd 3 / stderr) in the format `ubsan: <kind> by 0x<8-hex>\n`.
3. The recover variant returns; the `_abort` variant calls
`__builtin_trap()` which lowers to `BRK_pseudo` + sentinel `0xBE @ $70`
+ tight-loop spin.
**ASan is out of scope** — the 8:1 shadow-memory model would need
~2 MB of shadow for the 16 MB 65816 address space, while most IIgs
programs run in one or two banks.
End-to-end smoke probe:
```bash
bash tests/ubsan/runUbsanProbe.sh
```
Exercises add-overflow + shift-out-of-bounds + divide-by-zero,
verifies each handler fires and execution recovers past the UB site
(sentinels at `$025000..$025006`). Wired into `scripts/smokeTest.sh`
as the Phase 6.2 stage; override with `SMOKE_SKIP_UBSAN=1`.
The probe deliberately overrides three handlers with strong defs that
record their firing in a state byte rather than printing — that lets
the test verify the *call edge* without pulling `libc.o` (and the
attached `snprintf.o`) into a smoke probe that doesn't need console
I/O. A diagnostic-format smoke (asserting on the `ubsan: ...\n` line)
is a follow-up under the `cxxsmoke` GNO MAME harness.
---
## Known limitations
- **C++ exceptions** are not implemented for DWARF unwinding.
`try` / `catch` compiles but doesn't unwind. `-fsjlj-exceptions`
works for limited SJLJ-style throwing.
- **`stdin`** always returns EOF. `scanf` compiles but isn't useful.
Use `sscanf` on a buffer instead.
- **File I/O** through `fopen` requires a backing implementation. The
default `mfs` backing (memory-file-system) lets you simulate files
via `mfsRegister()` — useful for tests, not for real disk I/O. GS/OS
file I/O works via `runtime/iigsGsos.o` if you link against the GS/OS
runtime.
- **`fork`/`exec`** — not applicable on a 65816, no support.
- **Code generation gotcha:** very large stack frames (>200 bytes)
trigger FP-relative addressing. Most programs fit under that limit.
See the `frame-rel` discussion in
[LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md).
- **Three Lua functions** (`luaV_execute`, `symbexec`, `auxsort`) hit
the greedy register allocator's complexity budget. Workaround:
compile those TUs with `-mllvm -regalloc=basic`. Documented in
[`tests/lua/README.md`](../tests/lua/README.md).
---
## Where to go next
- **Building real GS/OS apps:** see
[`docs/multiSegmentPlan.md`](multiSegmentPlan.md) and the
`demos/launch.sh` script for booting through real GS/OS 6.0.2 in
MAME. The 9 demos under `demos/` are reasonable starting points.
- **Running as a GNO/ME shell command:** see [Running under
GNO/ME](#running-under-gnome) above, [`tools/gno/README.md`](../tools/gno/README.md),
and the `demos/gno*.c` examples.
- **Backend internals (you're hacking on the compiler):**
[LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md).
- **Smoke tests:** `scripts/smokeTest.sh` runs ~150 end-to-end checks.
Read it for examples of every feature in action.
- **Cycle-bench a Lua port or other real-world C:** see
[`tests/lua/README.md`](../tests/lua/README.md) for the recipe
(vendoring + per-file regalloc tuning + libc stubs).