899 lines
33 KiB
Markdown
899 lines
33 KiB
Markdown
# Using llvm816
|
||
|
||
This document covers compiling a C program, linking it into an Apple
|
||
IIgs binary, and running it under MAME. It assumes you've followed
|
||
[INSTALL.md](INSTALL.md) and the install completed successfully.
|
||
|
||
If you've never used **clang** or a similar C compiler before, start
|
||
with [Quick orientation](#quick-orientation) — it explains the moving
|
||
parts. If you already know what clang is, jump to
|
||
[Your first program](#your-first-program).
|
||
|
||
---
|
||
|
||
## Quick orientation
|
||
|
||
### What is clang?
|
||
|
||
Clang is a C / C++ compiler — the program that turns your `.c` source
|
||
file into machine code an actual CPU can execute. It's part of the
|
||
LLVM project and is the default C compiler on macOS and on most modern
|
||
Linux distributions. If you've used `gcc` before, clang takes nearly
|
||
the same command-line flags.
|
||
|
||
A normal install of clang produces code for the machine it's running on
|
||
— x86-64 if you're on a typical Linux PC. Clang has a **cross-compiler
|
||
mode**: pass `--target=<arch>` to make it emit code for a *different*
|
||
CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've
|
||
added to a fork of clang that ships with this project.
|
||
|
||
### What gets installed where
|
||
|
||
After `./setup.sh` completes, the project tree under your `llvm816/`
|
||
checkout looks roughly like this:
|
||
|
||
```
|
||
llvm816/ ← repo root; everything is contained here
|
||
├── docs/ ← this directory
|
||
├── runtime/ ← C standard library + startup code
|
||
│ ├── build.sh ← script that builds the runtime .o files
|
||
│ ├── include/ ← header files (<stdio.h>, etc.)
|
||
│ │ ├── stdio.h
|
||
│ │ ├── string.h
|
||
│ │ ├── ...
|
||
│ │ └── iigs/ ← Apple IIgs-specific headers
|
||
│ │ ├── toolbox.h ← ~1300 toolbox routine wrappers
|
||
│ │ ├── gsos.h
|
||
│ │ └── desktop.h
|
||
│ ├── src/ ← sources for the runtime (.c and .s)
|
||
│ └── *.o ← compiled runtime objects (after build)
|
||
├── scripts/ ← driver scripts
|
||
│ ├── runInMame.sh ← run a binary in MAME and check memory
|
||
│ ├── benchCycles.sh ← cycle-count benchmarks
|
||
│ └── smokeTest.sh ← ~150 end-to-end correctness checks
|
||
├── src/ ← OUR backend source (you compile from here)
|
||
├── tools/ ← installed tools (~7 GB total)
|
||
│ ├── llvm-mos/ ← LLVM source tree (~5 GB)
|
||
│ ├── llvm-mos-build/ ← built artifacts (~1.4 GB)
|
||
│ │ └── bin/
|
||
│ │ ├── clang ← THE COMPILER YOU USE
|
||
│ │ ├── clang++ ← same, for C++
|
||
│ │ ├── llc ← standalone IR → asm converter
|
||
│ │ ├── llvm-mc ← standalone assembler
|
||
│ │ ├── llvm-objdump ← disassembler
|
||
│ │ └── ...
|
||
│ ├── llvm-mos-sdk/ ← prebuilt llvm-mos SDK (~400 MB, mostly unused)
|
||
│ ├── link816 ← OUR LINKER (single binary, ~120 KB)
|
||
│ ├── omfEmit ← turns flat binary → Apple IIgs OMF v2.1
|
||
│ ├── mame/ ← Apple IIgs ROMs for MAME
|
||
│ ├── gsos/ ← GS/OS 6.0.2 / 6.0.4 disk images
|
||
│ ├── calypsi/ ← reference compiler for comparison (~580 MB)
|
||
│ └── orca-c/ ← reference compiler (header sources)
|
||
├── demos/ ← example IIgs programs
|
||
├── benchmarks/ ← cycle-count benchmarks
|
||
├── compare/ ← side-by-side ours-vs-Calypsi assembly
|
||
└── setup.sh ← one-shot installer
|
||
```
|
||
|
||
The two files you'll use most often:
|
||
|
||
| File | Purpose |
|
||
|---|---|
|
||
| **`tools/llvm-mos-build/bin/clang`** | The compiler. Pass `--target=w65816` to make it emit Apple IIgs code |
|
||
| **`tools/link816`** | The linker. Takes `.o` files and produces a flat binary the IIgs can load |
|
||
|
||
Nothing is installed into `/usr/local`, `/opt`, or anywhere else on
|
||
your system — the entire toolchain lives under your `llvm816/` checkout.
|
||
To uninstall, delete the directory.
|
||
|
||
### What about the system's `/usr/bin/clang`?
|
||
|
||
If your distribution provides a clang (most do), that's a *different*
|
||
clang for *your machine's* CPU. It does **not** know about the W65816
|
||
target. When following this document, always use the full path
|
||
`./tools/llvm-mos-build/bin/clang` (or set an alias / `$PATH` — see
|
||
[Setting up your environment](#setting-up-your-environment)).
|
||
|
||
### What the build process produces
|
||
|
||
When you compile a C file for the IIgs, the flow looks like this:
|
||
|
||
```
|
||
hello.c
|
||
│
|
||
│ clang --target=w65816 (cross-compile to 65816 machine code)
|
||
▼
|
||
hello.o (relocatable ELF object file)
|
||
│
|
||
│ + crt0.o + libc.o + libgcc.o (runtime libraries you link in)
|
||
│
|
||
│ link816 (our relocating linker)
|
||
▼
|
||
hello.bin (flat binary, loadable at $00:1000)
|
||
│
|
||
│ optionally: omfEmit hello.bin → hello.omf (for GS/OS Loader)
|
||
│
|
||
│ scripts/runInMame.sh hello.bin
|
||
▼
|
||
runs in MAME's emulated Apple IIgs
|
||
```
|
||
|
||
Three stages:
|
||
1. **Compile** — clang turns `.c` into `.o`
|
||
2. **Link** — `link816` combines `.o` files + runtime libraries into a binary
|
||
3. **Run** — MAME boots an emulated IIgs and executes the binary
|
||
|
||
---
|
||
|
||
## Setting up your environment
|
||
|
||
To save typing, you can either edit your `$PATH` or use absolute paths.
|
||
The rest of this document uses absolute paths so the examples work
|
||
without any setup, but in practice you'll want shortcuts.
|
||
|
||
### Option A: edit `$PATH` (recommended)
|
||
|
||
Add this to `~/.bashrc` (or `~/.zshrc`) so our tools are on your path:
|
||
|
||
```bash
|
||
export LLVM816_ROOT=$HOME/path/to/llvm816
|
||
export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH"
|
||
```
|
||
|
||
Then `source ~/.bashrc` (or restart your shell). After that you can
|
||
just type `clang --target=w65816 ...` without the path prefix.
|
||
|
||
> **Careful:** putting `tools/llvm-mos-build/bin` first on `$PATH` means
|
||
> *all* `clang` invocations in that shell go to our build, not the
|
||
> system clang. Ours still works for your machine's native target
|
||
> too (it's a multi-arch clang), but if you also need your distro's
|
||
> version, prefer Option B.
|
||
|
||
### Option B: shell aliases
|
||
|
||
In `~/.bashrc`:
|
||
|
||
```bash
|
||
LLVM816_ROOT=$HOME/path/to/llvm816
|
||
alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include"
|
||
alias link816="$LLVM816_ROOT/tools/link816"
|
||
```
|
||
|
||
Then:
|
||
|
||
```bash
|
||
w65clang -O2 -c hello.c -o hello.o
|
||
link816 -o hello.bin --text-base 0x1000 ...
|
||
```
|
||
|
||
### Option C: nothing — just use full paths
|
||
|
||
Every example in this document spells out the full path, so this works
|
||
too. Verbose, but unambiguous.
|
||
|
||
---
|
||
|
||
## Your first program
|
||
|
||
Let's compile, link, and run a tiny program. Open a terminal in your
|
||
`llvm816/` checkout directory.
|
||
|
||
### 1. Write the source
|
||
|
||
Create `hello.c`:
|
||
|
||
```c
|
||
// hello.c — the smallest meaningful Apple IIgs program.
|
||
//
|
||
// Writes a value to bank-2 RAM at $02:5000, then halts. The MAME
|
||
// harness reads that memory cell to verify the result.
|
||
|
||
int main(void) {
|
||
int x = 6 * 7;
|
||
// Write directly to the 24-bit absolute address $02:5000. With
|
||
// ptr32 mode (our default), constant pointers to >16-bit addresses
|
||
// lower to `sta long $025000` — no bank-switching needed.
|
||
*(volatile int *)0x025000 = x;
|
||
while (1) {} // halt; the harness reads memory + exits
|
||
return 0;
|
||
}
|
||
```
|
||
|
||
### 2. Compile to a `.o` file
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang \
|
||
--target=w65816 \
|
||
-O2 \
|
||
-I runtime/include \
|
||
-c hello.c \
|
||
-o hello.o
|
||
```
|
||
|
||
What each flag does:
|
||
|
||
| Flag | Meaning |
|
||
|---|---|
|
||
| `--target=w65816` | **Required.** Tells clang to emit W65816 machine code instead of the host CPU's code. |
|
||
| `-O2` | Optimization level. `-O2` is recommended; `-O0` works but produces 3-5× larger code. |
|
||
| `-I runtime/include` | Look for `<stdio.h>` etc. in our runtime headers. |
|
||
| `-c` | Compile only — produce a `.o`, don't link. |
|
||
| `-o hello.o` | Write the object to `hello.o`. |
|
||
|
||
If the command succeeds, you'll have a `hello.o` next to your `hello.c`.
|
||
You can inspect it:
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40
|
||
```
|
||
|
||
### 3. Link to a flat binary
|
||
|
||
```bash
|
||
./tools/link816 \
|
||
-o hello.bin \
|
||
--text-base 0x1000 \
|
||
runtime/crt0.o \
|
||
runtime/libc.o \
|
||
runtime/libgcc.o \
|
||
hello.o
|
||
```
|
||
|
||
Each argument:
|
||
|
||
| Argument | Why |
|
||
|---|---|
|
||
| `-o hello.bin` | Output file. |
|
||
| `--text-base 0x1000` | Where the code goes in memory. `0x1000` is conventional (first 4 KB of bank 0 is reserved for stack + zero page). |
|
||
| `runtime/crt0.o` | **Must come first.** The C runtime startup — sets up the stack, calls `main`, halts cleanly on return. |
|
||
| `runtime/libc.o` | Core C library (`printf`, `malloc`, `strlen`, etc.). |
|
||
| `runtime/libgcc.o` | Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program. |
|
||
| `hello.o` | Your code. |
|
||
|
||
`link816` will print something like:
|
||
|
||
```
|
||
linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin
|
||
```
|
||
|
||
That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS.
|
||
|
||
### 4. Run it in MAME
|
||
|
||
```bash
|
||
bash scripts/runInMame.sh hello.bin --check 0x025000=002a
|
||
```
|
||
|
||
`0x002a` is hexadecimal for 42 (= 6 × 7), and `0x025000` is the
|
||
24-bit address `bank $02 + offset $5000` — where your program wrote
|
||
`x`. The script boots MAME's emulated Apple IIgs, loads your binary
|
||
at `$00:1000`, runs for 5 seconds, reads memory at `$02:5000`, and
|
||
compares to the expected value.
|
||
|
||
A pass looks like:
|
||
|
||
```
|
||
MAME-LOADED bytes=128
|
||
MAME-READ addr=0x025000 val=0x002a
|
||
[llvm816] MAME OK: 1 reads matched
|
||
```
|
||
|
||
If you get `MAME mismatch`, your program wrote a different value (or
|
||
no value). Most common cause for a new project is writing to a
|
||
bank-0 address like `*(volatile int *)0x5000 = x;` (a plain `$5000`)
|
||
instead of a 24-bit address like `*(volatile int *)0x025000 = x;`
|
||
(`$02:5000`). The verification harness reads bank 2; writes to bank 0
|
||
go to a different RAM cell and the comparison fails.
|
||
|
||
---
|
||
|
||
## Compiling C — full reference
|
||
|
||
The compiler is invoked just like a normal clang, with one extra flag:
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o
|
||
```
|
||
|
||
### Recommended flags
|
||
|
||
| Flag | Meaning |
|
||
|---|---|
|
||
| `--target=w65816` | Selects the W65816 backend (required). |
|
||
| `-O2` | Default optimization. `-O0` and `-O1` work but produce ~3-5× larger code. `-O3` is the same as `-O2` for our backend. |
|
||
| `-ffunction-sections` | Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries). |
|
||
| `-I runtime/include` | Find `<stdio.h>`, `<stdlib.h>`, `<iigs/toolbox.h>` etc. |
|
||
| `-c` | Compile only — produce `.o`, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects. |
|
||
| `-g` | Emit DWARF debug info. Useful with `link816 --debug-out`. |
|
||
| `-S` | Emit assembly (`.s`) instead of an object file. Useful for inspecting codegen. |
|
||
|
||
### What works at `-O2`
|
||
|
||
- All C99 scalars: `int8_t` through `int64_t`, signed and unsigned,
|
||
all arithmetic operators
|
||
- Soft `float` and `double` (full IEEE-754 with round-to-nearest-even)
|
||
- Pointers, arrays, structs, unions, bitfields
|
||
- All control flow: `if`, `for`, `while`, `goto`, `switch`, recursion
|
||
- `<stdarg.h>` varargs
|
||
- `<setjmp.h>` setjmp/longjmp (SJLJ, no DWARF unwinder)
|
||
- Inline `__asm__` with `"a"`, `"x"`, `"y"` register constraints
|
||
- C++ subset: classes, single+multiple inheritance, virtual functions,
|
||
RTTI, `dynamic_cast`. **No exceptions** (DWARF unwinder not
|
||
implemented; SJLJ exceptions work via `-fsjlj-exceptions`).
|
||
|
||
See [STATUS.md](../STATUS.md) for the full feature matrix.
|
||
|
||
---
|
||
|
||
## Linking — full reference
|
||
|
||
`link816` produces a flat binary suitable for direct execution (loaded
|
||
into a fixed address) or, with `--omf`, an OMF binary that the GS/OS
|
||
Loader can load and relocate.
|
||
|
||
### Raw binary (fixed-address load)
|
||
|
||
```bash
|
||
./tools/link816 -o output.bin --text-base 0x1000 \
|
||
runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o
|
||
```
|
||
|
||
- `--text-base 0x1000` — Where code is loaded. `$1000` is conventional;
|
||
the first 4 KB of bank 0 (`$00:0000`-`$00:0FFF`) is reserved for the
|
||
stack and direct page.
|
||
- `--bss-base 0x020000` — Where uninitialized data (BSS) goes. By
|
||
default the linker places BSS immediately after rodata; supplying a
|
||
different bank is useful when your text + data exceeds a single
|
||
bank's free space.
|
||
- `--map output.map` — Writes a human-readable map file showing every
|
||
symbol's address. Useful for debugging.
|
||
- `--no-gc-sections` — Keep all functions, even unreferenced ones.
|
||
By default `link816 --gc-sections` (ON) drops unused code, shrinking
|
||
binaries dramatically (a minimal program with full runtime linked
|
||
goes from ~43 KB to ~1.5 KB).
|
||
|
||
### Runtime libraries
|
||
|
||
Each runtime library is built once by `runtime/build.sh` and lives as
|
||
a `.o` in `runtime/`. Link only what you use — `--gc-sections` drops
|
||
the rest.
|
||
|
||
| Library | When you need it |
|
||
|---|---|
|
||
| `runtime/crt0.o` | **Always.** C runtime startup. |
|
||
| `runtime/crt0Gsos.o` | Instead of `crt0.o` for programs launched by the GS/OS Loader. |
|
||
| `runtime/libc.o` | `printf`, `malloc`, `strlen`, the usual. Almost always. |
|
||
| `runtime/libgcc.o` | Compiler helpers — multiply, divide, shift. Almost always. |
|
||
| `runtime/snprintf.o` | If you use `sprintf` / `snprintf` / `vsnprintf`. |
|
||
| `runtime/sscanf.o` | If you use `sscanf` / `vsscanf` / `fscanf`. |
|
||
| `runtime/softDouble.o` | If you use `double`-precision arithmetic anywhere. |
|
||
| `runtime/softFloat.o` | If you use `float`-precision arithmetic. |
|
||
| `runtime/math.o` | `fabs`, `floor`, `sqrt`, `sin`, `cos`, `pow`, etc. |
|
||
| `runtime/qsort.o` | `qsort` / `bsearch`. |
|
||
| `runtime/strtol.o` | `strtol` / `strtoul` / `atoi` / `atol`. |
|
||
| `runtime/strtok.o` | `strtok` / `strtok_r`. |
|
||
| `runtime/extras.o` | `strcat`, `strncat`, `llabs`, `rand`/`srand`. |
|
||
| `runtime/timeExt.o` | `time` / `gmtime` / `mktime`. |
|
||
| `runtime/iigsToolbox.o` | Apple IIgs Toolbox call wrappers. |
|
||
| `runtime/iigsGsos.o` | GS/OS class-1 call wrappers (file I/O, etc.). |
|
||
| `runtime/desktop.o` | `startdesk()` helper used by demos that need a Window Manager environment. |
|
||
| `runtime/libcxxabi.o` | C++ ABI runtime (vtable RTTI, `dynamic_cast`). |
|
||
| `runtime/libcxxabiSjlj.o` | C++ SJLJ-exception support (paired with `-fsjlj-exceptions`). |
|
||
|
||
To (re)build the runtime:
|
||
|
||
```bash
|
||
bash runtime/build.sh
|
||
```
|
||
|
||
### Multi-segment OMF (for GS/OS Loader)
|
||
|
||
For programs >60 KB (the usable bank-0 limit after the stack, zero
|
||
page, and I/O window are subtracted), build a multi-segment OMF that
|
||
GS/OS Loader places across banks:
|
||
|
||
```bash
|
||
./tools/link816 -o myprog.bin \
|
||
--text-base 0x1000 \
|
||
--segment-cap 0xB000 \
|
||
--segment-bank-base 0x040000 \
|
||
--manifest myprog.manifest.json \
|
||
runtime/crt0Gsos.o ... yourprog.o
|
||
./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf
|
||
```
|
||
|
||
See [`docs/multiSegmentPlan.md`](multiSegmentPlan.md) for details and
|
||
[`scripts/runMultiSeg.sh`](../scripts/runMultiSeg.sh) for a working
|
||
example.
|
||
|
||
---
|
||
|
||
## Running under MAME
|
||
|
||
[`scripts/runInMame.sh`](../scripts/runInMame.sh) launches MAME's
|
||
`apple2gs` driver, loads your binary at `$00:1000`, runs for a few
|
||
seconds, and reads a memory cell:
|
||
|
||
```bash
|
||
bash scripts/runInMame.sh prog.bin # just run for ~5 s
|
||
bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value
|
||
bash scripts/runInMame.sh prog.bin 0x025000 0x025002 # dump these addresses
|
||
```
|
||
|
||
- `--check ADDR=VALUE` returns exit 0 if memory matches, exit 1 if not.
|
||
Used by smoke and CI.
|
||
- The bare-address form dumps the value without comparing.
|
||
|
||
The runner is headless by default (`-video none` + `SDL_VIDEODRIVER=dummy`)
|
||
so it runs in a terminal-only environment. Useful environment
|
||
variables:
|
||
|
||
| Variable | Default | Purpose |
|
||
|---|---|---|
|
||
| `MAME_CHECK_FRAME` | `300` | Frame at which to read the check address (300 ≈ 5 s at 60 Hz). |
|
||
| `MAME_SECS` | `6` | How long to let MAME run before forcibly exiting. |
|
||
| `MAME_TIMEOUT` | `30` | Wall-clock timeout for the whole MAME invocation. |
|
||
| `MAME_RAMSIZE` | unset | Override the emulated RAM size (e.g. `8M`). |
|
||
|
||
### Writing to non-bank-0 RAM
|
||
|
||
The 65816 has two registers that select which bank a memory access
|
||
goes to:
|
||
|
||
- **PBR** (Program Bank Register) — selects the bank for instruction
|
||
fetches. Set by `jsl long_addr` and `rtl`.
|
||
- **DBR** (Data Bank Register) — selects the bank for 16-bit absolute
|
||
data accesses like `lda $5000`.
|
||
|
||
When the IIgs boots, DBR defaults to `$00`. Bank `$00` contains the
|
||
I/O window at `$C000-$CFFF`, the language card area, and the stack —
|
||
not a great place for general data.
|
||
|
||
**With ptr32 mode** (the default — pointers are 32 bits / 24-bit
|
||
addresses), constant pointers to non-bank-0 addresses lower
|
||
automatically to long (24-bit absolute) instructions that *ignore DBR*:
|
||
|
||
```c
|
||
*(volatile int *)0x025000 = 42; // → sta long $025000 (DBR-independent)
|
||
*(volatile char *)0xE10068 = 1; // → sta long $E10068 (vert position reg)
|
||
unsigned char v = *(volatile char *)0xE0C025; // ROM read
|
||
```
|
||
|
||
For typical programs — writing a result to a verification address,
|
||
poking IIgs hardware registers, accessing the SHR framebuffer at
|
||
`$E1:2000` — you just dereference the absolute pointer and the
|
||
compiler does the right thing. **DBR doesn't matter.**
|
||
|
||
### Legacy: the `switchToBank2()` idiom
|
||
|
||
You may see older code (pre-ptr32 migration) using a `switchToBank2()`
|
||
helper that pokes DBR to `$02` so that subsequent 16-bit-absolute
|
||
stores like `*(volatile X*)0x5000 = v` land in bank 2:
|
||
|
||
```c
|
||
__attribute__((noinline)) void switchToBank2(void) {
|
||
__asm__ volatile (
|
||
"sep #0x20\n" // 8-bit A
|
||
".byte 0xa9,0x02\n" // lda #2 (hand-encoded)
|
||
"pha\n" // push A
|
||
"plb\n" // pop into DBR
|
||
"rep #0x20\n" // back to 16-bit A
|
||
);
|
||
}
|
||
// then:
|
||
switchToBank2();
|
||
*(volatile int *)0x5000 = x;
|
||
```
|
||
|
||
This still works but is **no longer needed** for new code. Prefer the
|
||
direct 24-bit pointer form (`*(volatile int *)0x025000 = x;`) — it's
|
||
clearer, requires no inline asm, and produces fewer instructions
|
||
because the bank byte is encoded inline.
|
||
|
||
There's still one case where it's useful: if you have a *large amount*
|
||
of data work in a single bank and want every store to be 3 bytes
|
||
(`sta $5000,X` etc.) instead of 4 bytes (`sta long $025000,X`). In
|
||
that case, set DBR once with the helper above and use 16-bit-absolute
|
||
addresses afterward. Otherwise, the direct form is simpler.
|
||
|
||
### What never needs bank-switching
|
||
|
||
- **Local variables on the stack** — stack-relative accesses bypass DBR.
|
||
- **Direct-page accesses** — `lda $D0` always reads `$00:00D0`.
|
||
- **`[dp],Y` indirect-long pointers** — they carry their own bank byte.
|
||
- **Function calls** — `jsl` uses PBR + a long destination.
|
||
- **Pointers in ptr32 mode** — every C pointer is 32 bits, so deref'ing
|
||
any pointer (even one to bank 0) generates DBR-independent code.
|
||
|
||
---
|
||
|
||
## Worked examples
|
||
|
||
### Recursion + printing
|
||
|
||
```c
|
||
// fib.c
|
||
#include <stdio.h>
|
||
#include <stdlib.h>
|
||
|
||
unsigned long fib(unsigned n) {
|
||
if (n < 2) return n;
|
||
return fib(n-1) + fib(n-2);
|
||
}
|
||
|
||
int main(void) {
|
||
char buf[32];
|
||
int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
|
||
// Copy the formatted string into bank-2 RAM at $02:5000 so the
|
||
// MAME harness can read it back. Each store goes through a 24-bit
|
||
// long-address write — no bank-switching needed.
|
||
for (int i = 0; i <= len; i++)
|
||
((volatile char *)0x025000)[i] = buf[i];
|
||
while (1) {}
|
||
}
|
||
```
|
||
|
||
Build (snprintf needs soft-double + sscanf to link cleanly):
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
|
||
-I runtime/include -c fib.c -o fib.o
|
||
|
||
./tools/link816 -o fib.bin --text-base 0x1000 \
|
||
runtime/crt0.o runtime/libc.o runtime/libgcc.o \
|
||
runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \
|
||
fib.o
|
||
|
||
bash scripts/runInMame.sh fib.bin --check 0x025000=0066 # 'f' (start of "fib")
|
||
```
|
||
|
||
### Apple IIgs Toolbox
|
||
|
||
```c
|
||
// hello_gs.c
|
||
#include <iigs/toolbox.h>
|
||
|
||
int main(void) {
|
||
SysBeep();
|
||
while (1) {}
|
||
}
|
||
```
|
||
|
||
Build (note `crt0Gsos.o` instead of `crt0.o` — sets up the toolbox
|
||
environment):
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
|
||
-I runtime/include -c hello_gs.c -o hello_gs.o
|
||
|
||
./tools/link816 -o hello_gs.bin --text-base 0x1000 \
|
||
runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
|
||
runtime/libgcc.o hello_gs.o
|
||
```
|
||
|
||
Programs that call the toolbox usually run under real GS/OS rather than
|
||
in the headless harness. See `demos/launch.sh` and `demos/build.sh`
|
||
for a working pipeline.
|
||
|
||
---
|
||
|
||
## Advanced: pointer-deref code generation
|
||
|
||
The W65816 backend treats every pointer as 32-bit (`p:32:16` datalayout
|
||
— `sizeof(void *) == 4` from the C compiler's perspective). The high
|
||
two bytes carry the bank byte plus a pad byte; the low two carry the
|
||
in-bank offset. This lets a single C pointer reach any byte in the
|
||
IIgs's 24-bit address space.
|
||
|
||
A pointer dereference has to read up to 24 bits of address to know
|
||
which bank to touch. The CPU's `[dp],Y` (indirect-long-Y, opcode
|
||
0xB7) reads a 24-bit pointer from a direct-page slot and uses it as
|
||
the effective address — three bytes wide, bank byte explicit. This
|
||
is the **safe default** path and it works regardless of where the
|
||
target memory lives.
|
||
|
||
There are two optimizations layered on top of the default path. One
|
||
is **always on** and safe. The other is **opt-in via a flag** and
|
||
needs care.
|
||
|
||
### Layer 1: constant-offset peeling (default on, always safe)
|
||
|
||
When you write `s->c` for a struct field at offset `4`, the natural
|
||
code is "compute `s + 4`, then deref". Layer 1 recognizes that
|
||
`[dp],Y` already has a Y register that's added to the 24-bit pointer
|
||
on the deref — so instead of computing `s + 4` first, the backend
|
||
stages the **base pointer** at `$E0..$E2` and loads `Y = #4` for the
|
||
deref. Saves three instructions per struct-field access (the
|
||
`clc; adc #4; ...; adc #0` carry chain).
|
||
|
||
A consecutive-access CSE peephole shares the `$E0/$E2` staging
|
||
between adjacent derefs of the same base, so `s->a + s->b + s->c +
|
||
s->d` stages once and emits four `ldy #K; lda [$E0],Y` pairs.
|
||
|
||
There's nothing to enable or disable. This was a `+1%` Lua-wide
|
||
size win on its own. It's always-on because it's structurally
|
||
equivalent to the un-optimized code — the same 24-bit deref, just
|
||
with the offset folded into Y instead of pre-added to the pointer.
|
||
|
||
### Layer 2: `-mllvm -w65816-dbr-safe-ptrs` (opt-in, unsafe if misused)
|
||
|
||
The default `[dp],Y` deref needs three bytes of staging at `$E0..$E2`
|
||
because it reads a 24-bit pointer. Calypsi uses `lda (d,S),Y`
|
||
(opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE
|
||
instruction — but that opcode reads only **16 bits** of pointer.
|
||
The bank byte is implicit DBR.
|
||
|
||
When you pass `-mllvm -w65816-dbr-safe-ptrs`, our backend uses the
|
||
same one-instruction path: it spills only the low 16 bits of the
|
||
pointer to a stack slot, sets Y to the offset, and emits
|
||
`lda (slot,S),Y` (or `sta (slot,S),Y`). Bank byte = whatever DBR
|
||
holds at runtime.
|
||
|
||
Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks
|
||
by 20.6% with the flag on.
|
||
|
||
**This is correct only when every pointer dereferenced in the TU
|
||
points to memory inside DBR's current bank.** Some examples:
|
||
|
||
| Pointer | Bank? | Safe with the flag? |
|
||
|---|---|---|
|
||
| `malloc()` result | DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) | Yes |
|
||
| Global variable address | DBR's bank (linker puts globals in the load segment) | Yes |
|
||
| `&local_array[i]` in a stack frame | Bank 0 (stack is always bank 0) | Yes IF DBR is 0 (typical) |
|
||
| Pointer returned by GS/OS Loader | The Loader's bank (might differ from yours) | **No** — would miscompile |
|
||
| Pointer cast from a `0x010000+addr` integer literal in bank 1 | Bank 1 | **No** if DBR is not bank 1 |
|
||
| `&ROMVECTORS[0]` from `iigs/`-style headers | Various IIgs system banks | **No** in general |
|
||
|
||
For Lua, Picol, plain C programs that allocate via `malloc` and
|
||
operate on globals, this flag is safe. For GS/OS demos that interact
|
||
with Loader-returned segments or system memory, it would miscompile.
|
||
|
||
Default is **off**. Opt in per-TU:
|
||
|
||
```bash
|
||
clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o
|
||
```
|
||
|
||
If you set the flag and your code does dereference cross-bank
|
||
pointers, the symptom is silent wrong-address reads — typically a
|
||
read from the same in-bank offset but in DBR's bank instead of the
|
||
intended one. No abort, no diagnostic.
|
||
|
||
**Mixing safely:** the flag is per-TU. You can compile your hot
|
||
struct-heavy code with the flag and your bank-aware code without.
|
||
The two `.o` files link cleanly together. Per-function or
|
||
per-parameter control isn't supported yet.
|
||
|
||
#### When the slot offset overflows 8 bits
|
||
|
||
`lda (d,S),Y` has an 8-bit `d` field — max slot offset 255 from SP.
|
||
If the function's frame is large enough that the spill slot exceeds
|
||
that, PEI emits a fallback sequence that long-indirects the slot via
|
||
`[$F6],Y` (the function's frame-pointer), then stages at `$E0..$E2`
|
||
and derefs via `[$E0],Y`. This is ~8 instructions — worse than the
|
||
plain `[dp],Y` path the flag was meant to replace. Functions that
|
||
hit this need `usesDpFP=true` (set automatically for large frames);
|
||
otherwise PEI emits a fatal error. In practice you'll only see this
|
||
on functions with hundreds of local variables.
|
||
|
||
### Inline-threshold tuning (default lowered to 50)
|
||
|
||
LLVM's default inline-cost threshold is 225, tuned for desktop CPUs
|
||
where call overhead is high relative to the size of the inlined body.
|
||
On W65816 a `jsl long:foo` is just 4 bytes / ~8 cycles, but every
|
||
inlined pointer dereference expands to multiple instructions even
|
||
with Layer 2. Aggressive inlining bloats code without commensurate
|
||
cycle wins.
|
||
|
||
The W65816 backend lowers the default to **50**. Calibration:
|
||
|
||
| Threshold | Lua size | CoreMark size | Cycle benches |
|
||
|----------:|---------:|--------------:|--------------|
|
||
| 225 (LLVM stock) | 1.47× Calypsi | (not measured) | baseline |
|
||
| 75 | 1.16× | 0.87× | identical |
|
||
| **50 (current)** | **1.13×** | **0.79×** | identical |
|
||
| 25 | 1.11× | 0.79× | identical |
|
||
|
||
At 225, Lua's `index2adr` (a multi-branch helper called 41 times in
|
||
`lapi.c`) was inlined into every API entry, adding ~2 KB per file —
|
||
and CoreMark's `matrix_test` was 17× Calypsi because the inliner
|
||
copied 5 nested-loop helpers into it. At 50, both regressions vanish
|
||
and the cycle benchmarks are unchanged.
|
||
|
||
To override (e.g. on size-sensitive ROMs or speed-critical loops):
|
||
|
||
```bash
|
||
# Force aggressive inlining (back to LLVM default)
|
||
clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o
|
||
|
||
# Force MORE conservative inlining
|
||
clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o
|
||
```
|
||
|
||
A function marked `__attribute__((always_inline))` is always inlined
|
||
regardless of threshold. A function marked `__attribute__((noinline))`
|
||
is never inlined. Use these to override the global threshold for
|
||
specific cases.
|
||
|
||
### Summary: which options to use when
|
||
|
||
| Goal | Compile flag |
|
||
|---|---|
|
||
| Smallest, safest binary (default) | `clang --target=w65816 -O2 ...` — Layer 1 is on, Layer 2 is off, threshold=50 |
|
||
| Smallest binary for code that touches only same-bank memory | Add `-mllvm -w65816-dbr-safe-ptrs` |
|
||
| Fastest possible code (size be damned) | Add `-mllvm -inline-threshold=500` |
|
||
| Reproduce LLVM's stock inlining behavior | Add `-mllvm -inline-threshold=225` |
|
||
| Maximum safety review of inlining decisions | Mark hot helpers `__attribute__((noinline))` explicitly |
|
||
|
||
---
|
||
|
||
## Inline assembly
|
||
|
||
The W65816 backend supports `__asm__` with operand constraints
|
||
`"a"`, `"x"`, `"y"`:
|
||
|
||
```c
|
||
unsigned short addOne(unsigned short x) {
|
||
unsigned short r;
|
||
__asm__("inc a" : "=a"(r) : "a"(x));
|
||
return r;
|
||
}
|
||
```
|
||
|
||
Multi-instruction asm and raw bytes both work:
|
||
|
||
```c
|
||
__asm__ volatile (
|
||
"sep #0x20\n"
|
||
".byte 0x68\n" // pla
|
||
"rep #0x20\n"
|
||
);
|
||
```
|
||
|
||
The `.byte` form is needed when llvm-mc can't yet parse an opcode
|
||
literally (some 65816 addressing modes still have gaps in the
|
||
assembler). Hand-encoding is a stopgap; report opcodes that need it.
|
||
|
||
---
|
||
|
||
## Tools reference
|
||
|
||
| Tool | Location | Purpose |
|
||
|---|---|---|
|
||
| `clang` | `tools/llvm-mos-build/bin/clang` | C / C++ compiler |
|
||
| `clang++` | `tools/llvm-mos-build/bin/clang++` | C++ driver |
|
||
| `llc` | `tools/llvm-mos-build/bin/llc` | Standalone codegen (`.ll` → `.s`) |
|
||
| `llvm-mc` | `tools/llvm-mos-build/bin/llvm-mc` | Assembler |
|
||
| `llvm-objdump` | `tools/llvm-mos-build/bin/llvm-objdump` | Disassembler |
|
||
| `link816` | `tools/link816` | Our relocating linker |
|
||
| `omfEmit` | `tools/omfEmit` | Emit OMF v2.1 binary from `link816` output |
|
||
| `mame` | system `apt` install | Apple IIgs emulator |
|
||
|
||
---
|
||
|
||
## Debugging
|
||
|
||
### Look at the asm
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c
|
||
cat prog.s
|
||
```
|
||
|
||
### Look at the MIR after each backend pass
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
|
||
-mllvm -print-after-all -S prog.c 2>&1 | less
|
||
```
|
||
|
||
Useful pass names to filter on:
|
||
|
||
| Pass name | What it does |
|
||
|---|---|
|
||
| `w65816-isel` | SDAG → MachineInstr selection |
|
||
| `w65816-widen-acc16` | Promote Acc16 vregs to Wide16 (regalloc help) |
|
||
| `w65816-stack-slot-cleanup` | Remove redundant spill/reload |
|
||
| `w65816-stackrel-to-img` | Promote hot stack slots to DP IMG slots |
|
||
| `w65816-stack-slot-merge` | Collapse PHI src/dst slot pairs |
|
||
| `w65816-branch-expand` | Long-distance Bxx → INV_Bxx skip; BRA |
|
||
|
||
### Single-pass filter
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
|
||
-mllvm -print-after=w65816-isel \
|
||
-mllvm -filter-print-funcs=myfunc \
|
||
-S prog.c 2>&1 | less
|
||
```
|
||
|
||
### Disassemble an object file
|
||
|
||
```bash
|
||
./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o
|
||
```
|
||
|
||
---
|
||
|
||
## Cycle-count benchmarks
|
||
|
||
13 microbenchmarks live under [`benchmarks/`](../benchmarks/) — eight
|
||
integer/string micro-benches, three soft-double FP benches (`dadd`,
|
||
`dmul`, `ddiv`), and two "game-like" workloads: `particles` (32-particle
|
||
physics tick with i16 bounce/wall collision) and `mandelbrot` (4×4
|
||
fixed-point Mandelbrot tile exercising i32 multiply and conditional
|
||
control flow).
|
||
|
||
```bash
|
||
bash scripts/benchCycles.sh
|
||
```
|
||
|
||
Output (2026-05-21):
|
||
|
||
```
|
||
| Benchmark | Per-iteration cycles |
|
||
|-----------|---------------------:|
|
||
| bsearch | 127 cyc/iter (100 iters) |
|
||
| crc32 | <65 (under timer resolution) |
|
||
| dadd | 1157 cyc/iter (10 iters) |
|
||
| ddiv | 1261 cyc/iter (10 iters) |
|
||
| dmul | 1033 cyc/iter (10 iters) |
|
||
| dotProduct | 144 cyc/iter (100 iters) |
|
||
| fib | 97 cyc/iter (100 iters) |
|
||
| mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) |
|
||
| memcmp | 113 cyc/iter (100 iters) |
|
||
| particles | 2253 cyc/iter (3 iters, N=32) |
|
||
| popcount | 93 cyc/iter (100 iters) |
|
||
| strcpy | 91 cyc/iter (100 iters) |
|
||
| sumOfSquares | 126 cyc/iter (100 iters) |
|
||
```
|
||
|
||
The legacy `scripts/benchCyclesPrecise.sh` (per-call cycle count via
|
||
`emu.time()`) is still available but slower to run.
|
||
|
||
The [`compare/`](../compare/) directory has side-by-side `.s` files vs
|
||
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:
|
||
|
||
```bash
|
||
bash compare/regen.sh
|
||
```
|
||
|
||
---
|
||
|
||
## Known limitations
|
||
|
||
- **C++ exceptions** are not implemented for DWARF unwinding.
|
||
`try` / `catch` compiles but doesn't unwind. `-fsjlj-exceptions`
|
||
works for limited SJLJ-style throwing.
|
||
- **`stdin`** always returns EOF. `scanf` compiles but isn't useful.
|
||
Use `sscanf` on a buffer instead.
|
||
- **File I/O** through `fopen` requires a backing implementation. The
|
||
default `mfs` backing (memory-file-system) lets you simulate files
|
||
via `mfsRegister()` — useful for tests, not for real disk I/O. GS/OS
|
||
file I/O works via `runtime/iigsGsos.o` if you link against the GS/OS
|
||
runtime.
|
||
- **`fork`/`exec`** — not applicable on a 65816, no support.
|
||
- **Code generation gotcha:** very large stack frames (>200 bytes)
|
||
trigger FP-relative addressing. Most programs fit under that limit.
|
||
See the `frame-rel` discussion in
|
||
[LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md).
|
||
- **Three Lua functions** (`luaV_execute`, `symbexec`, `auxsort`) hit
|
||
the greedy register allocator's complexity budget. Workaround:
|
||
compile those TUs with `-mllvm -regalloc=basic`. Documented in
|
||
[`tests/lua/README.md`](../tests/lua/README.md).
|
||
|
||
---
|
||
|
||
## Where to go next
|
||
|
||
- **Building real GS/OS apps:** see
|
||
[`docs/multiSegmentPlan.md`](multiSegmentPlan.md) and the
|
||
`demos/launch.sh` script for booting through real GS/OS 6.0.2 in
|
||
MAME. The 9 demos under `demos/` are reasonable starting points.
|
||
- **Backend internals (you're hacking on the compiler):**
|
||
[LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md).
|
||
- **Smoke tests:** `scripts/smokeTest.sh` runs ~150 end-to-end checks.
|
||
Read it for examples of every feature in action.
|
||
- **Cycle-bench a Lua port or other real-world C:** see
|
||
[`tests/lua/README.md`](../tests/lua/README.md) for the recipe
|
||
(vendoring + per-file regalloc tuning + libc stubs).
|