478 lines
18 KiB
Markdown
478 lines
18 KiB
Markdown
# LLVM 65816 Backend for Apple IIgs
|
|
## Project Design and Handoff Document
|
|
|
|
---
|
|
|
|
## 1. Project Goal
|
|
|
|
Build an optimized Clang/LLVM C compiler backend targeting the WDC 65816
|
|
processor, specifically for Apple IIgs development. The backend must produce
|
|
genuinely optimized output — not just correct code, but tight, cycle-efficient
|
|
65816 assembly that takes full advantage of the architecture.
|
|
|
|
ORCA/C is the existing open source option and its code generation quality is
|
|
poor. Calypsi is the commercial alternative with good output quality but is
|
|
closed source. This project aims to produce an open source backend that matches
|
|
or exceeds Calypsi's output quality.
|
|
|
|
---
|
|
|
|
## 2. Background and Context
|
|
|
|
### 2.1 The 65816 Architecture
|
|
|
|
The WDC W65C816S is a 16-bit microprocessor used in the Apple IIgs (and SNES).
|
|
Key characteristics relevant to code generation:
|
|
|
|
- **Dynamic register width:** The A (accumulator), X, and Y registers can be
|
|
either 8-bit or 16-bit depending on the M and X bits in the processor status
|
|
register. Mode switching is done via REP (Reset bits) and SEP (Set bits)
|
|
instructions. REP #$20 sets 16-bit accumulator mode; SEP #$20 sets 8-bit.
|
|
REP #$10 sets 16-bit index registers; SEP #$10 sets 8-bit.
|
|
|
|
- **Direct page:** A relocatable 256-byte window in bank 0 RAM, similar to
|
|
zero page on the 6502 but moveable. The Direct Page Register (DP) holds the
|
|
base address. Direct page addressing saves one byte per instruction and one
|
|
cycle over absolute addressing. Highly valuable for hot variables.
|
|
|
|
- **24-bit address space:** Addresses are bank:offset. The Data Bank Register
|
|
(DBR) implicitly provides the bank for 16-bit absolute data accesses. The
|
|
Program Bank Register (PBR) provides the bank for code. Long (24-bit)
|
|
addressing is available for cross-bank access.
|
|
|
|
- **Stack:** 16-bit stack pointer in bank 0. Stack-relative addressing is
|
|
available but expensive.
|
|
|
|
- **Native vs. emulation mode:** At reset the CPU starts in emulation mode
|
|
(behaves like 65C02). Native mode (16-bit capable) is entered by clearing
|
|
the emulation bit. All IIgs native code runs in native mode.
|
|
|
|
### 2.2 Why This Is Hard for LLVM
|
|
|
|
LLVM's register allocator assumes fixed-width registers. The 65816's dynamic
|
|
register widths are a fundamental mismatch. Key problems:
|
|
|
|
1. **REP/SEP scheduling:** Optimal code minimizes mode switches since each
|
|
REP/SEP costs 3 cycles. The backend must decide register width per
|
|
region of code, coalescing regions that use the same width to minimize
|
|
transitions. This is a global dataflow problem.
|
|
|
|
2. **Direct page as a register file:** LLVM has no model for a relocatable
|
|
zero page. Hot variables should be allocated to direct page offsets, but
|
|
GS/OS (the IIgs operating system) reserves certain direct page locations
|
|
for its own use. The allocator must know which regions are safe.
|
|
|
|
3. **Bank register management:** DBR affects all 16-bit absolute data accesses.
|
|
The backend must ensure DBR is set correctly for cross-bank data access.
|
|
|
|
4. **24-bit address space in 32-bit ELF:** LLVM's ELF format uses 32-bit
|
|
addresses. The llvm-mos project has already defined an ELF extension for
|
|
MOS processors that handles this correctly (ELFCLASS32 with unused high
|
|
bits zeroed).
|
|
|
|
5. **GS/OS calling conventions:** Apple IIgs toolbox calls use a stack-based
|
|
parameter protocol different from any standard calling convention LLVM
|
|
knows. Custom lowering is required for toolbox calls.
|
|
|
|
### 2.3 Existing Work
|
|
|
|
- **llvm-mos:** An open source LLVM fork providing first-class support for
|
|
MOS Technology 65xx processors. The 6502 backend is production quality.
|
|
The 65816 assembler (llvm-mc) has partial support — direct page assumed
|
|
at 0x0000, 24-bit long addressing supported, 16-bit immediates supported.
|
|
The **compiler backend** (C to 65816) is incomplete. Issues #32 and #321
|
|
in the llvm-mos GitHub track this work. The core blocker is REP/SEP
|
|
register width management.
|
|
|
|
- **jeremysrand/llvm-65816:** A separate LLVM backend attempt specifically
|
|
targeting Apple IIgs, aiming to output Merlin 32-compatible assembly.
|
|
The repo explicitly states "Don't even try to use it yet" and appears
|
|
stalled.
|
|
|
|
- **WorldsApartDevTeam/65816-c:** Claims C11 compliance for 65C816. Only
|
|
2 stars, 28 commits, no releases. Very early stage.
|
|
|
|
- **Calypsi:** Commercial closed-source compiler, version 5.16 released
|
|
April 15, 2026. Best available 65816 C compiler. Free for hobby use.
|
|
Use its output as a quality benchmark.
|
|
|
|
- **ORCA/C:** Open source, runs on-machine or via emulator. Mature but
|
|
generates poor code. The baseline we need to beat significantly.
|
|
|
|
### 2.4 llvm-mos as the Foundation
|
|
|
|
This project is built on top of llvm-mos, not vanilla LLVM:
|
|
|
|
- llvm-mos already has the MOS ELF specification implemented
|
|
- It has the 6502 backend as a reference for 65xx-family conventions
|
|
- It has the assembler-level 65816 support to build on
|
|
- It has the SDK infrastructure for platform-specific runtime support
|
|
- GitHub: https://github.com/llvm-mos/llvm-mos
|
|
|
|
### 2.5 Architectural Decision: Separate W65816 Target
|
|
|
|
llvm-mos already defines `FeatureW65816` as a subtarget feature of MOS, and
|
|
tracks the missing codegen support in issue #321. We could have added 16-bit
|
|
register handling to the existing MOS target as a subtarget feature.
|
|
|
|
**We did not.** This project is a **separate W65816 target**, maintained as
|
|
our own fork of llvm-mos. Reasons:
|
|
|
|
- Clean register model. MOS's register classes assume 8-bit hardware; aliasing
|
|
A/X/Y across 8 and 16-bit widths inside MOS would require invasive changes
|
|
to an already-shipping target. A fresh target can define
|
|
`Acc8`/`Acc16`/`Idx8`/`Idx16` classes directly and let the REP/SEP pass
|
|
operate at the MIR level.
|
|
- Independent evolution. This codebase can move at its own pace without
|
|
coordinating against llvm-mos's 6502 stability guarantees.
|
|
- No upstream burden. We are not attempting to land this in llvm-mos; this is
|
|
not a subtarget-feature extension to MOS, it is a sibling target.
|
|
|
|
The MOS target remains available in the same source tree, unchanged. We
|
|
borrow infrastructure (ELF writer, MC layer patterns) but implement our own
|
|
registers, instructions, scheduling, and lowering.
|
|
|
|
---
|
|
|
|
## 3. Architecture
|
|
|
|
### 3.1 LLVM Backend Structure
|
|
|
|
A standard LLVM backend consists of:
|
|
|
|
```
|
|
Clang frontend
|
|
│
|
|
▼
|
|
LLVM IR (architecture-independent)
|
|
│
|
|
▼
|
|
Target-specific lowering (SelectionDAG or GlobalISel)
|
|
│
|
|
▼
|
|
Machine IR (MIR)
|
|
│
|
|
▼
|
|
Register allocation
|
|
│
|
|
▼
|
|
Instruction scheduling
|
|
│
|
|
▼
|
|
Code emission (assembly or object file)
|
|
```
|
|
|
|
The backend adds:
|
|
- **TableGen definitions:** Register file, instruction encodings, calling
|
|
conventions, operand types, instruction selection patterns
|
|
- **Target machine:** Configuration, subtarget features, data layout
|
|
- **Instruction selector:** Lowers LLVM IR operations to 65816 instructions
|
|
- **Register allocator customization:** Handles the special register width
|
|
and direct page constraints
|
|
- **Pass pipeline:** Custom optimization passes for REP/SEP scheduling and
|
|
direct page allocation
|
|
|
|
### 3.2 Register File Definition
|
|
|
|
```tablegen
|
|
// Processor status register bits
|
|
def MBit : Register<"m">; // accumulator width (0=16-bit, 1=8-bit)
|
|
def XBit : Register<"x">; // index width (0=16-bit, 1=8-bit)
|
|
|
|
// Main registers
|
|
def A : Register<"a">; // accumulator (8 or 16-bit)
|
|
def X : Register<"x">; // index X (8 or 16-bit)
|
|
def Y : Register<"y">; // index Y (8 or 16-bit)
|
|
def SP : Register<"sp">; // stack pointer (16-bit)
|
|
def DP : Register<"dp">; // direct page register (16-bit)
|
|
def DBR : Register<"dbr">; // data bank register (8-bit)
|
|
def PBR : Register<"pbr">; // program bank register (8-bit)
|
|
def PC : Register<"pc">; // program counter (16-bit)
|
|
def P : Register<"p">; // processor status
|
|
|
|
// Register classes
|
|
def Acc8 : RegisterClass<"W65816", [i8], 8, (add A)>;
|
|
def Acc16 : RegisterClass<"W65816", [i16], 16, (add A)>;
|
|
def Idx8 : RegisterClass<"W65816", [i8], 8, (add X, Y)>;
|
|
def Idx16 : RegisterClass<"W65816", [i16], 16, (add X, Y)>;
|
|
```
|
|
|
|
### 3.3 REP/SEP Scheduling Strategy
|
|
|
|
The core algorithmic challenge. Proposed approach:
|
|
|
|
**Phase 1 — Width inference:** For each basic block, determine the "natural"
|
|
width of each operation: loads/stores of i8 values prefer 8-bit mode, i16
|
|
values prefer 16-bit mode.
|
|
|
|
**Phase 2 — Width coalescing:** Run a dataflow analysis across the CFG. Assign
|
|
each basic block a preferred width that minimizes the total number of mode
|
|
switches on all edges. This is a minimum cut / graph partitioning problem.
|
|
A greedy approach works well in practice.
|
|
|
|
**Phase 3 — Transition insertion:** At points where the width changes, insert
|
|
REP or SEP pseudo-instructions. These are later lowered to real REP/SEP.
|
|
|
|
**Phase 4 — Peephole cleanup:** Eliminate redundant REP/SEP pairs. If a
|
|
block ends with SEP #$20 and the successor starts with REP #$20, and no
|
|
path skips the block, one of them is redundant.
|
|
|
|
### 3.4 Direct Page Allocation
|
|
|
|
A custom LLVM pass runs after register allocation:
|
|
|
|
1. Identify the GS/OS-safe direct page region. GS/OS uses $00-$9F for
|
|
toolbox and system use. The safe region for user code is $A0-$FF
|
|
(96 bytes). This may need to be configurable per project.
|
|
|
|
2. Score all local variables and spill slots by access frequency (using
|
|
LLVM's block frequency analysis).
|
|
|
|
3. Greedily allocate the highest-frequency variables to direct page
|
|
offsets, packing them tightly to fit within the 96-byte budget.
|
|
|
|
4. Rewrite all accesses to direct-page-allocated variables to use DP
|
|
addressing mode instead of stack-relative or absolute addressing.
|
|
|
|
### 3.5 Calling Convention
|
|
|
|
Standard function calls:
|
|
- Parameters passed on stack (push right-to-left, caller cleans)
|
|
- Return value in A (8-bit) or A (16-bit) or A:X (32-bit)
|
|
- Callee saves: DP, DBR, direct page contents if modified
|
|
- Stack frame: return address (3 bytes in native mode), then locals
|
|
|
|
GS/OS toolbox calls:
|
|
- Parameters pushed as a parameter block
|
|
- Tool call via JSL $E10000 (tool dispatcher)
|
|
- Custom call lowering needed — mark with a special ISD node and lower
|
|
to JSL with the correct tool number encoding
|
|
|
|
### 3.6 Memory Model
|
|
|
|
The Apple IIgs memory map relevant to code generation:
|
|
|
|
```
|
|
Bank 00: $0000-$01FF Zero page + stack (system reserved)
|
|
$0200-$9FFF User RAM (conventional)
|
|
$A000-$BFFF User RAM or LC bank
|
|
$C000-$CFFF I/O space
|
|
$D000-$FFFF ROM / language card
|
|
Bank 01: $0000-$FFFF Additional RAM
|
|
Bank E0: $0000-$FFFF Mega II (Apple //e on a chip) I/O space
|
|
Bank E1: $0000-$FFFF Video/reserved
|
|
Bank FC-FF: ROM banks
|
|
```
|
|
|
|
The default code model places code in bank 0 (or bank 1 for large programs)
|
|
and data in bank 0. Long (24-bit) pointers are needed for cross-bank access.
|
|
|
|
---
|
|
|
|
## 4. Key Optimizations
|
|
|
|
### 4.1 Register Width Selection (highest impact)
|
|
|
|
Using 16-bit accumulator for word operations vs. two 8-bit operations cuts
|
|
cycle counts roughly in half for arithmetic. The REP/SEP scheduling described
|
|
above is the primary mechanism.
|
|
|
|
### 4.2 Direct Page Promotion (second highest impact)
|
|
|
|
Direct page addressing saves 1 byte and 1 cycle per instruction versus
|
|
absolute addressing. For variables accessed in tight loops, this compounds
|
|
significantly. The allocation pass described above handles this.
|
|
|
|
### 4.3 Addressing Mode Selection
|
|
|
|
The instruction selector must choose the most efficient addressing mode:
|
|
- Direct page (8-bit offset): preferred for hot variables
|
|
- Absolute (16-bit address): for global/static data in same bank
|
|
- Long (24-bit address): for cross-bank data access
|
|
- Stack-relative: avoid where possible, expensive on 65816
|
|
- Indexed: X or Y register indexed variants of the above
|
|
|
|
### 4.4 MVN/MVP for Block Moves
|
|
|
|
The 65816 has block move instructions (MVN = move negative, MVP = move
|
|
positive) that move memory efficiently. These map to LLVM's memcpy/memmove
|
|
intrinsics. The backend should recognize and emit these for small to medium
|
|
copies rather than emitting a loop.
|
|
|
|
### 4.5 Stack Frame Minimization
|
|
|
|
Minimize spills to the stack — they use expensive stack-relative addressing.
|
|
Prefer direct page allocation for spill slots (via the DP allocator above).
|
|
|
|
---
|
|
|
|
## 5. Testing Strategy
|
|
|
|
### 5.1 MAME Lua Interface for Automated Testing
|
|
|
|
MAME's Lua scripting interface allows automated correctness testing against
|
|
real IIgs hardware emulation.
|
|
|
|
Test harness workflow:
|
|
1. Compile a C test case with the backend
|
|
2. Load the binary into MAME's IIgs emulation
|
|
3. A Lua script sets breakpoints, runs the code, checks register and
|
|
memory state against expected values
|
|
4. Report pass/fail
|
|
|
|
Example Lua test script structure:
|
|
```lua
|
|
emu.add_machine_reset_notifier(function()
|
|
local cpu = manager.machine.devices[":maincpu"]
|
|
|
|
-- breakpoint at end of test function
|
|
cpu.debug:bpset(0x010200, "", function()
|
|
local a = cpu.state["A"].value
|
|
local p = cpu.state["P"].value
|
|
local mBit = (p >> 5) & 1 -- M bit: 0=16-bit, 1=8-bit
|
|
|
|
-- validate result and register width state
|
|
if a ~= 0x1234 then
|
|
print(string.format("FAIL: A=0x%04X expected 0x1234", a))
|
|
elseif mBit ~= 0 then
|
|
print("FAIL: expected 16-bit accumulator mode")
|
|
else
|
|
print("PASS")
|
|
end
|
|
manager.machine:exit()
|
|
end)
|
|
end)
|
|
```
|
|
|
|
REP/SEP tracing — validate mode switch sequences:
|
|
```lua
|
|
-- trace all REP/SEP instructions to validate width scheduling
|
|
cpu.debug:wpset(0x000000, 1, "x", "", function()
|
|
-- check if instruction at PC is REP or SEP
|
|
local pc = cpu.state["PC"].value
|
|
local opcode = manager.machine.spaces["program"]:read_u8(pc)
|
|
if opcode == 0xC2 then -- REP
|
|
local operand = manager.machine.spaces["program"]:read_u8(pc+1)
|
|
print(string.format("REP #$%02X at $%04X", operand, pc))
|
|
elseif opcode == 0xE2 then -- SEP
|
|
local operand = manager.machine.spaces["program"]:read_u8(pc+1)
|
|
print(string.format("SEP #$%02X at $%04X", operand, pc))
|
|
end
|
|
end)
|
|
```
|
|
|
|
### 5.2 Calypsi Output Comparison
|
|
|
|
Compile the same C code with Calypsi and with our backend. Diff the assembly
|
|
output. Cases where our output is significantly worse (more instructions,
|
|
more REP/SEP transitions, missing direct page usage) are optimization bugs.
|
|
|
|
### 5.3 Test Categories
|
|
|
|
- Basic arithmetic (8-bit and 16-bit, mixed width)
|
|
- Pointer operations (direct page, absolute, long)
|
|
- Function calls (stack frame setup/teardown)
|
|
- GS/OS toolbox calls
|
|
- Struct/array access
|
|
- Mixed 8/16-bit operations (the REP/SEP stress test)
|
|
- memcpy/memset (MVN/MVP emission)
|
|
- Interrupt handlers (bank register save/restore)
|
|
|
|
---
|
|
|
|
## 6. GS/OS Integration
|
|
|
|
### 6.1 Runtime Library
|
|
|
|
A minimal runtime library is needed:
|
|
- `crt0.s`: startup code, sets native mode, initializes DP and DBR,
|
|
sets up stack, calls `main()`
|
|
- `libc`: subset of C standard library functions adapted for GS/OS
|
|
- GS/OS system call wrappers (file I/O, memory management, etc.)
|
|
|
|
### 6.2 ProDOS 16 / GS/OS Output Format
|
|
|
|
The linker must produce output in a format the IIgs can load:
|
|
- OMF (Object Module Format) — the native Apple IIgs executable format
|
|
- Alternatively, a raw binary image for simple programs
|
|
- The llvm-mos SDK's infrastructure for platform-specific output formats
|
|
can be adapted here
|
|
|
|
---
|
|
|
|
## 7. Implementation Order
|
|
|
|
1. **Fork llvm-mos** and create the W65816 target directory structure
|
|
2. **TableGen: registers and instruction set** — describe all 65816 registers,
|
|
addressing modes, and instructions
|
|
3. **Target machine descriptor** — data layout, calling convention basics,
|
|
feature flags (native mode, emulation mode)
|
|
4. **Instruction selector** — basic patterns for arithmetic, load/store,
|
|
branches; initially target a fixed 16-bit accumulator mode (no REP/SEP)
|
|
5. **Calling convention lowering** — stack-based parameter passing, return
|
|
values
|
|
6. **Basic test suite** — compile simple C functions, verify correctness in
|
|
MAME via Lua harness
|
|
7. **REP/SEP scheduling pass** — the core optimization; width inference,
|
|
coalescing, transition insertion
|
|
8. **Direct page allocator** — promote hot variables to DP addressing
|
|
9. **GS/OS toolbox call lowering** — custom ISD nodes for JSL-based calls
|
|
10. **Runtime library** — crt0, minimal libc, GS/OS wrappers
|
|
11. **OMF output** — linker support for IIgs executable format
|
|
12. **Optimization tuning** — compare against Calypsi, close gaps
|
|
|
|
---
|
|
|
|
## 8. Open Questions
|
|
|
|
1. **GS/OS direct page reservation:** Exactly which direct page bytes does
|
|
GS/OS reserve vs. what is safe for user code? Need to verify against
|
|
the GS/OS reference documentation.
|
|
|
|
2. **Bank model:** Should the default code model assume everything fits in
|
|
bank 0/1, or should we support a large memory model with 24-bit pointers
|
|
by default? Small model is simpler; large model is more powerful.
|
|
|
|
3. **Interrupt handler ABI:** What registers must an IIgs interrupt handler
|
|
save/restore? DBR and DP especially — they may be in an unknown state
|
|
when an interrupt fires.
|
|
|
|
4. **ORCA/C compatibility:** Do we try to be ABI-compatible with ORCA/C to
|
|
allow linking against existing IIgs libraries? Or define a clean new ABI?
|
|
ORCA/C compatibility would be valuable but constraining.
|
|
|
|
5. **REP/SEP in leaf functions:** For small leaf functions that only use one
|
|
width, we can omit mode switches entirely if the caller guarantees a
|
|
known state. Is a "width contract" calling convention attribute worth
|
|
implementing?
|
|
|
|
6. **Cycle counting in MAME:** MAME's IIgs emulation may not be fully
|
|
cycle-accurate. For performance regression testing, do we need real
|
|
hardware or a more cycle-accurate emulator?
|
|
|
|
---
|
|
|
|
## 9. Coding Conventions
|
|
|
|
- `camelCase` for functions and variables
|
|
- `PascalCase` with `T` suffix for types (e.g., `RegWidthT`, `DpSlotT`)
|
|
- `//` single-line comments throughout
|
|
- Target language: C++ (LLVM requirement)
|
|
- Follow llvm-mos coding style for backend code
|
|
- Follow LLVM coding standards for everything else
|
|
|
|
---
|
|
|
|
## 10. References
|
|
|
|
- llvm-mos repository: https://github.com/llvm-mos/llvm-mos
|
|
- llvm-mos SDK: https://github.com/llvm-mos/llvm-mos-sdk
|
|
- llvm-mos ELF spec: https://llvm-mos.org/wiki/ELF_specification
|
|
- llvm-mos 65816 issue #32: https://github.com/llvm-mos/llvm-mos/issues/32
|
|
- llvm-mos 65816 issue #321: https://github.com/llvm-mos/llvm-mos/issues/321
|
|
- WDC 65816 data sheet: available from westerndesigncenter.com
|
|
- Apple IIgs Hardware Reference: archive.org
|
|
- Apple IIgs GS/OS Reference: archive.org
|
|
- Calypsi compiler (quality benchmark): https://www.calypsi.cc/
|
|
- ORCA/C (open source reference): https://github.com/byteworksinc/ORCA-C
|
|
- jeremysrand/llvm-65816 (prior attempt): https://github.com/jeremysrand/llvm-65816
|