65816-llvm-mos/SESSION_STATE.md
2026-04-28 16:49:41 -05:00

709 lines
35 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Resume — llvm816 project
Drop this into a new Claude Code session and say "read SESSION_STATE.md and
continue where we left off." Pairs with `LLVM_65816_DESIGN.md` (the design
doc — read that second).
---
## 1. Project in one sentence
Build an open-source LLVM/Clang backend for the WDC 65816 (Apple IIgs) that
matches or exceeds Calypsi's output quality, forked from llvm-mos but
maintained as our own separate W65816 target. User is Scott; expert C dev,
doesn't want hand-holding on LLVM or 65816 basics.
## 2. Where we are in the plan
Design doc section 7 lists a 12-step implementation order. We are at:
- [x] **Setup toolchain** (prior session)
- [x] **Architectural decision: separate W65816 target** (design doc §2.5)
- [x] **Repo-layout decision:** `src/` holds our authored files, `patches/`
holds modifications to upstream llvm-mos files, `tools/llvm-mos/` is
ephemeral and gitignored. `scripts/applyBackend.sh` stitches src +
patches into the clone.
- [x] **Step 1 — scaffold W65816 target directory.** 41 files under
`src/llvm/lib/Target/W65816/` + 2 files under
`src/clang/lib/Basic/Targets/`. 4 upstream patches under `patches/`.
- [x] **Step 2 — verify the skeleton fully compiles and links.** All 8
tablegen generators run clean, three static libs
(LLVMW65816Info/Desc/CodeGen) build, llc links with the target
registered, zero warnings in the W65816-local build.
`./bin/llc -march=w65816 -filetype=null /dev/null` → exit 0.
- [x] **Step 2a — real MC-layer instructions.** `W65816InstrInfo.td` now
holds ~90 real 65816 opcodes (LDA/STA/LDX/LDY/STX/STY across
immediate/DP/abs/DPX/AbsX/AbsY/long where applicable; ADC/SBC/CMP/
AND/ORA/EOR/BIT; INC/DEC/ASL/LSR/ROL/ROR; all transfers; stack
push/pull; REP/SEP/CLC/SEC/XCE/XBA; branches; JMP/JML/JSR/JSL;
RTS/RTL/RTI; MVN/MVP). Instructions whose size depends on M or X
bits exist as `_Imm8`/`_Imm16` pairs carrying the appropriate
TSFlag bits (MLow/MHigh/XLow/XHigh) for the future REP/SEP pass.
- [x] **Step 2b — wire MCCodeEmitter.** Tablegen `-gen-emitter` runs
cleanly; `W65816MCCodeEmitter.cpp` calls the tablegen-provided
`getBinaryCodeForInstr` and emits Size bytes little-endian.
- [x] **Step 2c — symbolic fixups.** Each operand class (imm8/imm16/
addrDP/addrAbs/addrLong/pcrel8/pcrel16) has its own
`EncoderMethod` that emits a `W65816::fixup_*` at the correct
byte offset for expression operands. `W65816AsmBackend::applyFixup`
patches the data bytes little-endian for resolved fixups and
defers to `maybeAddReloc` for unresolved ones.
`W65816ELFObjectWriter::getRelocType` returns placeholder
relocation numbers 1-5 (swap for canonical R_W65816_* names once
the ELF `EM_` is decided — §7 item 1).
- [x] **Step 2d — patch 0005 eliminates data-layout warning.** The
data-layout string for `Triple::w65816` lives in
`llvm/lib/TargetParser/TargetDataLayout.cpp`; `W65816TargetMachine`
now calls `TT.computeDataLayout()`. Zero warnings in the W65816
build.
- [x] **Step 2e — AsmParser scaffold.** 441-line
`AsmParser/W65816AsmParser.cpp` ported from MSP430, stripped of
register-operand handling (65816 has no MC register operands),
with width-narrowing predicates on each operand class so the
matcher picks the narrowest instruction variant the value fits
(e.g. `sta $10` → STA_DP, `sta $1000` → STA_Abs, `sta $10000`
STA_Long). `#` is emitted as a literal token to match the AsmString
tokenisation. Block-move (MVN/MVP) uses `addrDP` for both bank
bytes so `mvn $01, $02` parses.
- [x] **Step 2f — operand bit-field wiring.** Every `Inst*` class in
`W65816InstrFormats.td` now assigns named bitfields into
`Inst{N-8}` (e.g. `let Inst{15-8} = imm;`). Without this
tablegen emits an encoder that writes the opcode but leaves the
operand bytes as zero — we had that bug for an iteration.
- [x] **Step 2g — smoke-test script.** `scripts/smokeTest.sh` checks
llc registration, empty-module codegen, and llvm-mc encoding of
a representative instruction mix. Run with `--build` to rebuild
first.
- [x] **Step 2h — end-to-end ELF object.** `llvm-mc -filetype=obj`
produces a valid ELF with relocations at correct byte offsets.
Relocations are placeholder numbers 1-5 (§7 item 1 — decide
EM_/R_* mapping).
- [x] **Step 2i — Disassembler.** 190-line
`Disassembler/W65816Disassembler.cpp` tries 1/2/3/4-byte decode
tables in ascending size order. Custom decoder callbacks for
imm/addr/pcrel operands wrap raw bits into MCOperands. Mode-
ambiguous opcodes (LDA/LDX/LDY/ADC/SBC/CMP/AND/ORA/EOR/BIT/CPX/
CPY immediate forms) are parked in separate
DecoderTableW65816{MHigh,XHigh}16 tables and the scaffold only
reads the default tables — so those opcodes always disassemble
as 3-byte 16-bit-immediate forms until a mode-aware decoder
lands (alongside REP/SEP).
- [x] **Step 2j — register operands in the AsmParser.** Key fix found
via round-trip test: tablegen treats `a`, `x`, `y` in AsmStrings
(e.g. `"inc a"`, `"lda\t$addr, x"`) as references to the real
register records, so the matcher expects register operands, not
literal tokens. AsmParser now produces `k_Reg` operands for
these identifiers. Verified: `inc a` → 0x1A, `lda $1000, x`
0xBD,0x00,0x10, full ELF round-trip passes.
- [x] **Step 2k — smoke test covers disassembly.** The smoke test
now feeds raw bytes through `llvm-mc --disassemble` and checks
for expected mnemonics, so encoder/decoder asymmetries surface
immediately.
- [x] **Step 3a — first DAG patterns.** Type-as-mode model (approved).
`LDAi16imm` pseudo for i16 constants; `RTL` for retglue;
`emitPrologue` emits canonical `REP #$30`. Mode-dependent
`_Imm8` variants are `isCodeGenOnly` so the asm matcher never
picks them.
- [x] **Step 3c — single-arg function calls.** `LowerFormalArguments`
receives arg 0 in A; `LowerCall` passes arg 0 in A and JSL's
via a JSL pseudo to bridge the i16 symbol operand to the MC
`JSL_Long`'s 24-bit operand class. Result is back in A.
Multi-arg call lowering still wants a `PUSHA` SDNode + SP unwind
sequence — caller side currently fatals on >1 args.
- [x] **Step 3d — multi-arg via stack (callee side).**
`LowerFormalArguments` now reads arg 1+ from stack via
FrameIndex + load. `eliminateFrameIndex` translates LDAfi /
STAfi / ADCfi / SBCfi / ANDfi / ORAfi / EORfi / CMPfi pseudos
to their `LDA d,S` etc. counterparts with the offset baked in.
Stack-relative MC instructions are in place; AsmParser
recognises the `,s` suffix. Callee-side fully working: a
`define i16 @sum3(i16 %a, i16 %b, i16 %c)` compiles to
`clc; adc 4,s; clc; adc 6,s; rtl`.
- [x] **Step 3e — frame-index spill plumbing.** `storeRegToStackSlot`
and `loadRegFromStackSlot` emit STAfi / LDAfi pseudos so the
register allocator can spill Acc16 values when needed.
- [x] **Step 3f — multiplications via shifts.** Multiply by power-of-2
constants inherits the `shl` patterns (1/2/3/4 bits unrolled to
`asl a` sequences). Multiply by arbitrary constants and
runtime values fail at ISel pending library functions.
- [x] **Step 3h — clang front end builds.** Real C → 65816 machine
code via the full `clang -target w65816 -c` pipeline. Bumped
clang's `IntAlign`/`LongAlign`/`PointerAlign`/`SuitableAlign`
from 8 to 16; also overrode `allowsMisalignedMemoryAccesses` to
return true. `scripts/cDemo.sh` shows the full front-end
pipeline on a built-in 7-function demo. Additional patterns:
`INC_Abs`/`DEC_Abs` for `*p = *p + 1`; `ASRA16` (PHA;ASL;PLA;ROR
sequence) for signed shift-right by 1.
- [x] **Step 3i — frame reservation + epilogue.** `emitPrologue`
now emits `TSC; SEC; SBC #N; TCS` to reserve N bytes for locals
and spills, then `emitEpilogue` reverses with `TSC; CLC; ADC #N;
TCS` before the RTL. `eliminateFrameIndex` translates
FrameIndex operands into stack-relative offsets via
`disp = FrameOffset + StackSize`. `hasFPImpl` returns false
(no native FP — direct page would be the logical home). This
unblocks `clang -O0 -c` for pure-arithmetic functions (each
arg gets spilled to its own stack slot). Stack-relative
addressing modes for ADC/SBC/AND/ORA/EOR/CMP let the codegen
fold loads from frame indices into the carry-arithmetic ops.
- [x] **Step 3g — basic i8 codegen.** Acc8 patterns now cover:
`LDAi8imm` (constants), `INA_PSEUDO8` / `DEA_PSEUDO8` (inc/dec),
`ADCi8imm` / `SBCi8imm` (add/sub immediate), `ANDi8imm` /
`ORAi8imm` / `EORi8imm` (bitwise immediate), `LDA8abs` /
`STA8abs` (load/store via global), `ASLA8` / `LSRA8` (1-bit
shifts), `CMPi8imm` (compare against immediate, with BR_CC i8
lowering). Frame lowering scans the function IR for any i8
type usage (return, args, instruction values, operands) and
picks `REP #$10; SEP #$20` prologue when found, else
`REP #$30`. AsmPrinter masks i8 immediates to 8 bits before
printing so `i8 -16` shows `0xf0` rather than `0xfff0`.
Limitations: i8 mode is per-function only — mixed-mode
functions get the i8 prologue (8-bit A) and i16 ops fail.
Asm round-trip for i8 still loses M-mode info (the parser
can't disambiguate `lda #imm` between Imm8 and Imm16); use
`-filetype=obj` directly from llc to get the right encoding.
- [x] **Step 3b — globals, loads, stores, arithmetic, branches,
bitwise.** `LowerOperation` custom-lowers `GlobalAddress` and
`ExternalSymbol` to `W65816Wrapper(target...)`. Pseudo +
AsmPrinter-expansion family covers:
- `LDAi16imm`, `LDAabs`, `STAabs` (load/store/materialise via
Wrapper of global)
- `ADCi16imm`, `ADCabs`, `SBCi16imm`, `SBCabs` (add/sub with the
required CLC/SEC carry prefix)
- `ANDi16imm`, `ORAi16imm`, `EORi16imm` and their `*abs`
memory-fold variants
- `CMPi16imm`, `CMPabs` plus `W65816ISD::CMP` / `W65816ISD::BR_CC`
SDNodes; `LowerBR_CC` swaps constant-on-LHS forms and rewrites
SETULE/SETUGT/SETLE/SETGT to SETULT/SETUGE/SETLT/SETGE+1 so
the canonicalised DAG hits our patterns; condition-code map
covers BEQ/BNE/BCS/BCC plus signed BMI/BPL.
- `BRA` for unconditional `br`.
- `INA_PSEUDO` / `DEA_PSEUDO` for `add x, ±1``inc a` / `dec a`
- `ASLA16` / `LSRA16` for `shl x, 1` and `lshr x, 1``asl a` /
`lsr a`
- `NEGA16` for `0 - x``eor #$ffff; inc a`
- `(xor x, -1)``eor #$ffff` (bitwise NOT)
- Zero-extending byte load: `lda addr; and #$ff`
The end-to-end pipeline can now compile and assemble functions
that read/write globals, do arithmetic on them, and branch
conditionally — all with optimal-looking 65816 idioms (e.g.
`lda x ; clc ; adc y` for `*x + *y`).
- [ ] **Step 3i — open codegen gaps:**
1. **Multi-arg call lowering** (caller side). Callee side works;
caller still bails on >1 arg. Needs PUSHA SDNode + SP-unwind
in ADJCALLSTACKUP.
2. **Frame-reserved scratch space.** Prologue doesn't reserve
stack space for locals/spills, so any alloca'd value or
allocator-spilled value lands at a negative SP offset and
eliminateFrameIndex bails. Blocks: -O0 compilation of
functions with parameters; loops with PHIs that need to
compare two computed values; two-Acc16 binary ops in
general. Fix: emit `TSC; CLC; ADC #-N; TCS` (or PHA-loop)
in emitPrologue and the inverse in emitEpilogue, where N
is the function's frame size.
3. **Mixed-mode i8/i16.** Per-function mode only — the prologue
picks one mode; the other type's ops fail. REP/SEP scheduling
pass needed.
4. **Signed `(a - b)` overflow handling.** BMI/BPL based signed
comparisons are correct only when the subtraction can't
overflow; pathological values give wrong results.
5. **`sub imm, var`** and **`mul var, var`** (or non-power-of-2
constants). Need libcall support.
6. **SETCC and SELECT_CC i16.** Boolean conversions like
`(int)(cond != 0)` and `(cond) ? a : b` aren't selectable.
Custom lowering needed.
7. **Library functions.** `__mulhi3`, etc. — no runtime yet.
- [ ] **Step 4 — real frame lowering, calling convention, REP/SEP
scheduling pass.** The prologue `REP #$30` is unconditional;
the REP/SEP pass will remove it when redundant.
### Where we actually got to (current state, 2026-04-27)
The "open codegen gaps" list above is mostly resolved. Status of the
seven sub-items at line 192:
1. **Multi-arg call lowering (caller side)** — done. `LowerCall`
pushes args 1..N-1 right-to-left via `W65816ISD::PUSH`,
`ADJCALLSTACKUP` unwinds with `tsc;clc;adc #N;tcs`.
2. **Frame-reserved scratch space** — done. `emitPrologue` /
`emitEpilogue` use `tsc;sec;sbc #N;tcs` and the inverse.
3. **Mixed-mode i8/i16** — partial. Per-function mode based on IR
scan; full REP/SEP scheduling pass still TODO (Step 4).
4. **Signed `(a - b)` overflow in compares** — handled for i8/i16
via the signed-CC promote-to-i16 path. Still has the BMI/BPL
correctness caveat at INT16_MIN/MAX boundaries.
5. **`mul var, var` and friends** — done via libcalls; runtime stubs
live in `runtime/src/libgcc.s` (__mulhi3, __mulsi3, __ashlhi3,
__ashrhi3, __lshrhi3, __ashlsi3, __ashrsi3, __lshrsi3, __udivhi3,
__divhi3, __umodhi3, __modhi3, __udivsi3, __divsi3, __umodsi3,
__modsi3).
6. **SETCC and SELECT_CC i16** — done via custom inserter and the
`W65816cmp + W65816selectcc` SDNode pair.
7. **Library functions** — done; see #5 above.
### i32 (long) support — landed (2026-04-26..28)
- Type legalization splits i32 into two i16 halves.
- ABI: i32 first-arg lives in A:X (lo:hi), matching the return
ABI; subsequent i32 args go on stack 2 bytes per half.
`RetCC_W65816` assigns `[A, X]` for two i16 returns so
`__mulsi3` / `__divsi3` libcall returns work.
- ADD/SUB use the native ADC carry chain via ISD::ADDC/ADDE/SUBC/
SUBE Legal: `ADCi16imm` etc. mark `Defs = [P]` and pattern-match
`addc`; new `ADCEi16imm` / `ADCEabs` / `ADCEfi` (and SBC/E
variants) mark `Uses = [P], Defs = [P]` for `adde`/`sube`.
`ADDE_RR` / `SUBE_RR` have the inserter equivalent for two-Acc16
chains (e.g. fib32's loop). Net: an i32 add went from ~25 insns
(manual UADDO + SETCC + add-of-bool) to ~17 incl. prologue/epilogue,
with the core 8 being the optimal `clc;adc;sta;lda;adc;tax;lda;rtl`.
- NEGC16 / NEGE16 lower `(subc/sube 0, x)` for i32 negate via the
ADD chain (`EOR #$FFFF; CLC; ADC #1` lo, `EOR #$FFFF; ADC #0` hi).
- MUL/DIV/MOD/SHL/SHR/USHR all libcalled; preferredShift­Legalization­
Strategy returns `LowerToLibcall` for i32 to keep LLVM from emitting
SHL_PARTS we'd have no pattern for.
- `BuildSDIVPow2` / `BuildSREMPow2` overrides return SDValue() to
block the magic-constant pow2 expansion that emits unsupported
BUILD_VECTOR.
### Other recent work
- `i1` `sext_inreg` lowered as `(sub 0, (and x, 1))`.
- `i8` `sext_inreg` and `sextload-i8` go through the existing
branchless `((x & 0xFF) ^ 0x80) - 0x80` sequence (SEXTLOAD i8 set
to Expand, sext_inreg pattern added).
- `extloadi8` from an `Acc16` register pointer maps to `LDAptr` (16-
bit load; consumer ignores high byte).
- Bare `ISD::FrameIndex` selected as `ADDframe (FI, 0)` for
alloca'd-array address-of; `eliminateFrameIndex` expands ADDframe
into `tsc;clc;adc #disp` (LEA equivalent).
- **Indirect calls** (function pointers): `LowerCall` redirects
through `__jsl_indir` in `runtime/src/libgcc.s` — caller stores
the dynamic target to global `__indirTarget` then JSLs the
trampoline, which does `JMP (__indirTarget)`. Target's RTL pops
the original JSL frame and returns directly to the caller.
Single-bank only (JMP indirect is bank-local).
- **Code-quality cleanup pass** (`W65816StackSlotCleanup`,
addPostRegAlloc):
- Removes redundant `LDAfi slot` after `STAfi reg, slot` when the
LDA's destination matches and nothing in between clobbers
either reg or slot. Catches the regalloc spill+reload cycle
around COPY $a → vreg.
- Removes dead `STAfi reg, slot` when a subsequent `STAfi`
overwrites the same slot before any read, OR when the function
returns without reading the slot (catches result-spill-before-
return that the libcall return ABI makes redundant).
- Combined with `isReMaterializable` on LDAfi from fixed FIs, the
i32 add went from 17 → 11 instructions.
- **i32 shift-by-1 inline** (task #59). The type-legalizer's
SHL_PARTS / SRL_PARTS expansion of `i32 << 1` / `>> 1` emits a
`(srl x, 15)` or `(shl x, 15)` for the carry-cross-halves slot.
Previously routed through __lshrhi3 / __ashlhi3 libcalls. Added
SRL15A pseudo (`ASL A; LDA #0; ROL A`, 3 bytes) and SHL15A
(`LSR A; LDA #0; ROR A`). i32 shl-by-1 went 33 → 26 insns;
shr-by-1 29 → 23.
- **i16 shift-by-8 inline** (task #60). Same idea for `(srl x, 8)`
and `(shl x, 8)` — used by i32 shift-by-8 type-legalization.
XBA swaps the two bytes of A in 16-bit M; AND clears the half
we don't want. 4 bytes per shift. i32 shl/shr-by-8 went
39/35 → 27/24 insns.
- **PUSH16X for direct X-push** (task #61). When LowerCall sees
an outgoing arg whose SDValue is `CopyFromReg` of a vreg that's
live-in from $x (i.e. the i32-first-arg-in-A:X hi half), emit
`phx` directly instead of `txa; pha` (which also requires
spilling $a to preserve it). mul32 went 19 → 13 insns.
- **Dead frame-slot trimming** (task #62). Extended W65816Stack­
SlotCleanup to scan MIR for unreferenced (post-cleanup) local
frame indices and zero-size them so PrologueEpilogue trims the
prologue PHA/TSC reservation. Combined with the spill cleanup,
shrinks frames in many functions by 2-4 bytes (one fewer
PHA + PLY pair).
- **i32 first-arg in A:X (task #50)**. When the first original
argument is i32 (LowerFormalArguments / LowerCall detect via
`Outs[0..1].OrigArgIndex == 0` on i16 halves), pass it lo:hi in
A:X — matching the i32 return ABI. Saves one stack slot per
i32 arg. Required updating libgcc.s helpers (`__mulsi3`,
`__udivsi3`, `__umodsi3`, `__divsi3`, `__modsi3`, `__ashlsi3`,
`__lshrsi3`, `__ashrsi3`, `__divmodsi_setup`) to read arg0_hi
from X (and shifted arg1 offsets).
- **Implicit Defs/Uses on stack-rel MC instructions**: was a
pre-existing latent bug — `eliminateFrameIndex` strips the
implicit A/P def/use info when it converts ADCfi/STAfi/etc. to
the MC form (ADC d,S, STA d,S etc.). Machine Copy Propagation
then sees stale dataflow and elides necessary TAX/TXA copies.
Fixed by re-attaching `RegState::Implicit` operands on each
expanded MC instruction in W65816RegisterInfo::eliminateFrame­
Index. Without this, the i32-A:X ABI miscompiles return values
(TAX gets elided, X retains arg0_hi instead of result_hi).
The fix also benefits the existing single-A path; before it,
certain Machine Copy Propagation choices were unsafe but
happened not to trigger. Now they're also safe.
### Currently still pending
- **REP/SEP scheduling pass** (Step 4) — per-function mode only;
mixed-mode functions don't work.
- **Vararg functions** — `LowerFormalArguments` reports a fatal
error.
- **i32 comparison** — uses SETCC+ADD-of-bool instead of a CMP+SBC
chain (analogous to the ADC chain we landed for add/sub).
- **Regalloc** (#56) — heapify-style functions with 4+ live i16
values run out of A.
### Smoke-test coverage (31 checks as of 2026-04-28)
`scripts/smokeTest.sh` covers: target registration, llvm-mc encode/
disassemble, end-to-end IR→ELF, multi-pattern function, single-arg
call, 3-arg stack reads, pure-i8 SEP prologue, multi-branch SETCC,
SELECT_CC, two-Acc16 spill, libcall emission (__mulhi3/__ashlhi3),
pointer load/store, runtime/build.sh, real-world program,
libcall-symbol coverage, signed/eq i8 compare, -O2 tiny C, i32 add
end-to-end, i32 carry-chain shape (1 clc + 2 adc + 0 bcc), i32
A:X first-arg ABI (1 txa), 32-bit fib loop (ADDE_RR inserter),
__mulsi3 libcall, alloca'd-array LEA, signed-byte strcmp
(sextload + sext_inreg + extload-via-ptr), indirect call via
__jsl_indir trampoline, i32 shift-by-1 inline (no hi3 libcall).
## 3. What is installed and where
All under `/home/scott/claude/llvm816/tools/`:
| Tool | Path | Notes |
|---|---|---|
| llvm-mos source | `tools/llvm-mos/` | shallow clone. Backend files are symlinked in from `src/`; patches applied on top. Reset cleanly via `scripts/updateLlvmMos.sh`. |
| llvm-mos build dir | `tools/llvm-mos-build/` | cmake-generated, ephemeral |
| llvm-mos-sdk | `tools/llvm-mos-sdk/` | prebuilt toolchain |
| MAME 0.264 | `/usr/games/mame` (apt) | supports `-console` (Lua) |
| Apple IIgs ROMs | `tools/mame/roms/apple2gs.zip`, `apple2gsr1.zip` | from archive.org |
| Calypsi 5.16 | `tools/calypsi/` | extracted .deb |
| ORCA/C source | `tools/orca-c/` | reference only |
`./setup.sh --verify-only` passed all checks as of the prior session.
## 4. Repo layout (current)
```
llvm816/ # git repo, branch main
├── LLVM_65816_DESIGN.md # tracked
├── SESSION_STATE.md # this file
├── setup.sh # tracked
├── scripts/ # tracked
│ ├── common.sh
│ ├── installDeps.sh installCalypsi.sh installOrcaC.sh
│ ├── installLlvmMos.sh # non-destructive (see §8)
│ ├── installMame.sh verify.sh
│ ├── applyBackend.sh # src/ + patches/ -> tools/llvm-mos/
│ └── updateLlvmMos.sh # reset clone, re-apply backend
├── src/ # authored files, tracked
│ ├── llvm/lib/Target/W65816/ # 41 files
│ │ ├── MCTargetDesc/ (10 files)
│ │ ├── TargetInfo/ (3 files)
│ │ └── (28 top-level files)
│ └── clang/lib/Basic/Targets/
│ ├── W65816.h
│ └── W65816.cpp
├── patches/ # unified diffs, tracked
│ ├── 0001-triple-add-w65816-arch.patch
│ ├── 0002-triple-cpp-add-w65816-cases.patch
│ ├── 0003-clang-basic-dispatch-w65816.patch
│ └── 0004-cmake-add-w65816-experimental.patch
├── tools/ # gitignored, ephemeral
└── .gitignore # excludes tools/, .cache/
```
## 5. Key architectural decisions
### 5.1 Separate target, not MOS subtarget feature
llvm-mos has `FeatureW65816` declared in `MOSDevices.td` but codegen
unimplemented (issue #321). We are NOT extending MOS. Reasons:
- We cannot upstream an AI-assisted backend to llvm-mos anyway.
- Clean register model: `Acc8`/`Acc16`/`Idx8`/`Idx16` as separate classes.
- Independent evolution.
Recorded in design doc §2.5.
### 5.2 Symlinks + patches, not a fork
`applyBackend.sh` symlinks every file under `src/` into the corresponding
path under `tools/llvm-mos/`, then applies each `patches/*.patch` with
`git apply`. Idempotent: skips already-current symlinks and
already-applied patches (detected via `git apply --reverse --check`).
`updateLlvmMos.sh` is the ONLY script allowed to destructively reset the
clone. It reverses all patches, removes our symlinks, `git reset --hard
FETCH_HEAD`, then re-runs `applyBackend.sh`.
`installLlvmMos.sh` refuses to touch the clone if it is dirty or off
main — this is deliberate to protect applied patches.
## 6. Concrete next actions (in order)
### 6.1 Function arguments
`LowerFormalArguments` and `LowerCall` still fatal-error. Without
arguments, every function we test has to use globals as inputs. The
plan: pass i8/i16 args via the stack (push right-to-left, caller
cleans), with the first 1-2 args optionally going in A or X for
register-passing. Calypsi output is the reference for ABI choices.
### 6.2 i8 codegen
Currently every function gets `REP #$30` (16-bit mode). For i8 ops
we need either:
- A scan-and-prepend approach: if the function has any i8 op, emit
`SEP #$20` after the REP for whichever mode dominates, plus
toggle pseudos around the off-mode regions.
- Or commit to widening all i8 to i16 pre-ISel (simpler, but uses 2x
the cycles for byte-heavy code).
This is the natural lead-in to the REP/SEP scheduling pass (§6.4).
### 6.3 Frame indices, stack locals
Add `eliminateFrameIndex` and frame-pointer pseudos so we can spill
to the stack. Today `W65816RegisterInfo::eliminateFrameIndex` is
`llvm_unreachable`. Stack accesses on 65816 are `,s` and `(,s),y`
indirect — needs new operand classes.
### 6.4 REP/SEP scheduling pass
The core algorithmic work. TSFlag bits on every mode-dependent
instruction are already in place; the pass walks MIR, dataflows the
required mode per region, and inserts/removes REP/SEP transitions to
minimise total mode switches. Design doc §3.3.
### 6.2 Wire frame lowering + calling convention (real)
`W65816FrameLowering.cpp` is still `llvm_unreachable`. The simplest
working version: establish an i16 stack pointer-based frame using the
native SP, locals accessed via stack-relative indirect via Y. Calypsi
output for a trivial function is a good model.
`W65816CallingConv.td` covers i8/i16 return in A but nothing for
arguments. Start with stack-based (push right-to-left, caller cleans)
per design doc §3.5.
### 6.3 Disassembler mode-aware decoding (deferred)
The scaffold disassembler always decodes LDA/LDX/LDY/ADC/SBC/CMP/AND/
ORA/EOR/BIT/CPX/CPY immediate forms as 3-byte 16-bit-immediate
variants. A real decoder should track the M/X bits across the stream
(consuming REP/SEP, XCE transitions) and choose between
`DecoderTableW65816` (default) and `DecoderTableW65816{MHigh,XHigh}16`
per instruction. Naturally pairs with the REP/SEP codegen pass since
both need the same M/X tracking model.
### 6.4 REP/SEP scheduling pass
The core algorithmic work (design doc §3.3). Every real instruction
now carries TSFlag bits indicating which M/X mode it requires. The
pass reads those, does the width-inference / coalescing / transition
insertion dataflow, and emits REP/SEP instructions at block
boundaries. Plan to spend multiple sessions.
### 6.5 Tidy-ups (can happen in any order)
- Decide ELF `EM_` value (§7 item 1). Currently `EM_NONE`, with
placeholder relocation numbers 1-5 in `W65816ELFObjectWriter`.
Swap for canonical `R_W65816_*` names once chosen.
- Replace ASCII-art mnemonics (`inc a`, `dec a`, `asl a`, etc.) with
proper InstAliases so both `INA` and `INC A` assemble to the same
opcode. Requires AsmParser (§6.3).
## 7. Open design questions flagged by the scaffold
1. **ELF `EM_` machine number.** `W65816ELFObjectWriter.cpp` uses
`ELF::EM_NONE` as a placeholder. llvm-mos uses `EM_MOS = 0x1966` for
the 6502 family. Decide: share `EM_MOS`, or pick a new value?
2. **Data layout string** is hardcoded in
`W65816TargetMachine.cpp` rather than routing through
`Triple::computeDataLayout()`. That is OK for now — when we're
ready to consolidate, add a case in `TargetDataLayout.cpp` and
switch to `TT.computeDataLayout()`.
3. **i32 return convention** — does i32 return in A:X or via a hidden
pointer? Currently `W65816CallingConv.td` only handles i8/i16. Design
doc §3.5 says "A:X for 32-bit" but this isn't modelled yet.
4. **Register aliasing for mode-dependent widths.** `Acc8` and `Acc16`
both contain physical register `A`. LLVM's allocator will not cope
with this correctly. The REP/SEP management pass (§3.3) is required.
Flagged per the design doc.
5. **Open questions from design doc §8** (GS/OS DP reservation, bank
memory model, interrupt ABI, ORCA/C ABI compat, width-contract
attribute, MAME cycle accuracy) — still unresolved. Punt until after
we have a working instruction set.
## 8. Gotchas + hard-won knowledge
- **`installLlvmMos.sh` is non-destructive now.** It refuses to reset
the clone if it is dirty or off main. Use `scripts/updateLlvmMos.sh`
to refresh (the only script allowed to reset).
- **MAME `-console` flag** is listed by `-showusage`, NOT `-help`.
- **`log()` in `common.sh` writes to stderr.** Don't change it.
- **llvm-mos has `FeatureW65816` but not working codegen** (issue #321).
- **`RemapAllTargetPseudoPointerOperands<PtrRegs>` is required** in
`W65816.td` or tablegen fails with 8 "missing target override for
pseudoinstruction using PointerLikeRegClass" errors. Don't remove it.
- **`Triple::w65816` placement in Triple.h:** inserted right after
`mos,` to keep the 65xx family clustered. See patch 0001.
- **Added to `LLVM_ALL_EXPERIMENTAL_TARGETS`** in `llvm/CMakeLists.txt`
so `-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=all` picks up W65816. Not
strictly required — passing the name explicitly also works.
- **Operand `OperandType` field wants LLVM's enum spelling**, not
shortened. Use `OPERAND_IMMEDIATE`, `OPERAND_MEMORY`, `OPERAND_PCREL`
(see `llvm/include/llvm/MC/MCInstrDesc.h` MCOI::OperandType).
`OPERAND_IMM` / `OPERAND_IMM8` / `OPERAND_IMM16` are NOT valid.
- **`PrintMethod` signature for PC-rel operands takes `Address`.**
Tablegen generates `printPCRel8(MI, Address, OpNo, O)` — 4 args, not
3. Non-PC-rel PrintMethods use the 3-arg form `(MI, OpNo, O)`.
- **Several `.cpp` files needed explicit `#include`s beyond what MSP430
ships with** because the tablegen-generated `.inc` references full
types: `W65816RegisterInfo.cpp` needs `W65816Subtarget.h` and
`W65816FrameLowering.h` (for `GET_REGINFO_TARGET_DESC`);
`W65816InstPrinter.cpp` needs `llvm/MC/MCAsmInfo.h` (for
`MAI.printExpr`).
- **Marker classes can't override mayLoad/mayStore via `let`.**
TableGen's multi-inheritance doesn't let unrelated sibling classes
touch fields from the base `Instruction`. Use `let isReturn = 1, ...
in { ... }` blocks at def sites instead (idiomatic LLVM style).
- **Data layout is hardcoded** in `W65816TargetMachine.cpp` rather than
computed from `TT.computeDataLayout()`, because `TargetDataLayout.cpp`
doesn't have a case for `w65816` yet. This produces one `-Wswitch`
warning in the llvm-mos build. §6.5 notes adding a 5th patch to
silence it.
## 9. Disk space recovery
If space is tight before resume:
```
# safe to delete — regenerable from setup.sh + applyBackend.sh:
rm -rf /home/scott/claude/llvm816/tools/
rm -rf /home/scott/claude/llvm816/.cache/
```
Regenerate with `./setup.sh` then `./scripts/applyBackend.sh`.
The `tools/llvm-mos-build/` directory alone is ~2 GB after a full
configure+tablegen. A full ninja build will be much more.
## 10. Quick verification commands for resume
```
# Verify the scaffold is in place:
ls src/llvm/lib/Target/W65816/ | wc -l # expect ~20 top-level files
ls patches/ # expect 4 .patch files
# Verify apply is clean:
./scripts/applyBackend.sh # expect 0 new, 44 current symlinks; 0 new, 4 applied patches
# Verify cmake configures:
cmake -S tools/llvm-mos/llvm -B tools/llvm-mos-build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_TARGETS_TO_BUILD="" \
-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="MOS;W65816" \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_INCLUDE_TESTS=OFF -DLLVM_INCLUDE_EXAMPLES=OFF \
-DLLVM_INCLUDE_BENCHMARKS=OFF
# Verify full build + llc registration (slow first time, cached after):
( cd tools/llvm-mos-build && ninja LLVMW65816Info LLVMW65816Desc LLVMW65816CodeGen llc )
./tools/llvm-mos-build/bin/llc --version | grep w65816
./tools/llvm-mos-build/bin/llc -march=w65816 -filetype=null /dev/null ; echo $?
# Expect: grep matches; llc exits 0.
```
## 11. Files changed this session (not yet committed by user)
```
scripts/applyBackend.sh # idempotent src+patches apply
scripts/updateLlvmMos.sh # safe reset+reapply
scripts/installLlvmMos.sh # no longer destructively resets
scripts/smokeTest.sh # regression smoke test
src/llvm/lib/Target/W65816/ # full MC layer + first codegen:
# CodeGen scaffolds (~40 files)
# AsmParser/ (2 files)
# Disassembler/ (2 files)
# MCTargetDesc/ (11 files)
# TargetInfo/ (3 files)
# ~90 real instruction defs
# ~25 codegen pseudos +
# AsmPrinter expansion
src/clang/lib/Basic/Targets/W65816.{h,cpp}
patches/0001..0005.patch # upstream llvm-mos mods
SESSION_STATE.md # this file
```
The tools/ tree is all ephemeral (gitignored).
### What now works end-to-end
Try it yourself:
```
./scripts/cDemo.sh # built-in demo
./scripts/cDemo.sh path/to/your.c
```
Sample output for the built-in demo (real C → real 65816):
```
get_counter: lda counter ; rtl
set_counter: sta counter ; rtl
sum_with_target: clc ; adc target ; rtl
doubler: asl a ; rtl
half: lsr a ; rtl
reset: lda #0 ; sta counter ; rtl
answer: lda #42 ; rtl
```
### Detail: command-line invocations
```
# Round-trip asm -> bytes -> asm:
echo ' lda #0x1234' | ./bin/llvm-mc -arch=w65816 -show-encoding
# -> lda #0x1234 ; encoding: [0xa9,0x34,0x12]
echo '0xea 0xa9 0x34 0x12 0x6b' | ./bin/llvm-mc --disassemble --triple=w65816
# -> nop ; lda #0x1234 ; rtl
# Full asm -> ELF -> disasm:
./bin/llvm-mc -arch=w65816 -filetype=obj foo.s -o foo.o
./bin/llvm-objdump --triple=w65816 -d foo.o
# Real codegen. This .ll compiles cleanly:
@x = global i16 0
@y = global i16 0
define i16 @fib_step() {
%a = load i16, ptr @x
%b = load i16, ptr @y
%s = add i16 %a, %b
store i16 %a, ptr @y
store i16 %s, ptr @x
ret i16 %s
}
# llc emits idiomatic 65816:
# rep #0x30
# lda x; clc; adc y ; A = a + b
# sta x ; x = a + b
# ...
```
### What doesn't work yet
- **Multi-arg calls** (caller side). Callee accepts stack-passed
args; the matching push side is unimplemented. Functions with
more than one arg can be defined and compile correctly, but
cannot be called from another function.
- **Two-Acc16 cmp.** Loops with PHIs that need to compare two
computed values fail at ISel — only one A.
- **i8 ops** (always 16-bit mode for now).
- **Signed overflow** in CMP-based branches: BMI/BPL test the N flag
of the subtraction, which is incorrect when the subtract overflows.
- **`mul var, var`** (or by non-power-of-2 constants). Needs library
functions (`__mulhi3` etc.).
- **`sub imm, var`** (only `sub var, imm` works).
See §6.1-§6.4 for the next steps.