Initial check in. Lots of work yet to do.
This commit is contained in:
parent
873eab4922
commit
55c1ae1c3e
24 changed files with 3776 additions and 124 deletions
4
.gitignore
vendored
4
.gitignore
vendored
|
|
@ -5,6 +5,10 @@ tools/
|
|||
# Claude Code tool state
|
||||
.claude/
|
||||
|
||||
# Runtime build artifacts: regenerable via runtime/build.sh from
|
||||
# runtime/src/*.s. The source files (.s, build.sh) are tracked.
|
||||
runtime/*.o
|
||||
|
||||
# Editor / OS
|
||||
*.swp
|
||||
*.swo
|
||||
|
|
|
|||
146
SESSION_STATE.md
146
SESSION_STATE.md
|
|
@ -219,6 +219,152 @@ Design doc section 7 lists a 12-step implementation order. We are at:
|
|||
scheduling pass.** The prologue `REP #$30` is unconditional;
|
||||
the REP/SEP pass will remove it when redundant.
|
||||
|
||||
### Where we actually got to (current state, 2026-04-27)
|
||||
|
||||
The "open codegen gaps" list above is mostly resolved. Status of the
|
||||
seven sub-items at line 192:
|
||||
|
||||
1. **Multi-arg call lowering (caller side)** — done. `LowerCall`
|
||||
pushes args 1..N-1 right-to-left via `W65816ISD::PUSH`,
|
||||
`ADJCALLSTACKUP` unwinds with `tsc;clc;adc #N;tcs`.
|
||||
2. **Frame-reserved scratch space** — done. `emitPrologue` /
|
||||
`emitEpilogue` use `tsc;sec;sbc #N;tcs` and the inverse.
|
||||
3. **Mixed-mode i8/i16** — partial. Per-function mode based on IR
|
||||
scan; full REP/SEP scheduling pass still TODO (Step 4).
|
||||
4. **Signed `(a - b)` overflow in compares** — handled for i8/i16
|
||||
via the signed-CC promote-to-i16 path. Still has the BMI/BPL
|
||||
correctness caveat at INT16_MIN/MAX boundaries.
|
||||
5. **`mul var, var` and friends** — done via libcalls; runtime stubs
|
||||
live in `runtime/src/libgcc.s` (__mulhi3, __mulsi3, __ashlhi3,
|
||||
__ashrhi3, __lshrhi3, __ashlsi3, __ashrsi3, __lshrsi3, __udivhi3,
|
||||
__divhi3, __umodhi3, __modhi3, __udivsi3, __divsi3, __umodsi3,
|
||||
__modsi3).
|
||||
6. **SETCC and SELECT_CC i16** — done via custom inserter and the
|
||||
`W65816cmp + W65816selectcc` SDNode pair.
|
||||
7. **Library functions** — done; see #5 above.
|
||||
|
||||
### i32 (long) support — landed (2026-04-26..28)
|
||||
|
||||
- Type legalization splits i32 into two i16 halves.
|
||||
- ABI: i32 first-arg lives in A:X (lo:hi), matching the return
|
||||
ABI; subsequent i32 args go on stack 2 bytes per half.
|
||||
`RetCC_W65816` assigns `[A, X]` for two i16 returns so
|
||||
`__mulsi3` / `__divsi3` libcall returns work.
|
||||
- ADD/SUB use the native ADC carry chain via ISD::ADDC/ADDE/SUBC/
|
||||
SUBE Legal: `ADCi16imm` etc. mark `Defs = [P]` and pattern-match
|
||||
`addc`; new `ADCEi16imm` / `ADCEabs` / `ADCEfi` (and SBC/E
|
||||
variants) mark `Uses = [P], Defs = [P]` for `adde`/`sube`.
|
||||
`ADDE_RR` / `SUBE_RR` have the inserter equivalent for two-Acc16
|
||||
chains (e.g. fib32's loop). Net: an i32 add went from ~25 insns
|
||||
(manual UADDO + SETCC + add-of-bool) to ~17 incl. prologue/epilogue,
|
||||
with the core 8 being the optimal `clc;adc;sta;lda;adc;tax;lda;rtl`.
|
||||
- NEGC16 / NEGE16 lower `(subc/sube 0, x)` for i32 negate via the
|
||||
ADD chain (`EOR #$FFFF; CLC; ADC #1` lo, `EOR #$FFFF; ADC #0` hi).
|
||||
- MUL/DIV/MOD/SHL/SHR/USHR all libcalled; preferredShiftLegalization
|
||||
Strategy returns `LowerToLibcall` for i32 to keep LLVM from emitting
|
||||
SHL_PARTS we'd have no pattern for.
|
||||
- `BuildSDIVPow2` / `BuildSREMPow2` overrides return SDValue() to
|
||||
block the magic-constant pow2 expansion that emits unsupported
|
||||
BUILD_VECTOR.
|
||||
|
||||
### Other recent work
|
||||
|
||||
- `i1` `sext_inreg` lowered as `(sub 0, (and x, 1))`.
|
||||
- `i8` `sext_inreg` and `sextload-i8` go through the existing
|
||||
branchless `((x & 0xFF) ^ 0x80) - 0x80` sequence (SEXTLOAD i8 set
|
||||
to Expand, sext_inreg pattern added).
|
||||
- `extloadi8` from an `Acc16` register pointer maps to `LDAptr` (16-
|
||||
bit load; consumer ignores high byte).
|
||||
- Bare `ISD::FrameIndex` selected as `ADDframe (FI, 0)` for
|
||||
alloca'd-array address-of; `eliminateFrameIndex` expands ADDframe
|
||||
into `tsc;clc;adc #disp` (LEA equivalent).
|
||||
- **Indirect calls** (function pointers): `LowerCall` redirects
|
||||
through `__jsl_indir` in `runtime/src/libgcc.s` — caller stores
|
||||
the dynamic target to global `__indirTarget` then JSLs the
|
||||
trampoline, which does `JMP (__indirTarget)`. Target's RTL pops
|
||||
the original JSL frame and returns directly to the caller.
|
||||
Single-bank only (JMP indirect is bank-local).
|
||||
- **Code-quality cleanup pass** (`W65816StackSlotCleanup`,
|
||||
addPostRegAlloc):
|
||||
- Removes redundant `LDAfi slot` after `STAfi reg, slot` when the
|
||||
LDA's destination matches and nothing in between clobbers
|
||||
either reg or slot. Catches the regalloc spill+reload cycle
|
||||
around COPY $a → vreg.
|
||||
- Removes dead `STAfi reg, slot` when a subsequent `STAfi`
|
||||
overwrites the same slot before any read, OR when the function
|
||||
returns without reading the slot (catches result-spill-before-
|
||||
return that the libcall return ABI makes redundant).
|
||||
- Combined with `isReMaterializable` on LDAfi from fixed FIs, the
|
||||
i32 add went from 17 → 11 instructions.
|
||||
- **i32 shift-by-1 inline** (task #59). The type-legalizer's
|
||||
SHL_PARTS / SRL_PARTS expansion of `i32 << 1` / `>> 1` emits a
|
||||
`(srl x, 15)` or `(shl x, 15)` for the carry-cross-halves slot.
|
||||
Previously routed through __lshrhi3 / __ashlhi3 libcalls. Added
|
||||
SRL15A pseudo (`ASL A; LDA #0; ROL A`, 3 bytes) and SHL15A
|
||||
(`LSR A; LDA #0; ROR A`). i32 shl-by-1 went 33 → 26 insns;
|
||||
shr-by-1 29 → 23.
|
||||
- **i16 shift-by-8 inline** (task #60). Same idea for `(srl x, 8)`
|
||||
and `(shl x, 8)` — used by i32 shift-by-8 type-legalization.
|
||||
XBA swaps the two bytes of A in 16-bit M; AND clears the half
|
||||
we don't want. 4 bytes per shift. i32 shl/shr-by-8 went
|
||||
39/35 → 27/24 insns.
|
||||
- **PUSH16X for direct X-push** (task #61). When LowerCall sees
|
||||
an outgoing arg whose SDValue is `CopyFromReg` of a vreg that's
|
||||
live-in from $x (i.e. the i32-first-arg-in-A:X hi half), emit
|
||||
`phx` directly instead of `txa; pha` (which also requires
|
||||
spilling $a to preserve it). mul32 went 19 → 13 insns.
|
||||
- **Dead frame-slot trimming** (task #62). Extended W65816Stack
|
||||
SlotCleanup to scan MIR for unreferenced (post-cleanup) local
|
||||
frame indices and zero-size them so PrologueEpilogue trims the
|
||||
prologue PHA/TSC reservation. Combined with the spill cleanup,
|
||||
shrinks frames in many functions by 2-4 bytes (one fewer
|
||||
PHA + PLY pair).
|
||||
- **i32 first-arg in A:X (task #50)**. When the first original
|
||||
argument is i32 (LowerFormalArguments / LowerCall detect via
|
||||
`Outs[0..1].OrigArgIndex == 0` on i16 halves), pass it lo:hi in
|
||||
A:X — matching the i32 return ABI. Saves one stack slot per
|
||||
i32 arg. Required updating libgcc.s helpers (`__mulsi3`,
|
||||
`__udivsi3`, `__umodsi3`, `__divsi3`, `__modsi3`, `__ashlsi3`,
|
||||
`__lshrsi3`, `__ashrsi3`, `__divmodsi_setup`) to read arg0_hi
|
||||
from X (and shifted arg1 offsets).
|
||||
- **Implicit Defs/Uses on stack-rel MC instructions**: was a
|
||||
pre-existing latent bug — `eliminateFrameIndex` strips the
|
||||
implicit A/P def/use info when it converts ADCfi/STAfi/etc. to
|
||||
the MC form (ADC d,S, STA d,S etc.). Machine Copy Propagation
|
||||
then sees stale dataflow and elides necessary TAX/TXA copies.
|
||||
Fixed by re-attaching `RegState::Implicit` operands on each
|
||||
expanded MC instruction in W65816RegisterInfo::eliminateFrame
|
||||
Index. Without this, the i32-A:X ABI miscompiles return values
|
||||
(TAX gets elided, X retains arg0_hi instead of result_hi).
|
||||
The fix also benefits the existing single-A path; before it,
|
||||
certain Machine Copy Propagation choices were unsafe but
|
||||
happened not to trigger. Now they're also safe.
|
||||
|
||||
### Currently still pending
|
||||
|
||||
- **REP/SEP scheduling pass** (Step 4) — per-function mode only;
|
||||
mixed-mode functions don't work.
|
||||
- **Vararg functions** — `LowerFormalArguments` reports a fatal
|
||||
error.
|
||||
- **i32 comparison** — uses SETCC+ADD-of-bool instead of a CMP+SBC
|
||||
chain (analogous to the ADC chain we landed for add/sub).
|
||||
- **Regalloc** (#56) — heapify-style functions with 4+ live i16
|
||||
values run out of A.
|
||||
|
||||
### Smoke-test coverage (31 checks as of 2026-04-28)
|
||||
|
||||
`scripts/smokeTest.sh` covers: target registration, llvm-mc encode/
|
||||
disassemble, end-to-end IR→ELF, multi-pattern function, single-arg
|
||||
call, 3-arg stack reads, pure-i8 SEP prologue, multi-branch SETCC,
|
||||
SELECT_CC, two-Acc16 spill, libcall emission (__mulhi3/__ashlhi3),
|
||||
pointer load/store, runtime/build.sh, real-world program,
|
||||
libcall-symbol coverage, signed/eq i8 compare, -O2 tiny C, i32 add
|
||||
end-to-end, i32 carry-chain shape (1 clc + 2 adc + 0 bcc), i32
|
||||
A:X first-arg ABI (1 txa), 32-bit fib loop (ADDE_RR inserter),
|
||||
__mulsi3 libcall, alloca'd-array LEA, signed-byte strcmp
|
||||
(sextload + sext_inreg + extload-via-ptr), indirect call via
|
||||
__jsl_indir trampoline, i32 shift-by-1 inline (no hi3 libcall).
|
||||
|
||||
## 3. What is installed and where
|
||||
|
||||
All under `/home/scott/claude/llvm816/tools/`:
|
||||
|
|
|
|||
20
patches/0006-runtime-libcalls-w65816.patch
Normal file
20
patches/0006-runtime-libcalls-w65816.patch
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
diff --git a/llvm/include/llvm/IR/RuntimeLibcalls.td b/llvm/include/llvm/IR/RuntimeLibcalls.td
|
||||
index 0000000..0000000 100644
|
||||
--- a/llvm/include/llvm/IR/RuntimeLibcalls.td
|
||||
+++ b/llvm/include/llvm/IR/RuntimeLibcalls.td
|
||||
@@ -3620,6 +3620,15 @@ def MOSSystemLibrary
|
||||
__memset,
|
||||
abort)>;
|
||||
|
||||
+// W65816 (WDC 65816) - integer libcalls only. Multiply, divide, modulo
|
||||
+// and shifts go through the standard compiler-rt names (__mulhi3,
|
||||
+// __divhi3 etc.). No floating point yet.
|
||||
+def isW65816 : RuntimeLibcallPredicate<"TT.getArch() == Triple::w65816">;
|
||||
+
|
||||
+def W65816SystemLibrary
|
||||
+ : SystemRuntimeLibrary<isW65816,
|
||||
+ (add DefaultRuntimeLibcallImpls)>;
|
||||
+
|
||||
//===----------------------------------------------------------------------===//
|
||||
// Legacy Default Runtime Libcalls
|
||||
//===----------------------------------------------------------------------===//
|
||||
18
runtime/build.sh
Executable file
18
runtime/build.sh
Executable file
|
|
@ -0,0 +1,18 @@
|
|||
#!/usr/bin/env bash
|
||||
# Assemble the W65816 runtime library to runtime/libgcc.o.
|
||||
# Run after editing runtime/src/*.s.
|
||||
|
||||
set -euo pipefail
|
||||
PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
LLVM_MC="$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc"
|
||||
|
||||
[ -x "$LLVM_MC" ] || {
|
||||
echo "llvm-mc not found at $LLVM_MC" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
"$LLVM_MC" -arch=w65816 -filetype=obj \
|
||||
"$PROJECT_ROOT/runtime/src/libgcc.s" \
|
||||
-o "$PROJECT_ROOT/runtime/libgcc.o"
|
||||
|
||||
echo "built runtime/libgcc.o"
|
||||
640
runtime/src/libgcc.s
Normal file
640
runtime/src/libgcc.s
Normal file
|
|
@ -0,0 +1,640 @@
|
|||
; Minimal libgcc-equivalent runtime for the W65816 / Apple IIgs.
|
||||
; Provides the helpers that the LLVM backend lowers integer multiply,
|
||||
; shift, divide, and modulo operations to. Implementations are
|
||||
; correct-but-unoptimised; they exist to unblock end-to-end testing,
|
||||
; not to compete with hand-tuned 65816 math libraries.
|
||||
;
|
||||
; Calling convention (matches W65816ISelLowering::LowerCall):
|
||||
; - Arg 0 in A (16-bit M).
|
||||
; - Arg 1 pushed via PHA before the JSL. Reads as (4,S) inside the
|
||||
; callee (3-byte JSL return address sits at 1..3,S).
|
||||
; - Return value in A. Caller releases pushed args.
|
||||
; - Routines run in 16-bit M, 16-bit X (REP #$30 by convention).
|
||||
;
|
||||
; Direct-page scratch lives at DP+$E0..DP+$EF (16 bytes). Programs
|
||||
; that use this runtime must keep DP=0 or remap accordingly.
|
||||
;
|
||||
; Assembled with: tools/llvm-mos-build/bin/llvm-mc -arch=w65816 \
|
||||
; -filetype=obj
|
||||
; runtime/src/libgcc.s
|
||||
; -o runtime/libgcc.o
|
||||
|
||||
.text
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; Indirect-call trampoline. An indirect call (function pointer) stores
|
||||
; the target's 16-bit address to __indirTarget before JSL'ing here.
|
||||
; This routine does a JMP indirect through that variable: control
|
||||
; transfers to the target with the original caller's JSL frame still
|
||||
; on the stack, so target's RTL returns to the original caller (one
|
||||
; frame, no double-RTL).
|
||||
;
|
||||
; Caller emit sequence in W65816ISelLowering::LowerCall:
|
||||
; sta __indirTarget ; store ptr (must precede any A clobber for args)
|
||||
; ... arg pushes ...
|
||||
; jsl __jsl_indir
|
||||
;
|
||||
; Single-bank only (the IIgs convention assumes code in bank 0/1
|
||||
; via JSL — JMP indirect is bank-local).
|
||||
; --------------------------------------------------------------------
|
||||
.globl __indirTarget
|
||||
.bss
|
||||
__indirTarget:
|
||||
.zero 2
|
||||
|
||||
.text
|
||||
.globl __jsl_indir
|
||||
__jsl_indir:
|
||||
; Hand-encoded JMP (__indirTarget): 6C is "jmp (a)" — the assembler
|
||||
; doesn't yet parse the `(abs)` syntax, so emit the bytes directly
|
||||
; with a 16-bit relocation against the variable. Effective transfer:
|
||||
; PC <- mem[__indirTarget].
|
||||
.byte 0x6C
|
||||
.word __indirTarget
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __mulhi3 — 16-bit multiply. A * (4,S) -> A.
|
||||
; Signed and unsigned share an implementation: only the low 16 bits of
|
||||
; the product are returned, which is identical for both. Uses
|
||||
; shift-and-add over the multiplier bits.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __mulhi3
|
||||
__mulhi3:
|
||||
sta 0xe0 ; multiplier
|
||||
lda 0x4, s
|
||||
sta 0xe2 ; multiplicand
|
||||
lda #0x0
|
||||
sta 0xe4 ; running product
|
||||
.Lmul_loop:
|
||||
lda 0xe0
|
||||
beq .Lmul_done
|
||||
lsr a
|
||||
sta 0xe0
|
||||
bcc .Lmul_skip
|
||||
lda 0xe4
|
||||
clc
|
||||
adc 0xe2
|
||||
sta 0xe4
|
||||
.Lmul_skip:
|
||||
asl 0xe2
|
||||
bra .Lmul_loop
|
||||
.Lmul_done:
|
||||
lda 0xe4
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __ashlhi3 — A << (4,S) -> A. Shift count is i16 but only the low 4
|
||||
; bits are meaningful (counts >=16 are undefined behaviour in C).
|
||||
; --------------------------------------------------------------------
|
||||
.globl __ashlhi3
|
||||
__ashlhi3:
|
||||
pha ; save value on stack so we can free A
|
||||
lda 0x6, s ; arg 1 sits at 6,s now (PHA shifted by 2)
|
||||
tax
|
||||
pla ; restore value
|
||||
.Lashl_loop:
|
||||
cpx #0x0
|
||||
beq .Lashl_done
|
||||
asl a
|
||||
dex
|
||||
bra .Lashl_loop
|
||||
.Lashl_done:
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __lshrhi3 — A logical >> (4,S) -> A. Same shape as __ashlhi3 with
|
||||
; LSR instead of ASL.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __lshrhi3
|
||||
__lshrhi3:
|
||||
pha
|
||||
lda 0x6, s
|
||||
tax
|
||||
pla
|
||||
.Llshr_loop:
|
||||
cpx #0x0
|
||||
beq .Llshr_done
|
||||
lsr a
|
||||
dex
|
||||
bra .Llshr_loop
|
||||
.Llshr_done:
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __ashrhi3 — A arithmetic >> (4,S) -> A. Sign bit is preserved by
|
||||
; copying it into carry before each ROR via CMP #$8000 (which sets
|
||||
; carry exactly when the sign bit is set on a 16-bit unsigned compare).
|
||||
; --------------------------------------------------------------------
|
||||
.globl __ashrhi3
|
||||
__ashrhi3:
|
||||
pha
|
||||
lda 0x6, s
|
||||
tax
|
||||
pla
|
||||
.Lashr_loop:
|
||||
cpx #0x0
|
||||
beq .Lashr_done
|
||||
cmp #0x8000
|
||||
ror a
|
||||
dex
|
||||
bra .Lashr_loop
|
||||
.Lashr_done:
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __udivhi3 — A unsigned / (4,S) -> A.
|
||||
; Restoring shift-subtract division. Common helper; __umodhi3 reuses
|
||||
; the algorithm and returns the remainder instead.
|
||||
; Scratch: $e6 = numerator, $e8 = denominator,
|
||||
; $ea = quotient, $ec = remainder.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __udivhi3
|
||||
__udivhi3:
|
||||
; Public entry: A=dividend, (4,S)=divisor. Set up scratch and
|
||||
; call the same JSR-based core used by signed divide.
|
||||
sta 0xe6
|
||||
lda 0x4, s
|
||||
sta 0xe8
|
||||
jsr __udivmod_core
|
||||
lda 0xea
|
||||
rtl
|
||||
|
||||
.globl __umodhi3
|
||||
__umodhi3:
|
||||
sta 0xe6
|
||||
lda 0x4, s
|
||||
sta 0xe8
|
||||
jsr __udivmod_core
|
||||
lda 0xec
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __divhi3 / __modhi3 — signed 16-bit divide and modulo. Strategy:
|
||||
; - Stash sign of dividend in $ee bit 0 (used by modulo).
|
||||
; - Stash result sign of quotient (sign(a) XOR sign(b)) in $ee bit 1
|
||||
; (used by divide).
|
||||
; - Take absolute values, run the unsigned core, then negate the
|
||||
; appropriate result if its sign bit is set.
|
||||
; C99: quotient truncates toward zero; remainder takes the sign of the
|
||||
; dividend.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __divhi3
|
||||
__divhi3:
|
||||
jsr __divmod_setup
|
||||
jsr __udivmod_core
|
||||
; Quotient is in $ea. Negate if bit 1 of $ee is set.
|
||||
lda 0xea
|
||||
pha
|
||||
lda 0xee
|
||||
and #0x2
|
||||
beq .Ldiv_pos
|
||||
pla
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
rtl
|
||||
.Ldiv_pos:
|
||||
pla
|
||||
rtl
|
||||
|
||||
.globl __modhi3
|
||||
__modhi3:
|
||||
jsr __divmod_setup
|
||||
jsr __udivmod_core
|
||||
; Remainder is in $ec. Negate if bit 0 of $ee is set (dividend
|
||||
; was negative).
|
||||
lda 0xec
|
||||
pha
|
||||
lda 0xee
|
||||
and #0x1
|
||||
beq .Lmod_pos
|
||||
pla
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
rtl
|
||||
.Lmod_pos:
|
||||
pla
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __divmod_setup — common prologue for __divhi3/__modhi3. Reads
|
||||
; A=dividend and (4,S)=divisor (the public-entry stack frame is intact
|
||||
; because we used JSR not JSL, so (4,S) still points to the user's
|
||||
; pushed arg1 relative to the original JSL). Computes |a| -> $e6,
|
||||
; |b| -> $e8, and sign tracker -> $ee:
|
||||
; bit 0 = 1 if dividend was negative (modulo result sign)
|
||||
; bit 1 = 1 if dividend XOR divisor signs differ (quotient sign)
|
||||
; Uses JSR/RTS, same bank.
|
||||
; --------------------------------------------------------------------
|
||||
__divmod_setup:
|
||||
; Sign tracker. We don't have STZ in our instruction set yet, so
|
||||
; clear via PHA/LDA #0/STA/PLA to avoid trashing A.
|
||||
pha
|
||||
lda #0x0
|
||||
sta 0xee
|
||||
pla
|
||||
; Dividend sign + abs value.
|
||||
cmp #0x8000
|
||||
bcc .Lset_a_pos
|
||||
; Negative: set bits 0 and 1 (dividend sign, result sign so far).
|
||||
pha
|
||||
lda 0xee
|
||||
ora #0x3
|
||||
sta 0xee
|
||||
pla
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
.Lset_a_pos:
|
||||
sta 0xe6
|
||||
; Divisor sign + abs value. After our JSR (pushed 2 bytes of
|
||||
; near-return), the user's arg1 has shifted up by 2 from (4,S)
|
||||
; to (6,S).
|
||||
lda 0x6, s
|
||||
cmp #0x8000
|
||||
bcc .Lset_b_pos
|
||||
; Negative: flip bit 1 of $ee (XOR with sign of dividend).
|
||||
pha
|
||||
lda 0xee
|
||||
eor #0x2
|
||||
sta 0xee
|
||||
pla
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
.Lset_b_pos:
|
||||
sta 0xe8
|
||||
rts
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __udivmod_core — internal restoring divide. Inputs at $e6/$e8,
|
||||
; outputs quotient at $ea, remainder at $ec. JSR/RTS local helper.
|
||||
; --------------------------------------------------------------------
|
||||
__udivmod_core:
|
||||
lda #0x0
|
||||
sta 0xea
|
||||
sta 0xec
|
||||
ldx #0x10
|
||||
.Lcore_loop:
|
||||
asl 0xe6
|
||||
rol 0xec
|
||||
asl 0xea
|
||||
lda 0xec
|
||||
cmp 0xe8
|
||||
bcc .Lcore_skip
|
||||
sec
|
||||
sbc 0xe8
|
||||
sta 0xec
|
||||
inc 0xea
|
||||
.Lcore_skip:
|
||||
dex
|
||||
bne .Lcore_loop
|
||||
rts
|
||||
|
||||
; ====================================================================
|
||||
; 32-bit (long / si) helpers.
|
||||
;
|
||||
; ABI for these is the natural extension of the i16 libcalls:
|
||||
; - arg0_lo in A
|
||||
; - arg0_hi at (4,s)
|
||||
; - arg1_lo at (6,s) (or shift count, for the shift helpers)
|
||||
; - arg1_hi at (8,s)
|
||||
; - return: result_lo in A, result_hi in X
|
||||
;
|
||||
; All are correct-but-unoptimised; goal is unblocking end-to-end builds,
|
||||
; not winning a 65816 codegolf.
|
||||
;
|
||||
; Direct-page scratch for these:
|
||||
; $e0..$e3 = a (lo, hi) [renamed from $e0/$e2 for the i16 ones]
|
||||
; $e4..$e7 = b (lo, hi)
|
||||
; $e8..$eb = result / quotient (lo, hi)
|
||||
; $ec..$ef = remainder (lo, hi)
|
||||
; ====================================================================
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __mulsi3 — 32-bit multiply. Shift-and-add over 32 bits of the
|
||||
; multiplier. Result = (a * b) mod 2^32.
|
||||
;
|
||||
; ABI: A = a_lo, X = a_hi (the i32-first-arg in A:X convention),
|
||||
; (4,s) = b_lo, (6,s) = b_hi. Result returned in A:X (lo:hi).
|
||||
; --------------------------------------------------------------------
|
||||
.globl __mulsi3
|
||||
__mulsi3:
|
||||
; Stash a (multiplier) into $e0/$e2.
|
||||
sta 0xe0
|
||||
stx 0xe2
|
||||
; Stash b (multiplicand) into $e4/$e6.
|
||||
lda 0x4, s
|
||||
sta 0xe4
|
||||
lda 0x6, s
|
||||
sta 0xe6
|
||||
; Clear running product at $e8/$ea.
|
||||
lda #0x0
|
||||
sta 0xe8
|
||||
sta 0xea
|
||||
; Loop 32 times: examine LSB of multiplier, conditionally add
|
||||
; multiplicand to product, then shift multiplier right and
|
||||
; multiplicand left. Use Y as a 16-bit counter (X mode = 16).
|
||||
ldy #0x20
|
||||
.Lmulsi_loop:
|
||||
; Test bit 0 of multiplier (lo word).
|
||||
lda 0xe0
|
||||
lsr a
|
||||
sta 0xe0
|
||||
bcc .Lmulsi_noadd
|
||||
; Add multiplicand to product (32-bit).
|
||||
clc
|
||||
lda 0xe8
|
||||
adc 0xe4
|
||||
sta 0xe8
|
||||
lda 0xea
|
||||
adc 0xe6
|
||||
sta 0xea
|
||||
.Lmulsi_noadd:
|
||||
; Shift multiplier right (32-bit, hi-into-lo) — we already shifted
|
||||
; the lo half above, but the bit shifted out went to carry. We
|
||||
; need to also bring the lo bit of the hi half into bit 15 of lo,
|
||||
; and shift hi right. Simpler: do a full 32-bit shift right
|
||||
; before the LSR. Restructure:
|
||||
;
|
||||
; Shift multiplicand left (32-bit, carry chain).
|
||||
asl 0xe4
|
||||
rol 0xe6
|
||||
; Bring multiplier hi into multiplier lo's high bit. Multiplier
|
||||
; has been shifted lo>>1 already; we need to also put hi's lo bit
|
||||
; into lo's hi bit and shift hi right.
|
||||
lsr 0xe2
|
||||
bcc .Lmulsi_no_borrow
|
||||
; Carry from hi >> 1 needs to land in bit 15 of lo. ORA #$8000.
|
||||
lda 0xe0
|
||||
ora #0x8000
|
||||
sta 0xe0
|
||||
.Lmulsi_no_borrow:
|
||||
dey
|
||||
bne .Lmulsi_loop
|
||||
; Result is in $e8 (lo) / $ea (hi).
|
||||
ldx 0xea
|
||||
lda 0xe8
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __ashlsi3 — (A:X) << (4,s) -> A:X. Shift count is i16 in low byte;
|
||||
; counts >= 32 are UB in C. Uses a per-bit loop (cheap on 65816 — one
|
||||
; ASL + ROL per bit).
|
||||
;
|
||||
; ABI: A = a_lo, X = a_hi (i32-first-arg in A:X), (4,s) = count.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __ashlsi3
|
||||
__ashlsi3:
|
||||
sta 0xe0 ; lo
|
||||
stx 0xe2 ; hi
|
||||
lda 0x4, s
|
||||
tay ; count -> Y
|
||||
.Lashlsi_loop:
|
||||
cpy #0x0
|
||||
beq .Lashlsi_done
|
||||
asl 0xe0
|
||||
rol 0xe2
|
||||
dey
|
||||
bra .Lashlsi_loop
|
||||
.Lashlsi_done:
|
||||
ldx 0xe2
|
||||
lda 0xe0
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __lshrsi3 — logical >> shift. LSR hi, ROR lo: hi gets a 0, lo gets
|
||||
; hi's old bit 0. Per-bit loop.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __lshrsi3
|
||||
__lshrsi3:
|
||||
sta 0xe0
|
||||
stx 0xe2
|
||||
lda 0x4, s
|
||||
tay
|
||||
.Llshrsi_loop:
|
||||
cpy #0x0
|
||||
beq .Llshrsi_done
|
||||
lsr 0xe2
|
||||
ror 0xe0
|
||||
dey
|
||||
bra .Llshrsi_loop
|
||||
.Llshrsi_done:
|
||||
ldx 0xe2
|
||||
lda 0xe0
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __ashrsi3 — arithmetic >> shift. Sign bit must be preserved on each
|
||||
; iteration: copy bit 15 of hi into carry (via CMP #$8000), then ROR
|
||||
; hi, ROR lo. Per-bit loop.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __ashrsi3
|
||||
__ashrsi3:
|
||||
sta 0xe0
|
||||
stx 0xe2
|
||||
lda 0x4, s
|
||||
tay
|
||||
.Lashrsi_loop:
|
||||
cpy #0x0
|
||||
beq .Lashrsi_done
|
||||
; CMP #$8000 sets C iff the unsigned value >= 0x8000, i.e. bit 15
|
||||
; is set — exactly the sign bit.
|
||||
lda 0xe2
|
||||
cmp #0x8000
|
||||
ror 0xe2
|
||||
ror 0xe0
|
||||
dey
|
||||
bra .Lashrsi_loop
|
||||
.Lashrsi_done:
|
||||
ldx 0xe2
|
||||
lda 0xe0
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __udivmodsi_core — internal 32-bit unsigned divide. Inputs in
|
||||
; $e0/$e2 (numerator) and $e4/$e6 (denominator); outputs quotient in
|
||||
; $e8/$ea and remainder in $ec/$ee. 32-iteration restoring divide.
|
||||
; JSR/RTS local helper.
|
||||
; --------------------------------------------------------------------
|
||||
__udivmodsi_core:
|
||||
lda #0x0
|
||||
sta 0xe8
|
||||
sta 0xea
|
||||
sta 0xec
|
||||
sta 0xee
|
||||
ldy #0x20
|
||||
.Lcoresi_loop:
|
||||
; Shift numerator left through remainder.
|
||||
asl 0xe0
|
||||
rol 0xe2
|
||||
rol 0xec
|
||||
rol 0xee
|
||||
; Shift quotient left.
|
||||
asl 0xe8
|
||||
rol 0xea
|
||||
; Compare remainder to denominator (32-bit).
|
||||
lda 0xee
|
||||
cmp 0xe6
|
||||
bcc .Lcoresi_skip
|
||||
bne .Lcoresi_take
|
||||
lda 0xec
|
||||
cmp 0xe4
|
||||
bcc .Lcoresi_skip
|
||||
.Lcoresi_take:
|
||||
; Remainder >= denominator: subtract and set quotient bit 0.
|
||||
sec
|
||||
lda 0xec
|
||||
sbc 0xe4
|
||||
sta 0xec
|
||||
lda 0xee
|
||||
sbc 0xe6
|
||||
sta 0xee
|
||||
inc 0xe8
|
||||
.Lcoresi_skip:
|
||||
dey
|
||||
bne .Lcoresi_loop
|
||||
rts
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __udivsi3 — unsigned 32/32 -> 32 divide.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __udivsi3
|
||||
__udivsi3:
|
||||
; ABI: A = a_lo, X = a_hi, (4,s) = b_lo, (6,s) = b_hi.
|
||||
sta 0xe0
|
||||
stx 0xe2
|
||||
lda 0x4, s
|
||||
sta 0xe4
|
||||
lda 0x6, s
|
||||
sta 0xe6
|
||||
jsr __udivmodsi_core
|
||||
ldx 0xea
|
||||
lda 0xe8
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __umodsi3 — unsigned 32/32 -> 32 modulo.
|
||||
; --------------------------------------------------------------------
|
||||
.globl __umodsi3
|
||||
__umodsi3:
|
||||
sta 0xe0
|
||||
stx 0xe2
|
||||
lda 0x4, s
|
||||
sta 0xe4
|
||||
lda 0x6, s
|
||||
sta 0xe6
|
||||
jsr __udivmodsi_core
|
||||
ldx 0xee
|
||||
lda 0xec
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __divsi3 / __modsi3 — signed 32-bit divide / modulo. Strategy mirrors
|
||||
; the i16 helpers: stash signs, take abs, run unsigned core, negate
|
||||
; result(s) as needed. Sign tracker bits in $f0:
|
||||
; bit 0 = dividend was negative (modulo result sign)
|
||||
; bit 1 = quotient sign (sign(a) XOR sign(b))
|
||||
; --------------------------------------------------------------------
|
||||
.globl __divsi3
|
||||
__divsi3:
|
||||
jsr __divmodsi_setup
|
||||
jsr __udivmodsi_core
|
||||
; Quotient at $e8/$ea. Negate if bit 1 of $f0 is set.
|
||||
lda 0xf0
|
||||
and #0x2
|
||||
beq .Ldivsi_pos
|
||||
; 32-bit two's complement of quotient.
|
||||
lda 0xe8
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
sta 0xe8
|
||||
lda 0xea
|
||||
eor #0xffff
|
||||
adc #0x0
|
||||
sta 0xea
|
||||
.Ldivsi_pos:
|
||||
ldx 0xea
|
||||
lda 0xe8
|
||||
rtl
|
||||
|
||||
.globl __modsi3
|
||||
__modsi3:
|
||||
jsr __divmodsi_setup
|
||||
jsr __udivmodsi_core
|
||||
; Remainder at $ec/$ee. Negate if bit 0 of $f0 set (dividend
|
||||
; was negative — C99 remainder takes dividend's sign).
|
||||
lda 0xf0
|
||||
and #0x1
|
||||
beq .Lmodsi_pos
|
||||
lda 0xec
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
sta 0xec
|
||||
lda 0xee
|
||||
eor #0xffff
|
||||
adc #0x0
|
||||
sta 0xee
|
||||
.Lmodsi_pos:
|
||||
ldx 0xee
|
||||
lda 0xec
|
||||
rtl
|
||||
|
||||
; --------------------------------------------------------------------
|
||||
; __divmodsi_setup — common prologue for __divsi3 / __modsi3.
|
||||
; Reads A=a_lo, X=a_hi (i32-first-arg ABI), (4,s)=b_lo, (6,s)=b_hi.
|
||||
; Writes |a| to $e0/$e2, |b| to $e4/$e6, sign bits to $f0. JSR/RTS.
|
||||
; After JSR's 2-byte ret push, callee-relative offsets are (6,s)=b_lo,
|
||||
; (8,s)=b_hi.
|
||||
; --------------------------------------------------------------------
|
||||
__divmodsi_setup:
|
||||
; Clear sign tracker.
|
||||
pha
|
||||
lda #0x0
|
||||
sta 0xf0
|
||||
pla
|
||||
; |a|: A=a_lo, X=a_hi. Save them first (we need a_hi for sign test).
|
||||
sta 0xe0 ; tentative a_lo (may negate below)
|
||||
stx 0xe2 ; tentative a_hi
|
||||
cpx #0x8000
|
||||
bcc .Lsetsi_a_pos
|
||||
; a is negative. Set sign tracker bits 0+1 and negate.
|
||||
lda 0xf0
|
||||
ora #0x3
|
||||
sta 0xf0
|
||||
; 32-bit negate: invert + 1.
|
||||
lda 0xe0
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
sta 0xe0
|
||||
lda 0xe2
|
||||
eor #0xffff
|
||||
adc #0x0
|
||||
sta 0xe2
|
||||
.Lsetsi_a_pos:
|
||||
; |b|. Args shifted by 2 (the JSR ret push).
|
||||
lda 0x6, s
|
||||
sta 0xe4
|
||||
lda 0x8, s
|
||||
sta 0xe6
|
||||
cmp #0x8000
|
||||
bcc .Lsetsi_b_pos
|
||||
; b is negative. Flip bit 1 of $f0.
|
||||
lda 0xf0
|
||||
eor #0x2
|
||||
sta 0xf0
|
||||
lda 0xe4
|
||||
eor #0xffff
|
||||
clc
|
||||
adc #0x1
|
||||
sta 0xe4
|
||||
lda 0xe6
|
||||
eor #0xffff
|
||||
adc #0x0
|
||||
sta 0xe6
|
||||
.Lsetsi_b_pos:
|
||||
rts
|
||||
33
scripts/safeCC.sh
Executable file
33
scripts/safeCC.sh
Executable file
|
|
@ -0,0 +1,33 @@
|
|||
#!/usr/bin/env bash
|
||||
# Wrapper for ad-hoc invocations of the W65816 cross-compiler toolchain.
|
||||
# Applies the same memory/CPU caps as smokeTest.sh so a runaway backend
|
||||
# bug (infinite combine, runaway inserter) can't OOM-kill the whole tmux
|
||||
# scope and take Claude Code down with it.
|
||||
#
|
||||
# Usage:
|
||||
# scripts/safeCC.sh clang --target=w65816 -O2 -S foo.c -o foo.s
|
||||
# scripts/safeCC.sh llc -march=w65816 foo.ll -o foo.s
|
||||
#
|
||||
# The first arg is resolved against tools/llvm-mos-build/bin/ if it isn't
|
||||
# already an absolute or relative path containing a slash.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
ulimit -v $((4 * 1024 * 1024)) # 4 GB virtual memory
|
||||
ulimit -t 90 # 90 CPU-seconds
|
||||
|
||||
if [ $# -lt 1 ]; then
|
||||
printf 'usage: %s <tool> [args...]\n' "$0" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
BIN_DIR="$PROJECT_ROOT/tools/llvm-mos-build/bin"
|
||||
|
||||
tool="$1"
|
||||
shift
|
||||
|
||||
case "$tool" in
|
||||
/*|./*|*/*) exec "$tool" "$@" ;;
|
||||
*) exec "$BIN_DIR/$tool" "$@" ;;
|
||||
esac
|
||||
|
|
@ -11,6 +11,18 @@
|
|||
set -euo pipefail
|
||||
source "$(dirname "$0")/common.sh"
|
||||
|
||||
# Resource caps for child compilers. A bug in the W65816 backend can send
|
||||
# clang/llc into a runaway combine/inserter loop that allocates tens of GB
|
||||
# of RAM. When that happens the kernel OOM-killer takes down the entire
|
||||
# tmux scope (bash, the compiler, and the parent Claude Code session with
|
||||
# it). Bounding virtual memory and CPU time here turns "OOM kills the
|
||||
# terminal" into "compiler dies with SIGSEGV / SIGXCPU and we get a clean
|
||||
# error." Numbers are well above what a healthy compile of these tiny
|
||||
# test inputs needs (~200 MB / a few seconds), so legitimate work is
|
||||
# unaffected.
|
||||
ulimit -v $((4 * 1024 * 1024)) # 4 GB virtual memory ceiling
|
||||
ulimit -t 90 # 90 CPU-seconds per process
|
||||
|
||||
BUILD_DIR="$TOOLS_DIR/llvm-mos-build"
|
||||
LLC="$BUILD_DIR/bin/llc"
|
||||
LLVM_MC="$BUILD_DIR/bin/llvm-mc"
|
||||
|
|
@ -249,7 +261,344 @@ EOF
|
|||
done
|
||||
fi
|
||||
|
||||
# 11. Real C through clang. Uses the clang front-end if it has been
|
||||
# 11a. SETCC via clang: a > b returns 0/1. Exercises the multi-branch
|
||||
# CC path (BEQ + BPL diamond, since SETGT can't be a single Bxx).
|
||||
CLANG="$BUILD_DIR/bin/clang"
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang compiles a > b via multi-branch SETCC"
|
||||
cFile="$(mktemp --suffix=.c)"
|
||||
sCmpFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile"' EXIT
|
||||
cat > "$cFile" <<'EOF'
|
||||
int gt(int a, int b) { return a > b; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile" -o "$sCmpFile"
|
||||
# Expect a CMP, then BEQ + BPL forming the multi-branch diamond.
|
||||
for expect in "cmp 0x4, s" "lda #0x1" "beq" "bpl" "lda #0x0"; do
|
||||
if ! grep -qF "$expect" "$sCmpFile"; then
|
||||
warn "setcc gt test missing: $expect"
|
||||
cat "$sCmpFile" >&2
|
||||
die "setcc gt test failed"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# 11b. SELECT via clang: c ? a : b returns one of two constants.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang compiles c ? 100 : 200 via SELECT_CC"
|
||||
cFile2="$(mktemp --suffix=.c)"
|
||||
sSelFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile"' EXIT
|
||||
cat > "$cFile2" <<'EOF'
|
||||
int sel(int c) { return c ? 100 : 200; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile2" -o "$sSelFile"
|
||||
for expect in "cmp #0x0" "lda #0xc8" "beq" "lda #0x64"; do
|
||||
if ! grep -qF "$expect" "$sSelFile"; then
|
||||
warn "select test missing: $expect"
|
||||
cat "$sSelFile" >&2
|
||||
die "select test failed"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# 11c. Two-Acc16 op via clang: a - b where both are non-foldable Acc16.
|
||||
# Caller-side b lives in memory (FI), so this matches via SBCfi without
|
||||
# the spill — but a + b + c chains through a true two-Acc16 add.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang compiles two-Acc16 ops via spill (chained add)"
|
||||
cFile3="$(mktemp --suffix=.c)"
|
||||
sChainFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile"' EXIT
|
||||
cat > "$cFile3" <<'EOF'
|
||||
// max3 forces two-Acc16: outer SELECT_CC compares one Acc16 PHI value
|
||||
// to another Acc16 PHI value (m vs c, both computed values).
|
||||
int max3(int a, int b, int c) {
|
||||
int m = a > b ? a : b;
|
||||
return m > c ? m : c;
|
||||
}
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile3" -o "$sChainFile"
|
||||
# Expect at least one sta-spill paired with cmp to a stack-relative
|
||||
# slot - the signature of the two-Acc16 CMP_RR custom inserter.
|
||||
if ! grep -qE 'sta 0x[0-9a-f]+, s' "$sChainFile" \
|
||||
|| ! grep -qE 'cmp 0x[0-9a-f]+, s' "$sChainFile"; then
|
||||
cat "$sChainFile" >&2
|
||||
die "two-Acc16 (max3) didn't spill+cmp via stack-relative"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 11d. Multiply via libcall.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang emits __mulhi3 libcall for i16 multiply"
|
||||
cFile4="$(mktemp --suffix=.c)"
|
||||
sMulFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile"' EXIT
|
||||
cat > "$cFile4" <<'EOF'
|
||||
int mul(int a, int b) { return a * b; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile4" -o "$sMulFile"
|
||||
if ! grep -qF "jsl __mulhi3" "$sMulFile"; then
|
||||
cat "$sMulFile" >&2
|
||||
die "expected jsl __mulhi3"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 11e. Variable shift via libcall.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang emits __ashlhi3 libcall for variable i16 shift"
|
||||
cFile5="$(mktemp --suffix=.c)"
|
||||
sShfFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile"' EXIT
|
||||
cat > "$cFile5" <<'EOF'
|
||||
int shf(int x, int n) { return x << n; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile5" -o "$sShfFile"
|
||||
if ! grep -qF "jsl __ashlhi3" "$sShfFile"; then
|
||||
cat "$sShfFile" >&2
|
||||
die "expected jsl __ashlhi3"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 11f. Pointer deref: *p loads via stack-relative-indirect-Y.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang compiles *p via LDA (slot,s),y"
|
||||
cFile6="$(mktemp --suffix=.c)"
|
||||
sPtrFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile"' EXIT
|
||||
cat > "$cFile6" <<'EOF'
|
||||
int load_ptr(const int *p) { return *p; }
|
||||
void store_ptr(int *p, int v) { *p = v; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile6" -o "$sPtrFile"
|
||||
for expect in "ldy #0x0" "lda (0x" "sta (0x"; do
|
||||
if ! grep -qF "$expect" "$sPtrFile"; then
|
||||
warn "ptr-deref test missing: $expect"
|
||||
cat "$sPtrFile" >&2
|
||||
die "ptr-deref test failed"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# 11g. i8 store via pointer: *p = v wraps the STA in SEP/REP so only
|
||||
# 1 byte is written. Both load_byte and store_byte must compile.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang compiles *p = v with SEP/REP-wrapped STA (i8 store)"
|
||||
cFile7="$(mktemp --suffix=.c)"
|
||||
sBptrFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$cFile7" "$sBptrFile"' EXIT
|
||||
cat > "$cFile7" <<'EOF'
|
||||
unsigned char loadb(const unsigned char *p) { return *p; }
|
||||
void storeb(unsigned char *p, unsigned char v) { *p = v; }
|
||||
unsigned char incb(unsigned char *p) { return ++*p; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile7" -o "$sBptrFile"
|
||||
# storeb body should contain SEP #$20 ... STA (slot,s),y ... REP #$20.
|
||||
if ! grep -qF "sep #0x20" "$sBptrFile" \
|
||||
|| ! grep -qF "rep #0x20" "$sBptrFile" \
|
||||
|| ! grep -qE 'sta \(0x[0-9a-f]+, s\), y' "$sBptrFile"; then
|
||||
cat "$sBptrFile" >&2
|
||||
die "i8 ptr-store test missing SEP/STA/REP sequence"
|
||||
fi
|
||||
# All three functions must produce labels.
|
||||
for sym in loadb storeb incb; do
|
||||
if ! grep -qE "^${sym}:" "$sBptrFile"; then
|
||||
cat "$sBptrFile" >&2
|
||||
die "i8 ptr test: missing function ${sym}"
|
||||
fi
|
||||
done
|
||||
# Correctness check: storeb's prologue must NOT clobber A. A holds
|
||||
# the pointer arg on entry; the first body op must spill A intact.
|
||||
# The fixed prologue uses N/2 PHAs (small N) or TAY/TSC/.../TYA
|
||||
# (large N). Either way, the first non-prologue op should be a
|
||||
# `sta NN,s` that captures arg0=p. If we see TSC anywhere in the
|
||||
# prologue WITHOUT a TAY before it, that's the broken form (A
|
||||
# clobbered by TSC, then the spill stores garbage SP value as if
|
||||
# it were the pointer).
|
||||
storeb_body="$(sed -n '/^storeb:/,/^\.Lfunc_end/p' "$sBptrFile")"
|
||||
if printf '%s\n' "$storeb_body" | grep -qE '^ tsc$' \
|
||||
&& ! printf '%s\n' "$storeb_body" | grep -qE '^ tay$'; then
|
||||
cat "$sBptrFile" >&2
|
||||
die "storeb prologue uses bare TSC without TAY — A (the pointer arg) gets clobbered before being spilled. Byte store writes to the wrong address. Use PHA-based prologue or TAY/TSC/.../TYA bracket."
|
||||
fi
|
||||
# Also: there must be at least one `sta NN,s` in the body (the spill
|
||||
# of the pointer arg).
|
||||
if ! printf '%s\n' "$storeb_body" | grep -qE '^ sta 0x[0-9a-f]+, s$'; then
|
||||
cat "$sBptrFile" >&2
|
||||
die "storeb missing pointer-arg spill (sta NN,s)"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 11h. i8 global access stays in 8-bit M (no over-read). bump_gb must
|
||||
# get the SEP #$20 prologue and emit a single-byte lda/inc/sta sequence.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang keeps pure-i8 global access in 8-bit M (no wide-read regression)"
|
||||
cFile8="$(mktemp --suffix=.c)"
|
||||
sGbFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$cFile7" "$sBptrFile" "$cFile8" "$sGbFile"' EXIT
|
||||
cat > "$cFile8" <<'EOF'
|
||||
unsigned char gb;
|
||||
void bump_gb(void) { gb++; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile8" -o "$sGbFile"
|
||||
# Must use 8-bit M prologue (sep #$20), not the 16-bit one.
|
||||
if ! grep -qF "sep #0x20" "$sGbFile"; then
|
||||
cat "$sGbFile" >&2
|
||||
die "bump_gb test: expected sep #\$20 prologue (got 16-bit M)"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 11j. Runtime library assembles and exports all expected libcalls.
|
||||
# This is the destination of every __mulhi3/__ashlhi3/etc. that clang
|
||||
# emits — without it, generated code links to nothing.
|
||||
RUNTIME_SH="$PROJECT_ROOT/runtime/build.sh"
|
||||
RUNTIME_OBJ="$PROJECT_ROOT/runtime/libgcc.o"
|
||||
if [ -x "$RUNTIME_SH" ]; then
|
||||
log "check: runtime/build.sh assembles libgcc.o with all libcall symbols"
|
||||
"$RUNTIME_SH" >/dev/null
|
||||
if [ ! -f "$RUNTIME_OBJ" ]; then
|
||||
die "runtime/build.sh did not produce libgcc.o"
|
||||
fi
|
||||
syms="$("$BUILD_DIR/bin/llvm-objdump" -t "$RUNTIME_OBJ" 2>&1 | awk '{print $NF}')"
|
||||
for need in __mulhi3 __ashlhi3 __ashrhi3 __lshrhi3 __divhi3 __udivhi3 __modhi3 __umodhi3; do
|
||||
if ! printf '%s\n' "$syms" | grep -qx "$need"; then
|
||||
printf '%s\n' "$syms" >&2
|
||||
die "runtime missing symbol: $need"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# 11m. Real-world surface area: a non-trivial program that exercises
|
||||
# struct-field deref, char* iteration, multiply, shift, and a bit-twiddle
|
||||
# function. Validates the backend compiles a realistic C input
|
||||
# end-to-end without crashing. Doesn't assert specific asm; just
|
||||
# success and that the function bodies are non-empty.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang compiles a real-world multi-function program"
|
||||
cFile12="$(mktemp --suffix=.c)"
|
||||
sBigFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$cFile7" "$sBptrFile" "$cFile8" "$sGbFile" "$cFile9" "$sEqbFile" "$cFile10" "$sSgnFile" "$cFile11" "$sCallsFile" "$cFile12" "$sBigFile"' EXIT
|
||||
cat > "$cFile12" <<'EOF'
|
||||
typedef unsigned char u8;
|
||||
typedef unsigned int u16;
|
||||
struct Node { u16 data; struct Node *next; };
|
||||
u16 list_sum(const struct Node *h) {
|
||||
u16 s=0; while(h){ s+=h->data; h=h->next; } return s;
|
||||
}
|
||||
int strcmp_test(const char *a, const char *b) {
|
||||
while (*a && *a == *b) { a++; b++; }
|
||||
return (unsigned char)*a - (unsigned char)*b;
|
||||
}
|
||||
u16 fnv16(const u8 *p, u16 n) {
|
||||
u16 h=0x811C; for (u16 i=0;i<n;i++){ h^=p[i]; h=h*0x101; } return h;
|
||||
}
|
||||
u16 ctz16(u16 x) {
|
||||
if (!x) return 16;
|
||||
u16 n=0;
|
||||
if (!(x & 0xFF)) { n+=8; x>>=8; }
|
||||
if (!(x & 0x0F)) { n+=4; x>>=4; }
|
||||
if (!(x & 0x03)) { n+=2; x>>=2; }
|
||||
if (!(x & 0x01)) n+=1;
|
||||
return n;
|
||||
}
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile12" -o "$sBigFile"
|
||||
for sym in list_sum strcmp_test fnv16 ctz16; do
|
||||
if ! grep -qE "^${sym}:" "$sBigFile"; then
|
||||
cat "$sBigFile" >&2
|
||||
die "real-world test missing function: $sym"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# 11l. Linkage contract: every libcall clang generates from arithmetic
|
||||
# ops must match a symbol provided by runtime/libgcc.o. We can't run a
|
||||
# real link yet (no w65816-aware linker), but we can verify the symbol
|
||||
# names line up — drift here would be a silent runtime crash.
|
||||
if [ -x "$CLANG" ] && [ -f "$RUNTIME_OBJ" ]; then
|
||||
log "check: every libcall clang emits has a matching definition in libgcc.o"
|
||||
cFile11="$(mktemp --suffix=.c)"
|
||||
sCallsFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$cFile7" "$sBptrFile" "$cFile8" "$sGbFile" "$cFile9" "$sEqbFile" "$cFile10" "$sSgnFile" "$cFile11" "$sCallsFile"' EXIT
|
||||
cat > "$cFile11" <<'EOF'
|
||||
int m1(int a, int b) { return a * b; }
|
||||
unsigned int m2(unsigned int a, unsigned int b) { return a * b; }
|
||||
int s1(int x, int n) { return x << n; }
|
||||
unsigned int s2(unsigned int x, int n) { return x >> n; }
|
||||
int s3(int x, int n) { return x >> n; }
|
||||
int d1(int a, int b) { return a / b; }
|
||||
unsigned int d2(unsigned int a, unsigned int b) { return a / b; }
|
||||
int r1(int a, int b) { return a % b; }
|
||||
unsigned int r2(unsigned int a, unsigned int b) { return a % b; }
|
||||
long m3(long a, long b) { return a * b; }
|
||||
unsigned long m4(unsigned long a, unsigned long b) { return a * b; }
|
||||
long s4(long x, int n) { return x << n; }
|
||||
long s5(long x, int n) { return x >> n; }
|
||||
unsigned long s6(unsigned long x, int n) { return x >> n; }
|
||||
long d3(long a, long b) { return a / b; }
|
||||
unsigned long d4(unsigned long a, unsigned long b) { return a / b; }
|
||||
long r3(long a, long b) { return a % b; }
|
||||
unsigned long r4(unsigned long a, unsigned long b) { return a % b; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile11" -o "$sCallsFile"
|
||||
runtime_syms="$("$BUILD_DIR/bin/llvm-objdump" -t "$RUNTIME_OBJ" 2>&1 | awk '$2 == "g" {print $NF}')"
|
||||
emitted="$(grep -oE 'jsl __[a-z0-9]+' "$sCallsFile" | awk '{print $2}' | sort -u)"
|
||||
for sym in $emitted; do
|
||||
if ! printf '%s\n' "$runtime_syms" | grep -qx "$sym"; then
|
||||
warn "clang emitted libcall $sym but runtime/libgcc.o has no such symbol"
|
||||
printf 'runtime exports:\n%s\n' "$runtime_syms" >&2
|
||||
printf 'clang emitted:\n%s\n' "$emitted" >&2
|
||||
die "libcall name drift: $sym missing from runtime"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# 11k. signed i8 compare: forces 16-bit M prologue (instrLowersToWide)
|
||||
# because the SEXT lowering needs i16 ops. Verifies both that the
|
||||
# code compiles AND that the prologue is REP #$30 (not the 8-bit M
|
||||
# fast path, which would silently corrupt the SEXT mask).
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: signed i8 compare gets 16-bit M prologue + emits cmp"
|
||||
cFile10="$(mktemp --suffix=.c)"
|
||||
sSgnFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$cFile7" "$sBptrFile" "$cFile8" "$sGbFile" "$cFile9" "$sEqbFile" "$cFile10" "$sSgnFile"' EXIT
|
||||
cat > "$cFile10" <<'EOF'
|
||||
signed char sgnlt(signed char a, signed char b) { return a < b; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile10" -o "$sSgnFile"
|
||||
# Must use 16-bit M (rep #$30), not the 8-bit fast path.
|
||||
if ! grep -qF "rep #0x30" "$sSgnFile"; then
|
||||
cat "$sSgnFile" >&2
|
||||
die "sgnlt: expected rep #\$30 prologue (i8 signed cmp needs 16-bit M)"
|
||||
fi
|
||||
# Must NOT contain the 8-bit prologue, which would mean we never
|
||||
# transitioned (the SEXT injection's ora #\$ff00 would silently
|
||||
# truncate to ora #\$00 in 8-bit M).
|
||||
if grep -qF "rep #0x10" "$sSgnFile" && ! grep -qF "rep #0x30" "$sSgnFile"; then
|
||||
cat "$sSgnFile" >&2
|
||||
die "sgnlt: only saw 8-bit M prologue, SEXT high-byte mask would be dropped"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 11i. i8 equality compare on two stack args (eqbyte): exercises i8
|
||||
# SETCC promotion through Lower*CC.
|
||||
if [ -x "$CLANG" ]; then
|
||||
log "check: clang lowers i8 == i8 via promoted i16 cmp"
|
||||
cFile9="$(mktemp --suffix=.c)"
|
||||
sEqbFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$cFile7" "$sBptrFile" "$cFile8" "$sGbFile" "$cFile9" "$sEqbFile"' EXIT
|
||||
cat > "$cFile9" <<'EOF'
|
||||
unsigned char eqbyte(unsigned char a, unsigned char b) { return a == b; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -S "$cFile9" -o "$sEqbFile"
|
||||
# Must produce a cmp + beq (the eq diamond).
|
||||
if ! grep -qE 'cmp ' "$sEqbFile" || ! grep -qF "beq" "$sEqbFile"; then
|
||||
cat "$sEqbFile" >&2
|
||||
die "eqbyte test: expected cmp + beq sequence"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 12. Real C through clang. Uses the clang front-end if it has been
|
||||
# built; skipped otherwise (clang takes 15-30 minutes to build the
|
||||
# first time; afterwards rebuilds are fast).
|
||||
CLANG="$BUILD_DIR/bin/clang"
|
||||
|
|
@ -270,6 +619,222 @@ EOF
|
|||
die "clang end-to-end test failed"
|
||||
fi
|
||||
done
|
||||
|
||||
# 13. i32 (long) compile path. Type legalization splits i32 into
|
||||
# two i16 halves; the high half flows through the (add FrameIndex,
|
||||
# 2) shape, which previously crashed ISel with "Cannot select
|
||||
# FrameIndex<-2>". SelectFrameIndex now folds (add FI, const) so
|
||||
# the split loads land on a stack-relative addressing mode.
|
||||
# Return ABI: low->A, high->X (TAX in the epilogue).
|
||||
# Also asserts the native ADC carry chain (CLC + ADC + ADC) is in
|
||||
# place — task #49 replaced the bloated SETCC-based carry detect
|
||||
# (lda;cmp;bcc;lda) with a direct ADDC/ADDE-pattern lowering that
|
||||
# uses the C flag in P as a Glue-modeled physreg.
|
||||
log "check: clang compiles a long add (i32 split + A:X return)"
|
||||
cI32File="$(mktemp --suffix=.c)"
|
||||
oI32File="$(mktemp --suffix=.o)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File"' EXIT
|
||||
cat > "$cI32File" <<'EOF'
|
||||
long add32(long a, long b) { return a + b; }
|
||||
EOF
|
||||
"$CLANG" --target=w65816 -O2 -c "$cI32File" -o "$oI32File"
|
||||
disasmI32="$("$OBJDUMP" --triple=w65816 -d "$oI32File" 2>&1)"
|
||||
# TAX confirms the high-half-into-X part of the return ABI fired.
|
||||
# Without it, both halves would pile into A and one would be lost.
|
||||
# Exactly one CLC and exactly two ADCs prove the native carry chain
|
||||
# is wired (one CLC for lo, ADC lo, ADC hi-with-carry); a regression
|
||||
# to the SETCC path would show two CLCs and a bcc/cmp.
|
||||
for expect in "tax" "rtl" "clc" "adc"; do
|
||||
if ! printf '%s\n' "$disasmI32" | grep -qF "$expect"; then
|
||||
warn "i32 add test missing: $expect"
|
||||
printf '%s\n' "$disasmI32" >&2
|
||||
die "i32 add end-to-end test failed"
|
||||
fi
|
||||
done
|
||||
nClc="$(printf '%s\n' "$disasmI32" | grep -cE '\bclc\b' || true)"
|
||||
nAdc="$(printf '%s\n' "$disasmI32" | grep -cE '\badc\b' || true)"
|
||||
nBcc="$(printf '%s\n' "$disasmI32" | grep -cE '\bbcc\b' || true)"
|
||||
if [ "$nClc" != "1" ] || [ "$nAdc" != "2" ] || [ "$nBcc" != "0" ]; then
|
||||
warn "i32 add carry-chain shape wrong (clc=$nClc adc=$nAdc bcc=$nBcc, want 1/2/0)"
|
||||
printf '%s\n' "$disasmI32" >&2
|
||||
die "i32 add carry-chain regression"
|
||||
fi
|
||||
# Lock the post-StackSlotCleanup instruction count: should be ~11 for
|
||||
# add32 (rep + pha + clc + adc + sta + txa + adc + tax + lda + ply + rtl
|
||||
# — i32-first-arg in A:X means arg0_hi loads as TXA, no LDAfi). If
|
||||
# this regresses meaningfully (say >14) the cleanup pass, the
|
||||
# rematerialization flag, or the A:X first-arg ABI has been broken.
|
||||
nInsns="$(printf '%s\n' "$disasmI32" | grep -cE '^[0-9a-f]+:' || true)"
|
||||
if [ "$nInsns" -gt 14 ]; then
|
||||
warn "i32 add bloat (got $nInsns insns, want <=14 — was 25 pre-cleanup, 11 post)"
|
||||
printf '%s\n' "$disasmI32" >&2
|
||||
die "i32 add code-quality regression"
|
||||
fi
|
||||
# The A:X arg0 ABI moves arg0_hi out of the stack slot, so the
|
||||
# asm should contain TXA (X→A for the hi-half ADC tied input)
|
||||
# exactly once. A regression to "load arg0_hi from stack" would
|
||||
# remove the TXA and add an extra LDA.
|
||||
nTxa="$(printf '%s\n' "$disasmI32" | grep -cE '\btxa\b' || true)"
|
||||
if [ "$nTxa" != "1" ]; then
|
||||
warn "i32 add: expected exactly 1 txa (i32-first-arg-in-A:X path); got $nTxa"
|
||||
printf '%s\n' "$disasmI32" >&2
|
||||
die "i32 add A:X first-arg ABI regression"
|
||||
fi
|
||||
|
||||
# i32 carry chain on two-Acc16 (no foldable load): exercises the
|
||||
# ADD_RR + ADDE_RR custom-inserter path. fib32 has live a/b values
|
||||
# the inserter must spill to a fresh slot; pre-fix this crashed at
|
||||
# ISel with "Cannot select: adde reg, reg".
|
||||
log "check: clang compiles a 32-bit fib loop (ADDE_RR inserter path)"
|
||||
cFibFile="$(mktemp --suffix=.c)"
|
||||
sFibFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile"' EXIT
|
||||
cat > "$cFibFile" <<'EOF'
|
||||
unsigned long fib32(unsigned long n) {
|
||||
unsigned long a = 0, b = 1, t;
|
||||
while (n > 0) { t = a + b; a = b; b = t; n--; }
|
||||
return a;
|
||||
}
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cFibFile" -o "$sFibFile" 2>&1 >/dev/null; then
|
||||
die "i32 fib (ADDE_RR inserter) failed to compile"
|
||||
fi
|
||||
if ! grep -qE '\bclc\b' "$sFibFile" || ! grep -qE '\badc\b' "$sFibFile"; then
|
||||
warn "i32 fib output missing clc/adc"
|
||||
die "i32 fib carry-chain regression"
|
||||
fi
|
||||
|
||||
# i32 multiply via __mulsi3 libcall: tests the multi-i16-return path
|
||||
# (RetCC_W65816 assigning A then X for 2 i16 returns) plus the i32
|
||||
# arg push side. Pre-fix this hit "multi-return calls not yet
|
||||
# supported (Ins.size=4)" when LowerCallTo split the i32 return.
|
||||
log "check: clang compiles a long multiply via __mulsi3 libcall"
|
||||
cMulFile="$(mktemp --suffix=.c)"
|
||||
sMulFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile" "$cMulFile" "$sMulFile"' EXIT
|
||||
cat > "$cMulFile" <<'EOF'
|
||||
unsigned long mul32(unsigned long a, unsigned long b) { return a * b; }
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cMulFile" -o "$sMulFile" 2>&1 >/dev/null; then
|
||||
die "i32 mul via __mulsi3 failed to compile"
|
||||
fi
|
||||
if ! grep -q '__mulsi3' "$sMulFile"; then
|
||||
die "i32 mul did not emit __mulsi3 libcall"
|
||||
fi
|
||||
|
||||
# i32 shift-by-1 (SHL/SRL): the type-legalizer's SHL_PARTS / SRL_PARTS
|
||||
# expansion needs `(srl x, 15)` or `(shl x, 15)` for the carry-cross-
|
||||
# halves slot. Without inline patterns those fall to __lshrhi3 /
|
||||
# __ashlhi3 libcalls (~10 byte overhead per shift). SRL15A and
|
||||
# SHL15A pseudos handle them inline (`ASL/LSR; LDA #0; ROL/ROR`,
|
||||
# 3 bytes). Verify the shift-by-1 output doesn't contain a hi3
|
||||
# libcall.
|
||||
log "check: clang i32 shift-by-1 stays inline (no __lshrhi3 / __ashlhi3 libcall)"
|
||||
cSh1File="$(mktemp --suffix=.c)"
|
||||
sSh1File="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile" "$cMulFile" "$sMulFile" "$cSh1File" "$sSh1File"' EXIT
|
||||
cat > "$cSh1File" <<'EOF'
|
||||
unsigned long shl1(unsigned long a) { return a << 1; }
|
||||
unsigned long shr1(unsigned long a) { return a >> 1; }
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cSh1File" -o "$sSh1File" 2>&1 >/dev/null; then
|
||||
die "i32 shift-by-1 failed to compile"
|
||||
fi
|
||||
if grep -qE '__lshrhi3|__ashlhi3' "$sSh1File"; then
|
||||
warn "i32 shift-by-1 still calling i16 shift libcall — SRL15A/SHL15A pattern not firing"
|
||||
die "i32 shift-by-1 regression"
|
||||
fi
|
||||
|
||||
# Varargs (<stdarg.h>): LowerFormalArguments creates a fixed FI
|
||||
# for the first vararg slot when IsVarArg; LowerVASTART stores
|
||||
# its address to the va_list pointer. VAARG/VACOPY/VAEND use
|
||||
# default LLVM expansions. Pre-fix this hit
|
||||
# "vararg functions not yet supported" fatal error.
|
||||
log "check: clang compiles a vararg function (<stdarg.h>)"
|
||||
cVaFile="$(mktemp --suffix=.c)"
|
||||
sVaFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile" "$cMulFile" "$sMulFile" "$cSh1File" "$sSh1File" "$cVaFile" "$sVaFile"' EXIT
|
||||
cat > "$cVaFile" <<'EOF'
|
||||
#include <stdarg.h>
|
||||
int sumArgs(int n, ...) {
|
||||
va_list args;
|
||||
va_start(args, n);
|
||||
int sum = 0;
|
||||
for (int i = 0; i < n; i++) sum += va_arg(args, int);
|
||||
va_end(args);
|
||||
return sum;
|
||||
}
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cVaFile" -o "$sVaFile" 2>&1 >/dev/null; then
|
||||
die "vararg function failed to compile"
|
||||
fi
|
||||
|
||||
# Stack-array LEA: `char arr[16]; arr[i] = ...` needs the address
|
||||
# of an alloca'd object as an i16 value. Pre-fix this hit "Cannot
|
||||
# select: FrameIndex<0>" because addr_fi only matches in load/store
|
||||
# contexts. W65816DAGToDAGISel::Select now lowers a bare
|
||||
# ISD::FrameIndex to ADDframe (FI, 0); eliminateFrameIndex expands
|
||||
# ADDframe into TSC + CLC + ADC #disp.
|
||||
log "check: clang takes the address of a stack-allocated array"
|
||||
cAllocaFile="$(mktemp --suffix=.c)"
|
||||
sAllocaFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile" "$cMulFile" "$sMulFile" "$cAllocaFile" "$sAllocaFile"' EXIT
|
||||
cat > "$cAllocaFile" <<'EOF'
|
||||
void writeBytes(char *out, char v) {
|
||||
char tmp[8];
|
||||
for (int i = 0; i < 8; i++) tmp[i] = v + i;
|
||||
for (int i = 0; i < 8; i++) out[i] = tmp[i];
|
||||
}
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cAllocaFile" -o "$sAllocaFile" 2>&1 >/dev/null; then
|
||||
die "alloca'd-array address failed to compile"
|
||||
fi
|
||||
# The TSC; CLC; ADC #disp triple is the LEA expansion of ADDframe;
|
||||
# at least one occurrence proves the pseudo wired through.
|
||||
if ! grep -qE '^\s*tsc' "$sAllocaFile"; then
|
||||
die "alloca'd-array LEA missing TSC (ADDframe expansion broken)"
|
||||
fi
|
||||
|
||||
# signed-byte arithmetic (`(int)(*p) - (int)(*q)` style — strcmp).
|
||||
# Exercises three formerly-missing patterns: SEXTLOAD i16 from i8
|
||||
# (we Expand it to (sext (load))), sext_inreg i16 from i8 (the
|
||||
# `((x & 0xFF) ^ 0x80) - 0x80` tablegen Pat), and extloadi8 from
|
||||
# an Acc16 register pointer (LDAptr / "high byte don't care").
|
||||
log "check: clang compiles a signed-byte strcmp (sextload + sext_inreg + extload-via-ptr)"
|
||||
cStrFile="$(mktemp --suffix=.c)"
|
||||
sStrFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile" "$cMulFile" "$sMulFile" "$cAllocaFile" "$sAllocaFile" "$cStrFile" "$sStrFile"' EXIT
|
||||
cat > "$cStrFile" <<'EOF'
|
||||
int strcmp32(const char *a, const char *b) {
|
||||
while (*a && *a == *b) { a++; b++; }
|
||||
return (int)(*a) - (int)(*b);
|
||||
}
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cStrFile" -o "$sStrFile" 2>&1 >/dev/null; then
|
||||
die "signed-byte strcmp failed to compile"
|
||||
fi
|
||||
|
||||
# Indirect calls (function pointers). Lowered via the runtime
|
||||
# trampoline at runtime/src/libgcc.s::__jsl_indir, which does
|
||||
# JMP (__indirTarget) — caller stores target to __indirTarget then
|
||||
# JSL __jsl_indir. Pre-fix, LowerCall reported a fatal error.
|
||||
log "check: clang compiles an indirect call (via __jsl_indir trampoline)"
|
||||
cIndFile="$(mktemp --suffix=.c)"
|
||||
sIndFile="$(mktemp --suffix=.s)"
|
||||
trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$oFile2" "$cI32File" "$oI32File" "$cFibFile" "$sFibFile" "$cMulFile" "$sMulFile" "$cAllocaFile" "$sAllocaFile" "$cStrFile" "$sStrFile" "$cIndFile" "$sIndFile"' EXIT
|
||||
cat > "$cIndFile" <<'EOF'
|
||||
typedef int (*BinOp)(int, int);
|
||||
int doOp(BinOp op, int x, int y) { return op(x, y); }
|
||||
EOF
|
||||
if ! "$CLANG" --target=w65816 -O2 -S "$cIndFile" -o "$sIndFile" 2>&1 >/dev/null; then
|
||||
die "indirect call failed to compile"
|
||||
fi
|
||||
if ! grep -q '__indirTarget' "$sIndFile"; then
|
||||
die "indirect call missing __indirTarget store"
|
||||
fi
|
||||
if ! grep -q '__jsl_indir' "$sIndFile"; then
|
||||
die "indirect call missing JSL to __jsl_indir trampoline"
|
||||
fi
|
||||
fi
|
||||
|
||||
log "all smoke checks passed"
|
||||
|
|
|
|||
|
|
@ -200,7 +200,13 @@ public:
|
|||
}
|
||||
|
||||
bool isPCRel8() const {
|
||||
return Kind == k_Addr && isConstant(Addr) && constFitsUnsigned(Addr, 8);
|
||||
// Branch targets are typically symbols (resolved by the assembler /
|
||||
// linker into the final 8-bit signed offset). Accept any address
|
||||
// expression — constant in-range, or symbolic. Constants outside
|
||||
// 8 bits are rejected so they fall through to PCRel16 / longer
|
||||
// forms instead of silently overflowing.
|
||||
return Kind == k_Addr &&
|
||||
(!isConstant(Addr) || constFitsUnsigned(Addr, 8));
|
||||
}
|
||||
bool isPCRel16() const {
|
||||
return Kind == k_Addr &&
|
||||
|
|
|
|||
|
|
@ -24,6 +24,7 @@ add_llvm_target(W65816CodeGen
|
|||
W65816RegisterInfo.cpp
|
||||
W65816SelectionDAGInfo.cpp
|
||||
W65816Subtarget.cpp
|
||||
W65816StackSlotCleanup.cpp
|
||||
W65816TargetMachine.cpp
|
||||
W65816AsmPrinter.cpp
|
||||
W65816MCInstLower.cpp
|
||||
|
|
|
|||
|
|
@ -66,6 +66,22 @@ public:
|
|||
return;
|
||||
}
|
||||
|
||||
// PCRel8 (Bxx / BRA) takes a signed 8-bit offset. If the resolved
|
||||
// displacement won't fit, the encoded byte is meaningless — the
|
||||
// branch would land somewhere unintended. Diagnose explicitly
|
||||
// instead of silently truncating.
|
||||
if (Fixup.getKind() == W65816::fixup_8_pcrel) {
|
||||
int64_t Signed = static_cast<int64_t>(Value);
|
||||
if (Signed < -128 || Signed > 127) {
|
||||
getContext().reportError(
|
||||
Fixup.getLoc(),
|
||||
"branch target out of range for 8-bit PC-relative branch "
|
||||
"(offset " + Twine(Signed) + " bytes); use a long branch (BRL) "
|
||||
"or restructure the code");
|
||||
return; // don't patch — leave zero, error already issued
|
||||
}
|
||||
}
|
||||
|
||||
// Little-endian patch.
|
||||
for (unsigned i = 0; i < Width; ++i) {
|
||||
Data[Offset + i] = static_cast<uint8_t>((Value >> (8 * i)) & 0xff);
|
||||
|
|
|
|||
|
|
@ -20,15 +20,26 @@
|
|||
namespace W65816CC {
|
||||
// 65816 branch condition codes. Encoded as i8 immediate operands in
|
||||
// the BR_CC SDNode and tablegen patterns.
|
||||
//
|
||||
// 0..7 map to single Bxx instructions. 8..11 are pseudo codes that
|
||||
// expand to a two-branch sequence — needed for SETGT/SETLE/SETUGT/
|
||||
// SETULE when the operand we'd swap to LHS is a load (no
|
||||
// pattern-match for load on LHS without spilling A). Only used in
|
||||
// SELECT_CC16's custom inserter; never reaches a single Bxx.
|
||||
enum CondCode {
|
||||
COND_EQ = 0, // BEQ
|
||||
COND_NE = 1, // BNE
|
||||
COND_HS = 2, // BCS (unsigned >=)
|
||||
COND_LO = 3, // BCC (unsigned <)
|
||||
COND_MI = 4, // BMI (negative)
|
||||
COND_PL = 5, // BPL (non-negative)
|
||||
COND_MI = 4, // BMI (negative, signed <)
|
||||
COND_PL = 5, // BPL (non-negative, signed >=)
|
||||
COND_VS = 6, // BVS (overflow)
|
||||
COND_VC = 7, // BVC (no overflow)
|
||||
// Multi-branch pseudo codes (handled by SELECT_CC16 inserter):
|
||||
COND_GT_MB = 8, // signed > : take if (PL && NE)
|
||||
COND_LE_MB = 9, // signed <= : take if (MI || EQ)
|
||||
COND_HI_MB = 10, // unsigned > : take if (HS && NE)
|
||||
COND_LS_MB = 11, // unsigned <=: take if (LO || EQ)
|
||||
COND_INVALID = -1
|
||||
};
|
||||
} // namespace W65816CC
|
||||
|
|
@ -42,8 +53,15 @@ class PassRegistry;
|
|||
FunctionPass *createW65816ISelDag(W65816TargetMachine &TM,
|
||||
CodeGenOptLevel OptLevel);
|
||||
|
||||
// Post-RA cleanup: removes redundant STAfi+LDAfi same-slot pairs that
|
||||
// the greedy allocator emits when materialising a COPY $a -> vreg as
|
||||
// a spill/reload cycle, even though A still holds the value. See
|
||||
// W65816StackSlotCleanup.cpp.
|
||||
FunctionPass *createW65816StackSlotCleanup();
|
||||
|
||||
void initializeW65816AsmPrinterPass(PassRegistry &);
|
||||
void initializeW65816DAGToDAGISelLegacyPass(PassRegistry &);
|
||||
void initializeW65816StackSlotCleanupPass(PassRegistry &);
|
||||
|
||||
} // namespace llvm
|
||||
|
||||
|
|
|
|||
|
|
@ -82,6 +82,13 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
|||
switch (MI->getOpcode()) {
|
||||
default:
|
||||
break;
|
||||
case W65816::LDXi16imm: {
|
||||
MCInst Ldx;
|
||||
Ldx.setOpcode(W65816::LDX_Imm16);
|
||||
Ldx.addOperand(lowerOperand(MI->getOperand(1), MCInstLowering));
|
||||
EmitToStreamer(*OutStreamer, Ldx);
|
||||
return;
|
||||
}
|
||||
case W65816::LDAi16imm: {
|
||||
MCInst Lda;
|
||||
Lda.setOpcode(W65816::LDA_Imm16);
|
||||
|
|
@ -126,6 +133,18 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
|||
EmitToStreamer(*OutStreamer, Op);
|
||||
return;
|
||||
}
|
||||
case W65816::ADCEi16imm:
|
||||
case W65816::SBCEi16imm: {
|
||||
// Chained ADC/SBC: no CLC/SEC prefix — the carry/borrow from the
|
||||
// previous addc/adde/subc/sube is already in P. See ADCi16imm
|
||||
// comment in W65816InstrInfo.td.
|
||||
bool IsSub = MI->getOpcode() == W65816::SBCEi16imm;
|
||||
MCInst Op;
|
||||
Op.setOpcode(IsSub ? W65816::SBC_Imm16 : W65816::ADC_Imm16);
|
||||
Op.addOperand(lowerOperand(MI->getOperand(2), MCInstLowering));
|
||||
EmitToStreamer(*OutStreamer, Op);
|
||||
return;
|
||||
}
|
||||
case W65816::ADCi8imm:
|
||||
case W65816::SBCi8imm: {
|
||||
bool IsSub = MI->getOpcode() == W65816::SBCi8imm;
|
||||
|
|
@ -185,6 +204,16 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
|||
EmitToStreamer(*OutStreamer, Op);
|
||||
return;
|
||||
}
|
||||
case W65816::ADCEabs:
|
||||
case W65816::SBCEabs: {
|
||||
// Chained variant — no CLC/SEC prefix.
|
||||
bool IsSub = MI->getOpcode() == W65816::SBCEabs;
|
||||
MCInst Op;
|
||||
Op.setOpcode(IsSub ? W65816::SBC_Abs : W65816::ADC_Abs);
|
||||
Op.addOperand(lowerOperand(MI->getOperand(2), MCInstLowering));
|
||||
EmitToStreamer(*OutStreamer, Op);
|
||||
return;
|
||||
}
|
||||
case W65816::CMPi16imm: {
|
||||
// CMPi16imm has (outs), (ins Acc16:$lhs, i16imm:$rhs); MC needs only
|
||||
// the immediate.
|
||||
|
|
@ -248,6 +277,18 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
|||
EmitToStreamer(*OutStreamer, Jsl);
|
||||
return;
|
||||
}
|
||||
case W65816::PUSH16: {
|
||||
MCInst Pha;
|
||||
Pha.setOpcode(W65816::PHA);
|
||||
EmitToStreamer(*OutStreamer, Pha);
|
||||
return;
|
||||
}
|
||||
case W65816::PUSH16X: {
|
||||
MCInst Phx;
|
||||
Phx.setOpcode(W65816::PHX);
|
||||
EmitToStreamer(*OutStreamer, Phx);
|
||||
return;
|
||||
}
|
||||
case W65816::ASLA16: {
|
||||
MCInst Asl;
|
||||
Asl.setOpcode(W65816::ASL_A);
|
||||
|
|
@ -275,6 +316,12 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
|||
MCInst ror; ror.setOpcode(W65816::ROR_A); EmitToStreamer(*OutStreamer, ror);
|
||||
return;
|
||||
}
|
||||
case W65816::XBA16: {
|
||||
MCInst Xba;
|
||||
Xba.setOpcode(W65816::XBA);
|
||||
EmitToStreamer(*OutStreamer, Xba);
|
||||
return;
|
||||
}
|
||||
case W65816::INA_PSEUDO: {
|
||||
MCInst In;
|
||||
In.setOpcode(W65816::INA);
|
||||
|
|
@ -305,6 +352,112 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
|||
EmitToStreamer(*OutStreamer, Inc);
|
||||
return;
|
||||
}
|
||||
case W65816::NEGC16: {
|
||||
// (subc 0, x) — lo half of multi-precision negate.
|
||||
// EOR #$FFFF; CLC; ADC #1. C-out = 1 iff result = 0 (i.e. x was 0),
|
||||
// matching SBC's "no borrow" convention.
|
||||
MCInst Eor;
|
||||
Eor.setOpcode(W65816::EOR_Imm16);
|
||||
Eor.addOperand(MCOperand::createImm(0xFFFF));
|
||||
EmitToStreamer(*OutStreamer, Eor);
|
||||
MCInst Clc;
|
||||
Clc.setOpcode(W65816::CLC);
|
||||
EmitToStreamer(*OutStreamer, Clc);
|
||||
MCInst Adc;
|
||||
Adc.setOpcode(W65816::ADC_Imm16);
|
||||
Adc.addOperand(MCOperand::createImm(1));
|
||||
EmitToStreamer(*OutStreamer, Adc);
|
||||
return;
|
||||
}
|
||||
case W65816::SRL15A: {
|
||||
// ASL A; LDA #0; ROL A — extract bit 15 to bit 0.
|
||||
MCInst Asl;
|
||||
Asl.setOpcode(W65816::ASL_A);
|
||||
EmitToStreamer(*OutStreamer, Asl);
|
||||
MCInst Lda;
|
||||
Lda.setOpcode(W65816::LDA_Imm16);
|
||||
Lda.addOperand(MCOperand::createImm(0));
|
||||
EmitToStreamer(*OutStreamer, Lda);
|
||||
MCInst Rol;
|
||||
Rol.setOpcode(W65816::ROL_A);
|
||||
EmitToStreamer(*OutStreamer, Rol);
|
||||
return;
|
||||
}
|
||||
case W65816::SHL15A: {
|
||||
// LSR A; LDA #0; ROR A — move bit 0 to bit 15.
|
||||
MCInst Lsr;
|
||||
Lsr.setOpcode(W65816::LSR_A);
|
||||
EmitToStreamer(*OutStreamer, Lsr);
|
||||
MCInst Lda;
|
||||
Lda.setOpcode(W65816::LDA_Imm16);
|
||||
Lda.addOperand(MCOperand::createImm(0));
|
||||
EmitToStreamer(*OutStreamer, Lda);
|
||||
MCInst Ror;
|
||||
Ror.setOpcode(W65816::ROR_A);
|
||||
EmitToStreamer(*OutStreamer, Ror);
|
||||
return;
|
||||
}
|
||||
case W65816::SRL8A: {
|
||||
// XBA; AND #$00FF — high byte to low byte, zero high.
|
||||
MCInst Xba;
|
||||
Xba.setOpcode(W65816::XBA);
|
||||
EmitToStreamer(*OutStreamer, Xba);
|
||||
MCInst And;
|
||||
And.setOpcode(W65816::AND_Imm16);
|
||||
And.addOperand(MCOperand::createImm(0x00FF));
|
||||
EmitToStreamer(*OutStreamer, And);
|
||||
return;
|
||||
}
|
||||
case W65816::SHL8A: {
|
||||
// XBA; AND #$FF00 — low byte to high byte, zero low.
|
||||
MCInst Xba;
|
||||
Xba.setOpcode(W65816::XBA);
|
||||
EmitToStreamer(*OutStreamer, Xba);
|
||||
MCInst And;
|
||||
And.setOpcode(W65816::AND_Imm16);
|
||||
And.addOperand(MCOperand::createImm(0xFF00));
|
||||
EmitToStreamer(*OutStreamer, And);
|
||||
return;
|
||||
}
|
||||
case W65816::SRA15A: {
|
||||
// ASL A; LDA #0; ADC #-1; EOR #-1 — sign-fill from bit 15.
|
||||
// ASL: C = bit 15 of input (the sign).
|
||||
// LDA #0: A = 0, C unchanged.
|
||||
// ADC #-1: A = 0 + (-1) + C = -1 + C. If C=1 (neg): A = 0; if
|
||||
// C=0 (pos): A = -1. Inverted from what we want.
|
||||
// EOR #-1: flip bits — A = -1 (neg) or 0 (pos), correct.
|
||||
MCInst Asl;
|
||||
Asl.setOpcode(W65816::ASL_A);
|
||||
EmitToStreamer(*OutStreamer, Asl);
|
||||
MCInst Lda;
|
||||
Lda.setOpcode(W65816::LDA_Imm16);
|
||||
Lda.addOperand(MCOperand::createImm(0));
|
||||
EmitToStreamer(*OutStreamer, Lda);
|
||||
MCInst Adc;
|
||||
Adc.setOpcode(W65816::ADC_Imm16);
|
||||
Adc.addOperand(MCOperand::createImm(0xFFFF));
|
||||
EmitToStreamer(*OutStreamer, Adc);
|
||||
MCInst Eor;
|
||||
Eor.setOpcode(W65816::EOR_Imm16);
|
||||
Eor.addOperand(MCOperand::createImm(0xFFFF));
|
||||
EmitToStreamer(*OutStreamer, Eor);
|
||||
return;
|
||||
}
|
||||
case W65816::NEGE16: {
|
||||
// (sube 0, x) — hi half of multi-precision negate.
|
||||
// EOR #$FFFF; ADC #0. Carry-in from the previous subc/sube is
|
||||
// already in P; ADC #0 propagates it as ~x + C, which matches
|
||||
// 0 - x - !C in two's complement.
|
||||
MCInst Eor;
|
||||
Eor.setOpcode(W65816::EOR_Imm16);
|
||||
Eor.addOperand(MCOperand::createImm(0xFFFF));
|
||||
EmitToStreamer(*OutStreamer, Eor);
|
||||
MCInst Adc;
|
||||
Adc.setOpcode(W65816::ADC_Imm16);
|
||||
Adc.addOperand(MCOperand::createImm(0));
|
||||
EmitToStreamer(*OutStreamer, Adc);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
MCInst TmpInst;
|
||||
|
|
|
|||
|
|
@ -18,8 +18,10 @@
|
|||
def RetCC_W65816 : CallingConv<[
|
||||
// i8 values are returned in the 8-bit accumulator.
|
||||
CCIfType<[i8], CCAssignToReg<[A]>>,
|
||||
// i16 values are returned in the 16-bit accumulator (same physical reg).
|
||||
CCIfType<[i16], CCAssignToReg<[A]>>
|
||||
// i16 values are returned in A; for a split i32 (legalizer produces
|
||||
// two i16 returns), the second slot lands in X. LowerReturn /
|
||||
// LowerCall hardcode the same A,X order — keep them in sync.
|
||||
CCIfType<[i16], CCAssignToReg<[A, X]>>
|
||||
]>;
|
||||
|
||||
//===----------------------------------------------------------------------===//
|
||||
|
|
|
|||
|
|
@ -19,11 +19,52 @@
|
|||
#include "llvm/CodeGen/MachineFunction.h"
|
||||
#include "llvm/CodeGen/MachineInstrBuilder.h"
|
||||
#include "llvm/CodeGen/MachineRegisterInfo.h"
|
||||
#include "llvm/IR/Constants.h"
|
||||
#include "llvm/IR/Function.h"
|
||||
#include "llvm/IR/GlobalValue.h"
|
||||
#include "llvm/IR/InstrTypes.h"
|
||||
#include "llvm/IR/Instructions.h"
|
||||
#include "llvm/Support/ErrorHandling.h"
|
||||
|
||||
using namespace llvm;
|
||||
|
||||
// "Wide" = needs to live in a 16-bit register at some point during the
|
||||
// function body. i8 and i1 are fine in 8-bit M. Pointer operands that
|
||||
// are constant addresses (globals, externs) are fine too — they're
|
||||
// immediate operands of LDA/STA, not values held in A. A non-constant
|
||||
// pointer (function arg, computed value) does need to sit in A as 16
|
||||
// bits for stack-relative-indirect addressing.
|
||||
static bool isWideTyForMode(Type *T, const llvm::Value *V) {
|
||||
if (!T || T->isVoidTy()) return false;
|
||||
if (T->isIntegerTy(8) || T->isIntegerTy(1)) return false;
|
||||
if (T->isPointerTy() && V && (isa<GlobalValue>(V) || isa<Constant>(V)))
|
||||
return false;
|
||||
return true;
|
||||
}
|
||||
|
||||
// Some IR ops, even when their visible types are all i8, lower to
|
||||
// sequences that need 16-bit M during execution: signed compares (via
|
||||
// SEXT to i16 + cmp), variable shifts (libcall via i16-promoted args),
|
||||
// constant shifts > 4 (also routed through i16 via LowerShift), and
|
||||
// any sext of an i8 (synthesized as a SELECT_CC with i16 mask ops).
|
||||
// Detect those here so the prologue picks 16-bit M up front.
|
||||
static bool instrLowersToWide(const Instruction &I) {
|
||||
if (auto *Cmp = dyn_cast<ICmpInst>(&I)) {
|
||||
if (Cmp->isSigned() &&
|
||||
Cmp->getOperand(0)->getType()->isIntegerTy(8))
|
||||
return true;
|
||||
}
|
||||
if (isa<SExtInst>(&I) &&
|
||||
I.getOperand(0)->getType()->isIntegerTy(8))
|
||||
return true;
|
||||
unsigned Op = I.getOpcode();
|
||||
if ((Op == Instruction::Shl || Op == Instruction::LShr ||
|
||||
Op == Instruction::AShr) &&
|
||||
I.getType()->isIntegerTy(8))
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
W65816FrameLowering::W65816FrameLowering(const W65816Subtarget &STI)
|
||||
: TargetFrameLowering(TargetFrameLowering::StackGrowsDown, Align(1), 0,
|
||||
Align(1)) {}
|
||||
|
|
@ -54,39 +95,33 @@ void W65816FrameLowering::emitPrologue(MachineFunction &MF,
|
|||
MachineBasicBlock::iterator MBBI = MBB.begin();
|
||||
DebugLoc DL;
|
||||
|
||||
// Heuristic: scan the function body for any value with i8 type.
|
||||
// Captures both signature types and internal i8 ops (e.g. a void
|
||||
// function that loads / stores bytes). An eventual full
|
||||
// mode-dependence analysis (the REP/SEP pass) will replace this.
|
||||
bool UsesAcc8 = false;
|
||||
// Heuristic: choose 8-bit M (REP #$10 + SEP #$20) only for "pure-i8"
|
||||
// functions — those whose signature and body use no type wider than
|
||||
// i8 (no i16 ops, no pointers). Any wider type forces 16-bit M
|
||||
// (REP #$30) since pointer dereferences and stack-relative addressing
|
||||
// need M=1 to load/store 16 bits at a time. In 16-bit M functions,
|
||||
// individual i8 ops are wrapped with SEP/REP at the pseudo level.
|
||||
// A future REP/SEP scheduling pass (design doc 3.3) will replace
|
||||
// this whole-function decision with a per-region one.
|
||||
const Function &F = MF.getFunction();
|
||||
auto isI8 = [](Type *T) { return T && T->isIntegerTy(8); };
|
||||
if (isI8(F.getReturnType()))
|
||||
UsesAcc8 = true;
|
||||
bool HasWide = isWideTyForMode(F.getReturnType(), nullptr);
|
||||
for (const Argument &Arg : F.args()) {
|
||||
if (isI8(Arg.getType())) {
|
||||
UsesAcc8 = true;
|
||||
break;
|
||||
}
|
||||
if (isWideTyForMode(Arg.getType(), &Arg)) { HasWide = true; break; }
|
||||
}
|
||||
if (!UsesAcc8) {
|
||||
if (!HasWide) {
|
||||
for (const BasicBlock &BB : F) {
|
||||
if (UsesAcc8) break;
|
||||
if (HasWide) break;
|
||||
for (const Instruction &I : BB) {
|
||||
if (isI8(I.getType())) {
|
||||
UsesAcc8 = true;
|
||||
break;
|
||||
}
|
||||
if (isWideTyForMode(I.getType(), &I)) { HasWide = true; break; }
|
||||
if (instrLowersToWide(I)) { HasWide = true; break; }
|
||||
for (const Value *Op : I.operands()) {
|
||||
if (isI8(Op->getType())) {
|
||||
UsesAcc8 = true;
|
||||
break;
|
||||
}
|
||||
if (isWideTyForMode(Op->getType(), Op)) { HasWide = true; break; }
|
||||
}
|
||||
if (UsesAcc8) break;
|
||||
if (HasWide) break;
|
||||
}
|
||||
}
|
||||
}
|
||||
bool UsesAcc8 = !HasWide;
|
||||
(void)MRI;
|
||||
|
||||
if (UsesAcc8) {
|
||||
|
|
@ -96,17 +131,47 @@ void W65816FrameLowering::emitPrologue(MachineFunction &MF,
|
|||
BuildMI(MBB, MBBI, DL, TII.get(W65816::REP)).addImm(0x30);
|
||||
}
|
||||
|
||||
// Reserve stack space for locals/spills if any. Sequence is
|
||||
// `TSC ; SEC ; SBC #N ; TCS` to subtract N from S in 16-bit mode.
|
||||
// Skipped for i8 functions for now since the stack adjustment uses
|
||||
// the 16-bit accumulator (would need a save/restore around it).
|
||||
// Reserve stack space for locals/spills.
|
||||
//
|
||||
// Critical: arg0 lives in A on entry, so the prologue MUST NOT
|
||||
// clobber A. The naive `TSC; SEC; SBC #N; TCS` sequence destroys A
|
||||
// (TSC overwrites A with SP) — used to silently corrupt arg0 in
|
||||
// every function with a stack frame, until this fix.
|
||||
//
|
||||
// Strategy (16-bit M):
|
||||
// - Small frames (N <= 14 bytes): use N/2 `PHA` instructions. PHA
|
||||
// pushes A's value (whatever it is — including arg0) and only
|
||||
// decrements S. A is not modified. N/2 bytes of code per call.
|
||||
// Side-effect: the bytes pushed contain copies of arg0; the body's
|
||||
// regalloc-inserted spills may overwrite them, which is fine.
|
||||
// - Larger frames: TAY/TSC/.../TYA — 8 bytes total, preserves A
|
||||
// through Y as a temporary. Y is caller-saved by our (loose) ABI.
|
||||
//
|
||||
// Strategy (8-bit M): PHA in 8-bit M pushes 1 byte, so N PHAs for
|
||||
// N bytes. Without this, spills land on top of the return address
|
||||
// and corrupt it (was a latent silent crash for 8-bit M functions
|
||||
// that needed any spilling).
|
||||
uint64_t StackSize = MF.getFrameInfo().getStackSize();
|
||||
if (StackSize > 0 && !UsesAcc8) {
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::SEC));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::SBC_Imm16))
|
||||
.addImm(StackSize);
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
||||
if (StackSize > 0) {
|
||||
if (UsesAcc8) {
|
||||
// 8-bit M: 1 PHA per byte. Preserves A.
|
||||
for (uint64_t i = 0; i < StackSize; ++i)
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::PHA));
|
||||
} else if (StackSize <= 14 && (StackSize % 2) == 0) {
|
||||
// 16-bit M, small frame: N/2 PHAs. Preserves A.
|
||||
for (uint64_t i = 0; i < StackSize / 2; ++i)
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::PHA));
|
||||
} else {
|
||||
// 16-bit M, larger frame: TAY/TSC/.../TYA bracket. Preserves A
|
||||
// via Y as a temp.
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::SEC));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::SBC_Imm16))
|
||||
.addImm(StackSize);
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -124,25 +189,90 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
|
|||
// Insert before the terminator (the return).
|
||||
DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc();
|
||||
|
||||
// Mirror the prologue's pure-i8 detection: skip the 16-bit stack
|
||||
// adjustment only if the function ran in 8-bit M (no wide types
|
||||
// anywhere).
|
||||
const Function &F = MF.getFunction();
|
||||
bool UsesAcc8 = F.getReturnType()->isIntegerTy(8);
|
||||
if (!UsesAcc8) {
|
||||
bool HasWide = isWideTyForMode(F.getReturnType(), nullptr);
|
||||
if (!HasWide) {
|
||||
for (const Argument &Arg : F.args()) {
|
||||
if (Arg.getType()->isIntegerTy(8)) { UsesAcc8 = true; break; }
|
||||
if (isWideTyForMode(Arg.getType(), &Arg)) { HasWide = true; break; }
|
||||
}
|
||||
}
|
||||
if (UsesAcc8) return; // Cannot 16-bit math while in 8-bit mode.
|
||||
if (!HasWide) {
|
||||
for (const BasicBlock &BB : F) {
|
||||
if (HasWide) break;
|
||||
for (const Instruction &I : BB) {
|
||||
if (isWideTyForMode(I.getType(), &I)) { HasWide = true; break; }
|
||||
if (instrLowersToWide(I)) { HasWide = true; break; }
|
||||
for (const Value *Op : I.operands()) {
|
||||
if (isWideTyForMode(Op->getType(), Op)) { HasWide = true; break; }
|
||||
}
|
||||
if (HasWide) break;
|
||||
}
|
||||
}
|
||||
}
|
||||
// 8-bit M epilogue. Save A in Y(low) via TAY, pop N bytes via N
|
||||
// PLAs (each pops 1 byte in 8-bit M), restore A via TYA. Y is
|
||||
// caller-saved by our ABI so we can use it freely. Total cost:
|
||||
// N + 2 bytes per epilogue.
|
||||
if (!HasWide) {
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY)); // save A in Y
|
||||
for (uint64_t i = 0; i < StackSize; ++i)
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::PLA)); // pop frame bytes
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA)); // restore A from Y
|
||||
return;
|
||||
}
|
||||
|
||||
// 16-bit M epilogue. Mirror the prologue: A holds the return value
|
||||
// at this point and MUST be preserved. Small frames release via
|
||||
// N/2 PLY (pop into Y, discard); larger frames use
|
||||
// TAY/TSC/CLC/ADC #N/TCS/TYA.
|
||||
if (StackSize <= 14 && (StackSize % 2) == 0) {
|
||||
for (uint64_t i = 0; i < StackSize / 2; ++i)
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY));
|
||||
return;
|
||||
}
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::ADC_Imm16))
|
||||
.addImm(StackSize);
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
|
||||
}
|
||||
|
||||
MachineBasicBlock::iterator W65816FrameLowering::eliminateCallFramePseudoInstr(
|
||||
MachineFunction &MF, MachineBasicBlock &MBB,
|
||||
MachineBasicBlock::iterator I) const {
|
||||
// Drop ADJCALLSTACKDOWN/UP with no replacement for now.
|
||||
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
||||
const W65816InstrInfo &TII = *STI.getInstrInfo();
|
||||
|
||||
// ADJCALLSTACKDOWN does nothing — we push args via PUSH16/PHA which
|
||||
// implicitly decrements SP, so no separate adjustment is needed.
|
||||
// ADJCALLSTACKUP releases all the pushed bytes after a call.
|
||||
//
|
||||
// Critical: A holds the callee's return value here, so this MUST NOT
|
||||
// clobber A. The naive `tsc;clc;adc #N;tcs` does (TSC overwrites A),
|
||||
// which silently corrupts every call's return value. Same fix as the
|
||||
// epilogue: small N via PLY (clobbers Y, preserves A); larger N via
|
||||
// TAY/.../TYA bracket.
|
||||
if (I->getOpcode() == W65816::ADJCALLSTACKUP) {
|
||||
int N = I->getOperand(0).getImm();
|
||||
if (N > 0) {
|
||||
DebugLoc DL = I->getDebugLoc();
|
||||
if (N <= 14 && (N % 2) == 0) {
|
||||
for (int i = 0; i < N / 2; ++i)
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::PLY));
|
||||
} else {
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::TAY));
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::TSC));
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::CLC));
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::ADC_Imm16)).addImm(N);
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::TCS));
|
||||
BuildMI(MBB, I, DL, TII.get(W65816::TYA));
|
||||
}
|
||||
}
|
||||
}
|
||||
return MBB.erase(I);
|
||||
}
|
||||
|
|
|
|||
|
|
@ -71,17 +71,52 @@ void W65816DAGToDAGISel::Select(SDNode *Node) {
|
|||
return;
|
||||
}
|
||||
|
||||
// Defer to the auto-generated selector for everything else. Custom
|
||||
// selection paths (frame-index, wrapper, etc.) will land here later.
|
||||
// Custom selection: bare FrameIndex SDValue used as an i16 pointer
|
||||
// value (e.g. `&arr[0]` for a stack-allocated array). The
|
||||
// auto-generated selector has no pattern for `(i16 frameindex)`
|
||||
// because tablegen doesn't expose FrameIndex as a leaf type — so
|
||||
// ISel fails with "Cannot select: FrameIndex" before ever reaching
|
||||
// a load/store-context fold. Convert it to ADDframe (FI, 0); the
|
||||
// frame-index elimination pass turns ADDframe into TSC + CLC + ADC
|
||||
// #(offset+stackSize), producing SP+offset in A.
|
||||
if (Node->getOpcode() == ISD::FrameIndex) {
|
||||
SDLoc DL(Node);
|
||||
int FI = cast<FrameIndexSDNode>(Node)->getIndex();
|
||||
SDValue TFI = CurDAG->getTargetFrameIndex(FI, MVT::i16);
|
||||
SDValue Zero = CurDAG->getTargetConstant(0, DL, MVT::i16);
|
||||
CurDAG->SelectNodeTo(Node, W65816::ADDframe, MVT::i16, TFI, Zero);
|
||||
return;
|
||||
}
|
||||
|
||||
// Defer to the auto-generated selector for everything else.
|
||||
SelectCode(Node);
|
||||
}
|
||||
|
||||
bool W65816DAGToDAGISel::SelectFrameIndex(SDValue N, SDValue &Base,
|
||||
SDValue &Offset) {
|
||||
// Bare FrameIndex: offset 0.
|
||||
if (auto *FIN = dyn_cast<FrameIndexSDNode>(N)) {
|
||||
Base = CurDAG->getTargetFrameIndex(FIN->getIndex(), MVT::i16);
|
||||
Offset = CurDAG->getTargetConstant(0, SDLoc(N), MVT::i16);
|
||||
return true;
|
||||
}
|
||||
// (add FrameIndex, const): fold the const into the memfi offset.
|
||||
// Type legalization emits this shape when splitting a multi-byte
|
||||
// load/store at a stack slot into multiple smaller loads (e.g. an
|
||||
// i32 spill becomes two i16 loads, with the high load at FI+2).
|
||||
// Without this, the bare FrameIndex inside the add is left as an
|
||||
// unmatched i16 leaf and ISel reports "Cannot select FrameIndex".
|
||||
if (N.getOpcode() == ISD::ADD) {
|
||||
SDValue LHS = N.getOperand(0);
|
||||
SDValue RHS = N.getOperand(1);
|
||||
if (auto *FIN = dyn_cast<FrameIndexSDNode>(LHS)) {
|
||||
if (auto *CN = dyn_cast<ConstantSDNode>(RHS)) {
|
||||
Base = CurDAG->getTargetFrameIndex(FIN->getIndex(), MVT::i16);
|
||||
Offset = CurDAG->getTargetConstant(CN->getSExtValue(),
|
||||
SDLoc(N), MVT::i16);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load diff
|
|
@ -46,6 +46,10 @@ public:
|
|||
|
||||
SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;
|
||||
|
||||
MachineBasicBlock *
|
||||
EmitInstrWithCustomInserter(MachineInstr &MI,
|
||||
MachineBasicBlock *MBB) const override;
|
||||
|
||||
// The 65816 has no alignment requirement on memory access — any
|
||||
// address is fine. Telling LLVM this lets it emit single 16-bit
|
||||
// loads/stores even when the IR alignment is 1, instead of
|
||||
|
|
@ -59,10 +63,47 @@ public:
|
|||
return true;
|
||||
}
|
||||
|
||||
// Disable LLVM's magic-constant expansion of sdiv/srem by power-of-2.
|
||||
// The default expansion generates BUILD_VECTOR (used as a "splat shifter"
|
||||
// intermediate) which we can't lower; without an override, every sdiv/srem
|
||||
// by a pow2 constant crashes ISel. Returning the original node leaves it
|
||||
// intact for the libcall lowering path (SDIV/SREM are LibCall in our
|
||||
// ctor — see setOperationAction calls above).
|
||||
SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor,
|
||||
SelectionDAG &DAG,
|
||||
SmallVectorImpl<SDNode *> &Created) const override {
|
||||
return SDValue(N, 0);
|
||||
}
|
||||
SDValue BuildSREMPow2(SDNode *N, const APInt &Divisor,
|
||||
SelectionDAG &DAG,
|
||||
SmallVectorImpl<SDNode *> &Created) const override {
|
||||
return SDValue(N, 0);
|
||||
}
|
||||
|
||||
SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;
|
||||
|
||||
// Force i32 / i64 shifts through a libcall (__ashlsi3 / __lshrsi3 /
|
||||
// __ashrsi3) instead of LLVM's default ExpandToParts strategy, which
|
||||
// emits an SHL_PARTS node we have no pattern for. ExpandToParts also
|
||||
// produces a long select-based sequence; the libcall is both smaller
|
||||
// and matches our existing libcall-based approach for i16 mul/div.
|
||||
ShiftLegalizationStrategy
|
||||
preferredShiftLegalizationStrategy(SelectionDAG &DAG, SDNode *N,
|
||||
unsigned ExpansionFactor) const override {
|
||||
if (N->getValueType(0).getSizeInBits() > 16)
|
||||
return ShiftLegalizationStrategy::LowerToLibcall;
|
||||
return TargetLowering::preferredShiftLegalizationStrategy(DAG, N,
|
||||
ExpansionFactor);
|
||||
}
|
||||
|
||||
private:
|
||||
SDValue LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) const;
|
||||
SDValue LowerExternalSymbol(SDValue Op, SelectionDAG &DAG) const;
|
||||
SDValue LowerBR_CC(SDValue Op, SelectionDAG &DAG) const;
|
||||
SDValue LowerSETCC(SDValue Op, SelectionDAG &DAG) const;
|
||||
SDValue LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const;
|
||||
SDValue LowerSignExtend(SDValue Op, SelectionDAG &DAG) const;
|
||||
SDValue LowerShift(SDValue Op, SelectionDAG &DAG) const;
|
||||
};
|
||||
|
||||
} // namespace llvm
|
||||
|
|
|
|||
|
|
@ -258,6 +258,23 @@ class InstStackRel<bits<8> op, string mnem>
|
|||
let Inst{15-8} = off;
|
||||
}
|
||||
|
||||
// Stack-relative indirect indexed-Y: `LDA (off,S),Y`. Reads the 16-bit
|
||||
// pointer stored at S+off, adds Y, then loads from that address. Used
|
||||
// to dereference pointers spilled to a stack scratch slot — the only
|
||||
// way the 65816 can deref a pointer not already in zero page.
|
||||
// isCodeGenOnly because the asm-parser doesn't accept `(d,S),Y` syntax
|
||||
// today; codegen builds these MIs directly.
|
||||
class InstStackRelIndY<bits<8> op, string mnem>
|
||||
: W65816Inst<(outs), (ins addrDP:$off),
|
||||
!strconcat(mnem, "\t($off, s), y")> {
|
||||
let Size = 2;
|
||||
bits<8> off;
|
||||
bits<16> Inst;
|
||||
let Inst{7-0} = op;
|
||||
let Inst{15-8} = off;
|
||||
let isCodeGenOnly = 1;
|
||||
}
|
||||
|
||||
class InstPCRel8<bits<8> op, string mnem>
|
||||
: W65816Inst<(outs), (ins pcrel8:$dest), !strconcat(mnem, "\t$dest")> {
|
||||
let Size = 2;
|
||||
|
|
|
|||
|
|
@ -14,6 +14,7 @@
|
|||
#include "W65816InstrInfo.h"
|
||||
#include "W65816.h"
|
||||
#include "W65816Subtarget.h"
|
||||
#include "llvm/CodeGen/MachineFrameInfo.h"
|
||||
#include "llvm/CodeGen/MachineInstrBuilder.h"
|
||||
#include "llvm/Support/ErrorHandling.h"
|
||||
|
||||
|
|
@ -34,13 +35,28 @@ void W65816InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
|
|||
const DebugLoc &DL, Register DestReg,
|
||||
Register SrcReg, bool KillSrc,
|
||||
bool RenamableDest, bool RenamableSrc) const {
|
||||
// The only Acc16 register is A; copies between A and itself are no-ops.
|
||||
// Cross-class copies (e.g. A → X) need TAX/TXA pairs which we don't
|
||||
// need yet — bail loudly so we notice when the time comes.
|
||||
if (DestReg == SrcReg)
|
||||
return;
|
||||
if (DestReg == W65816::A && SrcReg == W65816::A)
|
||||
// A → X / X → A via TAX / TXA. Used by i32 return ABI (lo in A, hi
|
||||
// in X) and by callers reading split-i32 results. Both instructions
|
||||
// are 16-bit when M=0/X=0; that matches our default mode.
|
||||
if (DestReg == W65816::X && SrcReg == W65816::A) {
|
||||
BuildMI(MBB, I, DL, get(W65816::TAX));
|
||||
return;
|
||||
}
|
||||
if (DestReg == W65816::A && SrcReg == W65816::X) {
|
||||
BuildMI(MBB, I, DL, get(W65816::TXA));
|
||||
return;
|
||||
}
|
||||
// A → Y / Y → A via TAY / TYA. Same M/X width caveat.
|
||||
if (DestReg == W65816::Y && SrcReg == W65816::A) {
|
||||
BuildMI(MBB, I, DL, get(W65816::TAY));
|
||||
return;
|
||||
}
|
||||
if (DestReg == W65816::A && SrcReg == W65816::Y) {
|
||||
BuildMI(MBB, I, DL, get(W65816::TYA));
|
||||
return;
|
||||
}
|
||||
llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented");
|
||||
}
|
||||
|
||||
|
|
@ -71,3 +87,50 @@ void W65816InstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,
|
|||
.addFrameIndex(FrameIdx)
|
||||
.addImm(0);
|
||||
}
|
||||
|
||||
Register W65816InstrInfo::isLoadFromStackSlot(const MachineInstr &MI,
|
||||
int &FrameIndex) const {
|
||||
if (MI.getOpcode() != W65816::LDAfi)
|
||||
return 0;
|
||||
// memfi packs (FrameIndex, offset). Treat only offset==0 as a true
|
||||
// stack-slot load — non-zero offset means we're addressing within
|
||||
// the slot (e.g. the high half of an i32 spill), which the generic
|
||||
// peephole/CSE machinery doesn't model.
|
||||
if (MI.getNumOperands() < 3 || !MI.getOperand(1).isFI() ||
|
||||
!MI.getOperand(2).isImm() || MI.getOperand(2).getImm() != 0)
|
||||
return 0;
|
||||
FrameIndex = MI.getOperand(1).getIndex();
|
||||
return MI.getOperand(0).getReg();
|
||||
}
|
||||
|
||||
Register W65816InstrInfo::isStoreToStackSlot(const MachineInstr &MI,
|
||||
int &FrameIndex) const {
|
||||
if (MI.getOpcode() != W65816::STAfi)
|
||||
return 0;
|
||||
// STAfi: (ins Acc16:$src, memfi:$addr) — op0 is src reg, op1 is
|
||||
// FrameIndex, op2 is offset.
|
||||
if (MI.getNumOperands() < 3 || !MI.getOperand(1).isFI() ||
|
||||
!MI.getOperand(2).isImm() || MI.getOperand(2).getImm() != 0)
|
||||
return 0;
|
||||
FrameIndex = MI.getOperand(1).getIndex();
|
||||
return MI.getOperand(0).getReg();
|
||||
}
|
||||
|
||||
bool W65816InstrInfo::isReMaterializableImpl(const MachineInstr &MI) const {
|
||||
// Only LDAfi is gated on this hook. We declare it
|
||||
// isReMaterializable=1 in tablegen so the framework will *consider*
|
||||
// re-emitting it instead of spilling, then call back here to confirm.
|
||||
// The instruction is safely rematerializable iff it loads from a
|
||||
// *fixed* (immutable) frame index — i.e. an arg slot. Loads from a
|
||||
// regular spill slot read a computed value that may not be available
|
||||
// at the rematerialization point.
|
||||
if (MI.getOpcode() != W65816::LDAfi)
|
||||
return TargetInstrInfo::isReMaterializableImpl(MI);
|
||||
|
||||
// Operand 1 is the FrameIndex (operand 0 is the def).
|
||||
const MachineOperand &FIOp = MI.getOperand(1);
|
||||
if (!FIOp.isFI())
|
||||
return false;
|
||||
const MachineFrameInfo &MFI = MI.getMF()->getFrameInfo();
|
||||
return MFI.isFixedObjectIndex(FIOp.getIndex());
|
||||
}
|
||||
|
|
|
|||
|
|
@ -46,6 +46,29 @@ public:
|
|||
int FrameIdx, const TargetRegisterClass *RC, Register VReg,
|
||||
unsigned SubReg = 0,
|
||||
MachineInstr::MIFlag Flags = MachineInstr::NoFlags) const override;
|
||||
|
||||
// Override the default rematerializability check to recognise LDAfi
|
||||
// from a *fixed* (immutable) frame index — i.e. an arg slot — as
|
||||
// trivially rematerializable. Without this, the greedy allocator
|
||||
// spills arg loads to a fresh local slot the moment A is needed for
|
||||
// anything else, then reloads from the local slot at every use.
|
||||
// With it, the allocator just re-emits `LDA arg_slot,S` at each use
|
||||
// and the `STA local; LDA local; LDA local` cluster collapses to a
|
||||
// single `LDA arg_slot,S`. Spill-slot LDAfi (regular FI) is *not*
|
||||
// rematerializable — that loads a computed value.
|
||||
bool isReMaterializableImpl(const MachineInstr &MI) const override;
|
||||
|
||||
// Tell the framework which pseudos are direct stack-slot loads/stores.
|
||||
// MachineCSE, machine-licm, and peephole-opt use these hooks to elide
|
||||
// redundant store/load pairs and to hoist invariants. Without them,
|
||||
// patterns like `STAfi A, slot; LDAfi slot, A` (introduced by the
|
||||
// greedy allocator's COPY-of-physreg expansion) survive into final
|
||||
// asm as `sta x,s; lda x,s` no-op pairs.
|
||||
Register isLoadFromStackSlot(const MachineInstr &MI,
|
||||
int &FrameIndex) const override;
|
||||
Register isStoreToStackSlot(const MachineInstr &MI,
|
||||
int &FrameIndex) const override;
|
||||
|
||||
};
|
||||
|
||||
} // namespace llvm
|
||||
|
|
|
|||
|
|
@ -54,6 +54,31 @@ def W65816cmp : SDNode<"W65816ISD::CMP", SDT_W65816Cmp, [SDNPOutGlue]>;
|
|||
def W65816brcc : SDNode<"W65816ISD::BR_CC", SDT_W65816BrCC,
|
||||
[SDNPHasChain, SDNPInGlue]>;
|
||||
|
||||
// Push A onto the stack. Used by LowerCall to pass extra args.
|
||||
// Takes Chain + Glue (with A pre-loaded via CopyToReg), produces
|
||||
// Chain + Glue. Has a side effect (SP changes) and stores to
|
||||
// memory. In 16-bit M mode, pushes 2 bytes and decrements SP by 2;
|
||||
// the call's ADJCALLSTACKUP pseudo unwinds those bytes via
|
||||
// tsc;clc;adc #N;tcs after the JSL returns.
|
||||
def W65816push : SDNode<"W65816ISD::PUSH", SDTNone,
|
||||
[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
|
||||
SDNPSideEffect, SDNPMayStore]>;
|
||||
|
||||
// Push X onto the stack. Same shape as W65816push but the value to
|
||||
// push is glued from CopyToReg(X) instead of CopyToReg(A).
|
||||
def W65816pushx : SDNode<"W65816ISD::PUSH_X", SDTNone,
|
||||
[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
|
||||
SDNPSideEffect, SDNPMayStore]>;
|
||||
|
||||
// SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the
|
||||
// flags from a preceding W65816cmp. Lowered by EmitInstrWithCustomInserter
|
||||
// into a CMP (already in the BB) + Bxx + diamond CFG + PHI.
|
||||
def SDT_W65816SelectCC : SDTypeProfile<1, 3, [SDTCisSameAs<0, 1>,
|
||||
SDTCisSameAs<0, 2>,
|
||||
SDTCisVT<3, i8>]>;
|
||||
def W65816selectcc : SDNode<"W65816ISD::SELECT_CC", SDT_W65816SelectCC,
|
||||
[SDNPInGlue]>;
|
||||
|
||||
//===----------------------------------------------------------------------===//
|
||||
// Pseudo Instructions
|
||||
//===----------------------------------------------------------------------===//
|
||||
|
|
@ -71,14 +96,51 @@ def ADJCALLSTACKUP : W65816Pseudo<(outs),
|
|||
timm:$amt2)]>;
|
||||
}
|
||||
|
||||
let isReMaterializable = 1 in
|
||||
def ADDframe : W65816Pseudo<(outs PtrRegs:$dst),
|
||||
// LEA-equivalent: compute the address (SP + frame_offset + offset) of a
|
||||
// stack slot and place it in A. Selected from a bare ISD::FrameIndex
|
||||
// SDValue in W65816DAGToDAGISel::Select; expanded by eliminateFrameIndex
|
||||
// into TSC + CLC + ADC #disp. Output is Acc16 because the address ends
|
||||
// up in A; PtrRegs (which only contains SP) is the wrong class.
|
||||
let isReMaterializable = 1, hasSideEffects = 0,
|
||||
mayLoad = 0, mayStore = 0 in
|
||||
def ADDframe : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins i16imm:$base, i16imm:$offset),
|
||||
"# ADDframe PSEUDO", []>;
|
||||
|
||||
// The retglue node lowers directly to RTL (see Returns section below).
|
||||
// No separate RET pseudo — the real MC instruction handles the pattern.
|
||||
|
||||
// Push A onto the stack. Expanded in AsmPrinter to MC `PHA`. Used by
|
||||
// LowerCall to pass extra args; the matching `tsc;clc;adc #N;tcs` SP
|
||||
// unwind happens in eliminateCallFramePseudoInstr for ADJCALLSTACKUP.
|
||||
let Defs = [SP], Uses = [A, SP], mayStore = 1, hasSideEffects = 1 in {
|
||||
def PUSH16 : W65816Pseudo<(outs), (ins), "# PUSH16",
|
||||
[(W65816push)]>;
|
||||
}
|
||||
// Push X onto the stack. Used by LowerCall when an outgoing arg's
|
||||
// SDValue is already in X (e.g. forwarding the i32-first-arg-in-A:X
|
||||
// hi half). Saves a TXA+spill round-trip. Expansion: PHX.
|
||||
let Defs = [SP], Uses = [X, SP], mayStore = 1, hasSideEffects = 1 in {
|
||||
def PUSH16X : W65816Pseudo<(outs), (ins), "# PUSH16X",
|
||||
[(W65816pushx)]>;
|
||||
}
|
||||
|
||||
// SELECT_CC16: implements (set Acc16:$dst, (W65816selectcc tval, fval, cc))
|
||||
// where the CMP that produced the flags has already been emitted (its
|
||||
// glue is implicit via the P register). EmitInstrWithCustomInserter
|
||||
// expands this into a Bxx + 2 BBs + PHI. Marked usesCustomInserter so
|
||||
// the codegen invokes our hook; Uses=[P] so MachineSched keeps the CMP
|
||||
// adjacent.
|
||||
let usesCustomInserter = 1, Uses = [P], hasSideEffects = 1 in {
|
||||
def SELECT_CC16 : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$tval, Acc16:$fval, i8imm:$cc),
|
||||
"# SELECT_CC16 $dst, $tval, $fval, $cc",
|
||||
[(set Acc16:$dst,
|
||||
(W65816selectcc Acc16:$tval,
|
||||
Acc16:$fval,
|
||||
timm:$cc))]>;
|
||||
}
|
||||
|
||||
//===----------------------------------------------------------------------===//
|
||||
// Codegen pseudos that expand to MC instructions in the AsmPrinter.
|
||||
//
|
||||
|
|
@ -94,6 +156,15 @@ let isAsCheapAsAMove = 1, isReMaterializable = 1,
|
|||
def LDAi16imm : W65816Pseudo<(outs Acc16:$dst), (ins i16imm:$imm),
|
||||
"# LDAi16imm $dst, $imm",
|
||||
[(set Acc16:$dst, (i16 imm:$imm))]>;
|
||||
// Materialise an i16 constant directly in X (Idx16). Useful when the
|
||||
// constant's only consumer is `CopyToReg($x)` — saves an LDA+TAX
|
||||
// round-trip (and the A-clobber that round-trip implies). Common for
|
||||
// the high half of `(zext i16 to i32)` returns, where hi=const-zero.
|
||||
let isReMaterializable = 1, isAsCheapAsAMove = 1, hasSideEffects = 0,
|
||||
mayLoad = 0, mayStore = 0 in
|
||||
def LDXi16imm : W65816Pseudo<(outs Idx16:$dst), (ins i16imm:$imm),
|
||||
"# LDXi16imm $dst, $imm",
|
||||
[(set Idx16:$dst, (i16 imm:$imm))]>;
|
||||
def LDAi8imm : W65816Pseudo<(outs Acc8:$dst), (ins i8imm:$imm),
|
||||
"# LDAi8imm $dst, $imm",
|
||||
[(set Acc8:$dst, (i8 imm:$imm))]>;
|
||||
|
|
@ -177,8 +248,13 @@ def : Pat<(store Acc16:$src, (W65816Wrapper texternalsym:$s)),
|
|||
// source and dest to A — there is only one Acc16 register so this is
|
||||
// implicit, but stating it lets the register allocator coalesce
|
||||
// without needing a COPY.
|
||||
//
|
||||
// Defs = [P] models the C-flag side-effect. Required so tablegen can
|
||||
// connect this instruction to the SDNode `addc` / `subc` (SDNPOutGlue),
|
||||
// which is what the type legalizer emits as the lo half of a multi-
|
||||
// precision add/sub when ADDC/SUBC is Legal (see W65816ISelLowering ctor).
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0 in {
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0, Defs = [P] in {
|
||||
def ADCi16imm : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src, i16imm:$imm),
|
||||
"# ADCi16imm $dst, $src, $imm",
|
||||
|
|
@ -191,10 +267,19 @@ def SBCi16imm : W65816Pseudo<(outs Acc16:$dst),
|
|||
(sub Acc16:$src, imm:$imm))]>;
|
||||
}
|
||||
|
||||
// addc/subc: same as add/sub on this target (CLC then ADC, SEC then SBC),
|
||||
// but the SDNode produces a Glue carrying the post-op carry into a
|
||||
// subsequent adde/sube. Tablegen wires the Glue to the P register
|
||||
// because the instruction has Defs = [P].
|
||||
def : Pat<(addc Acc16:$src, imm:$imm),
|
||||
(ADCi16imm Acc16:$src, imm:$imm)>;
|
||||
def : Pat<(subc Acc16:$src, imm:$imm),
|
||||
(SBCi16imm Acc16:$src, imm:$imm)>;
|
||||
|
||||
// ADC/SBC from a 16-bit absolute address. Folds a load on the
|
||||
// right-hand side of an add/sub into the carry-arithmetic op.
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 1, mayStore = 0 in {
|
||||
hasSideEffects = 0, mayLoad = 1, mayStore = 0, Defs = [P] in {
|
||||
def ADCabs : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src, i32imm:$addr),
|
||||
"# ADCabs $dst, $src, $addr", []>;
|
||||
|
|
@ -214,6 +299,61 @@ def : Pat<(sub Acc16:$src,
|
|||
def : Pat<(sub Acc16:$src,
|
||||
(i16 (load (W65816Wrapper texternalsym:$s)))),
|
||||
(SBCabs Acc16:$src, texternalsym:$s)>;
|
||||
def : Pat<(addc Acc16:$src,
|
||||
(i16 (load (W65816Wrapper tglobaladdr:$g)))),
|
||||
(ADCabs Acc16:$src, tglobaladdr:$g)>;
|
||||
def : Pat<(addc Acc16:$src,
|
||||
(i16 (load (W65816Wrapper texternalsym:$s)))),
|
||||
(ADCabs Acc16:$src, texternalsym:$s)>;
|
||||
def : Pat<(subc Acc16:$src,
|
||||
(i16 (load (W65816Wrapper tglobaladdr:$g)))),
|
||||
(SBCabs Acc16:$src, tglobaladdr:$g)>;
|
||||
def : Pat<(subc Acc16:$src,
|
||||
(i16 (load (W65816Wrapper texternalsym:$s)))),
|
||||
(SBCabs Acc16:$src, texternalsym:$s)>;
|
||||
|
||||
// adde/sube: the chained ADC/SBC for the hi half of a multi-precision
|
||||
// add/sub. Reads the C flag from the previous addc/adde (Uses = [P]),
|
||||
// produces a fresh carry/borrow (Defs = [P]). AsmPrinter expansion
|
||||
// emits a bare ADC/SBC with no preceding CLC/SEC; eliminateFrameIndex
|
||||
// for ADCEfi/SBCEfi skips the carry-prefix step that the standalone
|
||||
// ADCfi/SBCfi rely on.
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0,
|
||||
Uses = [P], Defs = [P] in {
|
||||
def ADCEi16imm : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src, i16imm:$imm),
|
||||
"# ADCEi16imm $dst, $src, $imm",
|
||||
[(set Acc16:$dst,
|
||||
(adde Acc16:$src, imm:$imm))]>;
|
||||
def SBCEi16imm : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src, i16imm:$imm),
|
||||
"# SBCEi16imm $dst, $src, $imm",
|
||||
[(set Acc16:$dst,
|
||||
(sube Acc16:$src, imm:$imm))]>;
|
||||
}
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 1, mayStore = 0,
|
||||
Uses = [P], Defs = [P] in {
|
||||
def ADCEabs : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src, i32imm:$addr),
|
||||
"# ADCEabs $dst, $src, $addr", []>;
|
||||
def SBCEabs : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src, i32imm:$addr),
|
||||
"# SBCEabs $dst, $src, $addr", []>;
|
||||
}
|
||||
def : Pat<(adde Acc16:$src,
|
||||
(i16 (load (W65816Wrapper tglobaladdr:$g)))),
|
||||
(ADCEabs Acc16:$src, tglobaladdr:$g)>;
|
||||
def : Pat<(adde Acc16:$src,
|
||||
(i16 (load (W65816Wrapper texternalsym:$s)))),
|
||||
(ADCEabs Acc16:$src, texternalsym:$s)>;
|
||||
def : Pat<(sube Acc16:$src,
|
||||
(i16 (load (W65816Wrapper tglobaladdr:$g)))),
|
||||
(SBCEabs Acc16:$src, tglobaladdr:$g)>;
|
||||
def : Pat<(sube Acc16:$src,
|
||||
(i16 (load (W65816Wrapper texternalsym:$s)))),
|
||||
(SBCEabs Acc16:$src, texternalsym:$s)>;
|
||||
|
||||
// (add Acc16, Acc16) — same value added to itself, equivalent to a 1-bit
|
||||
// left shift. Pattern needs a tied input so the result lands in A.
|
||||
|
|
@ -293,6 +433,27 @@ def NEGA16 : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
|||
[(set Acc16:$dst, (sub (i16 0), Acc16:$src))]>;
|
||||
}
|
||||
|
||||
// Multi-precision negation: lo + hi halves of `-x` where x is i32.
|
||||
// LLVM splits `0 - x` into `(subc 0, x_lo)` and `(sube 0, x_hi)`.
|
||||
// We implement both via the ADD chain `~x + carry` since INC doesn't
|
||||
// touch C; the bit pattern of C from `~x + 1` matches what `subc 0, x`
|
||||
// would set (C=1 iff x was 0, i.e. no borrow).
|
||||
// NEGC16 matches subc → "EOR #$FFFF; CLC; ADC #1" (5 bytes)
|
||||
// NEGE16 matches sube → "EOR #$FFFF; ADC #0" (4 bytes, uses C-in)
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0, Defs = [P] in {
|
||||
def NEGC16 : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# NEGC16 $dst, $src",
|
||||
[(set Acc16:$dst, (subc (i16 0), Acc16:$src))]>;
|
||||
}
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0,
|
||||
Uses = [P], Defs = [P] in {
|
||||
def NEGE16 : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# NEGE16 $dst, $src",
|
||||
[(set Acc16:$dst, (sube (i16 0), Acc16:$src))]>;
|
||||
}
|
||||
|
||||
// Bitwise NOT pattern moved below EORi16imm definition.
|
||||
|
||||
// 16-bit bitwise ops: AND / OR / XOR against an immediate or memory
|
||||
|
|
@ -340,6 +501,71 @@ def : Pat<(xor Acc16:$src, (i16 (load (W65816Wrapper tglobaladdr:$g)))),
|
|||
def : Pat<(xor Acc16:$src, (i16 -1)),
|
||||
(EORi16imm Acc16:$src, 0xFFFF)>;
|
||||
|
||||
// (srl x, 15): extract bit 15 to bit 0 (yields 0 or 1). The
|
||||
// type-legalizer's SHL_PARTS expansion of `i32 << 1` needs this for
|
||||
// the high-half "carry from low" slot, and routing it through the
|
||||
// __lshrhi3 libcall costs ~10 bytes per i32 shift-by-1. Inline as
|
||||
// `ASL A; LDA #0; ROL A` (3 bytes): ASL puts bit 15 into C and
|
||||
// trashes A; LDA #0 doesn't touch C; ROL A folds C into bit 0.
|
||||
//
|
||||
// (shl x, 15): move bit 0 to bit 15 (yields 0 or 0x8000). Used by
|
||||
// SRL_PARTS / SRA_PARTS expansion of `i32 >> 1` for the low-half
|
||||
// "carry from hi" slot. Mirror sequence: `LSR A; LDA #0; ROR A`.
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0, Defs = [P] in {
|
||||
def SRL15A : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# SRL15A $dst, $src",
|
||||
[(set Acc16:$dst, (srl Acc16:$src, (i16 15)))]>;
|
||||
def SHL15A : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# SHL15A $dst, $src",
|
||||
[(set Acc16:$dst, (shl Acc16:$src, (i16 15)))]>;
|
||||
}
|
||||
// (srl x, 8): high byte to low byte, zero high byte. XBA swaps the
|
||||
// two bytes of A (in 16-bit M); AND #$00FF clears the new high byte.
|
||||
// 4 bytes total — much shorter than the __lshrhi3 libcall path. Used
|
||||
// by i32 shift-by-8 SHL_PARTS expansion for the cross-half slot.
|
||||
//
|
||||
// (shl x, 8): low byte to high byte, zero low byte. Mirror.
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0 in {
|
||||
def SRL8A : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# SRL8A $dst, $src",
|
||||
[(set Acc16:$dst, (srl Acc16:$src, (i16 8)))]>;
|
||||
def SHL8A : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# SHL8A $dst, $src",
|
||||
[(set Acc16:$dst, (shl Acc16:$src, (i16 8)))]>;
|
||||
}
|
||||
// (sra x, 15): sign-fill — yields $0000 if x is non-negative, $FFFF
|
||||
// if negative. Used by i32 sext-from-i16 type-legalization for the
|
||||
// hi half (avoids the __ashrhi3 libcall path). Sequence:
|
||||
// `ASL A; LDA #0; SBC #0; EOR #-1` (when our SBCi16imm uses SEC + SBC,
|
||||
// LDA #0; SBC #0 produces $FFFF if C=0, $0000 if C=1; EOR #-1 flips).
|
||||
// Actually simpler since SBC sets carry differently: see AsmPrinter
|
||||
// expansion for the exact 5-byte sequence.
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0, Defs = [P] in {
|
||||
def SRA15A : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# SRA15A $dst, $src",
|
||||
[(set Acc16:$dst, (sra Acc16:$src, (i16 15)))]>;
|
||||
}
|
||||
|
||||
// sext_inreg from i1: broadcast bit 0 to all bits. LLVM emits this
|
||||
// for `(c & 1) ? -1 : 0` patterns (e.g. CRC inner loops). The result
|
||||
// is `-(x & 1)` — 0 if bit 0 was clear, 0xFFFF if set. Mask to bit
|
||||
// 0 then two's-complement-negate. Three pseudos = ~7 bytes.
|
||||
def : Pat<(sext_inreg Acc16:$src, i1),
|
||||
(NEGA16 (ANDi16imm Acc16:$src, 1))>;
|
||||
|
||||
// sext_inreg from i8: branchless `((x & 0xFF) ^ 0x80) - 0x80` trick
|
||||
// (same sequence LowerSignExtend uses for ISD::SIGN_EXTEND i8->i16).
|
||||
// LLVM emits this when expanding a sextload-i16-from-i8 (we set
|
||||
// SEXTLOAD i8 to Expand in the lowering ctor) and for explicit
|
||||
// `(int)(signed char)` casts.
|
||||
def : Pat<(sext_inreg Acc16:$src, i8),
|
||||
(SBCi16imm (EORi16imm
|
||||
(ANDi16imm Acc16:$src, 0x00FF), 0x0080),
|
||||
0x0080)>;
|
||||
|
||||
// Frame-index loads/stores: take a FrameIndex + offset (packed into a
|
||||
// single MIOperandInfo) and expand (in eliminateFrameIndex) into an
|
||||
// LDA / STA d,S with the offset baked in. Used by LowerFormalArguments
|
||||
|
|
@ -350,7 +576,12 @@ def memfi : Operand<i16> {
|
|||
let PrintMethod = "printFrameMem";
|
||||
}
|
||||
|
||||
let mayLoad = 1, hasSideEffects = 0, mayStore = 0 in {
|
||||
// LDAfi is rematerializable when the FI is a fixed (immutable) arg
|
||||
// slot — see W65816InstrInfo::isReMaterializableImpl. Without this,
|
||||
// greedy regalloc spills every arg load to a fresh local slot then
|
||||
// reloads from there, ballooning every i32-arg function by 4-6 insns.
|
||||
let mayLoad = 1, hasSideEffects = 0, mayStore = 0,
|
||||
isReMaterializable = 1 in {
|
||||
def LDAfi : W65816Pseudo<(outs Acc16:$dst), (ins memfi:$addr),
|
||||
"# LDAfi $dst, $addr", []>;
|
||||
}
|
||||
|
|
@ -369,14 +600,37 @@ def : Pat<(i16 (load addr_fi:$addr)),
|
|||
def : Pat<(store Acc16:$src, addr_fi:$addr),
|
||||
(STAfi Acc16:$src, addr_fi:$addr)>;
|
||||
|
||||
// i8 access to a FrameIndex slot. The slots holding i8 values are
|
||||
// allocated as 2 bytes (CC_W65816 promotes i8 args to i16; spills also
|
||||
// align), so reading 2 bytes is safe even for an i8 value — we just
|
||||
// narrow to Acc8. Extending loads mask the high byte (zext) or leave
|
||||
// it (anyext). Truncating store writes the full i16 (overwrites the
|
||||
// 2-byte slot's high byte with whatever sits in A's high byte; safe
|
||||
// since the slot holds an i8 and no other consumer reads that high
|
||||
// byte).
|
||||
def : Pat<(i8 (load addr_fi:$addr)),
|
||||
(COPY_TO_REGCLASS (LDAfi addr_fi:$addr), Acc8)>;
|
||||
def : Pat<(i16 (zextloadi8 addr_fi:$addr)),
|
||||
(ANDi16imm (LDAfi addr_fi:$addr), 0xFF)>;
|
||||
def : Pat<(i16 (extloadi8 addr_fi:$addr)),
|
||||
(LDAfi addr_fi:$addr)>;
|
||||
def : Pat<(store Acc8:$src, addr_fi:$addr),
|
||||
(STAfi (COPY_TO_REGCLASS Acc8:$src, Acc16), addr_fi:$addr)>;
|
||||
def : Pat<(truncstorei8 Acc16:$src, addr_fi:$addr),
|
||||
(STAfi Acc16:$src, addr_fi:$addr)>;
|
||||
|
||||
// Frame-index folding into ADC / SBC / AND / ORA / EOR / CMP. Same
|
||||
// shape as the *abs variants but the second operand is a stack slot.
|
||||
// ADCfi/SBCfi mark P as Def so they can match `addc`/`subc` (the lo
|
||||
// half of a multi-precision split — see ADCi16imm comment above).
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 1, mayStore = 0 in {
|
||||
let Defs = [P] in {
|
||||
def ADCfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
"# ADCfi $dst, $src, $addr", []>;
|
||||
def SBCfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
"# SBCfi $dst, $src, $addr", []>;
|
||||
}
|
||||
def ANDfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
"# ANDfi $dst, $src, $addr", []>;
|
||||
def ORAfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
|
|
@ -384,6 +638,16 @@ def ORAfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
|||
def EORfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
"# EORfi $dst, $src, $addr", []>;
|
||||
}
|
||||
// ADCEfi / SBCEfi: chained ADC/SBC, hi half of a multi-precision split.
|
||||
// Read carry from previous addc/adde/subc/sube via Uses = [P].
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 1, mayStore = 0,
|
||||
Uses = [P], Defs = [P] in {
|
||||
def ADCEfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
"# ADCEfi $dst, $src, $addr", []>;
|
||||
def SBCEfi : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src, memfi:$addr),
|
||||
"# SBCEfi $dst, $src, $addr", []>;
|
||||
}
|
||||
let hasSideEffects = 0, mayLoad = 1, mayStore = 0, Defs = [P] in {
|
||||
def CMPfi : W65816Pseudo<(outs), (ins Acc16:$lhs, memfi:$addr),
|
||||
"# CMPfi $lhs, $addr", []>;
|
||||
|
|
@ -392,6 +656,14 @@ def : Pat<(add Acc16:$src, (i16 (load addr_fi:$addr))),
|
|||
(ADCfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(sub Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
(SBCfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(addc Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
(ADCfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(subc Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
(SBCfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(adde Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
(ADCEfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(sube Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
(SBCEfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(and Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
(ANDfi Acc16:$src, addr_fi:$addr)>;
|
||||
def : Pat<(or Acc16:$src, (i16 (load addr_fi:$addr))),
|
||||
|
|
@ -433,11 +705,217 @@ def : Pat<(W65816cmp Acc16:$lhs,
|
|||
(i16 (load (W65816Wrapper texternalsym:$s)))),
|
||||
(CMPabs Acc16:$lhs, texternalsym:$s)>;
|
||||
|
||||
// Two-Acc16 ops: deferred — needs proper frame setup so the register
|
||||
// allocator can spill one operand to a local stack slot. Without
|
||||
// reserved frame space, the spill goes to a negative SP offset and
|
||||
// eliminateFrameIndex bails. See SESSION_STATE §6 for the
|
||||
// dependency chain.
|
||||
// 16-bit byte swap: XBA exchanges A.high and A.low. Pattern matches
|
||||
// the (bswap Acc16) SDNode emitted by clang for byte-reverse loops.
|
||||
let Constraints = "$src = $dst",
|
||||
hasSideEffects = 0, mayLoad = 0, mayStore = 0 in {
|
||||
def XBA16 : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$src),
|
||||
"# XBA16 $dst, $src",
|
||||
[(set Acc16:$dst, (bswap Acc16:$src))]>;
|
||||
}
|
||||
|
||||
// Two-Acc16 binary ops. We have only one A register, so when both
|
||||
// operands are computed values (neither a foldable load/imm/global) we
|
||||
// must spill one to a stack slot. Each pseudo's custom inserter
|
||||
// allocates a fresh slot and emits a STAfi+OPfi sequence; the
|
||||
// register allocator handles the surrounding spills/reloads.
|
||||
// hasSideEffects=1 tells the validator the pseudo may load/store
|
||||
// without requiring a matching SDNode pattern (the stores are added
|
||||
// by the inserter, not visible in the DAG pattern).
|
||||
//
|
||||
// Defs = [P] on ADD_RR/SUB_RR matches the C-flag side-effect of the
|
||||
// underlying ADC/SBC, letting these pseudos serve `addc`/`subc` (the
|
||||
// lo half of an i32 split) as well as plain `add`/`sub`.
|
||||
let usesCustomInserter = 1, hasSideEffects = 1 in {
|
||||
let Defs = [P] in {
|
||||
def ADD_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# ADD_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(add Acc16:$src1, Acc16:$src2))]>;
|
||||
def SUB_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# SUB_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(sub Acc16:$src1, Acc16:$src2))]>;
|
||||
}
|
||||
def AND_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# AND_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(and Acc16:$src1, Acc16:$src2))]>;
|
||||
def ORA_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# ORA_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(or Acc16:$src1, Acc16:$src2))]>;
|
||||
def EOR_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# EOR_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(xor Acc16:$src1, Acc16:$src2))]>;
|
||||
}
|
||||
def : Pat<(addc Acc16:$src1, Acc16:$src2),
|
||||
(ADD_RR Acc16:$src1, Acc16:$src2)>;
|
||||
def : Pat<(subc Acc16:$src1, Acc16:$src2),
|
||||
(SUB_RR Acc16:$src1, Acc16:$src2)>;
|
||||
|
||||
// Chained-carry two-Acc16 add/sub for the hi half of i32 splits.
|
||||
// Inserter mirrors ADD_RR (STAfi spill + ADCEfi load-fold) but emits
|
||||
// the carry-chain pseudo so the previous addc/adde's C flag is
|
||||
// consumed instead of overwritten by a CLC. Uses+Defs = [P]
|
||||
// reflects the carry chain through the SDNode.
|
||||
let usesCustomInserter = 1, hasSideEffects = 1,
|
||||
Uses = [P], Defs = [P] in {
|
||||
def ADDE_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# ADDE_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(adde Acc16:$src1, Acc16:$src2))]>;
|
||||
def SUBE_RR : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$src1, Acc16:$src2),
|
||||
"# SUBE_RR $dst, $src1, $src2",
|
||||
[(set Acc16:$dst,
|
||||
(sube Acc16:$src1, Acc16:$src2))]>;
|
||||
}
|
||||
let usesCustomInserter = 1, hasSideEffects = 1, Defs = [P] in {
|
||||
def CMP_RR : W65816Pseudo<(outs), (ins Acc16:$lhs, Acc16:$rhs),
|
||||
"# CMP_RR $lhs, $rhs",
|
||||
[(W65816cmp Acc16:$lhs, Acc16:$rhs)]>;
|
||||
}
|
||||
|
||||
// Pointer dereference. The 65816 can't deref a register pointer
|
||||
// directly — the indirect addressing modes all read the pointer from
|
||||
// memory (DP or stack). These pseudos spill the Acc16 pointer to a
|
||||
// fresh stack slot, set Y=0, and emit LDA/STA (slot,S),Y. Y gets
|
||||
// clobbered as a side effect. hasSideEffects=1 covers the spill
|
||||
// store the inserter adds, in addition to the deref.
|
||||
let usesCustomInserter = 1, hasSideEffects = 1, mayLoad = 1,
|
||||
Defs = [Y] in {
|
||||
def LDAptr : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$ptr),
|
||||
"# LDAptr $dst, $ptr",
|
||||
[(set Acc16:$dst, (load Acc16:$ptr))]>;
|
||||
}
|
||||
let usesCustomInserter = 1, hasSideEffects = 1, mayStore = 1,
|
||||
Defs = [Y] in {
|
||||
def STAptr : W65816Pseudo<(outs), (ins Acc16:$val, Acc16:$ptr),
|
||||
"# STAptr $val, $ptr",
|
||||
[(store Acc16:$val, Acc16:$ptr)]>;
|
||||
}
|
||||
|
||||
// i8 zero-extending pointer load: do a 16-bit LDA (slot,s),y and mask
|
||||
// the high byte. Reads one byte past the source — fine for byte-array
|
||||
// iteration where the buffer is at least 2 bytes long. A future
|
||||
// SEP/REP-aware mode pass could switch to a true 8-bit LDA.
|
||||
def : Pat<(i16 (zextloadi8 Acc16:$ptr)),
|
||||
(ANDi16imm (LDAptr Acc16:$ptr), 0xFF)>;
|
||||
// Anyext byte load via pointer: consumer doesn't care about the high
|
||||
// byte, so just LDA (16-bit). Same 1-byte-past-buffer caveat as
|
||||
// zextloadi8.
|
||||
def : Pat<(i16 (extloadi8 Acc16:$ptr)),
|
||||
(LDAptr Acc16:$ptr)>;
|
||||
// And the equivalent for absolute addresses (byte loads via global ptr).
|
||||
// (Already covered for Wrapper(global) above; this catches the case
|
||||
// where the ptr is materialised as a value.)
|
||||
|
||||
// Intermediate pseudos used by the LDAptr/STAptr inserters. Each takes
|
||||
// a memfi describing the slot containing the pointer; eliminateFrameIndex
|
||||
// resolves it to LDA_StackRelIndY / STA_StackRelIndY with the right d-byte.
|
||||
// Y must hold 0 at the issue point (the inserter emits LDY #0 first).
|
||||
let mayLoad = 1, hasSideEffects = 0, mayStore = 0, Uses = [Y] in {
|
||||
def LDAfi_indY : W65816Pseudo<(outs Acc16:$dst), (ins memfi:$addr),
|
||||
"# LDAfi_indY $dst, $addr", []>;
|
||||
}
|
||||
let mayStore = 1, hasSideEffects = 0, mayLoad = 0, Uses = [Y] in {
|
||||
def STAfi_indY : W65816Pseudo<(outs), (ins Acc16:$src, memfi:$addr),
|
||||
"# STAfi_indY $src, $addr", []>;
|
||||
}
|
||||
|
||||
// i8 truncating store via Acc16 pointer. Same shape as STAptr but
|
||||
// custom inserter wraps the actual STA in SEP/REP so the M-bit is 8
|
||||
// across the store and only one byte is written. Without the wrap the
|
||||
// 16-bit STA would clobber the byte at ptr+1. Two patterns: the
|
||||
// natural truncstorei8 from an i16 value (common with arg promotion),
|
||||
// and a true i8 store (Acc8) that arises from i8-typed IR.
|
||||
let usesCustomInserter = 1, hasSideEffects = 1, mayStore = 1,
|
||||
Defs = [Y] in {
|
||||
def STBptr : W65816Pseudo<(outs), (ins Acc16:$val, Acc16:$ptr),
|
||||
"# STBptr $val, $ptr",
|
||||
[(truncstorei8 Acc16:$val, Acc16:$ptr)]>;
|
||||
}
|
||||
|
||||
// Pointer access with constant offset. `(load (add ptr, $off))` and
|
||||
// `(store val, (add ptr, $off))` come up for struct field access and
|
||||
// array indexing with small constant offsets. Without these patterns,
|
||||
// the offset becomes an explicit ADC #imm that has to spill A and
|
||||
// recompute the pointer per access. With them, we just load Y with
|
||||
// the offset in the inserter (Y is 16-bit so any i16 constant fits).
|
||||
let usesCustomInserter = 1, hasSideEffects = 1, mayLoad = 1,
|
||||
Defs = [Y] in {
|
||||
def LDAptrOff : W65816Pseudo<(outs Acc16:$dst),
|
||||
(ins Acc16:$ptr, i16imm:$off),
|
||||
"# LDAptrOff $dst, $ptr, $off", []>;
|
||||
}
|
||||
let usesCustomInserter = 1, hasSideEffects = 1, mayStore = 1,
|
||||
Defs = [Y] in {
|
||||
def STAptrOff : W65816Pseudo<(outs),
|
||||
(ins Acc16:$val, Acc16:$ptr, i16imm:$off),
|
||||
"# STAptrOff $val, $ptr, $off", []>;
|
||||
def STBptrOff : W65816Pseudo<(outs),
|
||||
(ins Acc16:$val, Acc16:$ptr, i16imm:$off),
|
||||
"# STBptrOff $val, $ptr, $off", []>;
|
||||
}
|
||||
def : Pat<(i16 (load (add Acc16:$ptr, (i16 imm:$off)))),
|
||||
(LDAptrOff Acc16:$ptr, imm:$off)>;
|
||||
def : Pat<(store Acc16:$val, (add Acc16:$ptr, (i16 imm:$off))),
|
||||
(STAptrOff Acc16:$val, Acc16:$ptr, imm:$off)>;
|
||||
def : Pat<(truncstorei8 Acc16:$val, (add Acc16:$ptr, (i16 imm:$off))),
|
||||
(STBptrOff Acc16:$val, Acc16:$ptr, imm:$off)>;
|
||||
def : Pat<(store Acc8:$val, (add Acc16:$ptr, (i16 imm:$off))),
|
||||
(STBptrOff (COPY_TO_REGCLASS Acc8:$val, Acc16),
|
||||
Acc16:$ptr, imm:$off)>;
|
||||
def : Pat<(store Acc8:$val, Acc16:$ptr),
|
||||
(STBptr (COPY_TO_REGCLASS Acc8:$val, Acc16), Acc16:$ptr)>;
|
||||
|
||||
// i8 load via Acc16 pointer producing a true i8 (Acc8) result. Reuses
|
||||
// the existing zextloadi8 16-bit-LDA-and-mask path: load 2 bytes, mask
|
||||
// the high byte, then narrow to Acc8. COPY_TO_REGCLASS to Acc8 is a
|
||||
// no-op at MC level (same physical A). Reads one byte past the source;
|
||||
// fine for char-array iteration where the buffer is at least 2 bytes.
|
||||
def : Pat<(i8 (load Acc16:$ptr)),
|
||||
(COPY_TO_REGCLASS (ANDi16imm (LDAptr Acc16:$ptr), 0xFF), Acc8)>;
|
||||
|
||||
// Acc8-to-Acc16 type conversions. Both Acc8 and Acc16 alias physical A,
|
||||
// so COPY_TO_REGCLASS is a no-op at MC level. ZEXT additionally masks
|
||||
// the high byte (which holds B from before any prior SEP). ANYEXT
|
||||
// leaves the high byte untouched since the consumer doesn't care.
|
||||
def : Pat<(i16 (anyext Acc8:$src)),
|
||||
(COPY_TO_REGCLASS Acc8:$src, Acc16)>;
|
||||
def : Pat<(i16 (zext Acc8:$src)),
|
||||
(ANDi16imm (COPY_TO_REGCLASS Acc8:$src, Acc16), 0xFF)>;
|
||||
def : Pat<(i8 (trunc Acc16:$src)),
|
||||
(COPY_TO_REGCLASS Acc16:$src, Acc8)>;
|
||||
|
||||
// Acc8 reg-reg arithmetic and bitwise ops, expanded through the Acc16
|
||||
// _RR pseudos. Cheap to do because Acc8 and Acc16 alias the same
|
||||
// physical A — COPY_TO_REGCLASS is a no-op. Only the low byte
|
||||
// matters; the high byte gets unrelated bits but is discarded by the
|
||||
// final narrow-back to Acc8. This lets an i8 expression that wasn't
|
||||
// promoted by legalization (e.g. an i8 XOR feeding only an i8 store)
|
||||
// reuse the spill-and-OPfi inserter without needing dedicated Acc8
|
||||
// pseudos.
|
||||
multiclass Acc8RR<SDNode op, Instruction ri> {
|
||||
def : Pat<(i8 (op Acc8:$a, Acc8:$b)),
|
||||
(COPY_TO_REGCLASS
|
||||
(ri (COPY_TO_REGCLASS Acc8:$a, Acc16),
|
||||
(COPY_TO_REGCLASS Acc8:$b, Acc16)),
|
||||
Acc8)>;
|
||||
}
|
||||
defm : Acc8RR<add, ADD_RR>;
|
||||
defm : Acc8RR<sub, SUB_RR>;
|
||||
defm : Acc8RR<and, AND_RR>;
|
||||
defm : Acc8RR<or, ORA_RR>;
|
||||
defm : Acc8RR<xor, EOR_RR>;
|
||||
|
||||
// (memory inc/dec patterns moved below INC_Abs/DEC_Abs defs.)
|
||||
|
||||
|
|
@ -728,6 +1206,11 @@ def AND_StackRel : InstStackRel<0x23, "and">;
|
|||
def ORA_StackRel : InstStackRel<0x03, "ora">;
|
||||
def EOR_StackRel : InstStackRel<0x43, "eor">;
|
||||
|
||||
//---------------------------------------------------------------- Stack-ind-Y
|
||||
// Stack-relative indirect indexed-Y: deref a pointer spilled at S+off.
|
||||
def LDA_StackRelIndY : InstStackRelIndY<0xB3, "lda">;
|
||||
def STA_StackRelIndY : InstStackRelIndY<0x93, "sta">;
|
||||
|
||||
//===----------------------------------------------------------------------===//
|
||||
// Branch patterns (placed after the Bxx defs).
|
||||
//
|
||||
|
|
|
|||
|
|
@ -77,10 +77,46 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
|
|||
case W65816::STAfi: NewOpc = W65816::STA_StackRel; break;
|
||||
case W65816::ADCfi: NewOpc = W65816::ADC_StackRel; NeedsCarryPrefix = true; break;
|
||||
case W65816::SBCfi: NewOpc = W65816::SBC_StackRel; NeedsCarryPrefix = true; IsSub = true; break;
|
||||
// ADCEfi / SBCEfi are the chained-carry variants used as the hi half of a
|
||||
// multi-precision split. No CLC/SEC prefix — they read the carry left
|
||||
// in P by the previous addc/adde/subc/sube.
|
||||
case W65816::ADCEfi: NewOpc = W65816::ADC_StackRel; break;
|
||||
case W65816::SBCEfi: NewOpc = W65816::SBC_StackRel; break;
|
||||
case W65816::ANDfi: NewOpc = W65816::AND_StackRel; break;
|
||||
case W65816::ORAfi: NewOpc = W65816::ORA_StackRel; break;
|
||||
case W65816::EORfi: NewOpc = W65816::EOR_StackRel; break;
|
||||
case W65816::CMPfi: NewOpc = W65816::CMP_StackRel; break;
|
||||
case W65816::LDAfi_indY: NewOpc = W65816::LDA_StackRelIndY; break;
|
||||
case W65816::STAfi_indY: NewOpc = W65816::STA_StackRelIndY; break;
|
||||
case W65816::ADDframe: {
|
||||
// LEA-equivalent: emit "TSC; CLC; ADC #disp" so A holds SP + disp,
|
||||
// i.e. the address of the stack slot. TSC has no carry side-effect
|
||||
// (it just transfers SP into A), so the CLC + ADC is needed for a
|
||||
// clean unsigned add. Disp uses the same FrameOffset+ImmOffset+
|
||||
// StackSize formula as the load/store cases.
|
||||
int FI = MI.getOperand(FIOperandNum).getIndex();
|
||||
int FrameOffset = MFI.getObjectOffset(FI);
|
||||
int ImmOffset = MI.getOperand(FIOperandNum + 1).getImm();
|
||||
int Disp = FrameOffset + ImmOffset + (int)MFI.getStackSize();
|
||||
if (Disp < 0 || Disp > 0xFFFF)
|
||||
report_fatal_error("W65816: frame offset out of i16 LEA range");
|
||||
// TSC: A = SP (implicit def of A, use of SP).
|
||||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TSC))
|
||||
.addReg(W65816::A, RegState::ImplicitDefine)
|
||||
.addReg(W65816::SP, RegState::Implicit);
|
||||
// CLC: clears C. Models as P-def, P-use (preserves N/V/Z).
|
||||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::CLC))
|
||||
.addReg(W65816::P, RegState::ImplicitDefine);
|
||||
// ADC #imm: reads A and P, writes A and P.
|
||||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::ADC_Imm16))
|
||||
.addImm(Disp)
|
||||
.addReg(W65816::A, RegState::Implicit)
|
||||
.addReg(W65816::A, RegState::ImplicitDefine)
|
||||
.addReg(W65816::P, RegState::Implicit)
|
||||
.addReg(W65816::P, RegState::ImplicitDefine);
|
||||
MI.eraseFromParent();
|
||||
return true;
|
||||
}
|
||||
default:
|
||||
llvm_unreachable("W65816: unhandled instruction in eliminateFrameIndex");
|
||||
}
|
||||
|
|
@ -108,8 +144,49 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
|
|||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
||||
TII.get(IsSub ? W65816::SEC : W65816::CLC));
|
||||
}
|
||||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(NewOpc))
|
||||
.addImm(Offset);
|
||||
// The MC instructions (LDA_StackRel, STA_StackRel, ADC_StackRel,
|
||||
// ADC_Imm16, etc.) don't have explicit Defs/Uses on the accumulator
|
||||
// because that's an implicit hardware semantic of every 65816
|
||||
// arithmetic/load/store. Without an explicit Def/Use, post-RA
|
||||
// passes (Machine Copy Propagation in particular) miss that an ADC
|
||||
// d,S between a TXA and a TAX redefines $a, and elide the TAX as
|
||||
// "redundant" — corrupting the return value. Add the implicit
|
||||
// operands here so dataflow tracking is correct. Match the
|
||||
// original pseudo's read/write semantics: LDA defs A only; STA uses
|
||||
// A only; ADC/SBC/AND/ORA/EOR/CMP read A and write A (CMP only
|
||||
// sets flags, but it still uses A — modelling it as Use is
|
||||
// sufficient since it doesn't change A).
|
||||
auto Builder = BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
||||
TII.get(NewOpc)).addImm(Offset);
|
||||
switch (NewOpc) {
|
||||
case W65816::LDA_StackRel:
|
||||
case W65816::LDA_StackRelIndY:
|
||||
Builder.addReg(W65816::A, RegState::ImplicitDefine);
|
||||
break;
|
||||
case W65816::STA_StackRel:
|
||||
case W65816::STA_StackRelIndY:
|
||||
Builder.addReg(W65816::A, RegState::Implicit);
|
||||
break;
|
||||
case W65816::ADC_StackRel:
|
||||
case W65816::SBC_StackRel:
|
||||
Builder.addReg(W65816::A, RegState::Implicit)
|
||||
.addReg(W65816::A, RegState::ImplicitDefine)
|
||||
.addReg(W65816::P, RegState::Implicit)
|
||||
.addReg(W65816::P, RegState::ImplicitDefine);
|
||||
break;
|
||||
case W65816::AND_StackRel:
|
||||
case W65816::ORA_StackRel:
|
||||
case W65816::EOR_StackRel:
|
||||
Builder.addReg(W65816::A, RegState::Implicit)
|
||||
.addReg(W65816::A, RegState::ImplicitDefine);
|
||||
break;
|
||||
case W65816::CMP_StackRel:
|
||||
Builder.addReg(W65816::A, RegState::Implicit)
|
||||
.addReg(W65816::P, RegState::ImplicitDefine);
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
MI.eraseFromParent();
|
||||
return true;
|
||||
}
|
||||
|
|
|
|||
355
src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
Normal file
355
src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
Normal file
|
|
@ -0,0 +1,355 @@
|
|||
//===-- W65816StackSlotCleanup.cpp - Remove redundant spill/reload pairs --===//
|
||||
//
|
||||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
||||
// See https://llvm.org/LICENSE.txt for license information.
|
||||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
||||
//
|
||||
//===----------------------------------------------------------------------===//
|
||||
//
|
||||
// Post-RA cleanup that erases redundant STAfi+LDAfi pairs to the same
|
||||
// stack slot when no instruction in between writes A or that slot.
|
||||
//
|
||||
// The greedy register allocator routinely emits this pattern when
|
||||
// materialising a COPY of $a into a vreg that gets allocated back to
|
||||
// $a — the spill+reload cycle is a no-op since A already holds the
|
||||
// stored value. The standard MachineLateInstrsCleanup pass only
|
||||
// detects identical instructions; it doesn't recognise that
|
||||
// `LDAfi slot` after `STAfi $a, slot` is a no-op. We do the
|
||||
// simple per-block scan here.
|
||||
//
|
||||
// Conservative: only matches adjacent STAfi+LDAfi pairs (no scan for
|
||||
// instructions in between). In practice the greedy-allocator-emitted
|
||||
// pattern is always adjacent or near-adjacent, and the scheduler keeps
|
||||
// it that way because the LDAfi feeds the next instruction. If
|
||||
// future codegen breaks this assumption, generalise to a longer scan
|
||||
// with explicit clobber tracking.
|
||||
//
|
||||
//===----------------------------------------------------------------------===//
|
||||
|
||||
#include "W65816.h"
|
||||
#include "W65816InstrInfo.h"
|
||||
#include "W65816Subtarget.h"
|
||||
#include "llvm/ADT/BitVector.h"
|
||||
#include "llvm/CodeGen/MachineFrameInfo.h"
|
||||
#include "llvm/CodeGen/MachineFunction.h"
|
||||
#include "llvm/CodeGen/MachineFunctionPass.h"
|
||||
#include "llvm/CodeGen/MachineInstr.h"
|
||||
#include "llvm/CodeGen/MachineRegisterInfo.h"
|
||||
#include "llvm/CodeGen/TargetRegisterInfo.h"
|
||||
|
||||
using namespace llvm;
|
||||
|
||||
#define DEBUG_TYPE "w65816-stack-slot-cleanup"
|
||||
|
||||
namespace {
|
||||
|
||||
class W65816StackSlotCleanup : public MachineFunctionPass {
|
||||
public:
|
||||
static char ID;
|
||||
|
||||
W65816StackSlotCleanup() : MachineFunctionPass(ID) {}
|
||||
|
||||
StringRef getPassName() const override {
|
||||
return "W65816 redundant stack-slot spill/reload elimination";
|
||||
}
|
||||
|
||||
bool runOnMachineFunction(MachineFunction &MF) override;
|
||||
};
|
||||
|
||||
} // namespace
|
||||
|
||||
char W65816StackSlotCleanup::ID = 0;
|
||||
|
||||
INITIALIZE_PASS(W65816StackSlotCleanup, DEBUG_TYPE,
|
||||
"W65816 redundant stack-slot spill/reload elimination",
|
||||
false, false)
|
||||
|
||||
FunctionPass *llvm::createW65816StackSlotCleanup() {
|
||||
return new W65816StackSlotCleanup();
|
||||
}
|
||||
|
||||
// Returns true if MI references frame index FI as one of its operands.
|
||||
// Used to bail dead-store removal when an intervening instruction
|
||||
// reads or writes the slot.
|
||||
static bool referencesFrameIndex(const MachineInstr &MI, int FI) {
|
||||
for (const MachineOperand &MO : MI.operands())
|
||||
if (MO.isFI() && MO.getIndex() == FI)
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
// Match `STAfi reg1, FI, 0; ... ; STAfi reg2, FI, 0` (kill via overwrite)
|
||||
// or `STAfi reg, FI, 0; ... ; <return> (no read in between)` (dead store
|
||||
// at function exit). Both mean the first STAfi is dead. Conservative:
|
||||
// bails on anything that references the slot, calls, inline asm. The
|
||||
// slot must be a *local* (non-fixed) FrameIndex — args live across the
|
||||
// function so we can't kill stores to fixed slots.
|
||||
static bool tryEliminateDeadStore(MachineBasicBlock &MBB,
|
||||
MachineInstr &StaMI) {
|
||||
if (StaMI.getOpcode() != W65816::STAfi)
|
||||
return false;
|
||||
if (StaMI.getNumOperands() < 3 ||
|
||||
!StaMI.getOperand(1).isFI() ||
|
||||
!StaMI.getOperand(2).isImm() || StaMI.getOperand(2).getImm() != 0)
|
||||
return false;
|
||||
int StoredFI = StaMI.getOperand(1).getIndex();
|
||||
|
||||
// Don't try to kill a store to a fixed (arg) slot — those are
|
||||
// observable to the caller. Locals/spills are fair game.
|
||||
const MachineFunction *MF = StaMI.getMF();
|
||||
if (MF->getFrameInfo().isFixedObjectIndex(StoredFI))
|
||||
return false;
|
||||
|
||||
auto It = std::next(StaMI.getIterator());
|
||||
while (It != MBB.end()) {
|
||||
MachineInstr &MI = *It;
|
||||
if (MI.isDebugInstr()) {
|
||||
++It;
|
||||
continue;
|
||||
}
|
||||
// A subsequent STAfi to the same slot, offset 0, kills our store.
|
||||
if (MI.getOpcode() == W65816::STAfi &&
|
||||
MI.getNumOperands() >= 3 &&
|
||||
MI.getOperand(1).isFI() &&
|
||||
MI.getOperand(1).getIndex() == StoredFI &&
|
||||
MI.getOperand(2).isImm() && MI.getOperand(2).getImm() == 0) {
|
||||
// Found the killing store. Erase the first.
|
||||
StaMI.eraseFromParent();
|
||||
return true;
|
||||
}
|
||||
// A return that doesn't read the slot kills the store too — the
|
||||
// local goes out of scope at function exit.
|
||||
if (MI.isReturn() && !referencesFrameIndex(MI, StoredFI)) {
|
||||
StaMI.eraseFromParent();
|
||||
return true;
|
||||
}
|
||||
// Anything else that touches the slot (load, ADC d,S, etc.) means
|
||||
// the first store IS observed — bail.
|
||||
if (referencesFrameIndex(MI, StoredFI))
|
||||
return false;
|
||||
// Inline asm / branches: too tricky. Calls are OK to walk past —
|
||||
// local (non-fixed) slots are addressed at offsets the callee
|
||||
// can't reach (callee's S has been shifted down by JSL's
|
||||
// 3-byte return frame and any of its own pha/tsc adjustments,
|
||||
// so its `(4,s)` reads land above our locals). We've already
|
||||
// bailed on fixed slots above, so reaching here means the slot
|
||||
// is local and call-safe.
|
||||
if (MI.isInlineAsm() || MI.isBranch())
|
||||
return false;
|
||||
++It;
|
||||
}
|
||||
// Walked off the end of the BB without seeing a return/use. Bail
|
||||
// (could fall through to a successor that reads the slot).
|
||||
return false;
|
||||
}
|
||||
|
||||
// Match `STAfi reg, FI, 0; ... ; LDAfi destReg, FI, 0` when reg == destReg
|
||||
// and nothing in between clobbers reg or the slot. Erase the LDAfi.
|
||||
static bool tryEliminateLoadAfterStore(MachineBasicBlock &MBB,
|
||||
MachineInstr &StaMI,
|
||||
const TargetRegisterInfo *TRI) {
|
||||
if (StaMI.getOpcode() != W65816::STAfi)
|
||||
return false;
|
||||
if (StaMI.getNumOperands() < 3 ||
|
||||
!StaMI.getOperand(0).isReg() ||
|
||||
!StaMI.getOperand(1).isFI() ||
|
||||
!StaMI.getOperand(2).isImm() || StaMI.getOperand(2).getImm() != 0)
|
||||
return false;
|
||||
Register StoredReg = StaMI.getOperand(0).getReg();
|
||||
int StoredFI = StaMI.getOperand(1).getIndex();
|
||||
|
||||
// Walk forward looking for the matching LDAfi. Bail on any
|
||||
// instruction that could clobber StoredReg or write the slot.
|
||||
auto It = std::next(StaMI.getIterator());
|
||||
while (It != MBB.end()) {
|
||||
MachineInstr &MI = *It;
|
||||
if (MI.isDebugInstr()) {
|
||||
++It;
|
||||
continue;
|
||||
}
|
||||
if (MI.getOpcode() == W65816::LDAfi &&
|
||||
MI.getNumOperands() >= 3 &&
|
||||
MI.getOperand(1).isFI() &&
|
||||
MI.getOperand(1).getIndex() == StoredFI &&
|
||||
MI.getOperand(2).isImm() && MI.getOperand(2).getImm() == 0 &&
|
||||
MI.getOperand(0).isReg() &&
|
||||
MI.getOperand(0).getReg() == StoredReg) {
|
||||
MI.eraseFromParent();
|
||||
return true;
|
||||
}
|
||||
// Calls clobber A — be safe.
|
||||
if (MI.isCall())
|
||||
return false;
|
||||
// Any other instruction that defines StoredReg or stores to the
|
||||
// slot invalidates the redundancy — bail.
|
||||
if (MI.modifiesRegister(StoredReg, TRI))
|
||||
return false;
|
||||
if (MI.getOpcode() == W65816::STAfi &&
|
||||
MI.getNumOperands() >= 2 && MI.getOperand(1).isFI() &&
|
||||
MI.getOperand(1).getIndex() == StoredFI)
|
||||
return false;
|
||||
++It;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
|
||||
const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
|
||||
bool Changed = false;
|
||||
|
||||
// Pass 0: rewrite `LDAi16imm $a, imm` immediately followed by
|
||||
// `COPY $x = $a` (with no intervening A clobber) into
|
||||
// `LDXi16imm $x, imm`. Run BEFORE the spill/reload cleanups so
|
||||
// the disappearing A clobber unblocks subsequent STAfi+LDAfi
|
||||
// pair removal.
|
||||
for (MachineBasicBlock &MBB : MF) {
|
||||
SmallVector<MachineInstr *, 4> Worklist;
|
||||
for (MachineInstr &MI : MBB)
|
||||
if (MI.getOpcode() == W65816::LDAi16imm)
|
||||
Worklist.push_back(&MI);
|
||||
for (MachineInstr *Lda : Worklist) {
|
||||
if (Lda->getNumOperands() < 2 || !Lda->getOperand(0).isReg() ||
|
||||
Lda->getOperand(0).getReg() != W65816::A)
|
||||
continue;
|
||||
auto It = std::next(Lda->getIterator());
|
||||
while (It != MBB.end() && It->isDebugInstr())
|
||||
++It;
|
||||
if (It == MBB.end())
|
||||
continue;
|
||||
MachineInstr &Next = *It;
|
||||
if (!Next.isCopy())
|
||||
continue;
|
||||
Register DstReg = Next.getOperand(0).getReg();
|
||||
Register SrcReg = Next.getOperand(1).getReg();
|
||||
if (DstReg != W65816::X || SrcReg != W65816::A)
|
||||
continue;
|
||||
const MachineOperand &ImmMO = Lda->getOperand(1);
|
||||
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
|
||||
MachineInstrBuilder Mib =
|
||||
BuildMI(MBB, Lda->getIterator(), Lda->getDebugLoc(),
|
||||
TII->get(W65816::LDXi16imm), W65816::X);
|
||||
if (ImmMO.isImm())
|
||||
Mib.addImm(ImmMO.getImm());
|
||||
else
|
||||
Mib.add(ImmMO);
|
||||
Lda->eraseFromParent();
|
||||
Next.eraseFromParent();
|
||||
Changed = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Pass 1: redundant LDAfi after STAfi (load-after-same-store with
|
||||
// matching register). Two-pass over Stores worklist to avoid
|
||||
// iterator invalidation when we erase the LDAfi mid-walk.
|
||||
for (MachineBasicBlock &MBB : MF) {
|
||||
SmallVector<MachineInstr *, 8> Stores;
|
||||
for (MachineInstr &MI : MBB)
|
||||
if (MI.getOpcode() == W65816::STAfi)
|
||||
Stores.push_back(&MI);
|
||||
for (MachineInstr *StaMI : Stores)
|
||||
if (tryEliminateLoadAfterStore(MBB, *StaMI, TRI))
|
||||
Changed = true;
|
||||
}
|
||||
|
||||
// Pass 2: dead stores (STAfi to slot followed by another STAfi to
|
||||
// the same slot with no intervening read). This catches the
|
||||
// arg0_lo "preserve" spill that the regalloc emits even though the
|
||||
// value is consumed by the very next instruction.
|
||||
for (MachineBasicBlock &MBB : MF) {
|
||||
SmallVector<MachineInstr *, 8> Stores;
|
||||
for (MachineInstr &MI : MBB)
|
||||
if (MI.getOpcode() == W65816::STAfi)
|
||||
Stores.push_back(&MI);
|
||||
for (MachineInstr *StaMI : Stores)
|
||||
if (tryEliminateDeadStore(MBB, *StaMI))
|
||||
Changed = true;
|
||||
}
|
||||
|
||||
// Pass 2.5: deleted (logic moved to Pass 0 above).
|
||||
// `COPY $x = $a` (with no intervening A use/def) into
|
||||
// `LDXi16imm $x, imm`, removing the A clobber. Without this, the
|
||||
// regalloc materialises i16 constants via Acc16 (LDAi16imm) even
|
||||
// when the only consumer is CopyToReg($x), forcing a TAX round-trip
|
||||
// and (often) a spill+reload of A's previous value. Common case:
|
||||
// the high half of `(zext i16 to i32)` returns, where hi = 0.
|
||||
for (MachineBasicBlock &MBB : MF) {
|
||||
SmallVector<MachineInstr *, 4> Worklist;
|
||||
for (MachineInstr &MI : MBB)
|
||||
if (MI.getOpcode() == W65816::LDAi16imm)
|
||||
Worklist.push_back(&MI);
|
||||
for (MachineInstr *Lda : Worklist) {
|
||||
// The LDA's def must be $a (post-RA) and the next instruction
|
||||
// must be a COPY $x = $a.
|
||||
if (Lda->getNumOperands() < 2 || !Lda->getOperand(0).isReg() ||
|
||||
Lda->getOperand(0).getReg() != W65816::A)
|
||||
continue;
|
||||
auto It = std::next(Lda->getIterator());
|
||||
// Skip debug instructions.
|
||||
while (It != MBB.end() && It->isDebugInstr())
|
||||
++It;
|
||||
if (It == MBB.end())
|
||||
continue;
|
||||
MachineInstr &Next = *It;
|
||||
if (!Next.isCopy())
|
||||
continue;
|
||||
Register DstReg = Next.getOperand(0).getReg();
|
||||
Register SrcReg = Next.getOperand(1).getReg();
|
||||
if (DstReg != W65816::X || SrcReg != W65816::A)
|
||||
continue;
|
||||
// Replace LDAi16imm with LDXi16imm and erase the COPY.
|
||||
const MachineOperand &ImmMO = Lda->getOperand(1);
|
||||
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
|
||||
MachineInstrBuilder Mib =
|
||||
BuildMI(MBB, Lda->getIterator(), Lda->getDebugLoc(),
|
||||
TII->get(W65816::LDXi16imm), W65816::X);
|
||||
if (ImmMO.isImm())
|
||||
Mib.addImm(ImmMO.getImm());
|
||||
else
|
||||
Mib.add(ImmMO);
|
||||
Lda->eraseFromParent();
|
||||
Next.eraseFromParent();
|
||||
Changed = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Pass 3: zero-size unused local frame objects so the
|
||||
// PrologueEpilogue pass shrinks the prologue PHAs / TSC reservation.
|
||||
// Walk the MIR collecting which FIs are still referenced; any
|
||||
// *non-fixed* (local) FI with no remaining reference is dead. We
|
||||
// can't safely remove it (RemoveStackObject can shift indexes); we
|
||||
// just zero-size it via setObjectSize, which is enough for the
|
||||
// frame layout pass to skip it.
|
||||
MachineFrameInfo &MFI = MF.getFrameInfo();
|
||||
if (MFI.getNumObjects() > 0) {
|
||||
BitVector Used(MFI.getObjectIndexEnd() - MFI.getObjectIndexBegin());
|
||||
auto Mark = [&](int FI) {
|
||||
int Idx = FI - MFI.getObjectIndexBegin();
|
||||
if (Idx >= 0 && Idx < (int)Used.size())
|
||||
Used.set(Idx);
|
||||
};
|
||||
for (MachineBasicBlock &MBB : MF)
|
||||
for (MachineInstr &MI : MBB)
|
||||
for (MachineOperand &MO : MI.operands())
|
||||
if (MO.isFI())
|
||||
Mark(MO.getIndex());
|
||||
for (int FI = MFI.getObjectIndexBegin();
|
||||
FI < MFI.getObjectIndexEnd(); ++FI) {
|
||||
// Skip fixed (arg) slots — those are "owned" by the caller.
|
||||
if (MFI.isFixedObjectIndex(FI))
|
||||
continue;
|
||||
int Idx = FI - MFI.getObjectIndexBegin();
|
||||
if (Idx < 0 || Idx >= (int)Used.size() || Used.test(Idx))
|
||||
continue;
|
||||
// Already zero-sized? Skip.
|
||||
if (MFI.getObjectSize(FI) == 0)
|
||||
continue;
|
||||
// Don't touch dead-stripped objects either.
|
||||
if (MFI.isDeadObjectIndex(FI))
|
||||
continue;
|
||||
MFI.setObjectSize(FI, 0);
|
||||
Changed = true;
|
||||
}
|
||||
}
|
||||
|
||||
return Changed;
|
||||
}
|
||||
|
|
@ -39,6 +39,7 @@ LLVMInitializeW65816Target() {
|
|||
PassRegistry &PR = *PassRegistry::getPassRegistry();
|
||||
initializeW65816AsmPrinterPass(PR);
|
||||
initializeW65816DAGToDAGISelLegacyPass(PR);
|
||||
initializeW65816StackSlotCleanupPass(PR);
|
||||
}
|
||||
|
||||
static Reloc::Model getEffectiveRelocModel(std::optional<Reloc::Model> RM) {
|
||||
|
|
@ -74,6 +75,7 @@ public:
|
|||
}
|
||||
|
||||
bool addInstSelector() override;
|
||||
void addPostRegAlloc() override;
|
||||
};
|
||||
|
||||
} // namespace
|
||||
|
|
@ -82,6 +84,10 @@ TargetPassConfig *W65816TargetMachine::createPassConfig(PassManagerBase &PM) {
|
|||
return new W65816PassConfig(*this, PM);
|
||||
}
|
||||
|
||||
void W65816PassConfig::addPostRegAlloc() {
|
||||
addPass(createW65816StackSlotCleanup());
|
||||
}
|
||||
|
||||
MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo(
|
||||
BumpPtrAllocator &Allocator, const Function &F,
|
||||
const TargetSubtargetInfo *STI) const {
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue