396 lines
18 KiB
Markdown
396 lines
18 KiB
Markdown
# Session Recovery — last updated 2026-05-25
|
||
|
||
Living recovery doc. Update on every meaningful change. If session is lost,
|
||
read this top-to-bottom + the memory notes referenced inside, then reread
|
||
the actual diffs in tree to ground assumptions.
|
||
|
||
## Headline state
|
||
|
||
- **Smoke**: 148/148 green. Demos 9/9 (helloBeep/helloText/helloWindow/
|
||
orcaFrame/qdProbe/heavyRelocs/frame/reversi/minicad).
|
||
- **Active config**: ptr32 (`p:32:16`), full IMG0..IMG15 caller-clobber
|
||
on JSL, greedy regalloc at -O1+. **Inline-threshold lowered to 50
|
||
target-wide** (was LLVM default 225; was 75 earlier this session).
|
||
- **Branch**: `main`.
|
||
- **vs Calypsi (2026-05-25)**:
|
||
- **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93× (we
|
||
beat by 7%).
|
||
- **CoreMark 1.0**: with Layer 2 **0.79× Calypsi (we beat by 21%)**.
|
||
- **vs Calypsi static-inst ratio (synthetic bench)**:
|
||
sumSquares **0.84×** (26 vs 31 — we beat),
|
||
mul16to32 **0.25×** (1 vs 4 — we beat),
|
||
evalAt 1.86× (472 vs 254 — structural floor; ABI overhaul rejected).
|
||
- **New code-gen options (2026-05-25)** — see docs/USAGE.md "Advanced:
|
||
pointer-deref code generation":
|
||
- Layer 1 ptr32 deref-fold (always on): LDY offset instead of
|
||
CLC/ADC carry chain. ~3 instr saved per struct-field access.
|
||
- `-mllvm -w65816-dbr-safe-ptrs` (Layer 2, opt-in): uses
|
||
`lda (d,S),Y` for ptr32 derefs assuming bank-byte == DBR.
|
||
5 instr → 1 instr per deref. Lua -20.6%. **MISCOMPILES
|
||
cross-bank pointers — opt in per-TU only when safe.**
|
||
- Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark
|
||
matrix.o 1.37× → 0.97× Calypsi. Override with
|
||
`-mllvm -inline-threshold=N`.
|
||
- **Cycle benches (2026-05-20)**:
|
||
popcount 93, strcpy 91, bsearch 127, memcmp 113, fib 97,
|
||
dotProduct 144, sumOfSquares 126 cyc/iter (100 iters);
|
||
dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters);
|
||
particles 2253 cyc/iter (3 iters), mandelbrot 11570 cyc/iter (1 iter).
|
||
- **Recent session wins (2026-05-20)**:
|
||
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg
|
||
(evalAt 498→472, fib -35%, 35 libc fns shrunk)
|
||
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
|
||
- case-(b) ImgCalleeSave bracket hoist enables phase 4 to elide
|
||
TAY/TYA round-trip in synergy
|
||
- FP cycle benches added (dadd/dmul/ddiv) with per-bench iter count
|
||
- Documented LSR-dp cycle mystery as HBL-counter wrap artifact
|
||
- Game-like benches added: particles (i16 physics), mandelbrot (i32 fp)
|
||
- **elideStoreForwarding now reached via early-return bail paths**:
|
||
particles 5005→2253 cyc/iter (-55%). Was being skipped for any
|
||
function where main IMG promotion bailed (SpAdj invalid, no
|
||
accesses, or > 16 hot slots).
|
||
|
||
## Uncommitted, must keep
|
||
|
||
`git status --short` (5 modified, no untracked of consequence):
|
||
|
||
1. `SESSION_RECOVERY.md` — this doc.
|
||
2. `scripts/smokeTest.sh` — added "omfEmit `--stack-size` emits a
|
||
DP/Stack `~Direct` segment" check. Validates 3-segment layout
|
||
(ExpressLoad + code + DP/Stack) when `--stack-size` is supplied;
|
||
parses the third segment header against KIND/LENGTH/RESSPC/ALIGN/
|
||
SEGNUM=3/name="~Direct" expectations.
|
||
3. `src/link816/omfEmit.cpp` — `emitDpStackSeg(length, segNum)` plus
|
||
the `--stack-size N` CLI flag. Validation: 256 ≤ N ≤ 65536, page-
|
||
aligned. **`--stack-size` implicitly enables `--expressload`** —
|
||
the GS/OS Loader's slow path silently rejects multi-seg OMFs (see
|
||
§D below for the empirical evidence).
|
||
4. `src/llvm/lib/Target/W65816/W65816ISelLowering.cpp` — `LowerShift`
|
||
now inlines i32 SHL/SRL/SRA by N=1..4 instead of routing to
|
||
`__lshrsi3`/`__ashlsi3`/`__ashrsi3`. See §E.
|
||
5. `src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp` — pre-existing
|
||
uncommitted change from prior turns; verify against git log before
|
||
re-staging if recovery is fresh.
|
||
|
||
Earlier-mentioned files (snprintf.c, W65816InstrInfo.cpp,
|
||
W65816SjLjFinalize.cpp) have been checkpoint-committed and are no
|
||
longer in `git status`.
|
||
|
||
## Already-committed in this session arc
|
||
|
||
Per `git log --oneline -20` these are the recent checkpoint commits;
|
||
the diffs they contain are real and load-bearing.
|
||
|
||
The big ones (search by file or grep):
|
||
- **JSLpseudo Defs += IMG0..IMG15** in `W65816InstrInfo.td`. With the
|
||
wider Defs, regalloc spills IMG-class vregs around calls instead of
|
||
treating them as preserved.
|
||
- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `STAfi`**:
|
||
PHA-bracketed for non-A source (IMG/X/Y). The `lda dp; sta d,s` chain
|
||
clobbered A; bracket preserves A while shifting offset by +2 between
|
||
PHA and PLA. Defs=[A] kept on STAfi as safe over-approximation.
|
||
- **`W65816RegisterInfo.cpp` `eliminateFrameIndex` for `LDAfi`**:
|
||
if `Dst = IMGn`, append `STA dp` so the IMG slot actually receives the
|
||
loaded value. Previously only loaded into A; downstream
|
||
`COPY $x = $imgN` (= `ldx $D?`) read garbage. **This was the smoking
|
||
gun for `dadd(1.5, 2.5) → 0x4010_0000_3000_3000`.**
|
||
- **`W65816LowerWide32.cpp`** fixed-point erase loop. Was single-pass;
|
||
REG_SEQUENCE got skipped if a not-yet-erased COPY consumer kept it
|
||
alive at the iteration moment. Removed ~40 dead Wide32 vregs from
|
||
`__adddf3`'s pre-RA MIR.
|
||
- **`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll`** relaxed
|
||
`stx 0xd / sta 0xd` to `0x{{[cd]}}` (regalloc now picks IMG8..15 too).
|
||
|
||
## Fixes landed (full list with rationale)
|
||
|
||
Each entry: what / why / where / what regression it would cause if reverted.
|
||
|
||
### A. Hash-shell DELETE bug → IMG caller-clobber
|
||
|
||
**Symptom**: `dbDelete("age")` returned 0 ("not found") instead of 1.
|
||
DELETE never ran; `COUNT` stayed at 2.
|
||
|
||
**Root cause**: `dbDelete` did `stx 0xd0` to save k_high, called `hashKey`,
|
||
then `pei 0xd0` to push k_high to strcmp. `hashKey` used $D0 as scratch
|
||
in its loop body (`sta 0xd0` storing the iterator's running-ptr-low). $D0
|
||
was clobbered by the time `pei 0xd0` ran. JSLpseudo Defs only listed
|
||
`A, X, Y, DPF0` — IMG slots were not modelled as caller-clobber.
|
||
|
||
**Fix**: `JSLpseudo Defs += [IMG0..IMG15]`.
|
||
|
||
**Cascading fallout** (each required its own fix):
|
||
|
||
#### A1. copyPhysReg vreg fallback
|
||
`storeRegToStackSlot`'s unpaired-Wide32 default branch hit `unreachable`
|
||
when called with a vreg source. Basic regalloc's InlineSpiller does
|
||
this. Fix: short-circuit virtual-reg cases to `TargetOpcode::COPY`.
|
||
|
||
#### A2. LowerWide32 fixed-point erase
|
||
Single-pass erase left ~40 dead Wide32 vregs in `__adddf3`. Pattern:
|
||
```
|
||
%X:wide32 = REG_SEQUENCE ...
|
||
%Y:wide32 = COPY %X
|
||
... uses of %Y rewritten by Pass 3 ...
|
||
```
|
||
Single-pass: REG_SEQUENCE skipped (COPY consumer still alive), then
|
||
COPY erased (now %X dead but loop already passed it). Fix: iterate
|
||
until no progress.
|
||
|
||
#### A3. STAfi PHA-bracket
|
||
Without bracket, regalloc could schedule `$img0 = COPY $a` AFTER a
|
||
`STAfi`-with-IMG-source whose internal `lda dp` clobbered $a, silently
|
||
storing X's value where A's was expected.
|
||
|
||
#### A4. LDAfi-IMG-dest STA dp
|
||
**The big one.** With narrow IMG, regalloc kept Wide16 vregs in IMG
|
||
slots across calls, never needed `$imgN = LDAfi %stack.X`. With full
|
||
IMG, every cross-call spill needed it. The expansion only emitted
|
||
`LDA d,s` (load A) — never wrote to the IMG slot. Downstream
|
||
`COPY $x = $imgN` (= `ldx $D?`) read stale prior data. Manifested as
|
||
`dadd(1.5, 2.5) → 0x4010_0000_3000_3000` (mantissa garbage).
|
||
|
||
**Diagnostic that found it**: diff post-RA MIR narrow vs full IMG. Pre-RA
|
||
MIR was identical. Full had 6 `$imgN = LDAfi` instances; narrow had 0.
|
||
Narrow used `COPY $imgN = $a` patterns instead — those work correctly.
|
||
|
||
#### A5. FileCheck regex
|
||
`src/llvm/test/CodeGen/W65816/i64-first-arg-img16.ll` expected
|
||
`stx 0xd / sta 0xd`. Under full IMG clobber, regalloc picks IMG8..15
|
||
($C0/$C2) for cross-call arg saves. Relaxed to `0x{{[cd]}}`.
|
||
|
||
### B. C++ try/catch source-level path
|
||
|
||
Two bugs blocking real `clang++ -fsjlj-exceptions` source code:
|
||
|
||
#### B1. W65816SjLjFinalize catchtab ordering
|
||
`runOnFunction` erased landingpad insts at line ~245, then built the
|
||
catchtab at line ~290 via `LPadBB->getLandingPadInst()`. By that
|
||
point, landingpads were nullptr. The build loop's `if (!LP) continue;`
|
||
skipped every entry. Catchtab ended with just `(0,0)` sentinel. LSDA
|
||
was 4 bytes of zeros. `findCatch` saw `ctx->lsda == 0`'s entry and
|
||
bailed. Result: any `throw` aborted.
|
||
|
||
Fix: capture catch-clause typeinfo Constants into a
|
||
`DenseMap<BasicBlock*, LPadInfo>` BEFORE erasing landingpads; the
|
||
catchtab build loop reads from the saved map.
|
||
|
||
#### B2. copyPhysReg IMG-to-IMG PHA-bracket
|
||
Comment said "Caller is responsible for ensuring A is dead at this
|
||
program point (regalloc usually arranges this)." It doesn't, in
|
||
practice. Regalloc inserted IMG-to-IMG copies between `$a = COPY $img10`
|
||
and `STAfi $a, slot`. Unbracketed `lda src; sta dst` clobbered A.
|
||
The subsequent STAfi spilled garbage. Visible as `*p = 42` after
|
||
`__cxa_allocate_exception` storing 42 to wrong addr (indirect-long
|
||
setup got hi-half at lo-slot).
|
||
|
||
Fix: PHA-bracket. Cost +7 cyc / +2 bytes per IMG-IMG copy (rare).
|
||
|
||
**Verified end-to-end** via MAME breakpoints: `begin_catch` entered
|
||
with correct ExcHeader, `end_catch` entered with A=42, doTest returns
|
||
A=42 from real C++ `try { throw 42; } catch (int x) { return x; }`.
|
||
|
||
### C. Cleanup wins
|
||
|
||
- `runtime/src/snprintf.c:106` — removed `optnone` on `emitULong`. Smoke green.
|
||
- `runtime/src/snprintf.c:303` — removed `optnone` on `snprintf`. Smoke green.
|
||
|
||
### D. `omfEmit --stack-size` — DP/Stack segment for GS/OS Loader
|
||
|
||
Added `emitDpStackSeg` (`src/link816/omfEmit.cpp`). KIND=0x1012 (DP/Stack
|
||
| PRIVATE), LENGTH=RESSPC=requested-bytes, ALIGN=0x100, BANKSIZE=0, body
|
||
is a single END opcode. Apps can now request a stack of any
|
||
page-aligned size from 256B to 64KB (replacing GS/OS Loader's default
|
||
4KB allocation).
|
||
|
||
**Loader gotcha** (cost ~1 hour to debug): plain (non-ExpressLoad)
|
||
multi-segment OMFs do NOT launch under real GS/OS 6.0.2 — the Loader's
|
||
slow path silently rejects the file and our entry point never runs.
|
||
ExpressLoad-wrapped multi-segment OMFs DO work. Fix: `--stack-size` now
|
||
implicitly enables `--expressload` (the Loader's slow path is
|
||
empirically broken for our 2-seg layout). The DP/Stack seg is appended
|
||
AFTER the user code seg as SEGNUM=3; the Loader walks all segments by
|
||
KIND after the ExpressLoad fast-load step finishes.
|
||
|
||
Verified: `runViaFinder.sh /tmp/test_el_dp.omf --check 0x70=0x42 0x71=0x99`
|
||
passes under real GS/OS 6.0.2 with `--stack-size 4096 --expressload`.
|
||
Verified failure mode: same payload with `--stack-size` alone (no
|
||
`--expressload`) → `0x70=0x00` (program never executed). Documented
|
||
in `feedback_loader_multi_seg_needs_expressload.md`.
|
||
|
||
Smoke updated: 132/132 expects 3 segs (ExpressLoad + code + DP/Stack)
|
||
when `--stack-size` is supplied.
|
||
|
||
### E. i32 shift-by-N inlined (was full libcall) — speed win
|
||
|
||
`W65816ISelLowering.cpp` `LowerShift` now inlines i32 SHL/SRL/SRA by
|
||
N=1..4. Previously every i32 shift went through `__lshrsi3`/
|
||
`__ashlsi3`/`__ashrsi3` — ~300+ cyc per call. popcount benchmark:
|
||
**8320 → 6888 cyc/call, 17% faster**. Implementation extracts
|
||
`Wide32` halves via `extractWide32Lo/Hi`, applies per-step
|
||
`lsr; ror`-equivalent SDAG ops with explicit carry propagation
|
||
(`(Hi & 1) << 15` for SRL/SRA's lo-fill, `Lo >> 15` for SHL's
|
||
hi-fill), recombines via `buildWide32`. N>4 still routes to libcall
|
||
— the unrolled cost (~5 i16 ops × N) crosses libcall overhead at N≈5.
|
||
|
||
Documented in `feedback_i32_shift_inline.md`.
|
||
|
||
## Still-open work areas
|
||
|
||
Each carries a fair-warning note for whoever picks it up.
|
||
|
||
### 1. qsort/bsearch `optnone` — REMOVED 2026-05-08
|
||
Source-restructured `qsort`: split the inner loop into a
|
||
`__attribute__((noinline))` helper `qsortInner` (4 args: base, cur,
|
||
size, cmp). Outer `qsort` just iterates `i = 1..nmemb-1` and calls
|
||
`qsortInner(base, base + i*size, size, cmp)`. This drops outer
|
||
qsort's i32-vreg simultaneous-live count below the inline-spill
|
||
OOM threshold; both halves compile cleanly at -O2 + basic regalloc.
|
||
|
||
`bsearch` `optnone` was kept-for-symmetry — once removed, it just
|
||
worked. The IMG-clobber + LDAfi-IMG-store backend fixes from
|
||
2026-05-07 had already resolved its underlying pressure issue.
|
||
|
||
Smoke stays green (now 132/132).
|
||
|
||
### 2. gmtime_r `optnone`
|
||
`runtime/src/timeExt.c:69`. NOT a backend bug — IR-level optimization
|
||
issue (loop rotation + IndVar simplify mis-evaluating
|
||
`days >= 365L + (__isLeap(...) ? 1 : 0)`). Fixing requires deciding
|
||
which combine pass is wrong and why. Out of scope for backend work.
|
||
|
||
### 3. softDouble noinlines
|
||
`runtime/src/softDouble.c:30` (`dpack`) and `:51` (`dclass`). Removing
|
||
`dpack` noinline broke dadd this session — register pressure for
|
||
`__adddf3`/`__muldf3`/`__divdf3`. Architectural for the same reason as
|
||
qsort.
|
||
|
||
### 4. Greedy regalloc retry — TRIED, blocked
|
||
Tested 2026-05-08. Greedy fails immediately on `atoi` in libc.c:
|
||
```
|
||
LiveRangeEdit.cpp:200: void llvm::LiveRangeEdit::eliminateDeadDef(...):
|
||
Assertion `MI->allDefsAreDead() && "Def isn't really dead"' failed.
|
||
```
|
||
Same upstream LLVM bug class as the dadd full-IMG attempt — sub-register
|
||
pair partial defs that the regalloc treats as fully dead. Greedy is
|
||
genuinely incompatible with the W65816's split-half subreg-pair patterns
|
||
until the upstream LLVM issue is patched. Reverted to basic regalloc.
|
||
Document `feedback_greedy_high_pressure.md` already covers this.
|
||
|
||
### 5. gmtime_r `optnone` — TRIED, blocked
|
||
Tested 2026-05-08. Hoisting `yearLen` to a long local (avoiding the
|
||
double-recompute of `365L + (__isLeap ? 1 : 0)`) didn't help; adding
|
||
`volatile` to the local also didn't help. IR optimizer is still
|
||
folding the comparison to compile-time-false. Source-level C
|
||
restructuring won't dodge it; needs IR-pass-level work to identify
|
||
which combine pass mis-evaluates and why. optnone stays.
|
||
|
||
## How to verify recovery
|
||
|
||
```bash
|
||
cd /home/scott/claude/llvm816
|
||
git status # 5 modified files listed above
|
||
cd tools/llvm-mos-build && ninja llc clang # rebuild backend (~5 min)
|
||
cd /home/scott/claude/llvm816
|
||
cd src/link816 && make && cd ../.. # rebuild link816 + omfEmit
|
||
bash runtime/build.sh # build runtime
|
||
bash scripts/smokeTest.sh # should end "all smoke checks passed"
|
||
bash scripts/benchCyclesPrecise.sh # popcount should be ~6888 cyc
|
||
```
|
||
|
||
Loader smoke (validates DP/Stack seg under real GS/OS 6.0.2):
|
||
```bash
|
||
# Build a simple test program with --stack-size, run via Finder.
|
||
tools/omfEmit --input X.bin --map X.map --base 0x1000 --entry __start \
|
||
--output /tmp/t.omf --stack-size 4096 --relocs X.relocs
|
||
bash scripts/runViaFinder.sh /tmp/t.omf --check 0x70=0x42 0x71=0x99
|
||
```
|
||
|
||
If smoke fails, the likely cause is one of the 5 uncommitted files
|
||
got reverted; check `git status` and re-apply. If popcount bench
|
||
regressed past ~7500 cyc, suspect the i32-shift-inline change in
|
||
`W65816ISelLowering.cpp` was lost.
|
||
|
||
## Diagnostic tools that worked
|
||
|
||
For posterity — these are the patterns that paid off this session.
|
||
|
||
### Pre-RA vs post-RA MIR diff
|
||
```bash
|
||
clang -mllvm -stop-before=regallocbasic -S ... # pre-RA
|
||
clang -mllvm -stop-after=virtregrewriter -S ... # post-RA (post-virtregrewriter)
|
||
```
|
||
Diff narrow-IMG vs full-IMG post-RA MIR for the failing function.
|
||
Pre-RA is identical (same IR), so the diff isolates regalloc-decision
|
||
divergence. Look at every NEW pattern that appears only in the failing
|
||
build — `$imgN = LDAfi` was the smoking gun for dadd.
|
||
|
||
### Pass-by-pass IR/MIR dumps
|
||
```bash
|
||
clang -mllvm -print-after=w65816-lower-wide32 -S ...
|
||
clang -mllvm -print-after-all -S ... 2>dump.txt
|
||
```
|
||
|
||
### MAME debugger via xvfb-run
|
||
```bash
|
||
xvfb-run -a mame apple2gs ... -debug -debugger qt -oslog -seconds_to_run N
|
||
```
|
||
With autoboot Lua: load .bin into bank 0 (skip `$C000..$CFFF` I/O),
|
||
set CPU state, then `cpu.debug:bpset(addr, condition, action)` with
|
||
actions like `"logerror \"...\\n\",a,x,...; go"`. `logerror` with format
|
||
args goes to stdout under `-oslog`. Memory reads in expressions:
|
||
`b@(addr)`, `w@(addr)`. Watchpoints: `cpu.debug:wpset(prog_space, "w",
|
||
addr, len, condition, action)`.
|
||
|
||
**AVOID**: `add_machine_pause_notifier` + `cpu.debug:go()` in callback —
|
||
segfaults from reentrancy. `printf` in actions stays in debugger console
|
||
(not -oslog). `tracelog` also debugger-console only.
|
||
|
||
### Trace methodology (find divergence point)
|
||
1. Set BPs at every `JSLpseudo` callee in the failing function.
|
||
2. Capture A/X/Y/DPF0 at each return.
|
||
3. Find first divergent return between known-good and failing builds.
|
||
4. The instruction sequence between previous-OK and first-divergent
|
||
return is where the bug lives.
|
||
|
||
This pattern found the dadd bug at `jsl@0x207f → __lshrsi3(0x8001_8000, 3)`
|
||
in 30 minutes. Recommended.
|
||
|
||
## Memory notes referenced
|
||
|
||
(Filenames under `/home/scott/.claude/projects/-home-scott-claude-llvm816/memory/`.)
|
||
|
||
- `feedback_strstr9_long_haystack.md` — the hash-shell bug story.
|
||
- `feedback_cpp_subset.md` — C++ subset, including the SjLj fix.
|
||
- `feedback_ptr32_frame_limit.md` — was 5 days stale; updated 2026-05-07
|
||
to "DONE, 131/131 smoke green".
|
||
- `feedback_jslpseudo_caller_save.md`, `feedback_libcall_img_clobber.md`,
|
||
`feedback_img_slot_expansion.md`, `feedback_greedy_high_pressure.md` —
|
||
related backend topics.
|
||
- `feedback_loader_multi_seg_needs_expressload.md` — **new 2026-05-08**.
|
||
Multi-seg OMFs need ExpressLoad to launch under real Loader.
|
||
- `feedback_i32_shift_inline.md` — **new 2026-05-08**. Inline i32
|
||
shift-by-N for N=1..4; first quantified bench-vs-self speed win.
|
||
- `feedback_speed_over_size.md` — **new 2026-05-07**. Optimization
|
||
priorities: cycle count over byte count, full stop.
|
||
|
||
## Next session candidates (ranked)
|
||
|
||
evalAt at 1.86× vs Calypsi is the structural floor for peephole work
|
||
(see `feedback_evalat_structural_gap.md`). Further gains need:
|
||
|
||
1. **i64-by-pointer ABI** (rejected this session — diminishing returns).
|
||
Pass doubles by ptr instead of value: saves ~120 cyc per evalAt call.
|
||
Requires runtime rewrite, OMF compat checks, every double caller
|
||
updated. Risk:reward too high for the size of the gain.
|
||
2. **__divdf3 / __adddf3 algorithmic improvements**. ddiv 1261 cyc
|
||
could drop via Newton-Raphson reciprocal multiplication (a*1/b
|
||
instead of bit-by-bit long division). Major rewrite, but our
|
||
__muldi3 short-circuit makes the multiplications cheap now.
|
||
3. **Higher-resolution cycle timer**. HBL counter is 8-bit and wraps
|
||
at ~256 ticks; combining scan-line position + frame counter would
|
||
give per-bench resolution better than ±65 cyc. Would unblock
|
||
benchmarking sub-loop changes (e.g., the LSR-dp shift form).
|
||
4. **More peepholes from the audit**. Phase 4 STA_StackRel extension
|
||
landed but doesn't fire in current libc (frame sizes too large).
|
||
If callers shrink frames via better SSM, more functions become
|
||
eligible.
|