65816-llvm-mos/docs/GAP_CLOSURE_PLAN.md
Scott Duensing da095402ec Updated
2026-06-02 23:17:57 -05:00

51 KiB
Raw Permalink Blame History

llvm816 gap-closure: comprehensive step plan

This is the full, ordered, dependency-aware plan for closing the 18 feature gaps identified in the 2026-05-30 audit, rolled together with every prerequisite the adversarial reviewers found. The earlier "master plan" stopped at milestones + per-item criticism; this document is the actionable step list.

Total effort (reviewer-adjusted): roughly 700-1000 hours across all 18 items plus the Phase 0 + Phase 1 prerequisites the reviewers added. Original briefs sum to ~280h; reviewers added ~420h of hidden work most planners missed.

Default audience: expert C/C++ developer porting code to the Apple IIgs, with a secondary path for retrocomputing tinkerers who want source-level debugging.

The plan is organized in six phases, with hard dependency arrows. Each step is a concrete, individually-shippable piece of work; nothing is "TBD" at the step level.


Phase 0 - Architectural decisions (DECIDED 2026-05-31)

# Topic Decision
0.1 EH model SJLJ + _Unwind_RaiseException-over-SJLJ stub
0.2 LTO model ThinLTO
0.3 Sanitizer scope UBSan-min + coverage only (no ASan)
0.4 localvars split Full -O0 + -O2/IMG in one shot (override of recommendation)
0.5 cxxstdlib split cxxchrono first, then cxxstream+format+path
0.6 Sprite harness Standalone first, desktop-coupled follow-up
0.7 Resource fork delivery AppleSingle blob
0.8 clangdwarffix approach BPF-style MAI flag spike first, escalate to new relocs only if needed
0.9 IigsSoundParmT Fix as breaking change (existing demos likely already broken)
0.10 rename cross-dir Implement copy+delete fallback now (override of recommendation)

Impact of overrides:

  • 0.4 (full localvars in one landing): Phase 3.2 and 3.3 collapse into a single ~55h item. Risk profile is higher (multi-stage delivery is now one-stage), but with Phase 1.5 DBG_VALUE audit landed first the foundation is solid. Expect 3-5 additional clang DWARF bugs to surface during -O2 / IMG work; budget contingency.
  • 0.10 (rename copy+delete fallback): Adds ~6h to Phase 2.3. Real work is the error-recovery path (partial-copy state, partial-delete state, source vanished mid-operation). Use the existing GS/OS class-1 calls (Open/Read/Write/Create/Close/Destroy) to compose the fallback; no new toolbox wrappers required.

Why these (rationale)

0.1 Pick exception-handling model (sanitizers, unwinder, cxxstdlib all depend on this)

Three options:

  • A. Keep SJLJ as default. Ship _Unwind_RaiseException-over-SJLJ stub for third-party C++ libraries. ~20h. Loses no functionality. Strongly recommended.
  • B. Build a real DWARF CFI unwinder (libunwind port + backend CFI emission
    • per-MIR-pass CFI annotation + __jsl_indir hand-CFI). ~260h floor. Real throw across non-instrumented frames.
  • C. Make EH model a Subtarget feature with two MCAsmInfo subclasses, ship both. ~20h extra plumbing on top of (A) or (B).

Recommendation: A. Reviewer for unwinder called (B) a multi-week structural change and a foot-in-the-door for a long string of follow-on bugs.

0.2 Pick LTO model

  • A. ThinLTO. Preserves per-TU codegen attachments so Lua's per-file -mllvm -regalloc=basic keeps working. Summary-based inlining decisions less prone to over-inlining the project already fought (feedback_lapi_inline_threshold.md, feedback_coremark_matrix_test_regression.md).
  • B. Full LTO with whole-module merge. Simpler tool, harder integration. Per-TU regalloc becomes unrepresentable.

Recommendation: A. Reviewer called the original "full LTO is simpler" choice exactly backwards for this codebase.

0.3 Pick sanitizer scope

  • A. UBSan-minimal + coverage only. Achievable. ~22h.
  • B. UBSan + ASan + coverage. ASan's 8:1 shadow-memory model does not fit the 65816 (full 16MB → 2MB shadow; programs run in 1-2 banks). Reviewer called the brief wrong on architectural grounds.

Recommendation: A. Document ASan as out-of-scope.

0.4 Pick localvars split

  • A. Ship -O0 stack-resident locals only (20-30h). Faster delivery, narrower payload.
  • B. Ship -O0 + -O2/IMG-resident + location-list crossing PCs + inlined subroutines (50-80h).

Recommendation: A first, then (B) as a follow-up. Reviewer's split point is the natural project boundary.

0.5 Pick cxxstdlib split

  • A. Ship cxxchrono only (etl::chrono + libc time hooks). 3-4h.
  • B. Ship cxxstream+format+path (string_stream + format + iigs::path + <cmath> shim + smoke). 12-15h.
  • C. Both, but as separate landings.

Recommendation: C, with A landing first. Reviewer noted mixing them hides the format/<cmath> rabbit hole.

0.6 Pick sprite-engine harness

  • A. Standalone. sprite.c does its own $C029 + SCB + palette init. Use runInMame.sh bare-metal harness.
  • B. Desktop-coupled. Relies on paintDesktopBackdrop, runs through GS/OS Finder launch (runViaFinder.sh).
  • C. Both, ship A first.

Recommendation: C.

0.7 Pick resource-fork delivery shape

  • A. AppleSingle blob (one file = data fork + resource fork; cadius auto-detects). Cleaner.
  • B. _ResourceFork.bin sidecar (cadius Prodos_Add.c:386 supports it). Cleaner separation.

Recommendation: A. Verify via a 1-hour disposable spike before writing the bundler.

0.8 Pick clangdwarffix approach

  • A. BPF-style one-liner: MAI->setDwarfUsesRelocationsAcrossSections(false). Reviewer cites BPFTargetMachine.cpp:87 as the precedent. If it works the whole 14h plan collapses to ~1h.
  • B. New R_W65816_DATA32 + R_W65816_PCREL32 relocs through the full pipeline (MC → ELF writer → link816 → pc2line.py).

Recommendation: Spike A first (10 minutes). Fall back to B only if A introduces new failures. Reviewer documented (B) is a strict superset of work covered by (A).

0.9 Decide whether to fix IigsSoundParmT as a breaking change

The current in-tree struct (6 bytes) does not match ORCA's authoritative SoundParamBlock (18 bytes). iigsPlayDocSample is almost certainly silently broken today. Either:

  • A. Fix the struct as a breaking change. Existing demos relying on it will need migration. Reviewer believes none actually work today, so the "breakage" is theoretical.
  • B. Add new corrected API (iigsPlaySoundV2?), leave broken one.

Recommendation: A. Item docram cannot be delivered honestly otherwise.

0.10 Decide rename() / cross-directory policy

GS/OS ChangePath ($2004) is rename-in-place-only. POSIX rename(old, new) across directories is impossible without an explicit copy-then-delete path.

  • A. Reject cross-dir new-paths up front with EINVAL.
  • B. Implement copy-then-delete fallback. ~6h, error-recovery is the hard part.
  • C. Accept divergence; document loudly.

Recommendation: A initially, with (B) deferred until a real user complains.


Phase 1 - Foundational prerequisites (everything later depends on these)

These are NOT in the original 18 items but were surfaced as preconditions by multiple reviewers. They unblock the actual feature work.

1.1 GS/OS fopen hang investigation [BLOCKS: resourcemgr runtime, tmpfile real path, cxxstdlib::filesystem, posixfile real I/O]

JSL $E100A8 doesn't return under real GS/OS 6.0.2 (per STATUS.md).

  • Reproduce under MAME with the -debug -debugger qt -oslog bpset workflow documented in SESSION_RECOVERY.md.
  • Bisect: ABI mismatch, stack-shape mismatch, missing tool init, DP/SP layout.
  • If unsolvable in a 4-8h budget: document the limitation and route all GS/OS-dependent items through the stub linker (iigsGsosStub.s) with the honest-failure sentinel from step 1.2.

Effort: 8-16h investigation. If fix is found, +4-8h to land. If unsolvable, project moves on with stub-only paths for affected items.

1.2 Stub-mode sentinel for honest iigsGsosStub.s [BLOCKS: tmpfile, posixfile, cursor]

iigsGsosStub.s currently makes every GS/OS call succeed silently. __gsosAvailable() returns TRUE in stub mode, so newly-added wrappers (gsosDestroy, gsosChangePath, prefix/dir/info) will fall through the catch-all stub and appear to succeed while doing nothing.

Two options to fix:

  • Add explicit error stubs for each new wrapper (returns -1 / sets errno).
  • Add a __gsosIsRealImpl() sentinel that distinguishes "real GS/OS linked" from "universal-success stub linked".

Recommendation: Sentinel. ~3h. One source of truth.

1.3 FK_Data_4R_W65816_*32 reloc fix [BLOCKS: clangdwarffix → debugger → localvars → profiler]

Independent of Phase 0.8 choice — once the BPF-style spike either passes or fails, this is the actual step.

Reviewer surfaced multiple landmines the original plan missed:

  • ELFObjectWriter::recordRelocation (MC/ELFObjectWriter.cpp:1329-1349) converts in-section diffs to PC-relative. Need BOTH R_W65816_DATA32 AND R_W65816_PCREL32.
  • link816.cpp:1275 has a hardcoded r.offset + 3 > sec.size width check inside writeDebugSidecar that must become reloc-type-driven.
  • Reloc-type emission must land BEFORE the MC change starts emitting the new types or every intermediate -g build dies on unknown reloc.
  • A ninja clean && ninja is required (TableGen-emitted enum dependencies do not play well with incremental builds).

Steps:

  1. (a) Spike setDwarfUsesRelocationsAcrossSections(false) in W65816MCAsmInfo. Rebuild clang, xxd .debug_line of a -g hello.c, verify non-zero unit_length / header_length. If green: skip to step 1.3.h.
  2. (b) Add R_W65816_DATA32 + R_W65816_PCREL32 to W65816FixupKinds.h / W65816AsmBackend.cpp.
  3. (c) Extend W65816ELFObjectWriter::getRelocType to dispatch FK_Data_4 by IsPCRel.
  4. (d) Add 4-byte reloc handlers to link816.cpp::applyReloc (DATA32 = write sectionBase + addend; PCREL32 = write target - patchAddr + addend).
  5. (e) Generalize the r.offset + 3 > sec.size width check to use a small switch on reloc type.
  6. (f) Land link816 + AsmBackend in one commit (so intermediate builds don't die). Then land the MC switch that starts emitting the new types.
  7. (g) Update pc2line.py to use the now-correct unit_length / header_length, keeping the tolerant zero-fallback for older artifacts.
  8. (h) Audit emitAbsoluteSymbolDiff / emitDwarfUnitLength / makeEndMinusStartExpr callers; verify .debug_frame, .debug_loclists, .eh_frame also work.
  9. (i) Drop llvm-dwarfdump consumes without warnings from shipsAsEM_NONE will still warn. File EM_ assignment as a separate gap item.

Effort: 1h (best case, MAI flag works) or 14-20h (full reloc path). Risk HIGH on rebase pain.

1.4 Backend prerequisites bundle [BLOCKS: sanitizers, lto, unwinder]

Three small backend changes that unblock multiple items:

  • (a) setOperationAction(ISD::RETURNADDR, MVT::i32, Expand) in W65816ISelLowering.cpp. Today any code calling __builtin_return_address(0) ICEs clang (since pointers are i32 but RETURNADDR is registered for i16 Expand only). Required by UBSan's caller-pc dedup AND by user code. ~30 min.
  • (b) setOperationAction(ISD::TRAP, MVT::Other, Custom) + lower to BRK_pseudo that writes a sentinel to $70 before halting. Required by -fsanitize-trap=undefined. ~2h.
  • (c) Minimal W65816TTI (TargetTransformInfo) returning 4× generic cost for i32 ops and 20× for soft-float libcalls. Required by LTO inliner so it doesn't over-inline based on generic cost defaults. ~6h.

Effort: ~9h total. Land as three separate commits.

1.5 DBG_VALUE preservation audit across custom MIR passes [BLOCKS: -O2 localvars]

Custom MIR passes (W65816StackRelToImg, W65816StackSlotMerge, W65816SepRepCleanup, W65816LowerWide32, W65816ImgCalleeSave, W65816SpillToX, W65816TiedDefSpill) only use getDebugLoc() for source-line info. None call MachineInstr::transferDbgValues() when slots move/coalesce or when stack slots get promoted to IMG slots.

For each pass: grep for slot/register replacement. Each call site that substitutes one operand for another must propagate DBG_VALUE.

Effort: 8-15h. Without this, -O2 locals are vapor regardless of how good the DWARF parser is.

1.6 IigsSoundParmT correction [BLOCKS: docram]

Phase 0.9 decided this is a breaking change. Steps:

  1. Replace 6-byte struct in runtime/include/iigs/sound.h with ORCA's 18-byte layout (Pointer waveStart (4B) / Word waveSize pages (2B) / Word freqOffset (2B) / Word docBuffer (2B) / Word bufferSize (2B) / Pointer nextWavePtr (4B) / Word volSetting (2B)).
  2. Rewrite iigsPlayDocSample to populate the corrected struct. Move channel out of the struct into FFStartSound's arg0.
  3. Audit existing callsite at smokeTest.sh:1147 and migrate.
  4. Update README.md:144-147 and STATUS.md claim that DOC-RAM staging is not wrapped — those lines are about to be wrong.
  5. Verify under real GS/OS or a known-good MAME version (silence vs. audio is the validation gate).

Effort: 4-6h.

1.7 Build/harness prerequisites bundle

  • (a) runInMame.sh --check-u8 <addr>=<val> for byte-level SHR pixel checks. Required by sprites. ~1h.
  • (b) runViaFinder.sh --data /PATH=file injection. Required by any GS/OS demo with file I/O (tmpfile, posixfile GS/OS path, eventually cxxstdlib::filesystem). ~1h.
  • (c) buildGno-launched MAME smoke harness. Currently smoke runs C++ via inline cpp HEREDOCs at build-time only; the cxxsmoke, cxxstdlib, and cursor smoke checks need actual MAME-launched OMF execution. Mirror tests/lua/ pattern. ~4h.
  • (d) Fix softDouble.o.bak in runtime/ (15KB stale dated May 1). Required before buildsystem can do file(GLOB) over runtime/*.o. Either delete the .bak or generate the imports manifest from runtime/build.sh. ~30 min.
  • (e) Generate W65816RuntimeImports.cmake from runtime/build.sh (or have build.sh emit a manifest). Single source of truth for the runtime .o list. ~2h.

Effort: ~9h.

1.8 srand seeding + ReadTimeHex toolbox call [BLOCKS: tmpfile uniqueness, posixfile mkstemp]

extras.c:124 seeds rand() to constant 1. mkstemp's claimed uniqueness guarantee is a lie without time-based seeding.

  • Expose ReadTimeHex ($0D03) in iigsToolbox.s (currently absent).
  • Add srand hook in crt0Gsos.s + crt0Gno.s that reads time and seeds.

Effort: ~2-3h.

1.9 <cmath> C++ shim [BLOCKS: cxxstdlib (format / chrono with FP)]

clang++ on llvm-mos has no system C++ stdlib; #include <cmath> fails. ETL's format.h #includes <cmath> when ETL_USING_FORMAT_FLOATING_POINT=1.

  • Create runtime/include/c++/cmath that pulls <math.h> (already extern-C-wrapped) and exports std:: aliases for the libc functions.
  • Optionally add <cstdlib>, <cstddef> shims following the same pattern.
  • Decide ETL_USING_FORMAT_FLOATING_POINT default policy in etl_profile.h: recommend OFF by default with --layer2 opt-in for FP format builds.

Effort: ~3h.

1.10 PATH_MAX and friends in limits.h [BLOCKS: posixfile]

PATH_MAX is not defined anywhere. Add to runtime/include/limits.h with a comment tying it to GSString.length being u16 and the practical NUL-terminated-path-fits-in-256-bytes rule.

Effort: ~30 min.

1.11 Weak-extern survival policy for LTO [BLOCKS: lto] [DONE]

libc.c declares dozens of __attribute__((weak)) extern GS/OS calls (gsosOpen/Read/Write/...). Under LTO, the inliner may decide a weak-extern is undefined and propagate that as constant 0 / NULL through callers, then DCE the surrounding code.

  • Marked all weak-extern decls in libc.c with __attribute__((weak, retain, used)): the GS/OS dispatchers (gsosOpen/Read/Write/Close/GetEOF/SetEOF/SetMark/GetMark/Create), __gsosIsRealImpl, __putByte, __getByte, __putByteErr, __heap_start, __heap_end. used keeps the compiler from dropping references; retain survives linker GC; both are no-ops in non-LTO builds.
  • libcxxabi.c::abiRunCxaAtexit (__run_cxa_atexit) annotated with __attribute__((retain, used)) — its only callers live in crt0*.s (jsl __run_cxa_atexit), which is invisible to LTO's IR view, so without the attributes LTO would strip the body and crt0 would JSL into the weak no-op fallback in libgcc.s and C++ global dtors would never run.
  • Definitions in libcGno.c left unannotated: link-pull-in from libc.c's weak-externs already keeps them alive; the LTO hazard is on the declaration side, not the definition side (the linker pulls libcGno.o in to resolve the libc.c weak-externs regardless of LTO).
  • 145 smoke checks pass.

1.12 LTO × Layer 2 silent-miscompile gate [BLOCKS: lto]

-mllvm -w65816-dbr-safe-ptrs is per-TU. Mixing in an LTO set produces silent wrong code.

Build the gate FIRST, before any LTO codegen work:

  • Embed Layer 2 flag in IR as a module-level attribute on every TU.
  • In the LTO driver pre-pass, hard-fail if attributes disagree.

Effort: ~3h.

1.13 ELF EM_ assignment [BLOCKS: clangdwarffix, llvm-dwarfdump tooling]

llvm-dwarfdump warns persistently because EM_NONE is set on output. Assign a real (vendor-private if needed) EM_ value.

Effort: ~2h.

Phase 1 total: ~60-90h.


Phase 2 - M1 quick wins (parallel, no DWARF dependency)

These items have no cross-dependencies and can run concurrently once Phase 1 lands the build-harness prerequisites.

2.1 clangdwarffix (continued from Phase 1.3)

Phase 1.3 covered the reloc plumbing. Remaining work:

  • Update smoke checks at smokeTest.sh:5347 (encodes 3-byte width — the new 4-byte LE address starts with the same 3 LE bytes, so green, but fragile).
  • Add pc2line.py cleanup to drop the zero-length fallback.
  • Update docs (USAGE.md, STATUS.md) to drop the "llvm-dwarfdump warns" caveat — depends on Phase 1.13.

Effort: ~3h after Phase 1.3 + 1.13.

2.2 hexfloat (%a / %A printf)

  • Decide subnormal canonical form (recommend 0x0.{mantissa}p-1022).
  • Decide trailing-zero stripping policy (recommend glibc-style: strip when precision unspecified).
  • Implement emitHexFloat in runtime/src/snprintf.c with local width/leftAlign/zeroPad arithmetic (do NOT reuse emitNumber's monolithic numeric body — only use it for the exponent).
  • Use 4 u16 words instead of u64 shifts to dodge i64-codegen surprises (>>52 and 12-bit mask paths).
  • Bring %f/%g/%e to Inf/NaN parity OR document the asymmetry (don't half-do it).
  • Add a new smoke probe block (don't extend the existing 0x7f bitmap — used by two checks at smokeTest.sh:2407 and :2581).
  • Update STATUS.md:48-52 (printf conversion table) and snprintf.c banner at lines 21-23.

Effort: 6-8h.

2.3 tmpfile / tmpnam / rename

Following Phase 0.10's decision (copy+delete fallback for cross-dir rename):

  • Per-FILE owned name buffer (extend FILE struct or use parallel tmpNames[MFS_MAX_FILES][L_tmpnam] table). Update __mfs[] initializer.
  • Add gsosDestroy ($2002 pCount=1) and gsosChangePath ($2004) wrappers in iigsGsos.s + iigsGsosStub.s (real stub semantics from Phase 1.2).
  • Promote remove() from mfs-only to mfs-then-GS/OS-Destroy.
  • Promote tmpfile() from stub: generate unique name via tmpnam, open O_CREAT|O_EXCL, set the auto-delete-on-close flag in the FILE.
  • Promote tmpnam() from stub: read time via Phase 1.8 srand seed, format /RAM5/T{16-hex-chars}.TMP or similar.
  • Promote rename() from stub:
    • Fast path: if new-path is in the same directory, route to ChangePath.
    • Cross-dir copy+delete fallback: Open source RDONLY, Create destination, chunked Read/Write loop (8KB buffer), Close both, Destroy source. Error recovery: if Write fails mid-loop, Destroy destination + return -1. If final Destroy of source fails, leave dest in place + return -1 with errno set + emit a debug log line (destructive partial-state, but the data is preserved). Source-vanished-mid-op is rare under GS/OS (no concurrent process); leave as best-effort.
    • Use GSString256 stack scratch (already present at __gsosPathBuf in libc.c).
  • Update mfs-path detection auto-detect / vs : separator.
  • Smoke tests:
    • create + write + close + remove + verify destroyed.
    • rename within same dir (ChangePath path).
    • rename across dirs (copy+delete fallback) — write 10KB file, verify contents byte-identical post-rename, verify source gone.

Effort: 16-18h (was 10-12h; +6h for copy+delete fallback per Phase 0.10).

2.4 docram (DOC-RAM sample upload)

Phase 1.6 already corrected IigsSoundParmT. Remaining work:

  • Add iigsLoadDocSample(const int8_t *wave, uint16_t size, uint16_t docOffset) wrapper around WriteRamBlock toolbox call.
  • Update iigsPlaySoundV2 / iigsPlayDocSample to consume corrected struct.
  • Add demos/helloSample.c standalone demo.
  • Wire runtime/src/sound.c into demos/build.sh (currently missing).
  • Add standalone MMStartUp+SoundStartUp helper to iigs/sound.h (since startdesk() is too heavy for a CLI-style sample probe).
  • Smoke test: WriteRamBlock returns cleanly + a marker store fires.

Effort: 6-8h after Phase 1.6.

2.5 cursor helpers

  • Add IigsCursorT typedef to runtime/include/iigs/toolbox.h.
  • Add runtime/src/cursor.c with iigsCursorPushArrow, iigsCursorPushBusy, iigsCursorPop, iigsCursorRegister(region, cursor) (via TaskMaster wmTaskMask cursor auto-track, NOT a custom idle hook).
  • Save-stack stores a COPY of the CursorRecord (not the pointer — toolset memory can move).
  • Hard-error or asserted-no-op before startdesk() (InitCursor invariant).
  • Decide: drop embedded cursor blobs from scope (just wrappers + Wait/IBeam ROM shapes via GetCursorAdr($800c)) OR hand-code 4 cursor blobs and budget ~3-4h for mask/hotspot debugging.
  • Recommend: drop embedded blobs; expose SetIigsCursor(const IigsCursorT*) + iigsCursorBusy()/iigsCursorArrow().
  • Update runtime/build.sh (use __attribute__((section(...))) per cursor blob if embedded; OR use -fdata-sections target-wide and re-verify smoke).
  • Smoke: $70-marker MAME region-transition probe.

Effort: 14-18h.

2.6 buildsystem (CMake + Make integration)

  • Decide on TYPE enumeration: flat | flatMultiSeg | gsos | gno (four values, not three — reviewer caught this).
  • Build CMAKE_C_LINK_EXECUTABLE override that fully bypasses CMake's link-line generator (link816 takes no -L/-l/-Wl/response files).
  • Generate W65816RuntimeImports.cmake from Phase 1.7.e (single source of truth).
  • Per-source-file CFLAGS override: set_source_files_properties(... PROPERTIES W65816_LAYER2 ON W65816_REGALLOC basic).
  • Wrap all four runner harnesses (runInMame.sh, runMultiSeg.sh, runViaFinder.sh, runInGno.sh) under add_w65816_mame_test().
  • Hand-build the link line in exact order (libcGno.o BEFORE libc.o for weak override).
  • ProDOS filetype/aux: pass --filetype to link816, emit .meta sidecar, ctest wrapper reads .meta to construct cadius #XX0000 suffix.
  • Guard at CMake configure time: TYPE=gno + SEGMENT_CAP is an error (omfEmit rejects this combo at omfEmit.cpp:723-724).
  • C++ auto-link of libcxxabi.o + libcxxabiSjlj.o AFTER libc.o: read SOURCES extensions, branch in CMake function body (genex can't reorder).
  • Make template: scope explicitly to single-binary single-mode flat hello-world ONLY. Document the gap.
  • Smoke integration under ulimit -t 90s: cold-cache CMake configure can take 30+s; ensure graceful skip when command -v cmake fails.
  • Optional: GENERATE_DEBUG keyword + ctest hookup for pc2line.py (depends on Phase 1.3).

Effort: 55h. HIGH risk on link-line override.

2.7 cxxsmoke (modern C++ smoke coverage)

  • Pre-spike: run each candidate snippet as a one-off demo through buildGno.sh
    • runInGno.sh BEFORE writing smoke checks. 30-min sanity gate.
  • Decide demo placement: create tests/cxxSmoke/ mirroring tests/coremark/ / tests/lua/ pattern, NOT in demos/ (where buildGno.sh auto-discovery would build them as GNO commands).
  • Add -include etl_profile.h to smoke compile line OR replace etl::tuple structured-binding check with a user struct that has tuple_size / tuple_element specializations defined in the heredoc.
  • Five checks: range-for, generic lambda + capture-by-reference of i32 local (the i32 path is where most recent fixes have lived — most likely to regress), variadic templates, structured bindings, fold expressions.
  • Each check: a buildGno-style probe with $70 marker on success.
  • Smoke harness from Phase 1.7.c launches each under MAME and verifies marker.
  • If any check fails: stop work, XFAIL the test with TODO note, book a separate codegen-fix PR.

Effort: 10-12h (clean run). Best case 4h, worst case multi-day if a codegen bug surfaces.

Phase 2 total: ~110-130h.


Phase 3 - M2 source-level debugging end-to-end

3.1 debugger (interactive GDB-style front-end)

Reviewer's critical findings:

  • cpu.debug:bpset(addr) 1-arg form CRASHES MAME. Use bpset(pc, '', 'logerror "BP-HIT PC=%X A=%X X=%X Y=%X S=%X DBR=%X\n",pc,a,x,y,s,db; go').
  • SESSION_RECOVERY.md:362-385 already documents the working -debug -debugger qt -oslog workflow. Reuse, do not reinvent.
  • Reentrancy SEGFAULT: add_machine_pause_notifier + cpu.debug:go() from a callback. Design must NOT call go() from Lua resume command callbacks.
  • MAME under -debug starts with execution_state = 'stop'. Harness must explicitly call dbg.execution_state = 'run'.
  • Drop bt from initial scope OR downgrade to best-effort single-frame parent only. Real multi-frame bt requires either DW_AT_frame_base in .debug_info or a per-function frame-size sidecar from link816 (new work item, not budgeted).
  • Add finish/return command (run-until-current-frame-RTL/RTS) — easier than step-over JSL and the natural escape from accidental step-into.

Steps:

  1. Add demos/build.sh --debug mode (adds -g to clang, --debug-out/--map to link816, _dbg output naming).
  2. Add demos/buildGno.sh --debug mode equivalent.
  3. Build Python front-end consuming -oslog stream (one-way pipe). Use machine.debugger.command(string) to inject debugger console commands at runtime for set-bp / step / continue.
  4. Pre-spike: confirm bpset(pc, '', '') form, verify bank-aware bp matching (24-bit PB:PC vs 16-bit PC), confirm execution_state behavior after pre-run bpset. 2h spike.
  5. Implement commands: b FUNC | FILE:LINE, c, s (step-instr), n (step-over: temp-bp at jsl_pc+4), finish, p &GLOBAL (map lookup only — p VAR deferred to localvars).
  6. Update SESSION_RECOVERY.md (not a new doc — keep one source of truth) to reference the new workflow.
  7. Add --trace mode that sets bp at main, captures one BP-HIT via -oslog, asserts pc2line.py resolves it. Default-on smoke, no DEBUGGER_E2E=1 gate.
  8. Gate interactive (dbg) prompt portion behind DEBUGGER_E2E=1 only.

Effort: 24-30h.

3.2 localvars (-O0 + -O2/IMG + location-lists + inlined subroutines, per Phase 0.4)

Depends on Phase 1.3 (DWARF reloc fix) + Phase 1.5 (DBG_VALUE preservation).

Per Phase 0.4 decision: full surface in one landing.

Steps:

  1. Verify llvm-dwarfdump can parse a -g .o after Phase 1.3. Hard precondition.
  2. Validate +1 stack skew convention with deliberate probe (int x=0xABCD; int y=0x1234; int z=0x5678; read fbreg offsets from memdump, verify alignment). Add as smoke check.
  3. Extend pc2line.py into a full DIE walker for .debug_info + .debug_abbrev
    • .debug_addr + .debug_str + .debug_str_offsets.
  4. Implement a DW_OP evaluator for: DW_OP_fbreg, DW_OP_addr, DW_OP_constN, DW_OP_reg0..7, DW_OP_breg0..7, DW_OP_call_frame_cfa.
  5. Add --locals 0xPC mode that reads from a MAME memdump (snapshot or -oslog register dump).
  6. Wire p VAR in debugger (3.1) to call pc2line.py --locals.
  7. -O2 / IMG-resident locals: rewrite DW_OP_regN refs to IMG slot indices (IMG0..IMG15) into DW_OP_breg<DP_base>+offset form. LLVM emits the fictitious-register form; pc2line maps it to actual DP $C0..$DE locations.
  8. Location lists: parse .debug_loclists (DWARF 5) for PC-range-keyed location expressions. Resolve to the correct entry for the queried PC.
  9. Inlined subroutines: DW_TAG_inlined_subroutine descent. Multiple- DIE-per-PC handling. Show inlined frame stack at the queried PC.
  10. Smoke checks (covering -O0 AND -O2 paths):
    • add(3, 4) -O0: locals print a=3 b=4 c=7.
    • popcount(0xF0F0) -O2 with IMG-resident vars: locals resolve correctly.
    • Multi-CU program (Lua-scale): locals from any CU resolve.
    • Inlined-helper case: stack shows the inlined frame.
  11. Expect 3-5 additional clang DWARF bugs to surface as -O2 / IMG / loclists work probes .debug_info deeper. Each is its own upstream-or-local-patch decision; budget contingency in this phase.

Effort: 50-75h (combined slice). Risk: HIGH (Phase 0.4 override accepts this). Mitigation: land Phase 1.5 DBG_VALUE audit FIRST.

3.3 posixfile (POSIX file helpers)

Depends on Phase 0.10 (cross-dir policy), Phase 1.7.b (--data injection), Phase 1.8 (srand), Phase 1.10 (PATH_MAX).

Steps:

  1. Add 3 new GS/OS class-1 wrappers to iigsGsos.s:
    • Get_Prefix ($200A) for realpath
    • Get_File_Info ($2006) for dirname/basename semantics
    • Get_Dir_Entry ($201C) for glob/directory iteration
  2. Add corresponding parm-block typedefs to runtime/include/iigs/gsos.h.
  3. Add stub-mode counterparts to iigsGsosStub.s (using Phase 1.2 sentinel).
  4. Pre-spike: write demos/gsosProbeDirEntry.c exercising directory open + Get_Dir_Entry iteration. Run under runInGno.sh + GSOS_FILE_SMOKE=1 BEFORE committing to glob's API. ~2h.
  5. Implement realpath (uses prefix resolution + Get_File_Info).
  6. Implement dirname / basename with auto-detect / vs : separator.
  7. Implement fnmatch with FULL bracket-set support ([A-Z]*, [!a-z]) — MANDATORY per reviewer, not optional.
  8. Implement glob using directory iteration + fnmatch.
  9. Implement mkstemp using Phase 1.8 srand seed. Template-must-be-writable invariant (refuse non-writable template, document rodata-write risk in header).
  10. Smoke check each: 6 helpers × ~20 min.
  11. Document GNO/POSIX-VFS limitation: realpath/glob route through GS/OS class-1 on both bare-metal-with-GS/OS and GNO. GNO chdir-via-K* not honored.

Effort: 18-26h.

3.4 resourcemgr (deferred or stub-only per Phase 1.1 outcome)

If Phase 1.1 resolves GS/OS fopen hang, proceed. Otherwise: stub-only landing documented as such.

Steps (full version):

  1. Decide bundler input format: TYPECODE_ID.bin per reviewer recommendation (16-bit type + 16-bit ID encoded in filename like 8005_0001.bin).
  2. Verify AppleSingle round-trip with disposable 1-hour cadius spike before writing full bundler.
  3. Install or build ORCA's rez as hard dependency for layout cross-checking.
  4. Write tools/rsrcBundle/rsrcBundle.py:
    • Read TYPECODE_ID.bin files
    • Build rResourceMap + rIndex
    • Stitch with OMF data fork
    • Emit AppleSingle
  5. Write tools/rsrcBundle/dumpFork.py for diffing against rez output.
  6. Implement resourceProbeInit() in runtime/src/resource.c (MMStartUp + TLStartUp + ResourceStartUp + OpenResourceFile-on-own-pathname).
  7. Build typed-C façade: LoadResource, GetResourceSize, HLock semantics (handle relocation via Memory Manager).
  8. Add ResourceShutDown hook via __cxa_atexit.
  9. Build demos/rsrcProbe.c with marker discipline (write $025000=0x99 + while(1); runViaFinder LAUNCHES only, no keypress automation).
  10. Add --rsrc <applesingle> mode to runViaFinder.sh.
  11. Update demos/build.sh to call rsrcBundle as post-step when .rsrc/ dir present.
  12. WriteResource + UpdateResourceFile DEFERRED to a separate item (persistent write needs disk-extract-and-diff verification).

Effort: 40-50h.

Phase 3 total: ~120-180h.


Phase 4 - M3 IIgs application authoring kit (parallel with Phase 3)

4.1 menubuilder

Steps:

  1. Pre-verify: does DrawMenuBar actually still hang post-InitCursor-landing? Drop paintMenuBarTitles fallback if not. 30-min check.
  2. Side-by-side dump struct offsets of NewWindowParm vs ORCA's window.h. 30-min ABI check.
  3. Reconcile WmTaskRec (used in all 5 demos) with IigsEventT (used in eventLoop.h). Either align field offsets or document why both exist.
  4. Build menu mini-format assembler in runtime/src/uiBuilder.c:
    • Handles >> (menu start), >>@ (Apple menu), \X (icon), \N### (numeric ID), *Xx (cmd-key), -- (item prefix), --- (divider), D (disabled), V (visible/check), * (separator), .\r (terminator).
    • Round-trip test against Menu Mgr's parser. 6h.
  5. Window builder + control wrappers (cButton/cCheckBox/cEditLine/cScrollBar using abstract 32-bit proc constants — NOT bank-E1 ROM addresses). 4h.
  6. Add cmdId→itemID lookup table to IigsMenuT. Document dispatch contract.
  7. Extend IigsEventCallbacksT with onCmd (menu-pick dispatcher).
  8. Migrate ALL FIVE affected demos (frame.c, orcaFrame.c, minicad.c, reversi.c, helloWindow.c). 6h.
  9. Either include AlertTemplate/ItemTemplate wrapper (uiBuilderAlert) in scope OR carve out a separate alertbuilder item. Recommend in-scope.
  10. Smoke check: install menu with one item, simulate keystroke via scripted MAME input, verify onCmd fires by setting $70=0x99. 4h.
  11. Re-baseline OMF sizes; verify cRELOC budget headroom.

Effort: 25-30h.

4.2 sprites (320 mode, standalone per Phase 0.6)

Steps:

  1. Standalone init: sprite.c does its own $C029 NEWVIDEO bit 7 + SCB ($E1:9D00)
    • palette ($E1:9E00). 2h.
  2. SHR-safe heap policy:
    • Document $C035 shadow register interaction.
    • Sprite save buffers MUST live above $A000 OR in a bank != 0 (since bank-0 $2000..$9FFF mirrors to $E1:2000..$E1:9FFF).
    • Add iigsSpriteAttachBuffer(void *buf, size_t size) so caller controls placement.
    • Document this in iigs/sprite.h and STATUS.md.
  3. Software sprite engine:
    • 16×16 fixed sprite shape, 4bpp packed.
    • Background save/restore.
    • Transparent blit (mask).
    • Sprite list (Begin/Add/RenderAll/EraseAll).
  4. Integration with eventLoop's TaskMaster frame cadence.
  5. Demo (demos/spriteProbe.c):
    • Init SHR.
    • Place 8 sprites.
    • One frame of update.
    • Verify via runInMame.sh --check-u8 (from Phase 1.7.a) at known SHR offsets.
  6. Cycle benchmarks in tests/sprites/: "blit one 16×16 sprite in <2000 cyc", "erase + redraw 8 sprites in <16000 cyc / 1 frame".
  7. 640 mode DEFERRED to follow-up item (Phase 0.6 decision).
  8. pha;plb DBR-to-$E1 optimization in inner loop: only if blit doesn't call any libgcc helper while DBR is contaminated. Audit before enabling.

Effort: 22-28h.

Phase 4 total: ~50-60h.


Phase 5 - M4 production-grade C++ toolchain

Per Phase 0.1/0.2, this is materially smaller than the original brief.

5.1 unwinder_Unwind_RaiseException-over-SJLJ stub (Phase 0.1 option A)

Not a real DWARF unwinder. Provides the Itanium surface third-party C++ libraries expect.

  • runtime/src/libunwindStub.c: _Unwind_RaiseException, _Unwind_Resume, _Unwind_GetIP, _Unwind_GetCFA routed to existing SJLJ jmpbuf.
  • Smoke: probe that throws + catches via the stub.
  • Document: "third-party libcxx-using code links; throw across non-instrumented frames terminates."

Effort: ~20h.

5.2 lto (ThinLTO per Phase 0.2)

Depends on Phase 1.4.c (TTI), Phase 1.11 (weak-extern survival), Phase 1.12 (Layer 2 gate).

Steps:

  1. Pre-spike (30 min): build llvm-link + llvm-dis, ThinLTO 3 small TUs (extras.c + strtok.c + libcGno.c), --mtriple=w65816 -inline-threshold=50, link with asm objects, run helloBeep. Validates the pipeline.
  2. Add llvm-link, llvm-as, llvm-dis to installLlvmMos.sh ninja targets. Extend existence-check at lines 75-78.
  3. Build scripts/ltoLink.sh that:
    • Reads bitcode + native asm objects
    • Runs llvm-link on bitcode
    • Runs opt -O2 --mtriple=w65816 -inline-threshold=50 (explicitly set; opt does NOT invoke TargetPassConfig so the TM-init hook for inline-threshold doesn't fire).
    • Runs llc -filetype=obj
    • Hands resulting .o to link816.
  4. Verify GlobalDCE doesn't strip .init_array boundary symbols. Mark with llvm.used if needed.
  5. Document: per-file -mllvm -regalloc=basic for Lua's lvm.c / ldebug.c / ltablib.c is preserved by ThinLTO's per-TU codegen attachment.
  6. CoreMark + Lua LTO smoke: success criterion "produces a working binary at parity size or better."
  7. Document LTO × Layer 2 hard-fail behavior (Phase 1.12).

Effort: 30-40h.

Status (2026-06-02 PARTIAL - NoTTI-Lite mode):

  • scripts/ltoLink.sh LANDED. Driver: llvm-link merges bitcode, opt -passes='w65816-layer2-gate' enforces Phase 1.12 (refuses on mismatch), opt --mtriple=w65816 -passes='default' -inline-threshold=50 runs IR-level optimization with the W65816- appropriate inline threshold, llc -filetype=obj produces the final native object. Flags: -o, --keep-temps, --layer2 (caller-asserts), --inline-threshold N (override), --emit-ll (debug).
  • installLlvmMos.sh now builds llvm-link / llvm-as / llvm-dis / opt as part of the toolchain ninja targets and gates the existence check on all four. Phase 5.2 step 2.
  • W65816TTI (W65816TargetTransformInfo.h + override in W65816TargetMachine) WIRED but kMildCostModelEnabled = false. The Phase 1.4c bsearch hang (smoke #77) RE-SURFACED when qsort.c was recompiled under TTI-active multipliers (2x i32, 5x float) — meeting the "if bsearch smoke fails, ship NoTTI-Lite" criterion in the spec. The TTI plumbing ships present-but-bypassed so flipping kMildCostModelEnabled to true is the only change needed to enable full Phase 5.2 cost-driven inlining once the underlying i32 termination-compare codegen bug is fixed.
  • Layer 2 LTO hard-fail behavior (Phase 1.12) is documented in W65816Layer2Gate.cpp header comment + ltoLink.sh step 2 comment. The gate has been end-to-end-verified: mixed Layer 2 + non-Layer 2 bitcode IS rejected at LTO time with a deterministic LLVM ERROR: W65816 Layer 2 LTO gate: Layer 2 mode disagreement.
  • Per-TU codegen attachment (-mllvm -regalloc=basic for Lua's lvm.c / ldebug.c / ltablib.c) is preserved by ThinLTO's per-function attribute mechanism — those flags translate to function-level attributes that survive bitcode merge. No code change needed.
  • Size parity probe: demos/ltoProbe.c + demos/ltoProbeHelper.c through ltoLink.sh produces 37781-byte GNO OMF vs 37785 bytes for non-LTO (parity-or-better met). Runs cleanly under MAME + GNO with the harness marker hit.
  • All 162 smoke checks green after Phase 5.2 land + TTI bring-up.

Deferred to a future phase:

  • Enabling the 2x i32 / 5x float TTI multipliers. Requires fixing the i32 termination-compare codegen bug that the original Phase 1.4c attempt surfaced (smoke #77 bsearch hang). Reproducer: kMildCostModelEnabled = true + rebuild runtime + run smoke.
  • CoreMark / Lua LTO smoke probes (the spec's step 6). CoreMark's bank-budget pressure under aggressive inlining is exactly what TTI was meant to address; without TTI active, ThinLTO of CoreMark is expected to bloat past Layer 2's single-bank budget. Re-attempt after the TTI re-enable lands.

5.3 cxxchrono (Phase 0.5 split — chrono only)

  • Add etl_get_steady_clock + etl_get_high_resolution_clock + etl_get_system_clock C-side hooks in runtime/src/libc.c.
  • Verify ETL chrono milliseconds rep is i32 or i64 with static_assert. Set ETL_CHRONO_*_CLOCK_DURATION in etl_profile.h to force i32 if i64.
  • Add prototype to runtime/include/time.h.
  • Smoke: chrono::steady_clock::now() returns monotonically increasing millisecond values.

Effort: 3-4h.

5.4 cxxstream+format+path (Phase 0.5 split — the rest)

Depends on Phase 1.9 (<cmath> shim).

Steps:

  1. Set ETL_USING_FORMAT_FLOATING_POINT=0 default in etl_profile.h. FP-format build is a separate --layer2 target.
  2. Define runtime/include/c++/iigs/path.h with ProDOS-aware path operations (64-char component / 8-component / : separator limits validated).
  3. etl::string_stream + printf("%s", ss.str().c_str()) is the cout replacement. Drop the iigs/console.h cout-shim idea — adds surface area without value.
  4. Add runtime/include/c++/cstdlib, <cstddef> shims.
  5. 1-hour etl::format size spike before committing: measure format_to(buf, "{}", 42) vs etlProbe size. If >10KB delta for one int format, document and downgrade scope.
  6. Smoke: cxxStdlibProbe demo through buildGno+MAME via Phase 1.7.c harness.
  7. Document std::iostream, std::regex, std::filesystem, std::format (the full versions, not ETL substitutes) as explicit out-of-scope with reasons (size, locale dependencies, GS/OS fopen).
  8. Set explicit per-component size budgets up front (regex link budget, filesystem code budget). Skip with documentation if exceeded.

Status (2026-06-02 LANDED):

  • ETL_USING_FORMAT_FLOATING_POINT=0 default confirmed in runtime/include/c++/etl_profile.h (via the ETL_FORMAT_NO_FLOATING_POINT gate); FP-format is a -UETL_FORMAT_NO_FLOATING_POINT opt-in.
  • runtime/include/c++/iigs/path.h provides pathNormalize / pathJoin / pathSplit with 64-char component + 8-depth + :-or-/ separator validation. Header-only, no link footprint when unreferenced.
  • runtime/include/c++/sstream aliases etl::string_stream as std::stringstream so portable code that names std::stringstream resolves to the ETL fixed-capacity surface. Cout-replacement idiom documented in iigs/path.h header preamble and in the <sstream> shim itself: etl::string_stream ss(buf); ss << ...; printf("%s", ss.str().c_str());
  • <cstdlib> / <cstddef> / <cmath> shims already exist (Phase 1.9).
  • Chrono::milliseconds rep is i32 on the W65816 by way of the ETL_CHRONO_*_CLOCK_DURATION overrides; cxxStreamProbe carries a static_assert(sizeof(etl::chrono::steady_clock::duration::rep) == 4) that fails compile if the override regresses.
  • etl::format size spike (step 5): a 1-line format_to(buf, "{}", 42) added ~82 KB to the binary over the no-format flavor. Hard downgrade per the step-5 rule (>10 KB threshold). etl::format is the layer2-opt-in surface, NOT default; gated by -DCXX_STREAM_PROBE_WITH_FORMAT=1 in the demo.
  • demos/cxxStreamProbe.cpp exercises stream<<int + path join/normalize/ split + chrono i32 contract + format sentinel. Bin 19199 bytes (well under bank-0 budget). Smoke check 9/9 green under GS/OS 6.0.4 + GNO in MAME.
  • Smoke-check entry added to scripts/smokeTest.sh after the cxxChronoProbe check (~6422 area).

Explicit out-of-scope (step 7), documented here for future reviewers:

  • std::iostream (full): locale-aware num_put/num_get machinery, ctype tables, and per-stream sentry construction cost ~15-25 KB even for a single cout << int. Replacement: etl::string_stream + printf("%s", ss.str().c_str()). Aliased as std::stringstream in <sstream> for code-portability.
  • std::regex: full NFA + DFA construction is a ~30-40 KB code budget on the W65816 even with a single-character-class regex. No locale surface available either. Replacement: caller-supplied scanner or hand-rolled state machine. Documented out-of-scope.
  • std::filesystem: directory-iterator + canonical-path resolution
    • permission-bit handling rely on POSIX surface the GS/OS FST does not provide (no lstat, no realpath, no permission bits beyond ProDOS access byte). Replacement: iigs::path::* + the existing libc opendir/readdir/stat surface in runtime/include/dirent.h and runtime/include/sys/stat.h. Documented out-of-scope.
  • std::format (the C++20 surface): the ETL surrogate (etl::format_to) measured at +82 KB for one int, the C++20 std:: surface would be larger again (full charconv float-to-text, locale hooks). Documented out-of-scope; the layer2-opt-in etl::format is the replacement.

Effort: 12-15h.

Phase 5 total: ~65-90h (vs original brief's 120-220h — Phase 0 decisions collapse the unwinder cost dramatically).


Phase 6 - M5 observability

6.1 profiler (function-attribution under MAME)

Depends on Phase 1.3 (DWARF reloc fix) + Phase 3.2 (pc2line DIE walker).

Steps:

  1. Pre-spike (2-3h): minimum-viable PC sampler as one-off script. Validate emu.register_periodic fires with usable density. Run against three representative shapes: short hot bench (strLen), libcall-dominated bench (popcount), multi-seg (Lua). If <30 samples or >50% misattribution, pivot to -debug mode + cpu.debug.bpset-with-counter (additional 6h).
  2. Switch attribution model to "sample count + hits-percent" (NOT emu.time() weighting — sample sparsity makes cycle% dishonest).
  3. Have link816 emit ALL local symbols (not just globalSyms) to a separate map file, gated by --map-locals. Required for meaningful libgcc / libc attribution. 1-2h link816 edit.
  4. CLOCK_HZ as CLI arg (slow-mode default 1023000; --fast-mode for GS/OS demos).
  5. Add --sample mode to runInMameCycles.sh (and runMultiSeg.sh). Do NOT fork into a separate runInMameProfile.sh — keep single-sourced.
  6. Smoke: assert ≤10% samples in '?' (unattributed) + assert dominant bucket matches expectation.
  7. Defer --line mode to a follow-up.

Effort: 14-20h.

6.2 sanitizers (UBSan-minimal + coverage per Phase 0.3)

Depends on Phase 1.4.a (RETURNADDR i32) + Phase 1.4.b (TRAP→BRK).

Steps:

  1. Document ASan as out-of-scope. STATUS.md + USAGE.md.
  2. Driver toolchain decision: Option (a) skip driver-side changes; users pass -fsanitize=undefined -fsanitize-minimal-runtime manually plus link runtime/ubsan.o. RECOMMENDED — 10h effort. Option (b) is +6h.
  3. Hand-roll runtime/src/ubsan.c based on ubsan_minimal_handlers.cpp:
    • Macro-substitute __builtin_return_address (Phase 1.4.a makes it work but at unknown cost; use Phase 1.4.b BRK trap PC for caller-pc dedup).
    • caller_pcs dedup table OR stub it out.
    • All 24 HANDLER pairs (recover + abort) + 2 RECOVER-only.
  4. Route ubsan messages via __putByteErr (stderr, fd 3 in GNO).
  5. Compile ubsan.c with -fno-sanitize=undefined (recursive ubsan footgun). Update runtime/build.sh.
  6. Add tests/ubsan/ mirroring tests/coremark/ pattern: build.sh, ubsanProbe.c, manifest.
  7. Probe scope: signed-overflow (add/sub/mul) + shift + divide. Three checks verified via $025000 sentinels.
  8. Document object-size cost honestly: empirically a 9-line indexed-read function expands from 12 to 682 lines instrumented. 3 intentionally- triggering ops may not fit single-bank.
  9. Coverage: -fprofile-instr-generate -fcoverage-mapping smoke check that verifies counters write to expected .profraw shape.

Effort: 22-28h.

Phase 6 total: ~36-48h.


Critical-path summary

The dependency arrows that gate everything else:

Phase 1.3 (DWARF reloc fix)
 ├─→ 2.1 clangdwarffix completion
 │    └─→ 3.1 debugger
 │         └─→ 3.2 localvars (full -O0 + -O2/IMG slice per Phase 0.4)
 │              └─→ 6.1 profiler
 └─→ 1.5 DBG_VALUE audit (must land before 3.2)

Phase 1.1 (GS/OS fopen hang)
 ├─→ 3.3 posixfile real I/O
 ├─→ 3.4 resourcemgr (or defer to stub)
 └─→ 5.4 cxxstream+format+path::filesystem (or document gap)

Phase 1.4 (backend prereqs)
 ├─→ 5.2 lto (1.4.c TTI)
 └─→ 6.2 sanitizers (1.4.a RETURNADDR, 1.4.b TRAP→BRK)

Phase 1.6 (IigsSoundParmT fix)
 └─→ 2.4 docram

Phase 1.11 + 1.12 (LTO weak-extern + Layer 2 gate)
 └─→ 5.2 lto

Week Phase Items
1 Phase 0 (DONE) + 1.1 spike + 1.3.a spike GS/OS fopen + MAI flag spikes
2 Phase 1.1-1.6 Foundational prerequisites
3 Phase 1.7-1.13 Build/harness + LTO gates
4-5 Phase 2 (parallel) M1 quick wins: clangdwarffix, hexfloat, tmpfile (+copy/delete fallback), docram, cursor, buildsystem, cxxsmoke
6-7 Phase 3.1 + Phase 4 (parallel) debugger; menubuilder + sprites
8-9 Phase 3.2 localvars full slice (-O0 + -O2/IMG + loclists + inlined)
10 Phase 3.3-3.4 posixfile; resourcemgr (or stub-only landing)
11 Phase 5 unwinder-stub + ThinLTO + cxxchrono + cxxstream/format/path
12 Phase 6 profiler + sanitizers (UBSan-min + coverage)

Total: 12 weeks of focused work for ~750-950h with Phase 0 decisions locked.

Phase 0.4 override (full localvars in one shot) adds ~10-15h vs the split approach; Phase 0.10 override (rename copy+delete) adds ~6h. Both are absorbed in the per-phase budgets above.


Risks I'm worried about (final list)

  1. FK_Data_4 truncation discovery cascade. The reviewer for localvars found the IMM24 truncation bug while planning DWARF work. The bug is fixed in Phase 1.3, but it's almost certainly the FIRST of several clang DWARF bugs for this target. Budget contingency in Phase 3.2-3.3.
  2. cxxsmoke surfaces silent codegen regressions. Every prior C++ probe this project has run (cxxProbe, etlProbe) has surfaced at least one backend bug. Phase 2.7 will likely do the same. Budget contingency.
  3. GS/OS fopen hang is unsolvable in budget. If Phase 1.1 doesn't yield a fix within 8-16h, multiple downstream items (resourcemgr, cxxstdlib::filesystem, tmpfile real path, posixfile real I/O) ship stub-only with documented limitations. This is acceptable but worth confirming up front.
  4. Layer-2-aware LTO miscompile. Phase 1.12 gate must be built FIRST. If skipped, the resulting binaries are silently wrong in the most performance-sensitive code path.
  5. menubuilder cRELOC budget pressure. reversi.omf already at 40.5KB; adding uiBuilder.c may push some demos past the cRELOC threshold. Re- baseline post-migration.
  6. unwinder scope creep. Phase 0.1 must be a hard decision. Going from (A) stub to (B) real DWARF mid-work would derail the schedule.
  7. MEMORY.md truncation. The index is already past the 200-line load limit. Before starting any item, grep for feedback_*<item-substring>*.md in the memory dir to surface anything the loaded portion doesn't show.
  8. sprites SHR shadow scribble. Phase 4.2.2 heap-vs-shadow policy is load-bearing. Without explicit handling, sprite save buffers will land in the visible display window and corrupt user pixels.

How to use this document

  • Start at Phase 0. Make each decision EXPLICITLY before any Phase 1 work.
  • Phase 1 is FOUNDATIONAL. Skip nothing. Items in later phases will fail silently if any Phase 1 prerequisite is missing.
  • For any item touching DWARF: Phase 1.3 MUST be green first.
  • For any item that does GS/OS file I/O: Phase 1.1 MUST be investigated.
  • Reviewer-adjusted hours are working estimates; brief hours are systematically low across the board.
  • The Critical-path summary is the dependency graph — respect it.