# Multi-segment OMF support — plan ## Why Single-segment cap: ~60KB usable in bank 0 after the IO window ($C000- $CFFF), the stack at $0FFF, and crt0 / runtime overhead. Real IIgs applications need 100s of KB across multiple banks. GS/OS Loader is designed for this — load each segment into its chosen bank, fix up inter-segment references at load time, jump to the entry segment. ## Today - `link816` produces a flat binary covering `[--text-base, ...]` in a single bank-0 image. All sections are concatenated into one address space. Inter-section relocations are resolved at link time. - `omfEmit` wraps that flat binary in a single OMF segment (KIND=CODE, ORG=0, SEGNUM=1, body = one DS opcode + END). No relocation records emitted (image is already absolute). - `crt0` enables LC RAM, zeroes BSS, runs `.init_array`, calls `main`. - All cross-function calls already use JSL (3-byte long) — we never emit JSR. That's accidentally helpful for multi-segment. ## Target A program that builds 4 segments — say: - Segment 1 ("MAIN"): crt0 + main + a few hot routines, in bank 1 - Segment 2 ("CODE"): bulk of code, in bank 2 - Segment 3 ("DATA"): rodata, in bank 3 - Segment 4 ("BSS"): uninitialized data + heap, in bank 4 GS/OS Loader places each segment, applies inter-segment relocations (every `JSL foo` where `foo` lives in a different segment becomes a `JSL ` patched at load time with the absolute address), and jumps to the entry. ## The four hard problems ### 1. Section → segment assignment policy We need a deterministic rule that maps every input object's `.text` / `.rodata` / `.bss` / `.init_array` section into a specific segment. Three options: **A. Per-object → one segment.** Each `.o` becomes one segment. Simple mental model; bad locality (many tiny segments, lots of inter-segment JSLs); GS/OS Loader has 8KB+ minimum overhead per segment. **B. Greedy bin-packing.** Compute total code size; cap each segment at N bytes (e.g. 32KB to leave headroom); pack `.text` sections into segments greedily in input order. Same for `.rodata` / `.bss`. Predictable, but a function near the end of segment N might want to JSL a function at the start of segment N+1 — common pattern, every call becomes inter-segment. **C. Static call graph + clustering.** Compute call graph from the relocations, cluster co-calling functions together, pack clusters into segments to minimize inter-segment edges. Best locality, real linker work. **Recommendation: B for v1.** Add a `--segment-cap` option (default 32768). Real applications will want C eventually, but B unblocks "my program is bigger than 64KB" today. ### 2. Inter-segment relocation tracking When a `JSL foo` reloc resolves to a function in a different segment, we MUST emit an OMF relocation record instead of patching the bytes in-place. Currently `link816` patches everything at link time and emits zero reloc records. The reloc model becomes per-segment: - Intra-segment IMM16 / PCREL: patch at link time, no OMF record. - Intra-segment IMM24 (JSL): patch at link time (low 24 bits = segment- relative offset for now; loader adjusts at load time when segment is placed). Actually need OMF reloc here too because we don't know the load bank. - Inter-segment IMM24 (cross-bank JSL): emit `INTERSEG` opcode (`E2`) pointing at `(target_segment_num, offset_within_segment)`. - Inter-segment IMM16 data ref: requires the data segment to land in the same bank as the referencing code OR we need the loader to fail (16-bit absolute can't cross banks). In v1, force all data refs to be to a "data segment" that's in a fixed bank, OR rewrite to long addressing. The IMM16 cross-segment problem is the killer. Three responses: i. **Punt:** Disallow it. All `.rodata` references must be in the same segment as the code, OR refs to global data must use long addressing (rewrite at compile time via `__attribute__((far))`). ii. **Promote to long at link time:** Detect IMM16 cross-segment refs, rewrite the instruction's encoding from absolute (3-byte) to absolute-long (4-byte). Changes code size, shifts everything after the patched site — invasive. iii. **Same-bank constraint:** Ensure the data segment's bank == the code's DBR. Means all code segments share one DBR, all data lives in one segment in that DBR's bank. **Recommendation: iii for v1.** All `.rodata` lives in one segment in the bank our code uses for DBR. We already pin DBR to bank 0 in crt0 (well, code does `pha;plb` for bank 2 sometimes for tests, but not in general). For v1, all `.rodata` goes in bank 0 alongside the first text segment, and code segments in higher banks reference data via long absolute addressing. Need to confirm what addressing modes our backend actually emits for global access. ### 3. crt0 / loader contract Current crt0 assumes flat layout: ``` __start: setup CPU mode, stack enable LC RAM zero BSS [__bss_start..__bss_end] run .init_array jsl main spin ``` Multi-segment changes: - BSS may span multiple segments (bank 0 LC + bank N segment). The `__bss_start` / `__bss_end` symbols need to be per-segment, OR a loop over a list of `(start, end)` pairs the linker emits. - `.init_array` ditto. - LC RAM enable only applies to bank 0 — fine. - The OMF Loader will handle the actual memory placement; crt0 just runs after Loader is done. - The Loader's entry call lands at the segment marked with the entry field. By convention that's segment 1. **Decision:** Designate segment 1 as "init segment" containing crt0 + its required symbols (`__bss_start_seg1`, `__init_array_start_seg1`, etc.) and the linker emits a `__bss_table` and `__init_array_table` — arrays of `(start, end)` pointers walked by crt0. Same idea Mac OS X's loader uses for multi-segment programs. ### 4. Build pipeline + tests - `link816 --segment-cap N` emits multiple `(image, base, syms)` triples plus inter-segment reloc records. - New intermediate format between linker and `omfEmit`: a small manifest file listing each segment's body, base, name, and reloc records. Easier than passing all that on the CLI. - `omfEmit` reads the manifest and emits a single multi-segment OMF file with proper INTERSEG opcodes. - Smoke needs new test: build a program with `--segment-cap 8192` so it forces ≥2 segments even for our small benches; verify under MAME via a GS/OS-loader-aware test path. (We don't have GS/OS-loaded tests today — see "Risks" below.) ## Phased implementation ### Phase 1: linker emits per-segment images + manifest - `link816 --segment-cap N --manifest manifest.json -o out` - Pack `.text` greedy into segments 1..K capped at N bytes each. - All `.rodata` into segment K+1 (the "data segment"). - All `.bss` into segment K+2. - Resolve intra-segment relocations. - Write inter-segment relocations into the manifest. - Emit one flat binary per segment; manifest references them by path. ### Phase 2: omfEmit consumes manifest, emits multi-segment OMF - One OMF segment header per manifest entry. - DS opcodes for body bytes. - INTERSEG (`E2`) opcodes for inter-segment reloc patch sites. - RELOC (`E0`) opcodes for intra-segment relocations that need load-time fixup (JSL targets within same segment but different bank than expected). - END opcode terminator per segment. ### Phase 3: runtime updates - Linker emits `__bss_table[]` and `__init_array_table[]` instead of single `__bss_start`/`__bss_end` symbols. - crt0 walks those tables. - `crt0.s` removes the LC-enable hardcoding from segment 1 if segment 1 isn't bank 0 (configurable). ### Phase 4: tests + smoke - Bench harness builds with `--segment-cap 8192` to force multi-segment even for small programs; verify output size growth (should be small — just OMF headers + reloc records overhead). - Need a GS/OS-aware MAME test path (boot a ProDOS volume with our OMF binary, let GS/OS Loader load it, check markers in bank 2). This is the test we deferred earlier in the GS/OS smoke task. **Phase 4 reopens the GS/OS-volume smoke decision** — multi-segment is the main reason to even care about that. ## Scope estimate - Phase 1: 2-3 sessions (linker rework, careful with reloc accounting) - Phase 2: 1 session (mostly OMF format work, well-specified) - Phase 3: 1 session (crt0 + linker symbol table changes) - Phase 4: 2-3 sessions (GS/OS-loaded test infra is the slog, not the multi-segment logic itself) Total: ~6-8 focused sessions. Phases 1-3 deliver a working multi- segment binary; phase 4 makes it testable in CI. ## Risks - **DBR management is genuinely tricky.** Code in segment 2 (bank 2) doing `lda foo` where foo is in segment K+1 (bank 0): the absolute fetch uses DBR. If DBR != bank-of-foo, we read garbage. The cleanest rule (DBR=0 always; data refs use long via `__attribute__((far))` or a backend pass that promotes them) requires backend cooperation we don't have. v1's "all data in one segment in DBR's bank" works but constrains data size to ~60KB. - **The Loader's behaviour around segment placement is poorly documented.** Apple's GS/OS Loader picks banks dynamically; we may end up with code segments in banks the loader chose, with relocations that work, but layouts that surprise us. Mitigation: use STATIC segments (KIND bit) initially so the loader can't move them. - **Smoke needs a real GS/OS volume image.** This is the same blocker as the deferred GS/OS file I/O smoke — needs a 2img/po image with ProDOS volume + a way to run our OMF through the actual loader. Without that, multi-segment logic is testable only by inspection of the OMF bytes and a hand-rolled mini-loader (which we'd have to write). ## Recommendation Start Phase 1. The linker work is contained, mostly mechanical, and the manifest format gives us a clean handoff to `omfEmit` work in Phase 2. We can validate Phase 1 by inspecting the per-segment images + manifest before any OMF / loader work. Phase 4's GS/OS-volume test path is the biggest unknown. Reasonable to defer that decision until Phases 1-3 are working — at that point we can decide whether to invest in proper GS/OS-loaded smoke or accept "multi-segment OMF emits valid bytes per the spec" as the test bar.