232 lines
10 KiB
Markdown
232 lines
10 KiB
Markdown
# Multi-segment OMF support — plan
|
|
|
|
## Why
|
|
|
|
Single-segment cap: ~60KB usable in bank 0 after the IO window ($C000-
|
|
$CFFF), the stack at $0FFF, and crt0 / runtime overhead. Real IIgs
|
|
applications need 100s of KB across multiple banks. GS/OS Loader is
|
|
designed for this — load each segment into its chosen bank, fix up
|
|
inter-segment references at load time, jump to the entry segment.
|
|
|
|
## Today
|
|
|
|
- `link816` produces a flat binary covering `[--text-base, ...]` in a
|
|
single bank-0 image. All sections are concatenated into one address
|
|
space. Inter-section relocations are resolved at link time.
|
|
- `omfEmit` wraps that flat binary in a single OMF segment (KIND=CODE,
|
|
ORG=0, SEGNUM=1, body = one DS opcode + END). No relocation records
|
|
emitted (image is already absolute).
|
|
- `crt0` enables LC RAM, zeroes BSS, runs `.init_array`, calls `main`.
|
|
- All cross-function calls already use JSL (3-byte long) — we never
|
|
emit JSR. That's accidentally helpful for multi-segment.
|
|
|
|
## Target
|
|
|
|
A program that builds 4 segments — say:
|
|
- Segment 1 ("MAIN"): crt0 + main + a few hot routines, in bank 1
|
|
- Segment 2 ("CODE"): bulk of code, in bank 2
|
|
- Segment 3 ("DATA"): rodata, in bank 3
|
|
- Segment 4 ("BSS"): uninitialized data + heap, in bank 4
|
|
|
|
GS/OS Loader places each segment, applies inter-segment relocations
|
|
(every `JSL foo` where `foo` lives in a different segment becomes a
|
|
`JSL <segment-relative-addr>` patched at load time with the absolute
|
|
address), and jumps to the entry.
|
|
|
|
## The four hard problems
|
|
|
|
### 1. Section → segment assignment policy
|
|
|
|
We need a deterministic rule that maps every input object's `.text` /
|
|
`.rodata` / `.bss` / `.init_array` section into a specific segment.
|
|
Three options:
|
|
|
|
**A. Per-object → one segment.** Each `.o` becomes one segment. Simple
|
|
mental model; bad locality (many tiny segments, lots of inter-segment
|
|
JSLs); GS/OS Loader has 8KB+ minimum overhead per segment.
|
|
|
|
**B. Greedy bin-packing.** Compute total code size; cap each segment at
|
|
N bytes (e.g. 32KB to leave headroom); pack `.text` sections into
|
|
segments greedily in input order. Same for `.rodata` / `.bss`.
|
|
Predictable, but a function near the end of segment N might want to
|
|
JSL a function at the start of segment N+1 — common pattern, every
|
|
call becomes inter-segment.
|
|
|
|
**C. Static call graph + clustering.** Compute call graph from the
|
|
relocations, cluster co-calling functions together, pack clusters into
|
|
segments to minimize inter-segment edges. Best locality, real linker
|
|
work.
|
|
|
|
**Recommendation: B for v1.** Add a `--segment-cap` option (default
|
|
32768). Real applications will want C eventually, but B unblocks
|
|
"my program is bigger than 64KB" today.
|
|
|
|
### 2. Inter-segment relocation tracking
|
|
|
|
When a `JSL foo` reloc resolves to a function in a different segment,
|
|
we MUST emit an OMF relocation record instead of patching the bytes
|
|
in-place. Currently `link816` patches everything at link time and emits
|
|
zero reloc records.
|
|
|
|
The reloc model becomes per-segment:
|
|
|
|
- Intra-segment IMM16 / PCREL: patch at link time, no OMF record.
|
|
- Intra-segment IMM24 (JSL): patch at link time (low 24 bits = segment-
|
|
relative offset for now; loader adjusts at load time when segment is
|
|
placed). Actually need OMF reloc here too because we don't know the
|
|
load bank.
|
|
- Inter-segment IMM24 (cross-bank JSL): emit `INTERSEG` opcode (`E2`)
|
|
pointing at `(target_segment_num, offset_within_segment)`.
|
|
- Inter-segment IMM16 data ref: requires the data segment to land in
|
|
the same bank as the referencing code OR we need the loader to fail
|
|
(16-bit absolute can't cross banks). In v1, force all data refs to be
|
|
to a "data segment" that's in a fixed bank, OR rewrite to long
|
|
addressing.
|
|
|
|
The IMM16 cross-segment problem is the killer. Three responses:
|
|
|
|
i. **Punt:** Disallow it. All `.rodata` references must be in the
|
|
same segment as the code, OR refs to global data must use long
|
|
addressing (rewrite at compile time via `__attribute__((far))`).
|
|
ii. **Promote to long at link time:** Detect IMM16 cross-segment
|
|
refs, rewrite the instruction's encoding from absolute (3-byte)
|
|
to absolute-long (4-byte). Changes code size, shifts everything
|
|
after the patched site — invasive.
|
|
iii. **Same-bank constraint:** Ensure the data segment's bank ==
|
|
the code's DBR. Means all code segments share one DBR, all data
|
|
lives in one segment in that DBR's bank.
|
|
|
|
**Recommendation: iii for v1.** All `.rodata` lives in one segment
|
|
in the bank our code uses for DBR. We already pin DBR to bank 0 in
|
|
crt0 (well, code does `pha;plb` for bank 2 sometimes for tests, but
|
|
not in general). For v1, all `.rodata` goes in bank 0 alongside the
|
|
first text segment, and code segments in higher banks reference data
|
|
via long absolute addressing. Need to confirm what addressing modes
|
|
our backend actually emits for global access.
|
|
|
|
### 3. crt0 / loader contract
|
|
|
|
Current crt0 assumes flat layout:
|
|
|
|
```
|
|
__start:
|
|
setup CPU mode, stack
|
|
enable LC RAM
|
|
zero BSS [__bss_start..__bss_end]
|
|
run .init_array
|
|
jsl main
|
|
spin
|
|
```
|
|
|
|
Multi-segment changes:
|
|
|
|
- BSS may span multiple segments (bank 0 LC + bank N segment). The
|
|
`__bss_start` / `__bss_end` symbols need to be per-segment, OR a
|
|
loop over a list of `(start, end)` pairs the linker emits.
|
|
- `.init_array` ditto.
|
|
- LC RAM enable only applies to bank 0 — fine.
|
|
- The OMF Loader will handle the actual memory placement; crt0 just
|
|
runs after Loader is done.
|
|
- The Loader's entry call lands at the segment marked with the entry
|
|
field. By convention that's segment 1.
|
|
|
|
**Decision:** Designate segment 1 as "init segment" containing crt0 +
|
|
its required symbols (`__bss_start_seg1`, `__init_array_start_seg1`,
|
|
etc.) and the linker emits a `__bss_table` and `__init_array_table` —
|
|
arrays of `(start, end)` pointers walked by crt0. Same idea Mac OS X's
|
|
loader uses for multi-segment programs.
|
|
|
|
### 4. Build pipeline + tests
|
|
|
|
- `link816 --segment-cap N` emits multiple `(image, base, syms)`
|
|
triples plus inter-segment reloc records.
|
|
- New intermediate format between linker and `omfEmit`: a small
|
|
manifest file listing each segment's body, base, name, and reloc
|
|
records. Easier than passing all that on the CLI.
|
|
- `omfEmit` reads the manifest and emits a single multi-segment OMF
|
|
file with proper INTERSEG opcodes.
|
|
- Smoke needs new test: build a program with `--segment-cap 8192` so it
|
|
forces ≥2 segments even for our small benches; verify under MAME via
|
|
a GS/OS-loader-aware test path. (We don't have GS/OS-loaded tests
|
|
today — see "Risks" below.)
|
|
|
|
## Phased implementation
|
|
|
|
### Phase 1: linker emits per-segment images + manifest
|
|
- `link816 --segment-cap N --manifest manifest.json -o out`
|
|
- Pack `.text` greedy into segments 1..K capped at N bytes each.
|
|
- All `.rodata` into segment K+1 (the "data segment").
|
|
- All `.bss` into segment K+2.
|
|
- Resolve intra-segment relocations.
|
|
- Write inter-segment relocations into the manifest.
|
|
- Emit one flat binary per segment; manifest references them by path.
|
|
|
|
### Phase 2: omfEmit consumes manifest, emits multi-segment OMF
|
|
- One OMF segment header per manifest entry.
|
|
- DS opcodes for body bytes.
|
|
- INTERSEG (`E2`) opcodes for inter-segment reloc patch sites.
|
|
- RELOC (`E0`) opcodes for intra-segment relocations that need
|
|
load-time fixup (JSL targets within same segment but different bank
|
|
than expected).
|
|
- END opcode terminator per segment.
|
|
|
|
### Phase 3: runtime updates
|
|
- Linker emits `__bss_table[]` and `__init_array_table[]` instead of
|
|
single `__bss_start`/`__bss_end` symbols.
|
|
- crt0 walks those tables.
|
|
- `crt0.s` removes the LC-enable hardcoding from segment 1 if segment
|
|
1 isn't bank 0 (configurable).
|
|
|
|
### Phase 4: tests + smoke
|
|
- Bench harness builds with `--segment-cap 8192` to force multi-segment
|
|
even for small programs; verify output size growth (should be small —
|
|
just OMF headers + reloc records overhead).
|
|
- Need a GS/OS-aware MAME test path (boot a ProDOS volume with our OMF
|
|
binary, let GS/OS Loader load it, check markers in bank 2). This is
|
|
the test we deferred earlier in the GS/OS smoke task. **Phase 4
|
|
reopens the GS/OS-volume smoke decision** — multi-segment is the
|
|
main reason to even care about that.
|
|
|
|
## Scope estimate
|
|
|
|
- Phase 1: 2-3 sessions (linker rework, careful with reloc accounting)
|
|
- Phase 2: 1 session (mostly OMF format work, well-specified)
|
|
- Phase 3: 1 session (crt0 + linker symbol table changes)
|
|
- Phase 4: 2-3 sessions (GS/OS-loaded test infra is the slog, not the
|
|
multi-segment logic itself)
|
|
|
|
Total: ~6-8 focused sessions. Phases 1-3 deliver a working multi-
|
|
segment binary; phase 4 makes it testable in CI.
|
|
|
|
## Risks
|
|
|
|
- **DBR management is genuinely tricky.** Code in segment 2 (bank 2)
|
|
doing `lda foo` where foo is in segment K+1 (bank 0): the absolute
|
|
fetch uses DBR. If DBR != bank-of-foo, we read garbage. The cleanest
|
|
rule (DBR=0 always; data refs use long via `__attribute__((far))` or
|
|
a backend pass that promotes them) requires backend cooperation
|
|
we don't have. v1's "all data in one segment in DBR's bank" works
|
|
but constrains data size to ~60KB.
|
|
- **The Loader's behaviour around segment placement is poorly
|
|
documented.** Apple's GS/OS Loader picks banks dynamically; we may
|
|
end up with code segments in banks the loader chose, with relocations
|
|
that work, but layouts that surprise us. Mitigation: use STATIC
|
|
segments (KIND bit) initially so the loader can't move them.
|
|
- **Smoke needs a real GS/OS volume image.** This is the same blocker
|
|
as the deferred GS/OS file I/O smoke — needs a 2img/po image with
|
|
ProDOS volume + a way to run our OMF through the actual loader.
|
|
Without that, multi-segment logic is testable only by inspection of
|
|
the OMF bytes and a hand-rolled mini-loader (which we'd have to
|
|
write).
|
|
|
|
## Recommendation
|
|
|
|
Start Phase 1. The linker work is contained, mostly mechanical, and
|
|
the manifest format gives us a clean handoff to `omfEmit` work in
|
|
Phase 2. We can validate Phase 1 by inspecting the per-segment images
|
|
+ manifest before any OMF / loader work.
|
|
|
|
Phase 4's GS/OS-volume test path is the biggest unknown. Reasonable to
|
|
defer that decision until Phases 1-3 are working — at that point we
|
|
can decide whether to invest in proper GS/OS-loaded smoke or accept
|
|
"multi-segment OMF emits valid bytes per the spec" as the test bar.
|