1 files changed, 502 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..c09e97e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,502 @@
+# pixpat
+
+A small C++ library for **pixel format conversion** and **test pattern
+generation**, with a C API and Python bindings.
+
+## Why pixpat
+
+- **Templated C++ core.** Each pixel format is described once as a
+  *layout* — component order, bit widths, plane shape — and the
+  conversion code is generated from those descriptions by the C++
+  compiler. Adding a new format or component order is a few lines of
+  layout, not a new conversion routine.
+- **16-bit normalized pivot.** Conversions don't go format-to-format.
+  They unpack into a 16-bit RGB or YUV intermediate and pack from it,
+  so cross-color-kind conversions (RGB ↔ YUV, with arbitrary matrix
+  and range) cost the same as same-color-kind ones, precision is
+  preserved when both endpoints are 8-bit, and the work scales as N+M
+  instead of N×M.
+- **Built to drop into pipelines.** Caller-owned buffers and a
+  freestanding C++ core (`-fno-exceptions -fno-rtti`, no libstdc++
+  runtime dep) mean pixpat fits inside the inner loop of a DRM/KMS,
+  V4L2, or GPU-upload path without copies or runtime baggage.
+
+## Why not pixpat
+
+- **You need maximum throughput.** Hand-tuned conversion paths —
+  OpenCV with its SIMD intrinsics, vendor-supplied codecs, ffmpeg's
+  `swscale` — still outrun pixpat on the heavily-trafficked format
+  pairs. pixpat aims for "fast enough" out of a small generated
+  codebase, not for raw peak speed.
+- **You want a pure-Python library.** The Python package is a thin
+  `ctypes` wrapper around `libpixpat.so`. You need a native wheel
+  matching your architecture; there is no pure-Python fallback.
+- **You need GPU conversion.** pixpat is CPU-only.
+
+## What it offers
+
+- **Test patterns** — `kmstest`, SMPTE RP 219-1 bars, `plain` solid
+  fill, R/G/B/gray ramps, checkerboard, color-bar overlays, zone
+  plate. All take a single per-pattern parameter string.
+- **Format conversion** between packed / semiplanar / planar YUV,
+  RGB, raw Bayer, and grayscale. Cross-color-kind conversions
+  (RGB ↔ YUV) honor BT.601 / BT.709 / BT.2020 matrices and
+  limited / full quantization range.
+- **Caller-owned pixel memory.** Callers pass plane pointers and
+  strides; pixpat never allocates pixel buffers. The only internal
+  allocation is a small per-thread normalized line buffer reused
+  across cold-path calls.
+- **DRM / kms++ / pixutils format names** (`XRGB8888`, `NV12`, …)
+  rather than DRM/V4L2 four-character codes (`XR24`, …). See
+  [Format names and byte order](#format-names-and-byte-order) — the
+  convention disagrees with OpenCV's.
+- **Optional multi-threading** via a single `num_threads` knob on
+  both entry points.
+- **Build-time tailoring.** A small TOML config selects which formats
+  and patterns are compiled in, which formats are read-only /
+  write-only, and which formats get a fully-fused fast path.
+
+## Supported formats
+
+The default build ships every format in the catalog. Each one works as
+a `pixpat_draw_pattern` target, a `pixpat_convert` source, and a
+`pixpat_convert` destination.
+
+- **RGB packed** — `RGB332`, `RGB565`, `BGR565`, the four 1555
+  permutations `{XRGB,ARGB,XBGR,ABGR}1555`, the six 4444 permutations
+  `{XRGB,ARGB,XBGR,ABGR,RGBX,RGBA}4444`, `RGB888`, `BGR888`, the
+  eight 8888 permutations `{XRGB,ARGB,XBGR,ABGR,RGBX,RGBA,BGRX,BGRA}8888`,
+  the eight 10-bit permutations `{XRGB,ARGB,XBGR,ABGR}2101010` and
+  `{RGBX,RGBA,BGRX,BGRA}1010102`, and `ABGR16161616`. `R8` is a
+  single-channel form; on read the unspecified channels are
+  synthesized as G=B=R.
+- **YUV packed** — `YUYV`, `YVYU`, `UYVY`, `VYUY`, `Y210`, `Y212`,
+  `Y216`, `VUY888`, `XVUY8888`, `XVUY2101010`, `AVUY16161616`.
+- **YUV semiplanar** — `NV12`, `NV21`, `NV16`, `NV61`, `P030`, `P230`.
+- **YUV planar** — `YUV444`, `YVU444`, `YUV422`, `YVU422`, `YUV420`,
+  `YVU420`, `T430`.
+- **Grayscale** — `Y8`, `Y10`, `Y12`, `Y16`, `XYYY2101010`, `Y10P`,
+  `Y12P`.
+- **Bayer unpacked** — `SRGGB` / `SBGGR` / `SGRBG` / `SGBRG` at
+  8 / 10 / 12 / 16 bit. Reads use a bilinear demosaic.
+- **Bayer MIPI-packed** — the same four phases at 10P / 12P.
+
+### Format names and byte order
+
+Format names follow the DRM / kms++ / pixutils convention:
+components are listed **MSB-first** inside the storage word. So
+`XRGB8888` means a 32-bit word with X in the highest byte and B in
+the lowest, and `BGR888` is a 24-bit format with B at the highest
+byte and R at byte 0.
+
+OpenCV uses the **opposite** convention — its `BGR` is byte-order, so
+OpenCV `BGR` and pixpat `RGB888` describe the same in-memory layout.
+Keep this in mind when comparing pipelines.
+
+## Quickstart
+
+### C
+
+```c
+#include <pixpat/pixpat.h>
+
+uint8_t pixels[1920 * 1080 * 4];
+pixpat_buffer buf = {
+    .format     = "XRGB8888",
+    .width      = 1920,
+    .height     = 1080,
+    .num_planes = 1,
+    .planes     = { pixels },
+    .strides    = { 1920 * 4 },
+};
+
+pixpat_draw_pattern(&buf, "smpte", NULL);  /* NULL opts → BT.601 / limited / auto threads */
+```
+
+A `pixpat_convert(dst, src, opts)` call has the same shape: two
+`pixpat_buffer`s plus a small options struct (also nullable). See the
+public header at `pixpat-native/inc/pixpat/pixpat.h`.
+
+### Python
+
+```python
+import pixpat
+
+w, h = 1920, 1080
+data = bytearray(w * h * 4)
+buf = pixpat.Buffer(planes=[data], fmt="XRGB8888",
+                    width=w, height=h, strides=[w * 4])
+
+pixpat.draw_pattern(buf, "smpte")
+```
+
+The Python `Buffer` accepts anything that supports the buffer protocol
+— `bytearray`, `array.array`, `numpy.ndarray`, `mmap.mmap`,
+`memoryview`. Source buffers may be read-only; destination buffers
+must be writable.
+
+## Components
+
+- **`libpixpat`** — the C++ implementation, exposed through a small
+  C ABI in [`pixpat-native/inc/pixpat/pixpat.h`](pixpat-native/inc/pixpat/pixpat.h).
+  Built with Meson; produces both shared and static libraries plus a
+  pkg-config file.
+- **`pixpat` (Python)** — thin `ctypes` bindings over the C ABI. No
+  CPython extension, so a single wheel works on any CPython ≥ 3.9 for
+  a given architecture.
+
+## Building
+
+`libpixpat` is built with [Meson](https://mesonbuild.com/). A C++20
+compiler and Python 3 (used by the build-time codegen) are required;
+there are no third-party runtime dependencies.
+
+```sh
+meson setup build
+meson compile -C build
+```
+
+This produces `build/libpixpat.so` (and `.a`), the public header at
+`pixpat-native/inc/pixpat/pixpat.h`, and a `pixpat.pc` pkg-config
+file. To install system-wide:
+
+```sh
+meson install -C build
+```
+
+### Cross-compiling
+
+A cross file for aarch64 Linux ships in the tree:
+
+```sh
+meson setup build-aarch64 --cross-file pixpat-native/cross/aarch64-linux-gnu.txt
+meson compile -C build-aarch64
+```
+
+### Native tests
+
+```sh
+meson test -C build
+```
+
+These are smoke tests that exercise the public ABI from C and C++.
+Behavioral coverage (matrix correctness, threading, subsampling,
+Bayer demosaic, …) lives in the Python test suite.
+
+### Selecting a build profile
+
+The default profile compiles every format and pattern. To pick a
+different one, point Meson at an alternate TOML file via the `config`
+option:
+
+```sh
+meson setup build-min -Dconfig=pixpat-native/profiles/no_hotpath.toml
+```
+
+A few example profiles ship in `pixpat-native/profiles/`. See
+[Build-time configuration and codegen](#build-time-configuration-and-codegen)
+below for what the TOML controls.
+
+### Python install
+
+The Python package wraps the C ABI via `ctypes`, so installing it
+just means compiling `libpixpat.so` and bundling it as package data.
+For a native install:
+
+```sh
+pip install .
+```
+
+`setup.py` invokes meson during the wheel build (into
+`pixpat-python/build/native/`), copies the resulting `.so` into the
+package, and stamps the wheel for the host architecture. Requires
+`meson`, `ninja`, and a C++ compiler on the host.
+
+To cross-compile a wheel for another architecture, use the helper:
+
+```sh
+pixpat-python/scripts/build_wheel.sh x86_64    # or aarch64
+```
+
+The resulting wheel lands in `dist/`, tagged for the chosen
+architecture; meson's per-arch build dir lands at
+`pixpat-python/build-<arch>/native/`.
+
+### Editable Python install for development
+
+```sh
+meson setup build
+meson compile -C build
+pip install -e .
+```
+
+The editable install symlinks `build/libpixpat.so.0.0.0` into the
+package, so rebuilding the native side is picked up without
+re-installing.
+
+### Python tests
+
+From the repo root, after an editable install:
+
+```sh
+pytest pixpat-python/tests
+```
+
+For micro-benchmarking the `draw_pattern` and `convert` paths across
+formats, see `pixpat-python/scripts/perf_test.py`. This is a
+development tool, not part of the supported API surface.
+
+## Architecture
+
+The rest of this document covers how `libpixpat` is put together
+internally: how a conversion is structured, how formats are
+described, how the build is configured, and the supported compiler
+and runtime.
+
+### Conversion pipeline
+
+A conversion is the composition of three stages:
+
+```
+Source  →  ColorXfm  →  Sink
+```
+
+- A **source** unpacks caller-memory pixels into a **normalized
+  pixel** — `RGB16` or `YUV16`, four `uint16_t` components.
+- A **ColorXfm** maps one normalized pixel type to another: identity
+  for same-color-kind conversions, the selected matrix/range for
+  cross-color-kind ones. Template-specialized, so the identity case
+  vanishes at compile time.
+- A **sink** packs normalized pixels into destination memory.
+
+Each sink declares a `block_h × block_w` block matching its chroma
+subsampling (1×1 for unsubsampled, e.g. 2×2 for `NV12`). The
+converter materializes one block on the stack per iteration; under
+`-O3` it stays in registers for most sinks.
+
+#### Hot path vs cold path
+
+The normalized pixel type does double duty:
+
+- On the **hot path**, with `-O3`, the compiler keeps it in registers
+  across the source / ColorXfm / sink boundary — no per-line buffer
+  is involved.
+- On the **cold path**, two short legs — *unpack to norm* and
+  *pack from norm* — share a per-thread normalized line buffer.
+  Each leg is a templated function: one body per source, one per
+  sink. Cross-color-kind conversions add an in-place ColorXfm pass over
+  the buffer between the two legs.
+
+Whether a particular conversion runs on the hot or cold path is
+decided by the dispatch tier described in [Two-tier
+dispatch](#two-tier-dispatch).
+
+### Layout descriptor
+
+A pixel format is described once, declaratively, as a C++20
+non-type-template-parameter (NTTP) value. Three small types are
+enough to describe any format:
+
+```cpp
+enum class C : uint8_t { X, A, R, G, B, Y, U, V };
+
+struct Comp { C c; uint8_t bits; uint8_t shift; };
+
+template <typename Storage, Comp... Cs>
+struct Plane;
+
+template <ColorKind Kind, size_t Hsub, size_t Vsub, typename... Planes>
+struct Layout;
+```
+
+A `Plane` describes one storage word (`uint32_t`, `uint16_t`, …) and
+the components packed into it at given bit offsets. A `Layout` lists
+the planes, the color kind (RGB or YUV), and the chroma subsampling
+factors. Two named formats for comparison:
+
+```cpp
+using XRGB8888 = Layout<ColorKind::RGB, 1, 1,
+    Plane<uint32_t, Comp{C::B,8,0}, Comp{C::G,8,8},
+                    Comp{C::R,8,16}, Comp{C::X,8,24}>>;
+
+using NV12 = Layout<ColorKind::YUV, 2, 2,
+    Plane<uint8_t,  Comp{C::Y,8,0}>,
+    Plane<uint16_t, Comp{C::U,8,0}, Comp{C::V,8,8}>>;
+```
+
+`Plane` exposes `constexpr` helpers — `find_pos<C>`, `pack(values)`,
+`unpack(word)`, `bytes_per_pixel`, … — that the I/O templates use to
+emit per-format read and write code.
+
+### Patterns
+
+A pattern is a synthetic source: a small C++ struct exposing
+`sample(x, y, W, H)` that returns one normalized pixel. The list of
+supported names and their parameters is in
+[`pixpat-native/inc/pixpat/pixpat.h`](pixpat-native/inc/pixpat/pixpat.h).
+
+Dispatch is intentionally simple: every (pattern, sink) pair takes
+one normalized-pivot path. Per-pattern fill writes the normalized
+line buffer in the destination's color kind — folding the
+cross-color-kind `ColorXfm` into the per-pixel fill so that constant
+patterns collapse to `memset` under `-O3`. Per-sink pack encodes the
+line into the destination memory layout. Total cost is *O(N + M)*:
+adding a pattern is one fill specialization, adding a format is
+automatic.
+
+The SMPTE pattern's pixel values are spec-defined in BT.709 /
+Limited. Other rec/range settings are accepted but produce
+visibly-wrong colors — pixpat does not silently override the caller's
+color spec.
+
+`pixpat_pattern_opts::params` is parsed once at the C entry point
+into a `Params` instance, then handed to the pattern constructor;
+patterns query keys by name and never see raw strings. The Python
+wrapper accepts either the wire string or a `Mapping[str, Any]`.
+
+### Two-tier dispatch
+
+The naïve approach — instantiate `Converter<Source, Sink>` for every
+source/sink pair — produces an N×N matrix of fully-fused inner loops.
+With this many formats the binary balloons quickly. So pixpat splits
+the conversion table into two tiers.
+
+**Hot pivot.** A small set of pivot formats gets the fully-fused
+treatment. Real-world conversion paths almost always have a common
+interchange format on at least one endpoint — OpenCV, Qt,
+framebuffers, and video pipelines all gravitate toward a single
+8-bit RGB form. Picking that form as the pivot makes the typical
+user path fully inlined; the rarer hardware-to-hardware path stays
+on the cold path. The default profile uses **`BGR888`** as its
+single pivot, the format OpenCV, Qt, and framebuffers all use. Each
+pivot covers itself as both source and destination, paired with
+every other format.
+
+**Cold path.** Every other source/sink pair walks each row group
+through the per-thread normalized line buffer:
+
+```
+for each row group:
+    unpack_to_norm<Source>(buf, src, ...)
+    if RGB ↔ YUV:
+        in-place ColorXfm pass over buf
+    pack_from_norm<Sink>(dst, buf, ...)
+```
+
+`unpack_to_norm<Source>` and `pack_from_norm<Sink>` are plain
+templated functions — one body per source, one per sink, not per
+pair. With ~2N legs plus two cross-color-kind helpers, the cold path fits
+in a small fixed code budget independent of the number of pairs.
+
+Adding a hot pivot is mechanical (one entry in the build config —
+see the next section). Whether a pivot is worth the code size
+depends on whether real workloads actually use that format on one
+endpoint, so the choice is workload-driven.
+
+### Source / sink shapes
+
+Internally the I/O templates are grouped by *iteration shape*. Every
+format reuses one of these template shapes, plus its own `Layout`:
+
+- **Packed** (RGB or YUV) — `XRGB8888`, `BGR888`, …
+- **Packed-YUV** — `YUYV` group.
+- **Semiplanar** — `NV12` group.
+- **Multi-pixel semiplanar** — `P030`, `P230`. Sink uses the
+  streaming entry point.
+- **Planar** — `YUV420`, `YUV444`, …
+- **Multi-pixel planar** — `T430`.
+- **Gray** — single-component YUV; chroma synthesized at neutral on
+  read.
+- **Mono RGB** — RGB counterpart of Gray (`R8`); G=B=R synthesized on
+  read.
+- **Multi-pixel gray** — `XYYY2101010`.
+- **Gray MIPI-packed** — `Y10P`, `Y12P`.
+- **Bayer** — phase-aware R/G/B selection. Reads use a 3×3 bilinear
+  demosaic; edges clamp.
+- **Bayer MIPI-packed** — the 10P / 12P byte layout, hand-rolled
+  because the bit packing doesn't fit `Plane<Storage, Comp...>`.
+
+Adding a new format usually means one of: writing a new `Layout` and
+reusing one of these templates verbatim; adding a new layout shape to
+an existing template group; or, rarely, adding a new template group.
+
+### Threading
+
+A worker fan-out helper splits the image into row stripes — one
+disjoint `[start, end)` row range per worker — and runs the same
+converter body on each stripe. Stripe boundaries are aligned to the
+destination's vertical subsampling so chroma blocks aren't split
+across workers. Workers are joined before the call returns.
+
+`num_threads = 0` selects a sensible default (one per online CPU,
+capped); `1` runs the conversion inline on the calling thread with
+no thread-spawn overhead; `N > 1` uses exactly `N` workers.
+
+### Color math and bit depth
+
+Color math runs in `float`. Normalized components are `uint16_t`;
+N-bit stored values bit-replicate to 16 bits on decode (so `0xFF →
+0xFFFF`) and truncate via `norm >> (16 - N)` on encode. Per-component
+decode→encode at the same bit depth is exact; full conversions are
+not — they go through float color math and, where bit depths differ,
+truncation.
+
+Alpha rules:
+
+- Source without A → `a = 0`.
+- Same-color-kind ColorXfm → `a` unchanged.
+- Cross-color-kind ColorXfm → `a` reset to `0xFFFF`.
+- Sinks with A encode `a`; sinks with X write zero.
+
+### Build-time configuration and codegen
+
+Which formats and patterns are compiled in, which formats are
+read-only / write-only, and which formats get the fully-fused
+hot-pivot treatment are all decided at build time by a small TOML
+config. The default lives at `pixpat-native/profiles/pixpat.toml`; pick
+a different one with Meson's `-Dconfig=…` option.
+
+The TOML options:
+
+```toml
+hot_pivots         = ["BGR888"]                # which formats get fully-fused arms
+patterns           = ["kmstest", "smpte"]      # which patterns to compile in
+[features]
+pattern            = true                       # toggle the draw-pattern entry point
+convert            = true                       # toggle the convert entry point
+default_format_caps = "rw"                      # default per-format read/write
+[formats]
+# RGB888 = "r"          # readable only
+# YUV420 = "off"        # not in this build
+```
+
+At configure time, a Python codegen step reads the TOML and the format
+and pattern catalogs and emits two generated files:
+
+- `pixpat_config.h` — the C-side feature flags.
+- `pixpat_caps.inc` — two parallel arrays, `FormatCaps[]` and
+  `PatternCaps[]`. Each entry carries booleans like `readable`,
+  `writable`, `hot_src`, `hot_dst` (formats) or `enabled` (patterns).
+
+The dispatch code reads those caps inside `if constexpr` guards, so a
+disabled format / pattern / hot arm produces no template
+instantiation at all — the corresponding code is never generated.
+This is how a constrained target shrinks the binary: turn off the
+patterns and formats it doesn't need; the rest disappears from the
+output.
+
+Three example profiles ship in `pixpat-native/profiles/`
+illustrating the knobs (no hot path, pattern-only, hot pivot moved
+to a different format).
+
+### Compiler
+
+Hot-path performance depends heavily on the inlined inner loop being
+auto-vectorized. CI builds and tests under both gcc and clang; both
+work, but performance differs:
+
+- Under clang 18 (built with `-O3 -march=native`) the templated inner
+  loops vectorize cleanly, the normalized pixel stays register-resident,
+  and constant patterns collapse to `memset`.
+- Under gcc 13 the same loops vectorize for some shapes but not others —
+  RGB→RGB in particular drops considerably.