mirror of https://github.com/unshackle-dl/unshackle.git synced 2026-06-10 03:02:09 +00:00

Files

imSp4rky 29232925d5 feat(subtitle): data-driven conversion registry + SubtitleEdit 5 support

Replace the hardcoded conversion if/elif in Subtitle.convert with a capability-matrix backend registry (subtitle_convert.py): each backend declares the source->target pairs it supports plus a rank, and run_conversion tries them in order as a real fallback chain. conversion_method pins a backend but still falls back (pin-then-fallback).

- Detect the cross-platform SubtitleEdit 5+ CLI (seconv) and use its --flag syntax for convert, SDH stripping, and reverse-RTL
- Protect styled ASS/SSA from automatic SRT downconversion; honor an explicit --sub-format / sidecar_format
- Read segmented fVTT (wvtt) and fTTML (stpp) directly from fragmented MP4
- Improve ASS/SSA font detection: inline \fn overrides, Format-located Fontname column, @-prefix strip, case-insensitive de-dup; covers SSA too
- Update SUBTITLE_CONFIG.md, example yaml, README; add regression tests and a backend benchmark script

2026-06-07 22:21:25 -06:00

6.7 KiB

Raw Blame History

Subtitle Processing Configuration

This document covers subtitle processing and formatting options under the top-level subtitle: key in unshackle.yaml.

For the canonical example, see unshackle/unshackle-example.yaml.

subtitle (dict)

Control subtitle conversion, SDH (hearing-impaired) stripping, formatting preservation, and output behavior.

conversion_method: Which backend to convert subtitles with. Default: auto.

Routing is data-driven (unshackle/core/tracks/subtitle_convert.py): a registry of backends each declares the source→target codec pairs it supports plus a preference rank. For a conversion, the available backends that support the pair are tried in rank order — a real fallback chain. A non-auto value pins that backend first, then still falls back through the chain if it can't handle the pair or errors (pin-then-fallback). A service may also set preferred_conversion_method on its tracks; an explicit conversion_method in config always wins.
- auto: Best available backend by rank — SubtitleEdit (if installed) for highest fidelity; otherwise subby for WebVTT/fVTT/SAMI→SRT (adds CommonIssuesFixer cleanup), pysubs2 for SSA/ASS and the broad format set, pycaption as last resort.
- subby: Prefer subby (CommonIssuesFixer); reads WebVTT/fVTT/SAMI, writes SRT (and TTML/VTT via an SRT intermediate).
- subtitleedit: Prefer SubtitleEdit / seconv. Highest fidelity — preserves positioning/italics.
- pycaption: Prefer pycaption. Flattens positioning/italics, writes only SRT/TTML/WebVTT.
- pysubs2: Prefer pysubs2 (SRT, SSA, ASS, WebVTT, TTML, SAMI, MicroDVD, MPL2, TMP). The only pure-Python backend that reads ASS/SSA, so it is the default for styled SubStation sources.
Styled-subtitle protection: ASS/SSA are never automatically downconverted to SRT (the conversion is skipped and the original kept) — SRT cannot carry their positioning/colours/styling. This applies to the default muxed track only; explicit requests still convert: a per-download --sub-format srt for the muxed track, or sidecar_format: srt for sidecars. To keep raw styled sidecars, set sidecar_format: original.

Segmented subtitles (fVTT/WVTT and fTTML/STPP from DASH/HLS, e.g. HBO Max) are read directly from the fragmented MP4: fVTT via subby's WVTTConverter, fTTML via pycaption's box parsing. They can be converted from but not to.

SubtitleEdit on Linux/macOS: install the SubtitleEdit 5+ CLI (SeConv / seconv, the self-contained cross-platform build from the SubtitleEdit releases) onto PATH or into unshackle/binaries/. unshackle targets the SubtitleEdit 5+ command syntax. The Windows SubtitleEdit.exe is the GUI app — use the SeConv CLI binary for headless conversion.
sdh_method: How to strip SDH cues. Default: auto.
- auto: Try subby for SRT first, then SubtitleEdit (when conversion_method is auto/subtitleedit and the binary is available), then subtitle-filter as the final fallback.
- subby: Use subby's SDHStripper. Only operates on SRT; for other codecs the call returns without stripping.
- subtitleedit: Use SubtitleEdit's --remove-text-for-hi (SE5 CLI) when the binary is available; otherwise falls through to subtitle-filter.
- filter-subs: Use the subtitle-filter library directly (rm_fonts, rm_ast, rm_music, rm_effects, rm_names, rm_author).
strip_sdh: Enable/disable automatic SDH stripping for tracks flagged as SDH. Default: true.
convert_before_strip: When falling through to the subtitle-filter path, auto-convert non-SRT subtitles to SRT first for better compatibility. Default: true. Has no effect when SubtitleEdit handles stripping directly.
preserve_formatting: Keep original subtitle tags and positioning during WebVTT processing. When true, sanitized WebVTT is written back without round-tripping through pycaption, preserving tags like <i>, <b>, and line: positioning. Default: true.
output_mode: Controls how subtitles are included in the output. Default: mux.
- mux: Embed subtitles in the MKV container only.
- sidecar: Save subtitles as separate files only (not muxed).
- both: Embed in the MKV container and save as sidecar files.
sidecar_format: Format for sidecar subtitle files (used when output_mode is sidecar or both). Default: srt.
- srt: SubRip.
- vtt: WebVTT.
- ass: Advanced SubStation Alpha.
- original: Keep the subtitle in its current format without conversion.

Example:

subtitle:
  conversion_method: auto
  sdh_method: auto
  strip_sdh: true
  convert_before_strip: true
  preserve_formatting: true
  output_mode: mux
  sidecar_format: srt

WebVTT Sanitization (automatic, not configurable)

After download, WebVTT and segmented WebVTT (fVTT/WVTT) tracks pass through a fixed sanitization pipeline before any conversion or muxing:

Segment merge — segmented DASH/HLS WebVTT is stitched via merge_segmented_webvtt (uses pysubs2 for lenient parsing when conversion_method is auto or pysubs2, otherwise pycaption directly).
Negative timestamps — sanitize_webvtt_timestamps rewrites -HH:MM:SS.mmm cues to 00:00:00.000.
Cue identifiers — sanitize_webvtt_cue_identifiers strips letter+digit IDs (e.g. Q0, S12) on their own line before a timing line, which otherwise confuse parsers like pysubs2.
Overlapping cues — merge_overlapping_webvtt_cues collapses cues with start times within 50 ms and matching end times into a single multi-line cue, ordered by line: percentage (lower % = higher on screen = first line).
Fallback hardening — when preserve_formatting is false and the first pycaption parse fails, sanitize_webvtt retries with a WEBVTT header guard, hour-padded timings, and another negative-timestamp pass; if that still fails, the sanitized text is written as-is.

sanitize_broken_webvtt and space_webvtt_headers additionally run inside Subtitle.parse() to drop malformed --> lines and reflow merged-segment headers. merge_same_cues and filter_unwanted_cues (drops  /whitespace-only cues) run only on the pycaption path.

These behaviors are intentional and have no config knobs — they apply to every WebVTT track regardless of conversion_method.

Filename sanitization (e.g. parenthesis handling, unidecode bracket artifacts from PR #105) lives in unshackle/core/utilities.py::sanitize_filename and is governed by output_template, not the subtitle: config block.
Subtitle codec support is defined in unshackle/core/tracks/subtitle.py; the conversion backend registry, capability matrix, and ranks live in unshackle/core/tracks/subtitle_convert.py.

6.7 KiB Raw Blame History

Subtitle Processing Configuration

subtitle (dict)

WebVTT Sanitization (automatic, not configurable)

Related

6.7 KiB

Raw Blame History