diff --git a/.gitignore b/.gitignore index a36bf89..3aa99fc 100644 --- a/.gitignore +++ b/.gitignore @@ -24,6 +24,8 @@ device_private_key device_vmp_blob unshackle/binaries/* !unshackle/binaries/placehere.txt +# test fixtures (binary subtitle samples) must be tracked despite the *.mp4 rule above +!tests/tracks/fixtures/*.mp4 unshackle/cache/ unshackle/cookies/ unshackle/certs/ diff --git a/README.md b/README.md index f36d0fd..5f552aa 100644 --- a/README.md +++ b/README.md @@ -47,6 +47,10 @@ External tools on your `PATH` (recommended versions): - [Bento4](https://github.com/axiomatic-systems/Bento4) - ≥ 1.6.0-639 - [dovi_tool](https://github.com/quietvoid/dovi_tool) - ≥ 2.1 +Optional: + +- [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit/releases) - ≥ 5.0 (`SeConv` CLI) + ## License [GPL-3.0](LICENSE). Do not use unshackle for content you lack the rights to. Keep the core free and open; keep service code private. Be kind. diff --git a/docs/SUBTITLE_CONFIG.md b/docs/SUBTITLE_CONFIG.md index 20adbb8..58d55ef 100644 --- a/docs/SUBTITLE_CONFIG.md +++ b/docs/SUBTITLE_CONFIG.md @@ -8,17 +8,44 @@ For the canonical example, see `unshackle/unshackle-example.yaml`. Control subtitle conversion, SDH (hearing-impaired) stripping, formatting preservation, and output behavior. -- `conversion_method`: How to convert subtitles between formats. Default: `auto`. - - `auto`: Smart routing - subby for WebVTT/fVTT/SAMI; for SSA/ASS/MicroDVD/MPL2/TMP use SubtitleEdit when available, otherwise pysubs2; standard pycaption/SubtitleEdit pipeline for everything else. - - `subby`: Always use subby with `CommonIssuesFixer` (falls back to standard if the source codec isn't supported by subby). - - `subtitleedit`: Prefer SubtitleEdit when available; otherwise fall back to the standard pycaption pipeline. - - `pycaption`: Use only the pycaption library (no SubtitleEdit, no subby). Limited to SRT, TTML, and WebVTT outputs. - - `pysubs2`: Use pysubs2 (supports SRT, SSA, ASS, WebVTT, TTML, SAMI, MicroDVD, MPL2, TMP). +- `conversion_method`: Which backend to convert subtitles with. Default: `auto`. + + Routing is data-driven (`unshackle/core/tracks/subtitle_convert.py`): a registry of backends each + declares the source→target codec pairs it supports plus a preference rank. For a conversion, the + available backends that support the pair are tried in rank order — a real fallback chain. A + non-`auto` value **pins** that backend first, then still falls back through the chain if it can't + handle the pair or errors (pin-then-fallback). A service may also set `preferred_conversion_method` + on its tracks; an explicit `conversion_method` in config always wins. + + - `auto`: Best available backend by rank — SubtitleEdit (if installed) for highest fidelity; + otherwise subby for WebVTT/fVTT/SAMI→SRT (adds `CommonIssuesFixer` cleanup), pysubs2 for SSA/ASS + and the broad format set, pycaption as last resort. + - `subby`: Prefer subby (`CommonIssuesFixer`); reads WebVTT/fVTT/SAMI, writes SRT (and TTML/VTT via + an SRT intermediate). + - `subtitleedit`: Prefer SubtitleEdit / `seconv`. Highest fidelity — preserves positioning/italics. + - `pycaption`: Prefer pycaption. **Flattens positioning/italics**, writes only SRT/TTML/WebVTT. + - `pysubs2`: Prefer pysubs2 (SRT, SSA, ASS, WebVTT, TTML, SAMI, MicroDVD, MPL2, TMP). The only + pure-Python backend that reads ASS/SSA, so it is the default for styled SubStation sources. + + **Styled-subtitle protection**: ASS/SSA are never *automatically* downconverted to SRT (the + conversion is skipped and the original kept) — SRT cannot carry their positioning/colours/styling. + This applies to the default muxed track only; explicit requests still convert: a per-download + `--sub-format srt` for the muxed track, or `sidecar_format: srt` for sidecars. To keep raw styled + sidecars, set `sidecar_format: original`. + + **Segmented subtitles** (`fVTT`/WVTT and `fTTML`/STPP from DASH/HLS, e.g. HBO Max) are read directly + from the fragmented MP4: fVTT via subby's `WVTTConverter`, fTTML via pycaption's box parsing. They + can be converted *from* but not *to*. + + **SubtitleEdit on Linux/macOS**: install the SubtitleEdit 5+ CLI (`SeConv` / `seconv`, the + self-contained cross-platform build from the SubtitleEdit releases) onto `PATH` or into + `unshackle/binaries/`. unshackle targets the SubtitleEdit **5+** command syntax. The Windows + `SubtitleEdit.exe` is the GUI app — use the `SeConv` CLI binary for headless conversion. - `sdh_method`: How to strip SDH cues. Default: `auto`. - `auto`: Try subby for SRT first, then SubtitleEdit (when `conversion_method` is `auto`/`subtitleedit` and the binary is available), then subtitle-filter as the final fallback. - `subby`: Use subby's `SDHStripper`. **Only operates on SRT**; for other codecs the call returns without stripping. - - `subtitleedit`: Use SubtitleEdit's `/RemoveTextForHI` when the binary is available; otherwise falls through to subtitle-filter. + - `subtitleedit`: Use SubtitleEdit's `--remove-text-for-hi` (SE5 CLI) when the binary is available; otherwise falls through to subtitle-filter. - `filter-subs`: Use the `subtitle-filter` library directly (`rm_fonts`, `rm_ast`, `rm_music`, `rm_effects`, `rm_names`, `rm_author`). - `strip_sdh`: Enable/disable automatic SDH stripping for tracks flagged as SDH. Default: `true`. @@ -68,6 +95,7 @@ These behaviors are intentional and have no config knobs — they apply to every ## Related - Filename sanitization (e.g. parenthesis handling, unidecode bracket artifacts from PR #105) lives in `unshackle/core/utilities.py::sanitize_filename` and is governed by `output_template`, not the `subtitle:` config block. -- Subtitle codec support and the conversion matrix are defined in `unshackle/core/tracks/subtitle.py`. +- Subtitle codec support is defined in `unshackle/core/tracks/subtitle.py`; the conversion backend + registry, capability matrix, and ranks live in `unshackle/core/tracks/subtitle_convert.py`. --- diff --git a/scripts/bench_subtitle_backends.py b/scripts/bench_subtitle_backends.py new file mode 100644 index 0000000..5333099 --- /dev/null +++ b/scripts/bench_subtitle_backends.py @@ -0,0 +1,98 @@ +#!/usr/bin/env python3 +""" +Benchmark subtitle conversion backends to (re-)tune the preference ranks in +``unshackle/core/tracks/subtitle_convert.py``. + +Runs every backend that can read each input file, converting to a target format (default +SRT), and reports cue count, leaked ASS override tags, and output size — so you can compare +fidelity per (source, target) pair on real files. Read-only: copies inputs to a temp dir. + +Usage: + uv run python scripts/bench_subtitle_backends.py [ ...] [--target SRT] + +Example: + uv run python scripts/bench_subtitle_backends.py downloads/ +""" + +from __future__ import annotations + +import argparse +import re +import shutil +import tempfile +from pathlib import Path + +from unshackle.core.tracks import subtitle_convert as sc +from unshackle.core.tracks.subtitle import Subtitle + +Codec = Subtitle.Codec + +EXT_TO_CODEC = { + ".srt": Codec.SubRip, + ".vtt": Codec.WebVTT, + ".ass": Codec.SubStationAlphav4, + ".ssa": Codec.SubStationAlpha, + ".ttml": Codec.TimedTextMarkupLang, + ".smi": Codec.SAMI, + ".sami": Codec.SAMI, +} + + +def gather(paths: list[str]) -> list[Path]: + files: list[Path] = [] + for p in paths: + path = Path(p) + if path.is_dir(): + files.extend(f for f in path.rglob("*") if f.suffix.lower() in EXT_TO_CODEC) + elif path.suffix.lower() in EXT_TO_CODEC: + files.append(path) + return sorted(files) + + +def metrics(text: str) -> tuple[int, int, int]: + cues = len(re.findall(r"-->", text)) + ass_residue = len(re.findall(r"\{\\", text)) + return cues, ass_residue, len(text) + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("paths", nargs="+", help="subtitle files or directories") + ap.add_argument("--target", default="SRT", help="target codec value (SRT, VTT, ASS, ...)") + args = ap.parse_args() + + target = Codec(args.target.upper()) + files = gather(args.paths) + if not files: + print("No subtitle files found.") + return + + tmp = Path(tempfile.mkdtemp(prefix="subbench_")) + print(f"{'file':40} {'source':10} {'backend':12} {'ok':3} {'cues':>5} {'resid':>5} {'bytes':>7}") + for f in files: + source = EXT_TO_CODEC[f.suffix.lower()] + if source == target: + continue + for backend in sc.REGISTRY: + if not (backend.is_available() and backend.can_convert(source, target)): + continue + work = tmp / f"{f.stem}.{backend.name}{f.suffix}" + shutil.copy2(f, work) + sub = Subtitle(url="x", language="en", codec=source) + sub.path = work + try: + # Call the backend directly so each row reflects only that backend (no fallback). + out = work.with_suffix(f".{target.value.lower()}") + backend.convert(sub, target, out) + cues, resid, size = metrics(out.read_text("utf8", errors="replace")) + print( + f"{f.name[:40]:40} {source.name[:10]:10} {backend.name:12} {'Y':3} {cues:>5} {resid:>5} {size:>7}" + ) + except Exception as e: # noqa: BLE001 - benchmark reports failures, does not raise + print( + f"{f.name[:40]:40} {source.name[:10]:10} {backend.name:12} {'N':3} {'-':>5} {'-':>5} {'-':>7} {type(e).__name__}" + ) + + +if __name__ == "__main__": + main() diff --git a/tests/tracks/fixtures/segmented.wvtt.mp4 b/tests/tracks/fixtures/segmented.wvtt.mp4 new file mode 100644 index 0000000..60019fa Binary files /dev/null and b/tests/tracks/fixtures/segmented.wvtt.mp4 differ diff --git a/tests/tracks/test_subtitle_convert.py b/tests/tracks/test_subtitle_convert.py new file mode 100644 index 0000000..1a65f4f --- /dev/null +++ b/tests/tracks/test_subtitle_convert.py @@ -0,0 +1,314 @@ +"""Tests for the data-driven subtitle conversion registry (``tracks/subtitle_convert.py``). + +Covers three things the refactor must guarantee: +- the capability matrix resolves the right backend chain per (source, target) and env + (SubtitleEdit present or not), +- ``conversion_method`` pins a backend but still falls back (pin-then-fallback), +- styled SubStation (ASS/SSA) is never auto-downconverted to SRT unless explicitly forced. + +Backends pysubs2/subby/pycaption are hard deps so the conversion paths run in CI without +SubtitleEdit; SubtitleEdit availability is simulated by patching ``binaries.SubtitleEdit``. +""" + +from __future__ import annotations + +import pathlib +import re +import struct + +import pytest + +from unshackle.core import binaries +from unshackle.core.tracks import subtitle_convert as sc +from unshackle.core.tracks.subtitle import Subtitle + +Codec = Subtitle.Codec + +VTT_SAMPLE = """WEBVTT + +1 +00:00:01.000 --> 00:00:02.000 +Hello + +2 +00:00:03.000 --> 00:00:04.000 +World +""" + +ASS_SAMPLE = """[Script Info] +ScriptType: v4.00+ + +[V4+ Styles] +Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding +Style: Default,Arial,20,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,2,1,2,10,10,18,1 + +[Events] +Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text +Dialogue: 0,0:00:01.00,0:00:02.00,Default,,0,0,0,,{\\i1}Hello{\\i0} +Dialogue: 0,0:00:03.00,0:00:04.00,Default,,0,0,0,,World +""" + + +@pytest.fixture(autouse=True) +def _no_subtitleedit(monkeypatch): + """Default every test to a SubtitleEdit-less environment; tests opt in when needed.""" + monkeypatch.setattr(binaries, "SubtitleEdit", None) + + +def make_sub(tmp_path, name: str, text: str, codec: Codec) -> Subtitle: + path = tmp_path / name + path.write_text(text, encoding="utf8") + sub = Subtitle(url="https://example.test/x", language="en", codec=codec) + sub.path = path + return sub + + +def cue_count(path) -> int: + return len(re.findall(r"-->", path.read_text("utf8"))) + + +# --- capability matrix / resolver ------------------------------------------------------- + + +def test_resolve_webvtt_to_srt_order(): + chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip)] + assert chain == ["subby", "pysubs2", "pycaption"] + + +def test_resolve_ass_to_srt_only_pysubs2_without_subtitleedit(): + # subby and pycaption cannot read ASS, so only pysubs2 remains. + chain = [b.name for b in sc.resolve_backends(Codec.SubStationAlphav4, Codec.SubRip)] + assert chain == ["pysubs2"] + + +def test_subtitleedit_ranks_first_when_available(monkeypatch): + monkeypatch.setattr(binaries, "SubtitleEdit", "/usr/bin/seconv") + chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip)] + assert chain[0] == "subtitleedit" + + +def test_pin_then_fallback_orders_pin_first(): + chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip, pin="pysubs2")] + assert chain[0] == "pysubs2" + assert "subby" in chain # fallbacks remain after the pin + + +def test_pin_unavailable_falls_back_to_ranked_chain(): + # subtitleedit pinned but not installed -> just the ranked available backends. + chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip, pin="subtitleedit")] + assert chain == ["subby", "pysubs2", "pycaption"] + + +def test_fallback_runs_when_first_backend_fails(tmp_path, monkeypatch): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + + def boom(self, source, src, target, out): + raise RuntimeError("backend exploded") + + # WebVTT->SRT chain is [subby, pysubs2, pycaption]; kill subby, expect pysubs2 to finish. + monkeypatch.setattr(sc.SubbyBackend, "convert", boom) + sub = make_sub(tmp_path, "x.vtt", VTT_SAMPLE, Codec.WebVTT) + out = sub.convert(Codec.SubRip, forced=True) + assert sub.codec == Codec.SubRip + assert cue_count(out) == 2 + + +def test_no_backend_for_unsupported_target_raises(tmp_path): + sub = make_sub(tmp_path, "x.ass", ASS_SAMPLE, Codec.SubStationAlphav4) + with pytest.raises(NotImplementedError): + sub.convert(Codec.fVTT, forced=True) # no backend writes segmented fVTT + + +# --- styled-ASS protection -------------------------------------------------------------- + + +def test_ass_to_srt_kept_as_is_when_not_forced(tmp_path, monkeypatch): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + sub = make_sub(tmp_path, "x.ass", ASS_SAMPLE, Codec.SubStationAlphav4) + out = sub.convert(Codec.SubRip, forced=False) + assert sub.codec == Codec.SubStationAlphav4 # unchanged + assert out == sub.path + assert out.suffix == ".ass" + + +def test_ass_to_srt_converts_when_forced(tmp_path, monkeypatch): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + sub = make_sub(tmp_path, "x.ass", ASS_SAMPLE, Codec.SubStationAlphav4) + out = sub.convert(Codec.SubRip, forced=True) + assert sub.codec == Codec.SubRip + assert out.suffix == ".srt" + assert cue_count(out) == 2 + assert "{\\" not in out.read_text("utf8") # override tags stripped + + +# --- conversion paths ------------------------------------------------------------------- + + +def test_webvtt_to_srt_conversion(tmp_path, monkeypatch): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + sub = make_sub(tmp_path, "x.vtt", VTT_SAMPLE, Codec.WebVTT) + out = sub.convert(Codec.SubRip, forced=True) + assert sub.codec == Codec.SubRip + assert cue_count(out) == 2 + + +def test_same_codec_is_noop(tmp_path): + sub = make_sub(tmp_path, "x.srt", "1\n00:00:01,000 --> 00:00:02,000\nHi\n", Codec.SubRip) + assert sub.convert(Codec.SubRip) == sub.path + assert sub.codec == Codec.SubRip + + +# --- ASS/SSA font detection ------------------------------------------------------------ + +FONT_ASS = """[Script Info] +ScriptType: v4.00+ + +[V4+ Styles] +Format: Name, Fontname, Fontsize, PrimaryColour, Bold, Italic, Alignment, MarginV, Encoding +Style: Default,Trebuchet MS,24,&H00FFFFFF,0,0,2,18,1 +Style: sign,@Arial Unicode MS,20,&H00FFFFFF,0,0,8,10,1 + +[Events] +Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text +Dialogue: 0,0:00:01.00,0:00:02.00,Default,,0,0,0,,{\\fnTimes New Roman}A sign +Dialogue: 0,0:00:03.00,0:00:04.00,Default,,0,0,0,,{\\fntimes new roman}lower case +Dialogue: 0,0:00:05.00,0:00:06.00,Default,,0,0,0,,{\\fnGeorgia\\b1}bold note +""" + + +def test_extract_fonts_styles_and_inline_overrides(): + fonts = Subtitle.extract_fonts(FONT_ASS) + # Style fontnames (column located via Format line, @-prefix stripped) + inline \fn overrides + assert fonts == {"Trebuchet MS", "Arial Unicode MS", "Times New Roman", "Georgia"} + # case-insensitive de-dup keeps the mixed-case spelling, not "times new roman" + assert "times new roman" not in fonts + + +def test_extract_fonts_handles_non_default_column_order(): + ass = ( + "[V4+ Styles]\n" + "Format: Name, Fontsize, Fontname, Bold\n" # Fontname not in the usual position + "Style: Main,28,Verdana,0\n" + ) + assert Subtitle.extract_fonts(ass) == {"Verdana"} + + +# --- non-Latin scripts (RTL / CJK) preserved through conversion ------------------------ + +CJK_RTL_VTT = """WEBVTT + +1 +00:00:01.000 --> 00:00:02.000 +مرحبا بالعالم + +2 +00:00:03.000 --> 00:00:04.000 +안녕하세요 + +3 +00:00:05.000 --> 00:00:06.000 +你好世界 +""" + + +@pytest.mark.parametrize( + "pattern", + [r"[؀-ۿ]", r"[가-힣]", r"[一-鿿]"], # Arabic, Hangul, CJK +) +def test_non_latin_scripts_survive_vtt_to_srt(tmp_path, monkeypatch, pattern): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + sub = make_sub(tmp_path, "x.vtt", CJK_RTL_VTT, Codec.WebVTT) + out = sub.convert(Codec.SubRip, forced=True) + text = out.read_text("utf8") + assert cue_count(out) == 3 + assert re.search(pattern, text) # script survived the round-trip, no mojibake + + +# --- SDH stripping ---------------------------------------------------------------------- + +SDH_SRT = """1 +00:00:01,000 --> 00:00:02,000 +[door creaks] + +2 +00:00:03,000 --> 00:00:04,000 +Hello there. + +3 +00:00:05,000 --> 00:00:06,000 +♪ upbeat music ♪ +""" + + +def test_sdh_stripping_removes_effects_keeps_dialogue(tmp_path, monkeypatch): + # subby's SDHStripper runs on SRT without SubtitleEdit installed. + monkeypatch.setattr("unshackle.core.config.config.subtitle", {"sdh_method": "subby"}, raising=False) + sub = make_sub(tmp_path, "x.srt", SDH_SRT, Codec.SubRip) + sub.strip_hearing_impaired() + out = sub.path.read_text("utf8") + assert "Hello there." in out # real dialogue kept + assert "door creaks" not in out # bracketed effect removed (subby SDHStripper) + + +# --- segmented (box-encapsulated) formats: fVTT (wvtt) / fTTML (stpp) -------------------- +# These ship from DASH/HLS as fragmented MP4 (e.g. HBO Max). The downloader concatenates +# init + media segments into one file; parse() reads the MP4 boxes directly. + +FIXTURES = pathlib.Path(__file__).parent / "fixtures" + + +def caption_total(caption_set) -> int: + return sum(len(caption_set.get_captions(lang)) for lang in caption_set.get_languages()) + + +def build_stpp_mp4(*ttml_fragments: str) -> bytes: + """A minimal stpp-style MP4: ftyp + one mdat per TTML fragment (what fTTML.parse reads).""" + + def box(box_type: bytes, payload: bytes) -> bytes: + return struct.pack(">I", 8 + len(payload)) + box_type + payload + + data = box(b"ftyp", b"isom" + struct.pack(">I", 0) + b"isomiso6") + for frag in ttml_fragments: + data += box(b"mdat", frag.encode("utf8")) + return data + + +def test_segmented_fvtt_parses_and_converts(tmp_path, monkeypatch): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + data = (FIXTURES / "segmented.wvtt.mp4").read_bytes() + + caption_set = Subtitle.parse(data, Codec.fVTT) + assert caption_total(caption_set) == 2 + + seg = tmp_path / "seg.wvtt" + seg.write_bytes(data) + sub = Subtitle(url="https://example.test/x", language="en", codec=Codec.fVTT) + sub.path = seg + # download() converts fVTT -> WebVTT (not "forced"); chain is subby then pycaption. + out = sub.convert(Codec.WebVTT) + assert sub.codec == Codec.WebVTT + assert cue_count(out) == 2 + + +def test_segmented_fttml_parses_and_converts(tmp_path, monkeypatch): + monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False) + frag = ( + '' + '
' + '

Line {a}

' + "
" + ) + data = build_stpp_mp4(frag.format(a=1, b=2), frag.format(a=3, b=4)) + + caption_set = Subtitle.parse(data, Codec.fTTML) + assert caption_total(caption_set) == 2 + + seg = tmp_path / "seg.stpp" + seg.write_bytes(data) + sub = Subtitle(url="https://example.test/x", language="en", codec=Codec.fTTML) + sub.path = seg + # download() converts fTTML -> TTML (only pycaption can read fTTML); then -> SRT. + sub.convert(Codec.TimedTextMarkupLang) + assert sub.codec == Codec.TimedTextMarkupLang + out = sub.convert(Codec.SubRip, forced=True) + assert cue_count(out) == 2 diff --git a/unshackle/commands/dl.py b/unshackle/commands/dl.py index 64af19b..231065d 100644 --- a/unshackle/commands/dl.py +++ b/unshackle/commands/dl.py @@ -311,7 +311,7 @@ class dl: ) temp_sub.path = temp_path try: - temp_sub.convert(target_codec) + temp_sub.convert(target_codec, forced=True) if temp_sub.path and temp_sub.path.exists(): shutil.copy2(temp_sub.path, sidecar_path) finally: @@ -528,9 +528,13 @@ class dl: @click.option("-S", "--subs-only", is_flag=True, default=False, help="Only download subtitle tracks.") @click.option("-C", "--chapters-only", is_flag=True, default=False, help="Only download chapter markers.") @click.option("-ns", "--no-subs", is_flag=True, default=False, help="Do not download subtitle tracks.") - @click.option("--skip-subtitle-errors", is_flag=True, default=False, - help="If a subtitle track fails to download, skip it and continue instead of " - "aborting the whole title (video/audio failures stay fatal).") + @click.option( + "--skip-subtitle-errors", + is_flag=True, + default=False, + help="If a subtitle track fails to download, skip it and continue instead of " + "aborting the whole title (video/audio failures stay fatal).", + ) @click.option("-na", "--no-audio", is_flag=True, default=False, help="Do not download audio tracks.") @click.option("-nc", "--no-chapters", is_flag=True, default=False, help="Do not download chapter markers.") @click.option("-nv", "--no-video", is_flag=True, default=False, help="Do not download video tracks.") @@ -2367,18 +2371,16 @@ class dl: for subtitle in title.tracks.subtitles: if sub_format: if subtitle.codec != sub_format: - subtitle.convert(sub_format) + subtitle.convert(sub_format, forced=True) elif subtitle.codec == Subtitle.Codec.TimedTextMarkupLang: # MKV does not support TTML, VTT is the next best option subtitle.convert(Subtitle.Codec.WebVTT) with console.status("Checking Subtitles for Fonts..."): - font_names = [] + font_names: list[str] = [] for subtitle in title.tracks.subtitles: - if subtitle.codec == Subtitle.Codec.SubStationAlphav4: - for line in subtitle.path.read_text("utf8").splitlines(): - if line.startswith("Style: "): - font_names.append(line.removeprefix("Style: ").split(",")[1].strip()) + if subtitle.codec in (Subtitle.Codec.SubStationAlpha, Subtitle.Codec.SubStationAlphav4): + font_names.extend(Subtitle.extract_fonts(subtitle.path.read_text("utf8"))) font_count, missing_fonts = self.attach_subtitle_fonts(font_names, title, temp_font_files) diff --git a/unshackle/core/binaries.py b/unshackle/core/binaries.py index dd0a114..65f92d0 100644 --- a/unshackle/core/binaries.py +++ b/unshackle/core/binaries.py @@ -37,7 +37,7 @@ def find(*names: str) -> Optional[Path]: FFMPEG = find("ffmpeg") FFProbe = find("ffprobe") FFPlay = find("ffplay") -SubtitleEdit = find("SubtitleEdit") +SubtitleEdit = find("SubtitleEdit", "seconv") # seconv = cross-platform subtitleedit-cli (.NET 8) ShakaPackager = find( "shaka-packager", "packager", diff --git a/unshackle/core/tracks/subtitle.py b/unshackle/core/tracks/subtitle.py index 847056a..4a5ccd0 100644 --- a/unshackle/core/tracks/subtitle.py +++ b/unshackle/core/tracks/subtitle.py @@ -11,13 +11,12 @@ from pathlib import Path from typing import Any, Callable, Iterable, Optional, Union import pycaption -import pysubs2 import requests from construct import Container from pycaption import Caption, CaptionList, CaptionNode, WebVTTReader from pycaption.geometry import Layout from pymp4.parser import MP4 -from subby import CommonIssuesFixer, SAMIConverter, SDHStripper, WebVTTConverter, WVTTConverter +from subby import CommonIssuesFixer, SAMIConverter, SDHStripper from subtitle_filter import Subtitles from unshackle.core import binaries @@ -600,306 +599,73 @@ class Subtitle(Track): return "\n".join(sanitized_lines) - def convert_with_subby(self, codec: Subtitle.Codec) -> Path: + def convert(self, codec: Subtitle.Codec, *, forced: bool = False) -> Path: """ - Convert subtitle using subby library for better format support and processing. + Convert this Subtitle to another format. - This method leverages subby's advanced subtitle processing capabilities - including better WebVTT handling, SDH stripping, and common issue fixing. + Backend selection is data-driven (see ``tracks/subtitle_convert.py``): the best + available backend that supports source->target is used, falling back through the + capability chain on failure. The backend can be pinned via the ``conversion_method`` + config key (``auto`` | ``subby`` | ``pysubs2`` | ``subtitleedit`` | ``pycaption``), + or nudged per-service via ``preferred_conversion_method``; an explicit config value + always wins. + + ``forced`` marks an explicit user request (``--sub-format``). Lossy downconverts of + styled formats (SSA/ASS -> SRT) are skipped unless ``forced`` is True. """ + from unshackle.core.tracks.subtitle_convert import run_conversion if not self.path or not self.path.exists(): raise ValueError("You must download the subtitle track first.") - if self.codec == codec: - return self.path + method = ( + config.subtitle.get("conversion_method") or getattr(self, "preferred_conversion_method", None) or "auto" + ) + pin = None if method == "auto" else method + return run_conversion(self, codec, pin=pin, forced=forced) - output_path = self.path.with_suffix(f".{codec.value.lower()}") - original_path = self.path - - try: - # Convert to SRT using subby first - srt_subtitles = None - - if self.codec == Subtitle.Codec.WebVTT: - converter = WebVTTConverter() - srt_subtitles = converter.from_file(self.path) - if self.codec == Subtitle.Codec.fVTT: - converter = WVTTConverter() - srt_subtitles = converter.from_file(self.path) - elif self.codec == Subtitle.Codec.SAMI: - converter = SAMIConverter() - srt_subtitles = converter.from_file(self.path) - - if srt_subtitles is not None: - # Apply common fixes - fixer = CommonIssuesFixer() - fixed_srt, _ = fixer.from_srt(srt_subtitles) - - # If target is SRT, we're done - if codec == Subtitle.Codec.SubRip: - fixed_srt.save(output_path, encoding="utf8") - else: - # Convert from SRT to target format using existing pycaption logic - temp_srt_path = self.path.with_suffix(".temp.srt") - fixed_srt.save(temp_srt_path, encoding="utf8") - - # Parse the SRT and convert to target format - caption_set = self.parse(temp_srt_path.read_bytes(), Subtitle.Codec.SubRip) - self.merge_same_cues(caption_set) - - writer = { - Subtitle.Codec.TimedTextMarkupLang: pycaption.DFXPWriter, - Subtitle.Codec.WebVTT: pycaption.WebVTTWriter, - }.get(codec) - - if writer: - subtitle_text = writer().write(caption_set) - output_path.write_text(subtitle_text, encoding="utf8") - else: - # Fall back to existing conversion method - temp_srt_path.unlink() - return self._convert_standard(codec) - - temp_srt_path.unlink() - - if original_path.exists() and original_path != output_path: - original_path.unlink() - - self.path = output_path - self.codec = codec - - if callable(self.OnConverted): - self.OnConverted(codec) - - return output_path - else: - # Fall back to existing conversion method - return self._convert_standard(codec) - - except Exception: - # Fall back to existing conversion method on any error - return self._convert_standard(codec) - - def convert_with_pysubs2(self, codec: Subtitle.Codec) -> Path: + @staticmethod + def extract_fonts(text: str) -> set[str]: """ - Convert subtitle using pysubs2 library for broad format support. + Font names referenced by an ASS/SSA subtitle. - pysubs2 is a pure-Python library supporting SubRip (SRT), SubStation Alpha - (SSA/ASS), WebVTT, TTML, SAMI, MicroDVD, MPL2, and TMP formats. + Covers both sources that need attaching for correct rendering: + - the ``Fontname`` column of every ``Style:`` line in ``[V4+ Styles]``/``[V4 Styles]`` + (column located from the section's ``Format:`` line, not assumed by index), and + - inline ``\\fn`` font overrides inside ``Dialogue`` override blocks. + + Leading ``@`` (vertical-writing prefix) is stripped and names are de-duplicated + case-insensitively, preferring a mixed-case spelling over an all-lowercase one. """ - if not self.path or not self.path.exists(): - raise ValueError("You must download the subtitle track first.") + names: set[str] = set() + name_index = 1 # ASS default Style order: Name, Fontname, ... + in_styles = False + for line in text.splitlines(): + stripped = line.strip() + if stripped.startswith("["): + in_styles = stripped.lower() in ("[v4+ styles]", "[v4 styles]") + continue + if not in_styles: + continue + if stripped.lower().startswith("format:"): + columns = [c.strip().lower() for c in stripped.split(":", 1)[1].split(",")] + if "fontname" in columns: + name_index = columns.index("fontname") + elif stripped.lower().startswith("style:"): + fields = stripped.split(":", 1)[1].split(",") + if len(fields) > name_index: + names.add(fields[name_index].strip()) - if self.codec == codec: - return self.path + names.update(match.strip() for match in re.findall(r"\\fn([^\\}]+)", text)) - output_path = self.path.with_suffix(f".{codec.value.lower()}") - original_path = self.path - - codec_to_pysubs2_format = { - Subtitle.Codec.SubRip: "srt", - Subtitle.Codec.SubStationAlpha: "ssa", - Subtitle.Codec.SubStationAlphav4: "ass", - Subtitle.Codec.WebVTT: "vtt", - Subtitle.Codec.TimedTextMarkupLang: "ttml", - Subtitle.Codec.SAMI: "sami", - Subtitle.Codec.MicroDVD: "microdvd", - Subtitle.Codec.MPL2: "mpl2", - Subtitle.Codec.TMP: "tmp", - } - - pysubs2_output_format = codec_to_pysubs2_format.get(codec) - if pysubs2_output_format is None: - return self._convert_standard(codec) - - try: - subs = pysubs2.load(str(self.path), encoding="utf-8") - - subs.save(str(output_path), format_=pysubs2_output_format, encoding="utf-8") - - if original_path.exists() and original_path != output_path: - original_path.unlink() - - self.path = output_path - self.codec = codec - - if callable(self.OnConverted): - self.OnConverted(codec) - - return output_path - - except Exception: - return self._convert_standard(codec) - - def convert(self, codec: Subtitle.Codec) -> Path: - """ - Convert this Subtitle to another Format. - - The conversion method is determined by the 'conversion_method' setting in config: - - 'auto' (default): Uses subby for WebVTT/fVTT/SAMI; for SSA/ASS/MicroDVD/MPL2/TMP - uses SubtitleEdit if available, otherwise pysubs2; standard for others - - 'subby': Always uses subby with CommonIssuesFixer - - 'subtitleedit': Uses SubtitleEdit when available, falls back to pycaption - - 'pycaption': Uses only pycaption library - - 'pysubs2': Uses pysubs2 library - """ - # Check configuration for conversion method - conversion_method = config.subtitle.get("conversion_method", "auto") - - if conversion_method == "subby": - return self.convert_with_subby(codec) - elif conversion_method == "subtitleedit": - return self._convert_standard(codec) - elif conversion_method == "pycaption": - return self._convert_pycaption_only(codec) - elif conversion_method == "pysubs2": - return self.convert_with_pysubs2(codec) - elif conversion_method == "auto": - if self.codec in (Subtitle.Codec.WebVTT, Subtitle.Codec.fVTT, Subtitle.Codec.SAMI): - return self.convert_with_subby(codec) - elif self.codec in ( - Subtitle.Codec.SubStationAlpha, - Subtitle.Codec.SubStationAlphav4, - Subtitle.Codec.MicroDVD, - Subtitle.Codec.MPL2, - Subtitle.Codec.TMP, - ): - if binaries.SubtitleEdit: - return self._convert_standard(codec) - else: - return self.convert_with_pysubs2(codec) - else: - return self._convert_standard(codec) - else: - return self._convert_standard(codec) - - def _convert_pycaption_only(self, codec: Subtitle.Codec) -> Path: - """ - Convert subtitle using only pycaption library (no SubtitleEdit, no subby). - - This is the original conversion method that only uses pycaption. - """ - if not self.path or not self.path.exists(): - raise ValueError("You must download the subtitle track first.") - - if self.codec == codec: - return self.path - - output_path = self.path.with_suffix(f".{codec.value.lower()}") - original_path = self.path - - # Use only pycaption for conversion - writer = { - Subtitle.Codec.SubRip: pycaption.SRTWriter, - Subtitle.Codec.TimedTextMarkupLang: pycaption.DFXPWriter, - Subtitle.Codec.WebVTT: pycaption.WebVTTWriter, - }.get(codec) - - if writer is None: - raise NotImplementedError(f"Cannot convert {self.codec.name} to {codec.name} using pycaption only.") - - caption_set = self.parse(self.path.read_bytes(), self.codec) - Subtitle.merge_same_cues(caption_set) - if codec == Subtitle.Codec.WebVTT: - Subtitle.filter_unwanted_cues(caption_set) - subtitle_text = writer().write(caption_set) - - output_path.write_text(subtitle_text, encoding="utf8") - - if original_path.exists() and original_path != output_path: - original_path.unlink() - - self.path = output_path - self.codec = codec - - if callable(self.OnConverted): - self.OnConverted(codec) - - return output_path - - def _convert_standard(self, codec: Subtitle.Codec) -> Path: - """ - Convert this Subtitle to another Format. - - The file path location of the Subtitle data will be kept at the same - location but the file extension will be changed appropriately. - - Supported formats: - - SubRip - SubtitleEdit or pycaption.SRTWriter - - TimedTextMarkupLang - SubtitleEdit or pycaption.DFXPWriter - - WebVTT - SubtitleEdit or pycaption.WebVTTWriter - - SubStationAlphav4 - SubtitleEdit - - SAMI - subby.SAMIConverter (when available) - - fTTML* - custom code using some pycaption functions - - fVTT* - custom code using some pycaption functions - *: Can read from format, but cannot convert to format - - Note: It currently prioritizes using SubtitleEdit over PyCaption as - I have personally noticed more oddities with PyCaption parsing over - SubtitleEdit. Especially when working with TTML/DFXP where it would - often have timecodes and stuff mixed in/duplicated. - - Returns the new file path of the Subtitle. - """ - if not self.path or not self.path.exists(): - raise ValueError("You must download the subtitle track first.") - - if self.codec == codec: - return self.path - - output_path = self.path.with_suffix(f".{codec.value.lower()}") - original_path = self.path - - if binaries.SubtitleEdit and self.codec not in (Subtitle.Codec.fTTML, Subtitle.Codec.fVTT): - sub_edit_format = { - Subtitle.Codec.SubRip: "subrip", - Subtitle.Codec.SubStationAlpha: "substationalpha", - Subtitle.Codec.SubStationAlphav4: "advancedsubstationalpha", - Subtitle.Codec.TimedTextMarkupLang: "timedtext1.0", - Subtitle.Codec.WebVTT: "webvtt", - Subtitle.Codec.SAMI: "sami", - Subtitle.Codec.MicroDVD: "microdvd", - }.get(codec, codec.name.lower()) - sub_edit_args = [ - str(binaries.SubtitleEdit), - "/convert", - str(self.path), - sub_edit_format, - f"/outputfilename:{output_path.name}", - "/encoding:utf8", - ] - if codec == Subtitle.Codec.SubRip: - sub_edit_args.append("/ConvertColorsToDialog") - subprocess.run(sub_edit_args, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) - else: - writer = { - # pycaption generally only supports these subtitle formats - Subtitle.Codec.SubRip: pycaption.SRTWriter, - Subtitle.Codec.TimedTextMarkupLang: pycaption.DFXPWriter, - Subtitle.Codec.WebVTT: pycaption.WebVTTWriter, - }.get(codec) - if writer is None: - raise NotImplementedError(f"Cannot yet convert {self.codec.name} to {codec.name}.") - - caption_set = self.parse(self.path.read_bytes(), self.codec) - Subtitle.merge_same_cues(caption_set) - if codec == Subtitle.Codec.WebVTT: - Subtitle.filter_unwanted_cues(caption_set) - subtitle_text = writer().write(caption_set) - - output_path.write_text(subtitle_text, encoding="utf8") - - if original_path.exists() and original_path != output_path: - original_path.unlink() - - self.path = output_path - self.codec = codec - - if callable(self.OnConverted): - self.OnConverted(codec) - - return output_path + canonical: dict[str, str] = {} + for name in (raw.lstrip("@").strip() for raw in names): + if not name: + continue + key = name.lower() + if key not in canonical or (name != name.lower() and canonical[key] == canonical[key].lower()): + canonical[key] = name + return set(canonical.values()) @staticmethod def parse(data: bytes, codec: Subtitle.Codec) -> pycaption.CaptionSet: @@ -1267,25 +1033,13 @@ class Subtitle(Track): ) if binaries.SubtitleEdit and use_subtitleedit: - output_format = { - Subtitle.Codec.SubRip: "subrip", - Subtitle.Codec.SubStationAlpha: "substationalpha", - Subtitle.Codec.SubStationAlphav4: "advancedsubstationalpha", - Subtitle.Codec.TimedTextMarkupLang: "timedtext1.0", - Subtitle.Codec.WebVTT: "webvtt", - Subtitle.Codec.SAMI: "sami", - Subtitle.Codec.MicroDVD: "microdvd", - }.get(self.codec, self.codec.name.lower()) + from unshackle.core.tracks.subtitle_convert import SUBTITLE_EDIT_FORMATS, subtitleedit_args + + output_format = SUBTITLE_EDIT_FORMATS.get(self.codec, self.codec.name.lower()) subprocess.run( - [ - str(binaries.SubtitleEdit), - "/convert", - str(self.path), - output_format, - "/encoding:utf8", - "/overwrite", - "/RemoveTextForHI", - ], + subtitleedit_args( + binaries.SubtitleEdit, self.path, output_format, output_folder=self.path.parent, remove_hi=True + ), check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, @@ -1330,26 +1084,14 @@ class Subtitle(Track): if not binaries.SubtitleEdit: raise EnvironmentError("SubtitleEdit executable not found...") - output_format = { - Subtitle.Codec.SubRip: "subrip", - Subtitle.Codec.SubStationAlpha: "substationalpha", - Subtitle.Codec.SubStationAlphav4: "advancedsubstationalpha", - Subtitle.Codec.TimedTextMarkupLang: "timedtext1.0", - Subtitle.Codec.WebVTT: "webvtt", - Subtitle.Codec.SAMI: "sami", - Subtitle.Codec.MicroDVD: "microdvd", - }.get(self.codec, self.codec.name.lower()) + from unshackle.core.tracks.subtitle_convert import SUBTITLE_EDIT_FORMATS, subtitleedit_args + + output_format = SUBTITLE_EDIT_FORMATS.get(self.codec, self.codec.name.lower()) subprocess.run( - [ - str(binaries.SubtitleEdit), - "/convert", - str(self.path), - output_format, - "/ReverseRtlStartEnd", - "/encoding:utf8", - "/overwrite", - ], + subtitleedit_args( + binaries.SubtitleEdit, self.path, output_format, output_folder=self.path.parent, reverse_rtl=True + ), check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, diff --git a/unshackle/core/tracks/subtitle_convert.py b/unshackle/core/tracks/subtitle_convert.py new file mode 100644 index 0000000..71f2273 --- /dev/null +++ b/unshackle/core/tracks/subtitle_convert.py @@ -0,0 +1,313 @@ +""" +Subtitle conversion backend registry. + +Routing is data-driven: each backend declares which (source -> target) codec pairs it can +read/write, whether it is available in the current environment, and a preference rank. +``resolve_backends`` filters the registry to the available backends that support the +requested pair and orders them by rank; ``run_conversion`` tries each in turn (a real +fallback chain) until one succeeds. + +The public entry point stays ``Subtitle.convert`` / ``Subtitle.strip_hearing_impaired`` in +subtitle.py — this module only holds the selection + conversion logic so subtitle.py keeps +the codec enum, ``parse``, sanitizers and cue helpers (the collaborators backends reuse). +""" + +from __future__ import annotations + +import logging +import subprocess +from pathlib import Path +from typing import Optional, Protocol + +import pycaption +import pysubs2 +from subby import CommonIssuesFixer, SAMIConverter, WebVTTConverter, WVTTConverter + +from unshackle.core import binaries +from unshackle.core.tracks.subtitle import Subtitle + +log = logging.getLogger("subtitle") + +Codec = Subtitle.Codec + +# SubtitleEdit (and the cross-platform seconv port) /convert format names. +# Shared by SubtitleEditBackend, strip_hearing_impaired and reverse_rtl so the map lives once. +SUBTITLE_EDIT_FORMATS: dict[Codec, str] = { + Codec.SubRip: "subrip", + Codec.SubStationAlpha: "substationalpha", + Codec.SubStationAlphav4: "advancedsubstationalpha", + Codec.TimedTextMarkupLang: "timedtext1.0", + Codec.WebVTT: "webvtt", + Codec.SAMI: "sami", + Codec.MicroDVD: "microdvd", +} + +# pycaption can only WRITE these three formats. +PYCAPTION_WRITERS = { + Codec.SubRip: pycaption.SRTWriter, + Codec.TimedTextMarkupLang: pycaption.DFXPWriter, + Codec.WebVTT: pycaption.WebVTTWriter, +} + +# pysubs2 format identifiers per codec. +PYSUBS2_FORMATS: dict[Codec, str] = { + Codec.SubRip: "srt", + Codec.SubStationAlpha: "ssa", + Codec.SubStationAlphav4: "ass", + Codec.WebVTT: "vtt", + Codec.TimedTextMarkupLang: "ttml", + Codec.SAMI: "sami", + Codec.MicroDVD: "microdvd", + Codec.MPL2: "mpl2", + Codec.TMP: "tmp", +} + + +def subtitleedit_args( + binary: object, + src: Path, + fmt: str, + *, + output_folder: Optional[Path] = None, + convert_colors: bool = False, + remove_hi: bool = False, + reverse_rtl: bool = False, +) -> list[str]: + """ + Build a SubtitleEdit batch-convert command. + + Targets the SubtitleEdit 5+ CLI (``SeConv`` / ``seconv`` on every platform), which takes + ``--flags`` with a positional `` `` (no legacy ``/convert`` verb). The + SE5 converter names the output ``.``; pass ``output_folder`` to + place it next to a chosen path (a bare ``--output-filename`` resolves against the *cwd*, + not the input dir, so we always steer with ``--output-folder``). ``--overwrite`` is always + set so re-runs and in-place transforms (SDH/RTL) don't fail on an existing file. + """ + args = [str(binary), str(src), fmt, "--encoding:utf-8", "--overwrite"] + if output_folder is not None: + args.append(f"--output-folder:{output_folder}") + if convert_colors: + args.append("--convert-colors-to-dialog") + if remove_hi: + args.append("--remove-text-for-hi") + if reverse_rtl: + args.append("--reverse-rtl-start-end") + return args + + +# Styled SubStation formats flattened to SRT lose positioning/colours/italics. +# Never performed automatically — only when the user explicitly forces a target format. +LOSSY_DOWNCONVERTS: frozenset[tuple[Codec, Codec]] = frozenset( + { + (Codec.SubStationAlpha, Codec.SubRip), + (Codec.SubStationAlphav4, Codec.SubRip), + } +) + + +class SubtitleBackend(Protocol): + name: str + + def is_available(self) -> bool: ... + + def can_convert(self, source: Codec, target: Codec) -> bool: ... + + def rank(self, source: Codec, target: Codec) -> int: ... + + def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None: + """Convert ``src`` (a ``source`` file) to ``target``, writing to ``out``. Raise on failure.""" + ... + + +class SubtitleEditBackend: + """SubtitleEdit / seconv CLI. Highest fidelity (keeps positioning + italics) when present.""" + + name = "subtitleedit" + reads = frozenset(SUBTITLE_EDIT_FORMATS) + writes = frozenset(SUBTITLE_EDIT_FORMATS) + + def is_available(self) -> bool: + return bool(binaries.SubtitleEdit) + + def can_convert(self, source: Codec, target: Codec) -> bool: + # Segmented box formats cannot be read by SubtitleEdit. + if source in (Codec.fTTML, Codec.fVTT): + return False + return source in self.reads and target in self.writes + + def rank(self, source: Codec, target: Codec) -> int: + return 0 + + def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None: + args = subtitleedit_args( + binaries.SubtitleEdit, + src, + SUBTITLE_EDIT_FORMATS[target], + output_folder=out.parent, + convert_colors=(target == Codec.SubRip), + ) + subprocess.run(args, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) + # SE5 names the output ., which may differ from our target + # suffix (e.g. timedtext1.0 -> .ttml). Normalise it onto `out`. + if not out.exists(): + produced = next((p for p in src.parent.glob(f"{src.stem}.*") if p not in (src, out)), None) + if produced is None: + raise FileNotFoundError(f"SubtitleEdit produced no output for {src.name} -> {target.name}") + produced.replace(out) + + +class Pysubs2Backend: + """pysubs2 — pure Python, broad format support, best fidelity for SSA/ASS (native style model).""" + + name = "pysubs2" + formats = frozenset(PYSUBS2_FORMATS) + + def is_available(self) -> bool: + return True + + def can_convert(self, source: Codec, target: Codec) -> bool: + return source in self.formats and target in self.formats + + def rank(self, source: Codec, target: Codec) -> int: + # Preferred reader for styled SubStation sources; solid general fallback otherwise. + return 1 if source in (Codec.SubStationAlpha, Codec.SubStationAlphav4) else 2 + + def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None: + subs = pysubs2.load(str(src), encoding="utf-8") + subs.save(str(out), format_=PYSUBS2_FORMATS[target], encoding="utf-8") + + +class SubbyBackend: + """subby — purpose-built for streaming subs. WebVTT/fVTT/SAMI -> SRT + CommonIssuesFixer cleanup.""" + + name = "subby" + reads = frozenset({Codec.WebVTT, Codec.fVTT, Codec.SAMI}) + # Native SRT output; non-SRT targets re-encoded from the SRT intermediate via pycaption. + writes = frozenset({Codec.SubRip, Codec.TimedTextMarkupLang, Codec.WebVTT}) + converters = { + Codec.WebVTT: WebVTTConverter, + Codec.fVTT: WVTTConverter, + Codec.SAMI: SAMIConverter, + } + + def is_available(self) -> bool: + return True + + def can_convert(self, source: Codec, target: Codec) -> bool: + return source in self.reads and target in self.writes + + def rank(self, source: Codec, target: Codec) -> int: + # Great for *->SRT (adds cleanup); the SRT intermediate is lossy for other targets. + return 1 if target == Codec.SubRip else 5 + + def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None: + srt_subtitles = self.converters[source]().from_file(src) + fixed_srt, _ = CommonIssuesFixer().from_srt(srt_subtitles) + if target == Codec.SubRip: + fixed_srt.save(out, encoding="utf8") + return + temp_srt = src.with_suffix(".temp.srt") + fixed_srt.save(temp_srt, encoding="utf8") + try: + caption_set = Subtitle.parse(temp_srt.read_bytes(), Codec.SubRip) + Subtitle.merge_same_cues(caption_set) + out.write_text(PYCAPTION_WRITERS[target]().write(caption_set), encoding="utf8") + finally: + temp_srt.unlink(missing_ok=True) + + +class PycaptionBackend: + """pycaption — last resort. Note: flattens positioning/italics (devine #39), so ranked last.""" + + name = "pycaption" + reads = frozenset({Codec.SubRip, Codec.TimedTextMarkupLang, Codec.WebVTT, Codec.SAMI, Codec.fTTML, Codec.fVTT}) + writes = frozenset(PYCAPTION_WRITERS) + + def is_available(self) -> bool: + return True + + def can_convert(self, source: Codec, target: Codec) -> bool: + return source in self.reads and target in self.writes + + def rank(self, source: Codec, target: Codec) -> int: + return 9 + + def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None: + caption_set = Subtitle.parse(src.read_bytes(), source) + Subtitle.merge_same_cues(caption_set) + if target == Codec.WebVTT: + Subtitle.filter_unwanted_cues(caption_set) + out.write_text(PYCAPTION_WRITERS[target]().write(caption_set), encoding="utf8") + + +REGISTRY: list[SubtitleBackend] = [ + SubtitleEditBackend(), + SubbyBackend(), + Pysubs2Backend(), + PycaptionBackend(), +] + + +def resolve_backends(source: Codec, target: Codec, *, pin: Optional[str] = None) -> list[SubtitleBackend]: + """Available backends that support source->target, ordered by rank. A pin is tried first.""" + available = [b for b in REGISTRY if b.is_available() and b.can_convert(source, target)] + if pin: + pinned = [b for b in available if b.name == pin] + rest = sorted((b for b in available if b.name != pin), key=lambda b: b.rank(source, target)) + return pinned + rest + return sorted(available, key=lambda b: b.rank(source, target)) + + +def finalize(sub: Subtitle, target: Codec, out: Path) -> Path: + """Swap the track onto the converted file and fire the OnConverted callback.""" + original = sub.path + if original and original.exists() and original != out: + original.unlink() + sub.path = out + sub.codec = target + if callable(sub.OnConverted): + sub.OnConverted(target) + return out + + +def run_conversion(sub: Subtitle, target: Codec, *, pin: Optional[str] = None, forced: bool = False) -> Path: + """ + Convert ``sub`` to ``target`` using the best available backend, falling back through the + capability chain on failure. + + ``forced`` is True only for explicit user requests (``--sub-format``); lossy downconverts + (styled SubStation -> SRT) are skipped unless forced. + """ + if sub.path is None or not sub.path.exists(): + raise ValueError("You must download the subtitle track first.") + if sub.codec is None: + raise ValueError("Subtitle has no codec to convert from.") + source, src = sub.codec, sub.path + + if source == target: + return src + + if (source, target) in LOSSY_DOWNCONVERTS and not forced: + log.info( + f"Keeping {source.name} subtitle as-is " + f"(skipping lossy auto-conversion to {target.name}; pass --sub-format to force)" + ) + return src + + chain = resolve_backends(source, target, pin=pin) + if not chain: + raise NotImplementedError(f"Cannot convert {source.name} to {target.name}.") + + out = src.with_suffix(f".{target.value.lower()}") + last_exc: Optional[Exception] = None + for backend in chain: + try: + backend.convert(source, src, target, out) + except Exception as e: + last_exc = e + log.debug(f"Subtitle backend {backend.name} failed ({source.name}->{target.name}): {e}") + continue + log.debug(f"Converted subtitle {source.name}->{target.name} via {backend.name}") + return finalize(sub, target, out) + + raise RuntimeError(f"All subtitle backends failed for {source.name}->{target.name}") from last_exc diff --git a/unshackle/unshackle-example.yaml b/unshackle/unshackle-example.yaml index 6e63ae8..0efec36 100644 --- a/unshackle/unshackle-example.yaml +++ b/unshackle/unshackle-example.yaml @@ -448,30 +448,41 @@ filenames: # - pysubs2: Use pysubs2 library (supports SRT/SSA/ASS/WebVTT/TTML/SAMI/MicroDVD/MPL2/TMP) subtitle: conversion_method: auto - # sdh_method: Method to use for SDH (hearing impaired) stripping + # Which backend converts subtitles (data-driven registry, pin-then-fallback) + # - auto (default): best available by rank (SubtitleEdit > subby/pysubs2 > pycaption) + # - subby | pysubs2 | subtitleedit | pycaption: pin that backend first, still falls back + # Styled ASS/SSA are never auto-downconverted to SRT (kept as-is); --sub-format srt overrides. + # SubtitleEdit on Linux/macOS = install the SE5 "SeConv" (seconv) CLI on PATH or unshackle/binaries/. + + sdh_method: auto + # Method to use for SDH (hearing impaired) stripping # - auto (default): Try subby (SRT only), then SubtitleEdit (if available), then subtitle-filter # - subby: Use subby library (SRT only) - # - subtitleedit: Use SubtitleEdit tool (Windows only, falls back to subtitle-filter) + # - subtitleedit: Use SubtitleEdit / seconv (SE5 CLI, cross-platform), falls back to subtitle-filter # - filter-subs: Use subtitle-filter library directly - sdh_method: auto - # strip_sdh: Automatically create stripped (non-SDH) versions of SDH subtitles - # Set to false to disable automatic SDH stripping entirely (default: true) + strip_sdh: true - # convert_before_strip: Auto-convert VTT/other formats to SRT before using subtitle-filter - # This ensures compatibility when subtitle-filter is used as fallback (default: true) + # Automatically create stripped (non-SDH) versions of SDH subtitles + # Set to false to disable automatic SDH stripping entirely (default: true) + convert_before_strip: true - # preserve_formatting: Preserve original subtitle formatting (tags, positioning, styling) + # Auto-convert VTT/other formats to SRT before using subtitle-filter + # This ensures compatibility when subtitle-filter is used as fallback (default: true) + + preserve_formatting: true + # Preserve original subtitle formatting (tags, positioning, styling) # When true, skips pycaption processing for WebVTT files to keep tags like , , positioning intact # Combined with no sub_format setting, ensures subtitles remain in their original format (default: true) - preserve_formatting: true - # output_mode: Output mode for subtitles + + output_mode: mux + # Output mode for subtitles # - mux: Embed subtitles in MKV container only (default) # - sidecar: Save subtitles as separate files only # - both: Embed in MKV AND save as sidecar files - output_mode: mux - # sidecar_format: Format for sidecar subtitle files - # Options: srt, vtt, ass, original (keep current format) + sidecar_format: srt + # Format for sidecar subtitle files + # Options: srt, vtt, ass, original (keep current format) # Configuration for pywidevine and pyplayready's serve functionality # Also used for remote services (unshackle serve)