mirror of
https://github.com/unshackle-dl/unshackle.git
synced 2026-06-10 03:02:09 +00:00
feat(subtitle): data-driven conversion registry + SubtitleEdit 5 support
Replace the hardcoded conversion if/elif in Subtitle.convert with a capability-matrix backend registry (subtitle_convert.py): each backend declares the source->target pairs it supports plus a rank, and run_conversion tries them in order as a real fallback chain. conversion_method pins a backend but still falls back (pin-then-fallback). - Detect the cross-platform SubtitleEdit 5+ CLI (seconv) and use its --flag syntax for convert, SDH stripping, and reverse-RTL - Protect styled ASS/SSA from automatic SRT downconversion; honor an explicit --sub-format / sidecar_format - Read segmented fVTT (wvtt) and fTTML (stpp) directly from fragmented MP4 - Improve ASS/SSA font detection: inline \fn overrides, Format-located Fontname column, @-prefix strip, case-insensitive de-dup; covers SSA too - Update SUBTITLE_CONFIG.md, example yaml, README; add regression tests and a backend benchmark script
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -24,6 +24,8 @@ device_private_key
|
|||||||
device_vmp_blob
|
device_vmp_blob
|
||||||
unshackle/binaries/*
|
unshackle/binaries/*
|
||||||
!unshackle/binaries/placehere.txt
|
!unshackle/binaries/placehere.txt
|
||||||
|
# test fixtures (binary subtitle samples) must be tracked despite the *.mp4 rule above
|
||||||
|
!tests/tracks/fixtures/*.mp4
|
||||||
unshackle/cache/
|
unshackle/cache/
|
||||||
unshackle/cookies/
|
unshackle/cookies/
|
||||||
unshackle/certs/
|
unshackle/certs/
|
||||||
|
|||||||
@@ -47,6 +47,10 @@ External tools on your `PATH` (recommended versions):
|
|||||||
- [Bento4](https://github.com/axiomatic-systems/Bento4) - ≥ 1.6.0-639
|
- [Bento4](https://github.com/axiomatic-systems/Bento4) - ≥ 1.6.0-639
|
||||||
- [dovi_tool](https://github.com/quietvoid/dovi_tool) - ≥ 2.1
|
- [dovi_tool](https://github.com/quietvoid/dovi_tool) - ≥ 2.1
|
||||||
|
|
||||||
|
Optional:
|
||||||
|
|
||||||
|
- [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit/releases) - ≥ 5.0 (`SeConv` CLI)
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
[GPL-3.0](LICENSE). Do not use unshackle for content you lack the rights to. Keep the core free and open; keep service code private. Be kind.
|
[GPL-3.0](LICENSE). Do not use unshackle for content you lack the rights to. Keep the core free and open; keep service code private. Be kind.
|
||||||
|
|||||||
@@ -8,17 +8,44 @@ For the canonical example, see `unshackle/unshackle-example.yaml`.
|
|||||||
|
|
||||||
Control subtitle conversion, SDH (hearing-impaired) stripping, formatting preservation, and output behavior.
|
Control subtitle conversion, SDH (hearing-impaired) stripping, formatting preservation, and output behavior.
|
||||||
|
|
||||||
- `conversion_method`: How to convert subtitles between formats. Default: `auto`.
|
- `conversion_method`: Which backend to convert subtitles with. Default: `auto`.
|
||||||
- `auto`: Smart routing - subby for WebVTT/fVTT/SAMI; for SSA/ASS/MicroDVD/MPL2/TMP use SubtitleEdit when available, otherwise pysubs2; standard pycaption/SubtitleEdit pipeline for everything else.
|
|
||||||
- `subby`: Always use subby with `CommonIssuesFixer` (falls back to standard if the source codec isn't supported by subby).
|
Routing is data-driven (`unshackle/core/tracks/subtitle_convert.py`): a registry of backends each
|
||||||
- `subtitleedit`: Prefer SubtitleEdit when available; otherwise fall back to the standard pycaption pipeline.
|
declares the source→target codec pairs it supports plus a preference rank. For a conversion, the
|
||||||
- `pycaption`: Use only the pycaption library (no SubtitleEdit, no subby). Limited to SRT, TTML, and WebVTT outputs.
|
available backends that support the pair are tried in rank order — a real fallback chain. A
|
||||||
- `pysubs2`: Use pysubs2 (supports SRT, SSA, ASS, WebVTT, TTML, SAMI, MicroDVD, MPL2, TMP).
|
non-`auto` value **pins** that backend first, then still falls back through the chain if it can't
|
||||||
|
handle the pair or errors (pin-then-fallback). A service may also set `preferred_conversion_method`
|
||||||
|
on its tracks; an explicit `conversion_method` in config always wins.
|
||||||
|
|
||||||
|
- `auto`: Best available backend by rank — SubtitleEdit (if installed) for highest fidelity;
|
||||||
|
otherwise subby for WebVTT/fVTT/SAMI→SRT (adds `CommonIssuesFixer` cleanup), pysubs2 for SSA/ASS
|
||||||
|
and the broad format set, pycaption as last resort.
|
||||||
|
- `subby`: Prefer subby (`CommonIssuesFixer`); reads WebVTT/fVTT/SAMI, writes SRT (and TTML/VTT via
|
||||||
|
an SRT intermediate).
|
||||||
|
- `subtitleedit`: Prefer SubtitleEdit / `seconv`. Highest fidelity — preserves positioning/italics.
|
||||||
|
- `pycaption`: Prefer pycaption. **Flattens positioning/italics**, writes only SRT/TTML/WebVTT.
|
||||||
|
- `pysubs2`: Prefer pysubs2 (SRT, SSA, ASS, WebVTT, TTML, SAMI, MicroDVD, MPL2, TMP). The only
|
||||||
|
pure-Python backend that reads ASS/SSA, so it is the default for styled SubStation sources.
|
||||||
|
|
||||||
|
**Styled-subtitle protection**: ASS/SSA are never *automatically* downconverted to SRT (the
|
||||||
|
conversion is skipped and the original kept) — SRT cannot carry their positioning/colours/styling.
|
||||||
|
This applies to the default muxed track only; explicit requests still convert: a per-download
|
||||||
|
`--sub-format srt` for the muxed track, or `sidecar_format: srt` for sidecars. To keep raw styled
|
||||||
|
sidecars, set `sidecar_format: original`.
|
||||||
|
|
||||||
|
**Segmented subtitles** (`fVTT`/WVTT and `fTTML`/STPP from DASH/HLS, e.g. HBO Max) are read directly
|
||||||
|
from the fragmented MP4: fVTT via subby's `WVTTConverter`, fTTML via pycaption's box parsing. They
|
||||||
|
can be converted *from* but not *to*.
|
||||||
|
|
||||||
|
**SubtitleEdit on Linux/macOS**: install the SubtitleEdit 5+ CLI (`SeConv` / `seconv`, the
|
||||||
|
self-contained cross-platform build from the SubtitleEdit releases) onto `PATH` or into
|
||||||
|
`unshackle/binaries/`. unshackle targets the SubtitleEdit **5+** command syntax. The Windows
|
||||||
|
`SubtitleEdit.exe` is the GUI app — use the `SeConv` CLI binary for headless conversion.
|
||||||
|
|
||||||
- `sdh_method`: How to strip SDH cues. Default: `auto`.
|
- `sdh_method`: How to strip SDH cues. Default: `auto`.
|
||||||
- `auto`: Try subby for SRT first, then SubtitleEdit (when `conversion_method` is `auto`/`subtitleedit` and the binary is available), then subtitle-filter as the final fallback.
|
- `auto`: Try subby for SRT first, then SubtitleEdit (when `conversion_method` is `auto`/`subtitleedit` and the binary is available), then subtitle-filter as the final fallback.
|
||||||
- `subby`: Use subby's `SDHStripper`. **Only operates on SRT**; for other codecs the call returns without stripping.
|
- `subby`: Use subby's `SDHStripper`. **Only operates on SRT**; for other codecs the call returns without stripping.
|
||||||
- `subtitleedit`: Use SubtitleEdit's `/RemoveTextForHI` when the binary is available; otherwise falls through to subtitle-filter.
|
- `subtitleedit`: Use SubtitleEdit's `--remove-text-for-hi` (SE5 CLI) when the binary is available; otherwise falls through to subtitle-filter.
|
||||||
- `filter-subs`: Use the `subtitle-filter` library directly (`rm_fonts`, `rm_ast`, `rm_music`, `rm_effects`, `rm_names`, `rm_author`).
|
- `filter-subs`: Use the `subtitle-filter` library directly (`rm_fonts`, `rm_ast`, `rm_music`, `rm_effects`, `rm_names`, `rm_author`).
|
||||||
|
|
||||||
- `strip_sdh`: Enable/disable automatic SDH stripping for tracks flagged as SDH. Default: `true`.
|
- `strip_sdh`: Enable/disable automatic SDH stripping for tracks flagged as SDH. Default: `true`.
|
||||||
@@ -68,6 +95,7 @@ These behaviors are intentional and have no config knobs — they apply to every
|
|||||||
## Related
|
## Related
|
||||||
|
|
||||||
- Filename sanitization (e.g. parenthesis handling, unidecode bracket artifacts from PR #105) lives in `unshackle/core/utilities.py::sanitize_filename` and is governed by `output_template`, not the `subtitle:` config block.
|
- Filename sanitization (e.g. parenthesis handling, unidecode bracket artifacts from PR #105) lives in `unshackle/core/utilities.py::sanitize_filename` and is governed by `output_template`, not the `subtitle:` config block.
|
||||||
- Subtitle codec support and the conversion matrix are defined in `unshackle/core/tracks/subtitle.py`.
|
- Subtitle codec support is defined in `unshackle/core/tracks/subtitle.py`; the conversion backend
|
||||||
|
registry, capability matrix, and ranks live in `unshackle/core/tracks/subtitle_convert.py`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
98
scripts/bench_subtitle_backends.py
Normal file
98
scripts/bench_subtitle_backends.py
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Benchmark subtitle conversion backends to (re-)tune the preference ranks in
|
||||||
|
``unshackle/core/tracks/subtitle_convert.py``.
|
||||||
|
|
||||||
|
Runs every backend that can read each input file, converting to a target format (default
|
||||||
|
SRT), and reports cue count, leaked ASS override tags, and output size — so you can compare
|
||||||
|
fidelity per (source, target) pair on real files. Read-only: copies inputs to a temp dir.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
uv run python scripts/bench_subtitle_backends.py <file-or-dir> [<file-or-dir> ...] [--target SRT]
|
||||||
|
|
||||||
|
Example:
|
||||||
|
uv run python scripts/bench_subtitle_backends.py downloads/
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import re
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from unshackle.core.tracks import subtitle_convert as sc
|
||||||
|
from unshackle.core.tracks.subtitle import Subtitle
|
||||||
|
|
||||||
|
Codec = Subtitle.Codec
|
||||||
|
|
||||||
|
EXT_TO_CODEC = {
|
||||||
|
".srt": Codec.SubRip,
|
||||||
|
".vtt": Codec.WebVTT,
|
||||||
|
".ass": Codec.SubStationAlphav4,
|
||||||
|
".ssa": Codec.SubStationAlpha,
|
||||||
|
".ttml": Codec.TimedTextMarkupLang,
|
||||||
|
".smi": Codec.SAMI,
|
||||||
|
".sami": Codec.SAMI,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def gather(paths: list[str]) -> list[Path]:
|
||||||
|
files: list[Path] = []
|
||||||
|
for p in paths:
|
||||||
|
path = Path(p)
|
||||||
|
if path.is_dir():
|
||||||
|
files.extend(f for f in path.rglob("*") if f.suffix.lower() in EXT_TO_CODEC)
|
||||||
|
elif path.suffix.lower() in EXT_TO_CODEC:
|
||||||
|
files.append(path)
|
||||||
|
return sorted(files)
|
||||||
|
|
||||||
|
|
||||||
|
def metrics(text: str) -> tuple[int, int, int]:
|
||||||
|
cues = len(re.findall(r"-->", text))
|
||||||
|
ass_residue = len(re.findall(r"\{\\", text))
|
||||||
|
return cues, ass_residue, len(text)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("paths", nargs="+", help="subtitle files or directories")
|
||||||
|
ap.add_argument("--target", default="SRT", help="target codec value (SRT, VTT, ASS, ...)")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
target = Codec(args.target.upper())
|
||||||
|
files = gather(args.paths)
|
||||||
|
if not files:
|
||||||
|
print("No subtitle files found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
tmp = Path(tempfile.mkdtemp(prefix="subbench_"))
|
||||||
|
print(f"{'file':40} {'source':10} {'backend':12} {'ok':3} {'cues':>5} {'resid':>5} {'bytes':>7}")
|
||||||
|
for f in files:
|
||||||
|
source = EXT_TO_CODEC[f.suffix.lower()]
|
||||||
|
if source == target:
|
||||||
|
continue
|
||||||
|
for backend in sc.REGISTRY:
|
||||||
|
if not (backend.is_available() and backend.can_convert(source, target)):
|
||||||
|
continue
|
||||||
|
work = tmp / f"{f.stem}.{backend.name}{f.suffix}"
|
||||||
|
shutil.copy2(f, work)
|
||||||
|
sub = Subtitle(url="x", language="en", codec=source)
|
||||||
|
sub.path = work
|
||||||
|
try:
|
||||||
|
# Call the backend directly so each row reflects only that backend (no fallback).
|
||||||
|
out = work.with_suffix(f".{target.value.lower()}")
|
||||||
|
backend.convert(sub, target, out)
|
||||||
|
cues, resid, size = metrics(out.read_text("utf8", errors="replace"))
|
||||||
|
print(
|
||||||
|
f"{f.name[:40]:40} {source.name[:10]:10} {backend.name:12} {'Y':3} {cues:>5} {resid:>5} {size:>7}"
|
||||||
|
)
|
||||||
|
except Exception as e: # noqa: BLE001 - benchmark reports failures, does not raise
|
||||||
|
print(
|
||||||
|
f"{f.name[:40]:40} {source.name[:10]:10} {backend.name:12} {'N':3} {'-':>5} {'-':>5} {'-':>7} {type(e).__name__}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
BIN
tests/tracks/fixtures/segmented.wvtt.mp4
Normal file
BIN
tests/tracks/fixtures/segmented.wvtt.mp4
Normal file
Binary file not shown.
314
tests/tracks/test_subtitle_convert.py
Normal file
314
tests/tracks/test_subtitle_convert.py
Normal file
@@ -0,0 +1,314 @@
|
|||||||
|
"""Tests for the data-driven subtitle conversion registry (``tracks/subtitle_convert.py``).
|
||||||
|
|
||||||
|
Covers three things the refactor must guarantee:
|
||||||
|
- the capability matrix resolves the right backend chain per (source, target) and env
|
||||||
|
(SubtitleEdit present or not),
|
||||||
|
- ``conversion_method`` pins a backend but still falls back (pin-then-fallback),
|
||||||
|
- styled SubStation (ASS/SSA) is never auto-downconverted to SRT unless explicitly forced.
|
||||||
|
|
||||||
|
Backends pysubs2/subby/pycaption are hard deps so the conversion paths run in CI without
|
||||||
|
SubtitleEdit; SubtitleEdit availability is simulated by patching ``binaries.SubtitleEdit``.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pathlib
|
||||||
|
import re
|
||||||
|
import struct
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from unshackle.core import binaries
|
||||||
|
from unshackle.core.tracks import subtitle_convert as sc
|
||||||
|
from unshackle.core.tracks.subtitle import Subtitle
|
||||||
|
|
||||||
|
Codec = Subtitle.Codec
|
||||||
|
|
||||||
|
VTT_SAMPLE = """WEBVTT
|
||||||
|
|
||||||
|
1
|
||||||
|
00:00:01.000 --> 00:00:02.000
|
||||||
|
Hello
|
||||||
|
|
||||||
|
2
|
||||||
|
00:00:03.000 --> 00:00:04.000
|
||||||
|
World
|
||||||
|
"""
|
||||||
|
|
||||||
|
ASS_SAMPLE = """[Script Info]
|
||||||
|
ScriptType: v4.00+
|
||||||
|
|
||||||
|
[V4+ Styles]
|
||||||
|
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
|
||||||
|
Style: Default,Arial,20,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,2,1,2,10,10,18,1
|
||||||
|
|
||||||
|
[Events]
|
||||||
|
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
|
||||||
|
Dialogue: 0,0:00:01.00,0:00:02.00,Default,,0,0,0,,{\\i1}Hello{\\i0}
|
||||||
|
Dialogue: 0,0:00:03.00,0:00:04.00,Default,,0,0,0,,World
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def _no_subtitleedit(monkeypatch):
|
||||||
|
"""Default every test to a SubtitleEdit-less environment; tests opt in when needed."""
|
||||||
|
monkeypatch.setattr(binaries, "SubtitleEdit", None)
|
||||||
|
|
||||||
|
|
||||||
|
def make_sub(tmp_path, name: str, text: str, codec: Codec) -> Subtitle:
|
||||||
|
path = tmp_path / name
|
||||||
|
path.write_text(text, encoding="utf8")
|
||||||
|
sub = Subtitle(url="https://example.test/x", language="en", codec=codec)
|
||||||
|
sub.path = path
|
||||||
|
return sub
|
||||||
|
|
||||||
|
|
||||||
|
def cue_count(path) -> int:
|
||||||
|
return len(re.findall(r"-->", path.read_text("utf8")))
|
||||||
|
|
||||||
|
|
||||||
|
# --- capability matrix / resolver -------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_resolve_webvtt_to_srt_order():
|
||||||
|
chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip)]
|
||||||
|
assert chain == ["subby", "pysubs2", "pycaption"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_resolve_ass_to_srt_only_pysubs2_without_subtitleedit():
|
||||||
|
# subby and pycaption cannot read ASS, so only pysubs2 remains.
|
||||||
|
chain = [b.name for b in sc.resolve_backends(Codec.SubStationAlphav4, Codec.SubRip)]
|
||||||
|
assert chain == ["pysubs2"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_subtitleedit_ranks_first_when_available(monkeypatch):
|
||||||
|
monkeypatch.setattr(binaries, "SubtitleEdit", "/usr/bin/seconv")
|
||||||
|
chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip)]
|
||||||
|
assert chain[0] == "subtitleedit"
|
||||||
|
|
||||||
|
|
||||||
|
def test_pin_then_fallback_orders_pin_first():
|
||||||
|
chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip, pin="pysubs2")]
|
||||||
|
assert chain[0] == "pysubs2"
|
||||||
|
assert "subby" in chain # fallbacks remain after the pin
|
||||||
|
|
||||||
|
|
||||||
|
def test_pin_unavailable_falls_back_to_ranked_chain():
|
||||||
|
# subtitleedit pinned but not installed -> just the ranked available backends.
|
||||||
|
chain = [b.name for b in sc.resolve_backends(Codec.WebVTT, Codec.SubRip, pin="subtitleedit")]
|
||||||
|
assert chain == ["subby", "pysubs2", "pycaption"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_fallback_runs_when_first_backend_fails(tmp_path, monkeypatch):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
|
||||||
|
def boom(self, source, src, target, out):
|
||||||
|
raise RuntimeError("backend exploded")
|
||||||
|
|
||||||
|
# WebVTT->SRT chain is [subby, pysubs2, pycaption]; kill subby, expect pysubs2 to finish.
|
||||||
|
monkeypatch.setattr(sc.SubbyBackend, "convert", boom)
|
||||||
|
sub = make_sub(tmp_path, "x.vtt", VTT_SAMPLE, Codec.WebVTT)
|
||||||
|
out = sub.convert(Codec.SubRip, forced=True)
|
||||||
|
assert sub.codec == Codec.SubRip
|
||||||
|
assert cue_count(out) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_backend_for_unsupported_target_raises(tmp_path):
|
||||||
|
sub = make_sub(tmp_path, "x.ass", ASS_SAMPLE, Codec.SubStationAlphav4)
|
||||||
|
with pytest.raises(NotImplementedError):
|
||||||
|
sub.convert(Codec.fVTT, forced=True) # no backend writes segmented fVTT
|
||||||
|
|
||||||
|
|
||||||
|
# --- styled-ASS protection --------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_ass_to_srt_kept_as_is_when_not_forced(tmp_path, monkeypatch):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
sub = make_sub(tmp_path, "x.ass", ASS_SAMPLE, Codec.SubStationAlphav4)
|
||||||
|
out = sub.convert(Codec.SubRip, forced=False)
|
||||||
|
assert sub.codec == Codec.SubStationAlphav4 # unchanged
|
||||||
|
assert out == sub.path
|
||||||
|
assert out.suffix == ".ass"
|
||||||
|
|
||||||
|
|
||||||
|
def test_ass_to_srt_converts_when_forced(tmp_path, monkeypatch):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
sub = make_sub(tmp_path, "x.ass", ASS_SAMPLE, Codec.SubStationAlphav4)
|
||||||
|
out = sub.convert(Codec.SubRip, forced=True)
|
||||||
|
assert sub.codec == Codec.SubRip
|
||||||
|
assert out.suffix == ".srt"
|
||||||
|
assert cue_count(out) == 2
|
||||||
|
assert "{\\" not in out.read_text("utf8") # override tags stripped
|
||||||
|
|
||||||
|
|
||||||
|
# --- conversion paths -------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_webvtt_to_srt_conversion(tmp_path, monkeypatch):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
sub = make_sub(tmp_path, "x.vtt", VTT_SAMPLE, Codec.WebVTT)
|
||||||
|
out = sub.convert(Codec.SubRip, forced=True)
|
||||||
|
assert sub.codec == Codec.SubRip
|
||||||
|
assert cue_count(out) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_same_codec_is_noop(tmp_path):
|
||||||
|
sub = make_sub(tmp_path, "x.srt", "1\n00:00:01,000 --> 00:00:02,000\nHi\n", Codec.SubRip)
|
||||||
|
assert sub.convert(Codec.SubRip) == sub.path
|
||||||
|
assert sub.codec == Codec.SubRip
|
||||||
|
|
||||||
|
|
||||||
|
# --- ASS/SSA font detection ------------------------------------------------------------
|
||||||
|
|
||||||
|
FONT_ASS = """[Script Info]
|
||||||
|
ScriptType: v4.00+
|
||||||
|
|
||||||
|
[V4+ Styles]
|
||||||
|
Format: Name, Fontname, Fontsize, PrimaryColour, Bold, Italic, Alignment, MarginV, Encoding
|
||||||
|
Style: Default,Trebuchet MS,24,&H00FFFFFF,0,0,2,18,1
|
||||||
|
Style: sign,@Arial Unicode MS,20,&H00FFFFFF,0,0,8,10,1
|
||||||
|
|
||||||
|
[Events]
|
||||||
|
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
|
||||||
|
Dialogue: 0,0:00:01.00,0:00:02.00,Default,,0,0,0,,{\\fnTimes New Roman}A sign
|
||||||
|
Dialogue: 0,0:00:03.00,0:00:04.00,Default,,0,0,0,,{\\fntimes new roman}lower case
|
||||||
|
Dialogue: 0,0:00:05.00,0:00:06.00,Default,,0,0,0,,{\\fnGeorgia\\b1}bold note
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_fonts_styles_and_inline_overrides():
|
||||||
|
fonts = Subtitle.extract_fonts(FONT_ASS)
|
||||||
|
# Style fontnames (column located via Format line, @-prefix stripped) + inline \fn overrides
|
||||||
|
assert fonts == {"Trebuchet MS", "Arial Unicode MS", "Times New Roman", "Georgia"}
|
||||||
|
# case-insensitive de-dup keeps the mixed-case spelling, not "times new roman"
|
||||||
|
assert "times new roman" not in fonts
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_fonts_handles_non_default_column_order():
|
||||||
|
ass = (
|
||||||
|
"[V4+ Styles]\n"
|
||||||
|
"Format: Name, Fontsize, Fontname, Bold\n" # Fontname not in the usual position
|
||||||
|
"Style: Main,28,Verdana,0\n"
|
||||||
|
)
|
||||||
|
assert Subtitle.extract_fonts(ass) == {"Verdana"}
|
||||||
|
|
||||||
|
|
||||||
|
# --- non-Latin scripts (RTL / CJK) preserved through conversion ------------------------
|
||||||
|
|
||||||
|
CJK_RTL_VTT = """WEBVTT
|
||||||
|
|
||||||
|
1
|
||||||
|
00:00:01.000 --> 00:00:02.000
|
||||||
|
مرحبا بالعالم
|
||||||
|
|
||||||
|
2
|
||||||
|
00:00:03.000 --> 00:00:04.000
|
||||||
|
안녕하세요
|
||||||
|
|
||||||
|
3
|
||||||
|
00:00:05.000 --> 00:00:06.000
|
||||||
|
你好世界
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"pattern",
|
||||||
|
[r"[-ۿ]", r"[가-힣]", r"[一-鿿]"], # Arabic, Hangul, CJK
|
||||||
|
)
|
||||||
|
def test_non_latin_scripts_survive_vtt_to_srt(tmp_path, monkeypatch, pattern):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
sub = make_sub(tmp_path, "x.vtt", CJK_RTL_VTT, Codec.WebVTT)
|
||||||
|
out = sub.convert(Codec.SubRip, forced=True)
|
||||||
|
text = out.read_text("utf8")
|
||||||
|
assert cue_count(out) == 3
|
||||||
|
assert re.search(pattern, text) # script survived the round-trip, no mojibake
|
||||||
|
|
||||||
|
|
||||||
|
# --- SDH stripping ----------------------------------------------------------------------
|
||||||
|
|
||||||
|
SDH_SRT = """1
|
||||||
|
00:00:01,000 --> 00:00:02,000
|
||||||
|
[door creaks]
|
||||||
|
|
||||||
|
2
|
||||||
|
00:00:03,000 --> 00:00:04,000
|
||||||
|
Hello there.
|
||||||
|
|
||||||
|
3
|
||||||
|
00:00:05,000 --> 00:00:06,000
|
||||||
|
♪ upbeat music ♪
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def test_sdh_stripping_removes_effects_keeps_dialogue(tmp_path, monkeypatch):
|
||||||
|
# subby's SDHStripper runs on SRT without SubtitleEdit installed.
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {"sdh_method": "subby"}, raising=False)
|
||||||
|
sub = make_sub(tmp_path, "x.srt", SDH_SRT, Codec.SubRip)
|
||||||
|
sub.strip_hearing_impaired()
|
||||||
|
out = sub.path.read_text("utf8")
|
||||||
|
assert "Hello there." in out # real dialogue kept
|
||||||
|
assert "door creaks" not in out # bracketed effect removed (subby SDHStripper)
|
||||||
|
|
||||||
|
|
||||||
|
# --- segmented (box-encapsulated) formats: fVTT (wvtt) / fTTML (stpp) --------------------
|
||||||
|
# These ship from DASH/HLS as fragmented MP4 (e.g. HBO Max). The downloader concatenates
|
||||||
|
# init + media segments into one file; parse() reads the MP4 boxes directly.
|
||||||
|
|
||||||
|
FIXTURES = pathlib.Path(__file__).parent / "fixtures"
|
||||||
|
|
||||||
|
|
||||||
|
def caption_total(caption_set) -> int:
|
||||||
|
return sum(len(caption_set.get_captions(lang)) for lang in caption_set.get_languages())
|
||||||
|
|
||||||
|
|
||||||
|
def build_stpp_mp4(*ttml_fragments: str) -> bytes:
|
||||||
|
"""A minimal stpp-style MP4: ftyp + one mdat per TTML fragment (what fTTML.parse reads)."""
|
||||||
|
|
||||||
|
def box(box_type: bytes, payload: bytes) -> bytes:
|
||||||
|
return struct.pack(">I", 8 + len(payload)) + box_type + payload
|
||||||
|
|
||||||
|
data = box(b"ftyp", b"isom" + struct.pack(">I", 0) + b"isomiso6")
|
||||||
|
for frag in ttml_fragments:
|
||||||
|
data += box(b"mdat", frag.encode("utf8"))
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def test_segmented_fvtt_parses_and_converts(tmp_path, monkeypatch):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
data = (FIXTURES / "segmented.wvtt.mp4").read_bytes()
|
||||||
|
|
||||||
|
caption_set = Subtitle.parse(data, Codec.fVTT)
|
||||||
|
assert caption_total(caption_set) == 2
|
||||||
|
|
||||||
|
seg = tmp_path / "seg.wvtt"
|
||||||
|
seg.write_bytes(data)
|
||||||
|
sub = Subtitle(url="https://example.test/x", language="en", codec=Codec.fVTT)
|
||||||
|
sub.path = seg
|
||||||
|
# download() converts fVTT -> WebVTT (not "forced"); chain is subby then pycaption.
|
||||||
|
out = sub.convert(Codec.WebVTT)
|
||||||
|
assert sub.codec == Codec.WebVTT
|
||||||
|
assert cue_count(out) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_segmented_fttml_parses_and_converts(tmp_path, monkeypatch):
|
||||||
|
monkeypatch.setattr("unshackle.core.config.config.subtitle", {}, raising=False)
|
||||||
|
frag = (
|
||||||
|
'<?xml version="1.0" encoding="utf-8"?>'
|
||||||
|
'<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en"><body><div>'
|
||||||
|
'<p begin="00:00:0{a}.000" end="00:00:0{b}.000">Line {a}</p>'
|
||||||
|
"</div></body></tt>"
|
||||||
|
)
|
||||||
|
data = build_stpp_mp4(frag.format(a=1, b=2), frag.format(a=3, b=4))
|
||||||
|
|
||||||
|
caption_set = Subtitle.parse(data, Codec.fTTML)
|
||||||
|
assert caption_total(caption_set) == 2
|
||||||
|
|
||||||
|
seg = tmp_path / "seg.stpp"
|
||||||
|
seg.write_bytes(data)
|
||||||
|
sub = Subtitle(url="https://example.test/x", language="en", codec=Codec.fTTML)
|
||||||
|
sub.path = seg
|
||||||
|
# download() converts fTTML -> TTML (only pycaption can read fTTML); then -> SRT.
|
||||||
|
sub.convert(Codec.TimedTextMarkupLang)
|
||||||
|
assert sub.codec == Codec.TimedTextMarkupLang
|
||||||
|
out = sub.convert(Codec.SubRip, forced=True)
|
||||||
|
assert cue_count(out) == 2
|
||||||
@@ -311,7 +311,7 @@ class dl:
|
|||||||
)
|
)
|
||||||
temp_sub.path = temp_path
|
temp_sub.path = temp_path
|
||||||
try:
|
try:
|
||||||
temp_sub.convert(target_codec)
|
temp_sub.convert(target_codec, forced=True)
|
||||||
if temp_sub.path and temp_sub.path.exists():
|
if temp_sub.path and temp_sub.path.exists():
|
||||||
shutil.copy2(temp_sub.path, sidecar_path)
|
shutil.copy2(temp_sub.path, sidecar_path)
|
||||||
finally:
|
finally:
|
||||||
@@ -528,9 +528,13 @@ class dl:
|
|||||||
@click.option("-S", "--subs-only", is_flag=True, default=False, help="Only download subtitle tracks.")
|
@click.option("-S", "--subs-only", is_flag=True, default=False, help="Only download subtitle tracks.")
|
||||||
@click.option("-C", "--chapters-only", is_flag=True, default=False, help="Only download chapter markers.")
|
@click.option("-C", "--chapters-only", is_flag=True, default=False, help="Only download chapter markers.")
|
||||||
@click.option("-ns", "--no-subs", is_flag=True, default=False, help="Do not download subtitle tracks.")
|
@click.option("-ns", "--no-subs", is_flag=True, default=False, help="Do not download subtitle tracks.")
|
||||||
@click.option("--skip-subtitle-errors", is_flag=True, default=False,
|
@click.option(
|
||||||
|
"--skip-subtitle-errors",
|
||||||
|
is_flag=True,
|
||||||
|
default=False,
|
||||||
help="If a subtitle track fails to download, skip it and continue instead of "
|
help="If a subtitle track fails to download, skip it and continue instead of "
|
||||||
"aborting the whole title (video/audio failures stay fatal).")
|
"aborting the whole title (video/audio failures stay fatal).",
|
||||||
|
)
|
||||||
@click.option("-na", "--no-audio", is_flag=True, default=False, help="Do not download audio tracks.")
|
@click.option("-na", "--no-audio", is_flag=True, default=False, help="Do not download audio tracks.")
|
||||||
@click.option("-nc", "--no-chapters", is_flag=True, default=False, help="Do not download chapter markers.")
|
@click.option("-nc", "--no-chapters", is_flag=True, default=False, help="Do not download chapter markers.")
|
||||||
@click.option("-nv", "--no-video", is_flag=True, default=False, help="Do not download video tracks.")
|
@click.option("-nv", "--no-video", is_flag=True, default=False, help="Do not download video tracks.")
|
||||||
@@ -2367,18 +2371,16 @@ class dl:
|
|||||||
for subtitle in title.tracks.subtitles:
|
for subtitle in title.tracks.subtitles:
|
||||||
if sub_format:
|
if sub_format:
|
||||||
if subtitle.codec != sub_format:
|
if subtitle.codec != sub_format:
|
||||||
subtitle.convert(sub_format)
|
subtitle.convert(sub_format, forced=True)
|
||||||
elif subtitle.codec == Subtitle.Codec.TimedTextMarkupLang:
|
elif subtitle.codec == Subtitle.Codec.TimedTextMarkupLang:
|
||||||
# MKV does not support TTML, VTT is the next best option
|
# MKV does not support TTML, VTT is the next best option
|
||||||
subtitle.convert(Subtitle.Codec.WebVTT)
|
subtitle.convert(Subtitle.Codec.WebVTT)
|
||||||
|
|
||||||
with console.status("Checking Subtitles for Fonts..."):
|
with console.status("Checking Subtitles for Fonts..."):
|
||||||
font_names = []
|
font_names: list[str] = []
|
||||||
for subtitle in title.tracks.subtitles:
|
for subtitle in title.tracks.subtitles:
|
||||||
if subtitle.codec == Subtitle.Codec.SubStationAlphav4:
|
if subtitle.codec in (Subtitle.Codec.SubStationAlpha, Subtitle.Codec.SubStationAlphav4):
|
||||||
for line in subtitle.path.read_text("utf8").splitlines():
|
font_names.extend(Subtitle.extract_fonts(subtitle.path.read_text("utf8")))
|
||||||
if line.startswith("Style: "):
|
|
||||||
font_names.append(line.removeprefix("Style: ").split(",")[1].strip())
|
|
||||||
|
|
||||||
font_count, missing_fonts = self.attach_subtitle_fonts(font_names, title, temp_font_files)
|
font_count, missing_fonts = self.attach_subtitle_fonts(font_names, title, temp_font_files)
|
||||||
|
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ def find(*names: str) -> Optional[Path]:
|
|||||||
FFMPEG = find("ffmpeg")
|
FFMPEG = find("ffmpeg")
|
||||||
FFProbe = find("ffprobe")
|
FFProbe = find("ffprobe")
|
||||||
FFPlay = find("ffplay")
|
FFPlay = find("ffplay")
|
||||||
SubtitleEdit = find("SubtitleEdit")
|
SubtitleEdit = find("SubtitleEdit", "seconv") # seconv = cross-platform subtitleedit-cli (.NET 8)
|
||||||
ShakaPackager = find(
|
ShakaPackager = find(
|
||||||
"shaka-packager",
|
"shaka-packager",
|
||||||
"packager",
|
"packager",
|
||||||
|
|||||||
@@ -11,13 +11,12 @@ from pathlib import Path
|
|||||||
from typing import Any, Callable, Iterable, Optional, Union
|
from typing import Any, Callable, Iterable, Optional, Union
|
||||||
|
|
||||||
import pycaption
|
import pycaption
|
||||||
import pysubs2
|
|
||||||
import requests
|
import requests
|
||||||
from construct import Container
|
from construct import Container
|
||||||
from pycaption import Caption, CaptionList, CaptionNode, WebVTTReader
|
from pycaption import Caption, CaptionList, CaptionNode, WebVTTReader
|
||||||
from pycaption.geometry import Layout
|
from pycaption.geometry import Layout
|
||||||
from pymp4.parser import MP4
|
from pymp4.parser import MP4
|
||||||
from subby import CommonIssuesFixer, SAMIConverter, SDHStripper, WebVTTConverter, WVTTConverter
|
from subby import CommonIssuesFixer, SAMIConverter, SDHStripper
|
||||||
from subtitle_filter import Subtitles
|
from subtitle_filter import Subtitles
|
||||||
|
|
||||||
from unshackle.core import binaries
|
from unshackle.core import binaries
|
||||||
@@ -600,306 +599,73 @@ class Subtitle(Track):
|
|||||||
|
|
||||||
return "\n".join(sanitized_lines)
|
return "\n".join(sanitized_lines)
|
||||||
|
|
||||||
def convert_with_subby(self, codec: Subtitle.Codec) -> Path:
|
def convert(self, codec: Subtitle.Codec, *, forced: bool = False) -> Path:
|
||||||
"""
|
"""
|
||||||
Convert subtitle using subby library for better format support and processing.
|
Convert this Subtitle to another format.
|
||||||
|
|
||||||
This method leverages subby's advanced subtitle processing capabilities
|
Backend selection is data-driven (see ``tracks/subtitle_convert.py``): the best
|
||||||
including better WebVTT handling, SDH stripping, and common issue fixing.
|
available backend that supports source->target is used, falling back through the
|
||||||
|
capability chain on failure. The backend can be pinned via the ``conversion_method``
|
||||||
|
config key (``auto`` | ``subby`` | ``pysubs2`` | ``subtitleedit`` | ``pycaption``),
|
||||||
|
or nudged per-service via ``preferred_conversion_method``; an explicit config value
|
||||||
|
always wins.
|
||||||
|
|
||||||
|
``forced`` marks an explicit user request (``--sub-format``). Lossy downconverts of
|
||||||
|
styled formats (SSA/ASS -> SRT) are skipped unless ``forced`` is True.
|
||||||
"""
|
"""
|
||||||
|
from unshackle.core.tracks.subtitle_convert import run_conversion
|
||||||
|
|
||||||
if not self.path or not self.path.exists():
|
if not self.path or not self.path.exists():
|
||||||
raise ValueError("You must download the subtitle track first.")
|
raise ValueError("You must download the subtitle track first.")
|
||||||
|
|
||||||
if self.codec == codec:
|
method = (
|
||||||
return self.path
|
config.subtitle.get("conversion_method") or getattr(self, "preferred_conversion_method", None) or "auto"
|
||||||
|
)
|
||||||
|
pin = None if method == "auto" else method
|
||||||
|
return run_conversion(self, codec, pin=pin, forced=forced)
|
||||||
|
|
||||||
output_path = self.path.with_suffix(f".{codec.value.lower()}")
|
@staticmethod
|
||||||
original_path = self.path
|
def extract_fonts(text: str) -> set[str]:
|
||||||
|
|
||||||
try:
|
|
||||||
# Convert to SRT using subby first
|
|
||||||
srt_subtitles = None
|
|
||||||
|
|
||||||
if self.codec == Subtitle.Codec.WebVTT:
|
|
||||||
converter = WebVTTConverter()
|
|
||||||
srt_subtitles = converter.from_file(self.path)
|
|
||||||
if self.codec == Subtitle.Codec.fVTT:
|
|
||||||
converter = WVTTConverter()
|
|
||||||
srt_subtitles = converter.from_file(self.path)
|
|
||||||
elif self.codec == Subtitle.Codec.SAMI:
|
|
||||||
converter = SAMIConverter()
|
|
||||||
srt_subtitles = converter.from_file(self.path)
|
|
||||||
|
|
||||||
if srt_subtitles is not None:
|
|
||||||
# Apply common fixes
|
|
||||||
fixer = CommonIssuesFixer()
|
|
||||||
fixed_srt, _ = fixer.from_srt(srt_subtitles)
|
|
||||||
|
|
||||||
# If target is SRT, we're done
|
|
||||||
if codec == Subtitle.Codec.SubRip:
|
|
||||||
fixed_srt.save(output_path, encoding="utf8")
|
|
||||||
else:
|
|
||||||
# Convert from SRT to target format using existing pycaption logic
|
|
||||||
temp_srt_path = self.path.with_suffix(".temp.srt")
|
|
||||||
fixed_srt.save(temp_srt_path, encoding="utf8")
|
|
||||||
|
|
||||||
# Parse the SRT and convert to target format
|
|
||||||
caption_set = self.parse(temp_srt_path.read_bytes(), Subtitle.Codec.SubRip)
|
|
||||||
self.merge_same_cues(caption_set)
|
|
||||||
|
|
||||||
writer = {
|
|
||||||
Subtitle.Codec.TimedTextMarkupLang: pycaption.DFXPWriter,
|
|
||||||
Subtitle.Codec.WebVTT: pycaption.WebVTTWriter,
|
|
||||||
}.get(codec)
|
|
||||||
|
|
||||||
if writer:
|
|
||||||
subtitle_text = writer().write(caption_set)
|
|
||||||
output_path.write_text(subtitle_text, encoding="utf8")
|
|
||||||
else:
|
|
||||||
# Fall back to existing conversion method
|
|
||||||
temp_srt_path.unlink()
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
|
|
||||||
temp_srt_path.unlink()
|
|
||||||
|
|
||||||
if original_path.exists() and original_path != output_path:
|
|
||||||
original_path.unlink()
|
|
||||||
|
|
||||||
self.path = output_path
|
|
||||||
self.codec = codec
|
|
||||||
|
|
||||||
if callable(self.OnConverted):
|
|
||||||
self.OnConverted(codec)
|
|
||||||
|
|
||||||
return output_path
|
|
||||||
else:
|
|
||||||
# Fall back to existing conversion method
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
|
|
||||||
except Exception:
|
|
||||||
# Fall back to existing conversion method on any error
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
|
|
||||||
def convert_with_pysubs2(self, codec: Subtitle.Codec) -> Path:
|
|
||||||
"""
|
"""
|
||||||
Convert subtitle using pysubs2 library for broad format support.
|
Font names referenced by an ASS/SSA subtitle.
|
||||||
|
|
||||||
pysubs2 is a pure-Python library supporting SubRip (SRT), SubStation Alpha
|
Covers both sources that need attaching for correct rendering:
|
||||||
(SSA/ASS), WebVTT, TTML, SAMI, MicroDVD, MPL2, and TMP formats.
|
- the ``Fontname`` column of every ``Style:`` line in ``[V4+ Styles]``/``[V4 Styles]``
|
||||||
|
(column located from the section's ``Format:`` line, not assumed by index), and
|
||||||
|
- inline ``\\fn`` font overrides inside ``Dialogue`` override blocks.
|
||||||
|
|
||||||
|
Leading ``@`` (vertical-writing prefix) is stripped and names are de-duplicated
|
||||||
|
case-insensitively, preferring a mixed-case spelling over an all-lowercase one.
|
||||||
"""
|
"""
|
||||||
if not self.path or not self.path.exists():
|
names: set[str] = set()
|
||||||
raise ValueError("You must download the subtitle track first.")
|
name_index = 1 # ASS default Style order: Name, Fontname, ...
|
||||||
|
in_styles = False
|
||||||
|
for line in text.splitlines():
|
||||||
|
stripped = line.strip()
|
||||||
|
if stripped.startswith("["):
|
||||||
|
in_styles = stripped.lower() in ("[v4+ styles]", "[v4 styles]")
|
||||||
|
continue
|
||||||
|
if not in_styles:
|
||||||
|
continue
|
||||||
|
if stripped.lower().startswith("format:"):
|
||||||
|
columns = [c.strip().lower() for c in stripped.split(":", 1)[1].split(",")]
|
||||||
|
if "fontname" in columns:
|
||||||
|
name_index = columns.index("fontname")
|
||||||
|
elif stripped.lower().startswith("style:"):
|
||||||
|
fields = stripped.split(":", 1)[1].split(",")
|
||||||
|
if len(fields) > name_index:
|
||||||
|
names.add(fields[name_index].strip())
|
||||||
|
|
||||||
if self.codec == codec:
|
names.update(match.strip() for match in re.findall(r"\\fn([^\\}]+)", text))
|
||||||
return self.path
|
|
||||||
|
|
||||||
output_path = self.path.with_suffix(f".{codec.value.lower()}")
|
canonical: dict[str, str] = {}
|
||||||
original_path = self.path
|
for name in (raw.lstrip("@").strip() for raw in names):
|
||||||
|
if not name:
|
||||||
codec_to_pysubs2_format = {
|
continue
|
||||||
Subtitle.Codec.SubRip: "srt",
|
key = name.lower()
|
||||||
Subtitle.Codec.SubStationAlpha: "ssa",
|
if key not in canonical or (name != name.lower() and canonical[key] == canonical[key].lower()):
|
||||||
Subtitle.Codec.SubStationAlphav4: "ass",
|
canonical[key] = name
|
||||||
Subtitle.Codec.WebVTT: "vtt",
|
return set(canonical.values())
|
||||||
Subtitle.Codec.TimedTextMarkupLang: "ttml",
|
|
||||||
Subtitle.Codec.SAMI: "sami",
|
|
||||||
Subtitle.Codec.MicroDVD: "microdvd",
|
|
||||||
Subtitle.Codec.MPL2: "mpl2",
|
|
||||||
Subtitle.Codec.TMP: "tmp",
|
|
||||||
}
|
|
||||||
|
|
||||||
pysubs2_output_format = codec_to_pysubs2_format.get(codec)
|
|
||||||
if pysubs2_output_format is None:
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
|
|
||||||
try:
|
|
||||||
subs = pysubs2.load(str(self.path), encoding="utf-8")
|
|
||||||
|
|
||||||
subs.save(str(output_path), format_=pysubs2_output_format, encoding="utf-8")
|
|
||||||
|
|
||||||
if original_path.exists() and original_path != output_path:
|
|
||||||
original_path.unlink()
|
|
||||||
|
|
||||||
self.path = output_path
|
|
||||||
self.codec = codec
|
|
||||||
|
|
||||||
if callable(self.OnConverted):
|
|
||||||
self.OnConverted(codec)
|
|
||||||
|
|
||||||
return output_path
|
|
||||||
|
|
||||||
except Exception:
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
|
|
||||||
def convert(self, codec: Subtitle.Codec) -> Path:
|
|
||||||
"""
|
|
||||||
Convert this Subtitle to another Format.
|
|
||||||
|
|
||||||
The conversion method is determined by the 'conversion_method' setting in config:
|
|
||||||
- 'auto' (default): Uses subby for WebVTT/fVTT/SAMI; for SSA/ASS/MicroDVD/MPL2/TMP
|
|
||||||
uses SubtitleEdit if available, otherwise pysubs2; standard for others
|
|
||||||
- 'subby': Always uses subby with CommonIssuesFixer
|
|
||||||
- 'subtitleedit': Uses SubtitleEdit when available, falls back to pycaption
|
|
||||||
- 'pycaption': Uses only pycaption library
|
|
||||||
- 'pysubs2': Uses pysubs2 library
|
|
||||||
"""
|
|
||||||
# Check configuration for conversion method
|
|
||||||
conversion_method = config.subtitle.get("conversion_method", "auto")
|
|
||||||
|
|
||||||
if conversion_method == "subby":
|
|
||||||
return self.convert_with_subby(codec)
|
|
||||||
elif conversion_method == "subtitleedit":
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
elif conversion_method == "pycaption":
|
|
||||||
return self._convert_pycaption_only(codec)
|
|
||||||
elif conversion_method == "pysubs2":
|
|
||||||
return self.convert_with_pysubs2(codec)
|
|
||||||
elif conversion_method == "auto":
|
|
||||||
if self.codec in (Subtitle.Codec.WebVTT, Subtitle.Codec.fVTT, Subtitle.Codec.SAMI):
|
|
||||||
return self.convert_with_subby(codec)
|
|
||||||
elif self.codec in (
|
|
||||||
Subtitle.Codec.SubStationAlpha,
|
|
||||||
Subtitle.Codec.SubStationAlphav4,
|
|
||||||
Subtitle.Codec.MicroDVD,
|
|
||||||
Subtitle.Codec.MPL2,
|
|
||||||
Subtitle.Codec.TMP,
|
|
||||||
):
|
|
||||||
if binaries.SubtitleEdit:
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
else:
|
|
||||||
return self.convert_with_pysubs2(codec)
|
|
||||||
else:
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
else:
|
|
||||||
return self._convert_standard(codec)
|
|
||||||
|
|
||||||
def _convert_pycaption_only(self, codec: Subtitle.Codec) -> Path:
|
|
||||||
"""
|
|
||||||
Convert subtitle using only pycaption library (no SubtitleEdit, no subby).
|
|
||||||
|
|
||||||
This is the original conversion method that only uses pycaption.
|
|
||||||
"""
|
|
||||||
if not self.path or not self.path.exists():
|
|
||||||
raise ValueError("You must download the subtitle track first.")
|
|
||||||
|
|
||||||
if self.codec == codec:
|
|
||||||
return self.path
|
|
||||||
|
|
||||||
output_path = self.path.with_suffix(f".{codec.value.lower()}")
|
|
||||||
original_path = self.path
|
|
||||||
|
|
||||||
# Use only pycaption for conversion
|
|
||||||
writer = {
|
|
||||||
Subtitle.Codec.SubRip: pycaption.SRTWriter,
|
|
||||||
Subtitle.Codec.TimedTextMarkupLang: pycaption.DFXPWriter,
|
|
||||||
Subtitle.Codec.WebVTT: pycaption.WebVTTWriter,
|
|
||||||
}.get(codec)
|
|
||||||
|
|
||||||
if writer is None:
|
|
||||||
raise NotImplementedError(f"Cannot convert {self.codec.name} to {codec.name} using pycaption only.")
|
|
||||||
|
|
||||||
caption_set = self.parse(self.path.read_bytes(), self.codec)
|
|
||||||
Subtitle.merge_same_cues(caption_set)
|
|
||||||
if codec == Subtitle.Codec.WebVTT:
|
|
||||||
Subtitle.filter_unwanted_cues(caption_set)
|
|
||||||
subtitle_text = writer().write(caption_set)
|
|
||||||
|
|
||||||
output_path.write_text(subtitle_text, encoding="utf8")
|
|
||||||
|
|
||||||
if original_path.exists() and original_path != output_path:
|
|
||||||
original_path.unlink()
|
|
||||||
|
|
||||||
self.path = output_path
|
|
||||||
self.codec = codec
|
|
||||||
|
|
||||||
if callable(self.OnConverted):
|
|
||||||
self.OnConverted(codec)
|
|
||||||
|
|
||||||
return output_path
|
|
||||||
|
|
||||||
def _convert_standard(self, codec: Subtitle.Codec) -> Path:
|
|
||||||
"""
|
|
||||||
Convert this Subtitle to another Format.
|
|
||||||
|
|
||||||
The file path location of the Subtitle data will be kept at the same
|
|
||||||
location but the file extension will be changed appropriately.
|
|
||||||
|
|
||||||
Supported formats:
|
|
||||||
- SubRip - SubtitleEdit or pycaption.SRTWriter
|
|
||||||
- TimedTextMarkupLang - SubtitleEdit or pycaption.DFXPWriter
|
|
||||||
- WebVTT - SubtitleEdit or pycaption.WebVTTWriter
|
|
||||||
- SubStationAlphav4 - SubtitleEdit
|
|
||||||
- SAMI - subby.SAMIConverter (when available)
|
|
||||||
- fTTML* - custom code using some pycaption functions
|
|
||||||
- fVTT* - custom code using some pycaption functions
|
|
||||||
*: Can read from format, but cannot convert to format
|
|
||||||
|
|
||||||
Note: It currently prioritizes using SubtitleEdit over PyCaption as
|
|
||||||
I have personally noticed more oddities with PyCaption parsing over
|
|
||||||
SubtitleEdit. Especially when working with TTML/DFXP where it would
|
|
||||||
often have timecodes and stuff mixed in/duplicated.
|
|
||||||
|
|
||||||
Returns the new file path of the Subtitle.
|
|
||||||
"""
|
|
||||||
if not self.path or not self.path.exists():
|
|
||||||
raise ValueError("You must download the subtitle track first.")
|
|
||||||
|
|
||||||
if self.codec == codec:
|
|
||||||
return self.path
|
|
||||||
|
|
||||||
output_path = self.path.with_suffix(f".{codec.value.lower()}")
|
|
||||||
original_path = self.path
|
|
||||||
|
|
||||||
if binaries.SubtitleEdit and self.codec not in (Subtitle.Codec.fTTML, Subtitle.Codec.fVTT):
|
|
||||||
sub_edit_format = {
|
|
||||||
Subtitle.Codec.SubRip: "subrip",
|
|
||||||
Subtitle.Codec.SubStationAlpha: "substationalpha",
|
|
||||||
Subtitle.Codec.SubStationAlphav4: "advancedsubstationalpha",
|
|
||||||
Subtitle.Codec.TimedTextMarkupLang: "timedtext1.0",
|
|
||||||
Subtitle.Codec.WebVTT: "webvtt",
|
|
||||||
Subtitle.Codec.SAMI: "sami",
|
|
||||||
Subtitle.Codec.MicroDVD: "microdvd",
|
|
||||||
}.get(codec, codec.name.lower())
|
|
||||||
sub_edit_args = [
|
|
||||||
str(binaries.SubtitleEdit),
|
|
||||||
"/convert",
|
|
||||||
str(self.path),
|
|
||||||
sub_edit_format,
|
|
||||||
f"/outputfilename:{output_path.name}",
|
|
||||||
"/encoding:utf8",
|
|
||||||
]
|
|
||||||
if codec == Subtitle.Codec.SubRip:
|
|
||||||
sub_edit_args.append("/ConvertColorsToDialog")
|
|
||||||
subprocess.run(sub_edit_args, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
|
|
||||||
else:
|
|
||||||
writer = {
|
|
||||||
# pycaption generally only supports these subtitle formats
|
|
||||||
Subtitle.Codec.SubRip: pycaption.SRTWriter,
|
|
||||||
Subtitle.Codec.TimedTextMarkupLang: pycaption.DFXPWriter,
|
|
||||||
Subtitle.Codec.WebVTT: pycaption.WebVTTWriter,
|
|
||||||
}.get(codec)
|
|
||||||
if writer is None:
|
|
||||||
raise NotImplementedError(f"Cannot yet convert {self.codec.name} to {codec.name}.")
|
|
||||||
|
|
||||||
caption_set = self.parse(self.path.read_bytes(), self.codec)
|
|
||||||
Subtitle.merge_same_cues(caption_set)
|
|
||||||
if codec == Subtitle.Codec.WebVTT:
|
|
||||||
Subtitle.filter_unwanted_cues(caption_set)
|
|
||||||
subtitle_text = writer().write(caption_set)
|
|
||||||
|
|
||||||
output_path.write_text(subtitle_text, encoding="utf8")
|
|
||||||
|
|
||||||
if original_path.exists() and original_path != output_path:
|
|
||||||
original_path.unlink()
|
|
||||||
|
|
||||||
self.path = output_path
|
|
||||||
self.codec = codec
|
|
||||||
|
|
||||||
if callable(self.OnConverted):
|
|
||||||
self.OnConverted(codec)
|
|
||||||
|
|
||||||
return output_path
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def parse(data: bytes, codec: Subtitle.Codec) -> pycaption.CaptionSet:
|
def parse(data: bytes, codec: Subtitle.Codec) -> pycaption.CaptionSet:
|
||||||
@@ -1267,25 +1033,13 @@ class Subtitle(Track):
|
|||||||
)
|
)
|
||||||
|
|
||||||
if binaries.SubtitleEdit and use_subtitleedit:
|
if binaries.SubtitleEdit and use_subtitleedit:
|
||||||
output_format = {
|
from unshackle.core.tracks.subtitle_convert import SUBTITLE_EDIT_FORMATS, subtitleedit_args
|
||||||
Subtitle.Codec.SubRip: "subrip",
|
|
||||||
Subtitle.Codec.SubStationAlpha: "substationalpha",
|
output_format = SUBTITLE_EDIT_FORMATS.get(self.codec, self.codec.name.lower())
|
||||||
Subtitle.Codec.SubStationAlphav4: "advancedsubstationalpha",
|
|
||||||
Subtitle.Codec.TimedTextMarkupLang: "timedtext1.0",
|
|
||||||
Subtitle.Codec.WebVTT: "webvtt",
|
|
||||||
Subtitle.Codec.SAMI: "sami",
|
|
||||||
Subtitle.Codec.MicroDVD: "microdvd",
|
|
||||||
}.get(self.codec, self.codec.name.lower())
|
|
||||||
subprocess.run(
|
subprocess.run(
|
||||||
[
|
subtitleedit_args(
|
||||||
str(binaries.SubtitleEdit),
|
binaries.SubtitleEdit, self.path, output_format, output_folder=self.path.parent, remove_hi=True
|
||||||
"/convert",
|
),
|
||||||
str(self.path),
|
|
||||||
output_format,
|
|
||||||
"/encoding:utf8",
|
|
||||||
"/overwrite",
|
|
||||||
"/RemoveTextForHI",
|
|
||||||
],
|
|
||||||
check=True,
|
check=True,
|
||||||
stdout=subprocess.DEVNULL,
|
stdout=subprocess.DEVNULL,
|
||||||
stderr=subprocess.DEVNULL,
|
stderr=subprocess.DEVNULL,
|
||||||
@@ -1330,26 +1084,14 @@ class Subtitle(Track):
|
|||||||
if not binaries.SubtitleEdit:
|
if not binaries.SubtitleEdit:
|
||||||
raise EnvironmentError("SubtitleEdit executable not found...")
|
raise EnvironmentError("SubtitleEdit executable not found...")
|
||||||
|
|
||||||
output_format = {
|
from unshackle.core.tracks.subtitle_convert import SUBTITLE_EDIT_FORMATS, subtitleedit_args
|
||||||
Subtitle.Codec.SubRip: "subrip",
|
|
||||||
Subtitle.Codec.SubStationAlpha: "substationalpha",
|
output_format = SUBTITLE_EDIT_FORMATS.get(self.codec, self.codec.name.lower())
|
||||||
Subtitle.Codec.SubStationAlphav4: "advancedsubstationalpha",
|
|
||||||
Subtitle.Codec.TimedTextMarkupLang: "timedtext1.0",
|
|
||||||
Subtitle.Codec.WebVTT: "webvtt",
|
|
||||||
Subtitle.Codec.SAMI: "sami",
|
|
||||||
Subtitle.Codec.MicroDVD: "microdvd",
|
|
||||||
}.get(self.codec, self.codec.name.lower())
|
|
||||||
|
|
||||||
subprocess.run(
|
subprocess.run(
|
||||||
[
|
subtitleedit_args(
|
||||||
str(binaries.SubtitleEdit),
|
binaries.SubtitleEdit, self.path, output_format, output_folder=self.path.parent, reverse_rtl=True
|
||||||
"/convert",
|
),
|
||||||
str(self.path),
|
|
||||||
output_format,
|
|
||||||
"/ReverseRtlStartEnd",
|
|
||||||
"/encoding:utf8",
|
|
||||||
"/overwrite",
|
|
||||||
],
|
|
||||||
check=True,
|
check=True,
|
||||||
stdout=subprocess.DEVNULL,
|
stdout=subprocess.DEVNULL,
|
||||||
stderr=subprocess.DEVNULL,
|
stderr=subprocess.DEVNULL,
|
||||||
|
|||||||
313
unshackle/core/tracks/subtitle_convert.py
Normal file
313
unshackle/core/tracks/subtitle_convert.py
Normal file
@@ -0,0 +1,313 @@
|
|||||||
|
"""
|
||||||
|
Subtitle conversion backend registry.
|
||||||
|
|
||||||
|
Routing is data-driven: each backend declares which (source -> target) codec pairs it can
|
||||||
|
read/write, whether it is available in the current environment, and a preference rank.
|
||||||
|
``resolve_backends`` filters the registry to the available backends that support the
|
||||||
|
requested pair and orders them by rank; ``run_conversion`` tries each in turn (a real
|
||||||
|
fallback chain) until one succeeds.
|
||||||
|
|
||||||
|
The public entry point stays ``Subtitle.convert`` / ``Subtitle.strip_hearing_impaired`` in
|
||||||
|
subtitle.py — this module only holds the selection + conversion logic so subtitle.py keeps
|
||||||
|
the codec enum, ``parse``, sanitizers and cue helpers (the collaborators backends reuse).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import subprocess
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional, Protocol
|
||||||
|
|
||||||
|
import pycaption
|
||||||
|
import pysubs2
|
||||||
|
from subby import CommonIssuesFixer, SAMIConverter, WebVTTConverter, WVTTConverter
|
||||||
|
|
||||||
|
from unshackle.core import binaries
|
||||||
|
from unshackle.core.tracks.subtitle import Subtitle
|
||||||
|
|
||||||
|
log = logging.getLogger("subtitle")
|
||||||
|
|
||||||
|
Codec = Subtitle.Codec
|
||||||
|
|
||||||
|
# SubtitleEdit (and the cross-platform seconv port) /convert format names.
|
||||||
|
# Shared by SubtitleEditBackend, strip_hearing_impaired and reverse_rtl so the map lives once.
|
||||||
|
SUBTITLE_EDIT_FORMATS: dict[Codec, str] = {
|
||||||
|
Codec.SubRip: "subrip",
|
||||||
|
Codec.SubStationAlpha: "substationalpha",
|
||||||
|
Codec.SubStationAlphav4: "advancedsubstationalpha",
|
||||||
|
Codec.TimedTextMarkupLang: "timedtext1.0",
|
||||||
|
Codec.WebVTT: "webvtt",
|
||||||
|
Codec.SAMI: "sami",
|
||||||
|
Codec.MicroDVD: "microdvd",
|
||||||
|
}
|
||||||
|
|
||||||
|
# pycaption can only WRITE these three formats.
|
||||||
|
PYCAPTION_WRITERS = {
|
||||||
|
Codec.SubRip: pycaption.SRTWriter,
|
||||||
|
Codec.TimedTextMarkupLang: pycaption.DFXPWriter,
|
||||||
|
Codec.WebVTT: pycaption.WebVTTWriter,
|
||||||
|
}
|
||||||
|
|
||||||
|
# pysubs2 format identifiers per codec.
|
||||||
|
PYSUBS2_FORMATS: dict[Codec, str] = {
|
||||||
|
Codec.SubRip: "srt",
|
||||||
|
Codec.SubStationAlpha: "ssa",
|
||||||
|
Codec.SubStationAlphav4: "ass",
|
||||||
|
Codec.WebVTT: "vtt",
|
||||||
|
Codec.TimedTextMarkupLang: "ttml",
|
||||||
|
Codec.SAMI: "sami",
|
||||||
|
Codec.MicroDVD: "microdvd",
|
||||||
|
Codec.MPL2: "mpl2",
|
||||||
|
Codec.TMP: "tmp",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def subtitleedit_args(
|
||||||
|
binary: object,
|
||||||
|
src: Path,
|
||||||
|
fmt: str,
|
||||||
|
*,
|
||||||
|
output_folder: Optional[Path] = None,
|
||||||
|
convert_colors: bool = False,
|
||||||
|
remove_hi: bool = False,
|
||||||
|
reverse_rtl: bool = False,
|
||||||
|
) -> list[str]:
|
||||||
|
"""
|
||||||
|
Build a SubtitleEdit batch-convert command.
|
||||||
|
|
||||||
|
Targets the SubtitleEdit 5+ CLI (``SeConv`` / ``seconv`` on every platform), which takes
|
||||||
|
``--flags`` with a positional ``<pattern> <format>`` (no legacy ``/convert`` verb). The
|
||||||
|
SE5 converter names the output ``<input-stem>.<format-ext>``; pass ``output_folder`` to
|
||||||
|
place it next to a chosen path (a bare ``--output-filename`` resolves against the *cwd*,
|
||||||
|
not the input dir, so we always steer with ``--output-folder``). ``--overwrite`` is always
|
||||||
|
set so re-runs and in-place transforms (SDH/RTL) don't fail on an existing file.
|
||||||
|
"""
|
||||||
|
args = [str(binary), str(src), fmt, "--encoding:utf-8", "--overwrite"]
|
||||||
|
if output_folder is not None:
|
||||||
|
args.append(f"--output-folder:{output_folder}")
|
||||||
|
if convert_colors:
|
||||||
|
args.append("--convert-colors-to-dialog")
|
||||||
|
if remove_hi:
|
||||||
|
args.append("--remove-text-for-hi")
|
||||||
|
if reverse_rtl:
|
||||||
|
args.append("--reverse-rtl-start-end")
|
||||||
|
return args
|
||||||
|
|
||||||
|
|
||||||
|
# Styled SubStation formats flattened to SRT lose positioning/colours/italics.
|
||||||
|
# Never performed automatically — only when the user explicitly forces a target format.
|
||||||
|
LOSSY_DOWNCONVERTS: frozenset[tuple[Codec, Codec]] = frozenset(
|
||||||
|
{
|
||||||
|
(Codec.SubStationAlpha, Codec.SubRip),
|
||||||
|
(Codec.SubStationAlphav4, Codec.SubRip),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class SubtitleBackend(Protocol):
|
||||||
|
name: str
|
||||||
|
|
||||||
|
def is_available(self) -> bool: ...
|
||||||
|
|
||||||
|
def can_convert(self, source: Codec, target: Codec) -> bool: ...
|
||||||
|
|
||||||
|
def rank(self, source: Codec, target: Codec) -> int: ...
|
||||||
|
|
||||||
|
def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None:
|
||||||
|
"""Convert ``src`` (a ``source`` file) to ``target``, writing to ``out``. Raise on failure."""
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
class SubtitleEditBackend:
|
||||||
|
"""SubtitleEdit / seconv CLI. Highest fidelity (keeps positioning + italics) when present."""
|
||||||
|
|
||||||
|
name = "subtitleedit"
|
||||||
|
reads = frozenset(SUBTITLE_EDIT_FORMATS)
|
||||||
|
writes = frozenset(SUBTITLE_EDIT_FORMATS)
|
||||||
|
|
||||||
|
def is_available(self) -> bool:
|
||||||
|
return bool(binaries.SubtitleEdit)
|
||||||
|
|
||||||
|
def can_convert(self, source: Codec, target: Codec) -> bool:
|
||||||
|
# Segmented box formats cannot be read by SubtitleEdit.
|
||||||
|
if source in (Codec.fTTML, Codec.fVTT):
|
||||||
|
return False
|
||||||
|
return source in self.reads and target in self.writes
|
||||||
|
|
||||||
|
def rank(self, source: Codec, target: Codec) -> int:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None:
|
||||||
|
args = subtitleedit_args(
|
||||||
|
binaries.SubtitleEdit,
|
||||||
|
src,
|
||||||
|
SUBTITLE_EDIT_FORMATS[target],
|
||||||
|
output_folder=out.parent,
|
||||||
|
convert_colors=(target == Codec.SubRip),
|
||||||
|
)
|
||||||
|
subprocess.run(args, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
|
||||||
|
# SE5 names the output <input-stem>.<format-ext>, which may differ from our target
|
||||||
|
# suffix (e.g. timedtext1.0 -> .ttml). Normalise it onto `out`.
|
||||||
|
if not out.exists():
|
||||||
|
produced = next((p for p in src.parent.glob(f"{src.stem}.*") if p not in (src, out)), None)
|
||||||
|
if produced is None:
|
||||||
|
raise FileNotFoundError(f"SubtitleEdit produced no output for {src.name} -> {target.name}")
|
||||||
|
produced.replace(out)
|
||||||
|
|
||||||
|
|
||||||
|
class Pysubs2Backend:
|
||||||
|
"""pysubs2 — pure Python, broad format support, best fidelity for SSA/ASS (native style model)."""
|
||||||
|
|
||||||
|
name = "pysubs2"
|
||||||
|
formats = frozenset(PYSUBS2_FORMATS)
|
||||||
|
|
||||||
|
def is_available(self) -> bool:
|
||||||
|
return True
|
||||||
|
|
||||||
|
def can_convert(self, source: Codec, target: Codec) -> bool:
|
||||||
|
return source in self.formats and target in self.formats
|
||||||
|
|
||||||
|
def rank(self, source: Codec, target: Codec) -> int:
|
||||||
|
# Preferred reader for styled SubStation sources; solid general fallback otherwise.
|
||||||
|
return 1 if source in (Codec.SubStationAlpha, Codec.SubStationAlphav4) else 2
|
||||||
|
|
||||||
|
def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None:
|
||||||
|
subs = pysubs2.load(str(src), encoding="utf-8")
|
||||||
|
subs.save(str(out), format_=PYSUBS2_FORMATS[target], encoding="utf-8")
|
||||||
|
|
||||||
|
|
||||||
|
class SubbyBackend:
|
||||||
|
"""subby — purpose-built for streaming subs. WebVTT/fVTT/SAMI -> SRT + CommonIssuesFixer cleanup."""
|
||||||
|
|
||||||
|
name = "subby"
|
||||||
|
reads = frozenset({Codec.WebVTT, Codec.fVTT, Codec.SAMI})
|
||||||
|
# Native SRT output; non-SRT targets re-encoded from the SRT intermediate via pycaption.
|
||||||
|
writes = frozenset({Codec.SubRip, Codec.TimedTextMarkupLang, Codec.WebVTT})
|
||||||
|
converters = {
|
||||||
|
Codec.WebVTT: WebVTTConverter,
|
||||||
|
Codec.fVTT: WVTTConverter,
|
||||||
|
Codec.SAMI: SAMIConverter,
|
||||||
|
}
|
||||||
|
|
||||||
|
def is_available(self) -> bool:
|
||||||
|
return True
|
||||||
|
|
||||||
|
def can_convert(self, source: Codec, target: Codec) -> bool:
|
||||||
|
return source in self.reads and target in self.writes
|
||||||
|
|
||||||
|
def rank(self, source: Codec, target: Codec) -> int:
|
||||||
|
# Great for *->SRT (adds cleanup); the SRT intermediate is lossy for other targets.
|
||||||
|
return 1 if target == Codec.SubRip else 5
|
||||||
|
|
||||||
|
def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None:
|
||||||
|
srt_subtitles = self.converters[source]().from_file(src)
|
||||||
|
fixed_srt, _ = CommonIssuesFixer().from_srt(srt_subtitles)
|
||||||
|
if target == Codec.SubRip:
|
||||||
|
fixed_srt.save(out, encoding="utf8")
|
||||||
|
return
|
||||||
|
temp_srt = src.with_suffix(".temp.srt")
|
||||||
|
fixed_srt.save(temp_srt, encoding="utf8")
|
||||||
|
try:
|
||||||
|
caption_set = Subtitle.parse(temp_srt.read_bytes(), Codec.SubRip)
|
||||||
|
Subtitle.merge_same_cues(caption_set)
|
||||||
|
out.write_text(PYCAPTION_WRITERS[target]().write(caption_set), encoding="utf8")
|
||||||
|
finally:
|
||||||
|
temp_srt.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
|
||||||
|
class PycaptionBackend:
|
||||||
|
"""pycaption — last resort. Note: flattens positioning/italics (devine #39), so ranked last."""
|
||||||
|
|
||||||
|
name = "pycaption"
|
||||||
|
reads = frozenset({Codec.SubRip, Codec.TimedTextMarkupLang, Codec.WebVTT, Codec.SAMI, Codec.fTTML, Codec.fVTT})
|
||||||
|
writes = frozenset(PYCAPTION_WRITERS)
|
||||||
|
|
||||||
|
def is_available(self) -> bool:
|
||||||
|
return True
|
||||||
|
|
||||||
|
def can_convert(self, source: Codec, target: Codec) -> bool:
|
||||||
|
return source in self.reads and target in self.writes
|
||||||
|
|
||||||
|
def rank(self, source: Codec, target: Codec) -> int:
|
||||||
|
return 9
|
||||||
|
|
||||||
|
def convert(self, source: Codec, src: Path, target: Codec, out: Path) -> None:
|
||||||
|
caption_set = Subtitle.parse(src.read_bytes(), source)
|
||||||
|
Subtitle.merge_same_cues(caption_set)
|
||||||
|
if target == Codec.WebVTT:
|
||||||
|
Subtitle.filter_unwanted_cues(caption_set)
|
||||||
|
out.write_text(PYCAPTION_WRITERS[target]().write(caption_set), encoding="utf8")
|
||||||
|
|
||||||
|
|
||||||
|
REGISTRY: list[SubtitleBackend] = [
|
||||||
|
SubtitleEditBackend(),
|
||||||
|
SubbyBackend(),
|
||||||
|
Pysubs2Backend(),
|
||||||
|
PycaptionBackend(),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_backends(source: Codec, target: Codec, *, pin: Optional[str] = None) -> list[SubtitleBackend]:
|
||||||
|
"""Available backends that support source->target, ordered by rank. A pin is tried first."""
|
||||||
|
available = [b for b in REGISTRY if b.is_available() and b.can_convert(source, target)]
|
||||||
|
if pin:
|
||||||
|
pinned = [b for b in available if b.name == pin]
|
||||||
|
rest = sorted((b for b in available if b.name != pin), key=lambda b: b.rank(source, target))
|
||||||
|
return pinned + rest
|
||||||
|
return sorted(available, key=lambda b: b.rank(source, target))
|
||||||
|
|
||||||
|
|
||||||
|
def finalize(sub: Subtitle, target: Codec, out: Path) -> Path:
|
||||||
|
"""Swap the track onto the converted file and fire the OnConverted callback."""
|
||||||
|
original = sub.path
|
||||||
|
if original and original.exists() and original != out:
|
||||||
|
original.unlink()
|
||||||
|
sub.path = out
|
||||||
|
sub.codec = target
|
||||||
|
if callable(sub.OnConverted):
|
||||||
|
sub.OnConverted(target)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def run_conversion(sub: Subtitle, target: Codec, *, pin: Optional[str] = None, forced: bool = False) -> Path:
|
||||||
|
"""
|
||||||
|
Convert ``sub`` to ``target`` using the best available backend, falling back through the
|
||||||
|
capability chain on failure.
|
||||||
|
|
||||||
|
``forced`` is True only for explicit user requests (``--sub-format``); lossy downconverts
|
||||||
|
(styled SubStation -> SRT) are skipped unless forced.
|
||||||
|
"""
|
||||||
|
if sub.path is None or not sub.path.exists():
|
||||||
|
raise ValueError("You must download the subtitle track first.")
|
||||||
|
if sub.codec is None:
|
||||||
|
raise ValueError("Subtitle has no codec to convert from.")
|
||||||
|
source, src = sub.codec, sub.path
|
||||||
|
|
||||||
|
if source == target:
|
||||||
|
return src
|
||||||
|
|
||||||
|
if (source, target) in LOSSY_DOWNCONVERTS and not forced:
|
||||||
|
log.info(
|
||||||
|
f"Keeping {source.name} subtitle as-is "
|
||||||
|
f"(skipping lossy auto-conversion to {target.name}; pass --sub-format to force)"
|
||||||
|
)
|
||||||
|
return src
|
||||||
|
|
||||||
|
chain = resolve_backends(source, target, pin=pin)
|
||||||
|
if not chain:
|
||||||
|
raise NotImplementedError(f"Cannot convert {source.name} to {target.name}.")
|
||||||
|
|
||||||
|
out = src.with_suffix(f".{target.value.lower()}")
|
||||||
|
last_exc: Optional[Exception] = None
|
||||||
|
for backend in chain:
|
||||||
|
try:
|
||||||
|
backend.convert(source, src, target, out)
|
||||||
|
except Exception as e:
|
||||||
|
last_exc = e
|
||||||
|
log.debug(f"Subtitle backend {backend.name} failed ({source.name}->{target.name}): {e}")
|
||||||
|
continue
|
||||||
|
log.debug(f"Converted subtitle {source.name}->{target.name} via {backend.name}")
|
||||||
|
return finalize(sub, target, out)
|
||||||
|
|
||||||
|
raise RuntimeError(f"All subtitle backends failed for {source.name}->{target.name}") from last_exc
|
||||||
@@ -448,30 +448,41 @@ filenames:
|
|||||||
# - pysubs2: Use pysubs2 library (supports SRT/SSA/ASS/WebVTT/TTML/SAMI/MicroDVD/MPL2/TMP)
|
# - pysubs2: Use pysubs2 library (supports SRT/SSA/ASS/WebVTT/TTML/SAMI/MicroDVD/MPL2/TMP)
|
||||||
subtitle:
|
subtitle:
|
||||||
conversion_method: auto
|
conversion_method: auto
|
||||||
# sdh_method: Method to use for SDH (hearing impaired) stripping
|
# Which backend converts subtitles (data-driven registry, pin-then-fallback)
|
||||||
|
# - auto (default): best available by rank (SubtitleEdit > subby/pysubs2 > pycaption)
|
||||||
|
# - subby | pysubs2 | subtitleedit | pycaption: pin that backend first, still falls back
|
||||||
|
# Styled ASS/SSA are never auto-downconverted to SRT (kept as-is); --sub-format srt overrides.
|
||||||
|
# SubtitleEdit on Linux/macOS = install the SE5 "SeConv" (seconv) CLI on PATH or unshackle/binaries/.
|
||||||
|
|
||||||
|
sdh_method: auto
|
||||||
|
# Method to use for SDH (hearing impaired) stripping
|
||||||
# - auto (default): Try subby (SRT only), then SubtitleEdit (if available), then subtitle-filter
|
# - auto (default): Try subby (SRT only), then SubtitleEdit (if available), then subtitle-filter
|
||||||
# - subby: Use subby library (SRT only)
|
# - subby: Use subby library (SRT only)
|
||||||
# - subtitleedit: Use SubtitleEdit tool (Windows only, falls back to subtitle-filter)
|
# - subtitleedit: Use SubtitleEdit / seconv (SE5 CLI, cross-platform), falls back to subtitle-filter
|
||||||
# - filter-subs: Use subtitle-filter library directly
|
# - filter-subs: Use subtitle-filter library directly
|
||||||
sdh_method: auto
|
|
||||||
# strip_sdh: Automatically create stripped (non-SDH) versions of SDH subtitles
|
|
||||||
# Set to false to disable automatic SDH stripping entirely (default: true)
|
|
||||||
strip_sdh: true
|
strip_sdh: true
|
||||||
# convert_before_strip: Auto-convert VTT/other formats to SRT before using subtitle-filter
|
# Automatically create stripped (non-SDH) versions of SDH subtitles
|
||||||
# This ensures compatibility when subtitle-filter is used as fallback (default: true)
|
# Set to false to disable automatic SDH stripping entirely (default: true)
|
||||||
|
|
||||||
convert_before_strip: true
|
convert_before_strip: true
|
||||||
# preserve_formatting: Preserve original subtitle formatting (tags, positioning, styling)
|
# Auto-convert VTT/other formats to SRT before using subtitle-filter
|
||||||
|
# This ensures compatibility when subtitle-filter is used as fallback (default: true)
|
||||||
|
|
||||||
|
preserve_formatting: true
|
||||||
|
# Preserve original subtitle formatting (tags, positioning, styling)
|
||||||
# When true, skips pycaption processing for WebVTT files to keep tags like <i>, <b>, positioning intact
|
# When true, skips pycaption processing for WebVTT files to keep tags like <i>, <b>, positioning intact
|
||||||
# Combined with no sub_format setting, ensures subtitles remain in their original format (default: true)
|
# Combined with no sub_format setting, ensures subtitles remain in their original format (default: true)
|
||||||
preserve_formatting: true
|
|
||||||
# output_mode: Output mode for subtitles
|
output_mode: mux
|
||||||
|
# Output mode for subtitles
|
||||||
# - mux: Embed subtitles in MKV container only (default)
|
# - mux: Embed subtitles in MKV container only (default)
|
||||||
# - sidecar: Save subtitles as separate files only
|
# - sidecar: Save subtitles as separate files only
|
||||||
# - both: Embed in MKV AND save as sidecar files
|
# - both: Embed in MKV AND save as sidecar files
|
||||||
output_mode: mux
|
|
||||||
# sidecar_format: Format for sidecar subtitle files
|
|
||||||
# Options: srt, vtt, ass, original (keep current format)
|
|
||||||
sidecar_format: srt
|
sidecar_format: srt
|
||||||
|
# Format for sidecar subtitle files
|
||||||
|
# Options: srt, vtt, ass, original (keep current format)
|
||||||
|
|
||||||
# Configuration for pywidevine and pyplayready's serve functionality
|
# Configuration for pywidevine and pyplayready's serve functionality
|
||||||
# Also used for remote services (unshackle serve)
|
# Also used for remote services (unshackle serve)
|
||||||
|
|||||||
Reference in New Issue
Block a user