We have a lot of implicit and often slightly different assumptions of how we handle subtitles. In here, I'll try to document a best-practice approach. After some discussion, we can maybe add this to the Opencast documentation and make it the official approach.
I'll try to suggest common ways of storing and identifying the subtitles from upload, through processing and finally for showing them in the player.
Let's make subtitles a first-class citizen in Opencast, storing them as tracks alongside audio and video streams and letting workflows handle them by default.
Let us use flavors of the form captions/<processing>
, e.g. captions/source
and captions/delivery
. This makes it easy to identify subtitles and also easy to publish everything ready for publication.
Let us specify tags holding additional information of the form lang:<lang>
, generator:<type>[:<id>]
and type:<caption-type>
to each subtitle track.
While there are many components in the history of Opencast dealing with subtitles (Matterhorn admin interface, Engage player, …), the most relevant ones are Opencast's HTML5 players: Theodul and Paella.
Since it is still beta, I leave Paella Player 7 out of the historical consideration.
Theodul supports WebVTT files and considers both tracks and attachments when loading subtitles. It only supports a single subtitle stream. If both exist, it prefers tracks over attachments. The only selection criteria is, that the media package element must have the mime type text/vtt
.
Examples:
track
, flavor captions/delivery
, no tags, mime type text/vtt
track
, flavor captions/vtt
, tag lang:en
, mime type text/plain
See loadAndAppendCaptions(…)
in engage-theodul-plugin-video-videojs/src/main/resources/static/main.js
for more details.
Paella supports dxfp, WebVTT and SubRip files (not entirely sure). It supports loading subtitles from attachments, catalogs and (since ≥ 12.x) tracks. It supports multiple subtitle streams. Subtitles selection happens by selecting all media package elements (only attachments in Opencast ≤ 11) with main flavor caption
.
The sub-flavor is split at the first +
character. The first part is used as the format identifier, the second part, if present, as a language identifier. If no language was detected, the Paella player looks for a tag of the formlang:<language>
to use as language identifier. The language identifier is also the language description.
Examples:
captions/vtt+de
, tag lang:de
becomes:
vtt
de
de
captions/vtt
, tag lang:en
becomes:
vtt
en
en
captions/delivery
, no tags, mime type text/vtt
delivery
is detected as formatSee getCaptions(…)
in engage-paella-player/src/main/paella-opencast/plugins/es.upv.paella.opencast.loader/03_oc_search_converter.js
for more details.
Going forward, we need to support subtitles not only in players, but in several components to help users working with subtitles:
Now let's talk about how things should work in the future.
The Web Video Text Tracks Format (WebCTT) is a widely adopted W3C standard for subtitles/captions in the World Wide Web. In this context, it has completely replaced all competitor formats by now. We do not need to support any other formats.
For simplicity, I suggest to support WebVTT only.
If other formats still exist in its archive, Opencast can convert many formats to WebVTT using FFmpeg.
In the past, two media package categories have been used to store subtitles: Tracks and attachments.
Since Opencast allows cutting of events, and with the trend to push this task more and more towards end-users, we cannot control the order of generating subtitles versus cutting the event.
Therefore, it is important that subtitles are treated similar to other tracks, since that allows us to cut the subtitle along video and audio whenever that becomes necessary.
To allow for this, I suggest to always store subtitles in the tracks section of media packages.
Using tracks also allows us to easily use tools like FFmpeg to convert subtitles between different formats, in the same way we convert audio and video using the encode
operation.
With auto-generated and archived subtitles, as well as with editing in user's hands, we need to be able to distinguish uncut and cut subtitles. That is why using captions/vtt
for everything doesn't really work.
More than that, we may have several different subtitle streams. They can differ because they are generated differently, they can be in different languages, or they can be either closed captions or subtitles.
Containing all information in a two (arguably three) component flavor is hard, and we should consider putting most of these additional information in more flexible places instead.
With video and audio streams, the main flavor describes the kind of video (presenter
, presentation
, …) while the sub-flavor expresses the processing state (source
, work
, delivery
). Having unrelated captions would mean that we also have unrelated audio streams in a media package. That is unlikely. We can hopefully assume that just supporting one stream is enough, and we can always stick to captions
as main flavor, to make the set of captions easily identifiable.
As sub-flavor, we should use the processing state similar to what we use for video and audio streams. This makes it easy for us to distinguish between source material and material which has been processed (e.g. cut).
I suggest always using captions
as flavor while using the processing state as sub-flavor, similar to other media package tracks. For example, a caption could be flavored captions/source
when ingesting and captions/delivery
when it is ready for publication.
Previously, the language was sometimes attached to the sub-flavor in the form captions/source+en
. I suggest moving this information to tags instead, since they are more flexible and this makes generic handling of media package tracks harder. For example, it's no longer possible to publish */delivery
if captions are flavored captions/delivery+en
.
Tags are very flexible and can hold all sorts of additional information about media package elements. I suggest defining common tags to specify:
All tags should be optional and components should work without them being present, falling back to generic displays, or not showing information at all.
When processing subtitles, Opencast's components should either keep these tags as is, or adjust them accordingly.
To specify the language of a subtitle track, we can use tags of the form lang:<language>
where language
is a 3-letter ISO 639 language code. Adding multiple tags in case multiple languages are used is possible.
Just using 3-letter ISO language codes does not allow us to specify regions. For example, we cannot distinguish between British English (en-GB
) and American English (en-US
). Do we need that? If so, we could use RFC 3066 language codes instead.
Specifying how subtitles are generated, in partcular, if they are automatically generated, can help users. Therefore, I suggest adding tags of the form generator:<id>
and generator-type:<type>
, where type should be either manual
or auto
and the generator id
may be added to specify the system generating the subtitle.
For example, a manually generated subtitle should be tagged generator:manual
while an auto-generated subtitle should be tagged generator:auto
and one generated by Vosk may even be tagged generator:auto:vosk
to be more specific.
If no generator
is specified, components should make no claims about how this may be generated.
For accessibility in particular, it is important to know if a subtitle track is actually a subtitle or a closed caption (e.g. including additional information for deaf people). If we have this information, we should include it with a tag of the form type:<tyle>
, either type:subtitle
or type:closed-caption
.
I suggest including additional information about subtitles with tags of the form lang:<language-code>
, generator:<type>[:<id>]
and type:<captions-type>
.
Since the format of subtitle tracks should always be WebVTT, the mime type should always be text/vtt
.
This also makes selecting all subtitles easier and could also be an alternative to using captions
as main flavor, although no operations in Opencast let you select tracks by mime type.
For internal components, no fallbacks should be necessary, unless adopters want to reprocess already generated and/or published subtitles. We already have the means to access published media package elements. If we provide a simple operation to move attachments flavored captions/vtt+lang
to the new track-based format, we should be good to go.
For already published media, it would be great if we could still have a fallback in players, so that old media are still displayed correctly. That is why I suggest players to implement the following fallback mechanism:
captions
. This should already give us a list of all old subtitles.…+<lang>
, treat it as if a tag of the form lang:<lang>
were present.Let players fall back to attachments if no subtitle tracks are present.