By Werner Ruotsalainen on Mon, 04/23/2012
UPDATE (12/Sept/2012): in the meantime, Subler, one of the best remuxer and MP4 editor tools has also received OCR capabilities. It has some advantages over the character recognition of SubRip, introduced in the article below. Please read THIS for a full tutorial.
UPDATE (28/08/2012): today, I've played quite a bit with Project-X to make mass-demuxing possible: to quickly demux the DVB subtitle tracks from my TS files, as HandBrake (the tool I use for recompressing the files to H.264/MP4 ones) can't read DVB subtitles. (It doesn't even list them in the source TS files.)
Unfortunately, just dragging more than one file to the list view doesn't seem to work. If you run remuxing (“prepare>>”), after the first source video, all the subsequent demuxed tracks will have the same name as those of the first source video, with only a difference in the continuously-increased index. (This index, BTW, shows the number of the extracted subtitle track, if there are more than one – as is the case with most Finnish broadcasts with foreign-language originals on the YLE / AVA channels. NB. for the last 4-5 months, these YLE's files distributed by TVKaista no longer have contained teletext information or a track for the hearing-impaired: while there still is a Netherlands DVB subtitle track, it won't have any information. It's only in truly foreign-language original TS files that there will be any kind of subtitles.)
I've found out the following:
- you don't need to supply any(!!!) runtime parameter – once you've configured the app (PreSettings > PreSettings), the configuration will be permanently stored in the configuration file “X.ini”. For example, if you configure the app to only export subtitle tracks using the GUI, when you switch to the command-line (making sure X.ini is in the same directory), the settings will be remembered and only subtitle tracks will be demuxed. This is a BIG advantage as noone needs to learn miriads of command-line options.
- while you can pass more than one file to Project-X using wildcards like * (for example, “*.ts” or “A-plus_2011*.ts” as a parameter of “java -jar ProjectX.jar”), with more than 3-4 input files, only zero-sized output will be created. (I've played with this quite a bit and couldn't find an internal solution not using external, shell-based file lookup and looping). That is, never ever use wildcards in the command line!
- to use wildcards, use the just-mentioned shell looping. In OS X, for example, you'll want to use “for f in *.ts; do java -jar ProjectX.jar "$f"; done”. This command will iterate over all the .ts files in the current directory and, one after another, pass their names to ProjectX. This will work with even hundreds of Gigabytes of source TS files and greatly helps quickly and automatically(!!) remuxing only the type of tracks you do need.
HERE's a shell script doing this.
1, you can also export SUB (as opposed to the far less supported SUP) files directly from Project-X. (That is, you won't need to run SubtitleCreator at all for the SUP -> SUB conversion, also meaning you can stay right under OS X if you have a Mac if all you want is extracting SUB files from your TS files.) Just make sure you check in the checkbox “additional export as VobSub” in the “Subtitle” tab of PreSettings > PreSettings (annotated by a red rectangle below):
2, Also, you can greatly speed up the demuxing process by not letting the app demux the audio and video streams. Just uncheck the following checkboxes on the “Output” tab:
Original article follows:
(Note that this article is also part of my Multimedia bible series. I publish it as a separate entity as I've received some E-mails from my readers asking for more info on handling subtitles in files containing recordings of standard digital DVB TV broadcasts.)
You may already have seen subtitles in DVB recordings. If you don't, take a look at, for example, THIS with either VLC on your desktop computer or, say, GoodPlayer on your iPhone / iPad / iPod touch (iDevice for short). In the former, select Video > Subtitles Track > DVB Subtitles [Suomi] or [Nederlands]; in the latter, just swipe the screen up/down to switch subtitles (the name of the currently selected one will be displayed).
What you see are DVB subtitles recorded from TV. (The “Suomi” track only contains Swedish -> Finnish translations; the “Nederlands” one will contain even the transcription of the originally Finnish speech.) In this article, you'll see how you can extract (demux), convert and recognize these subtitles so that they become standard ones embeddable in standardized MP4 files.
Why the need – you may ask. The answer is simple: MPEG2 TS isn't natively played back by iDevices, as opposed to the in every respect (storage requirements etc.) much-much more advanced H.264 in a MOV/MP4 container. This means far higher CPU / battery usage, far hotter device (particularly with the iPad 3) and the need for purchasing GoodPlayer, the best TS player. (The second-best TS player, XBMC, is free, but it requires jailbreaking, which is still not available for a wide array of model / iOS combinations.)
However, the MP4 containers are far more restricted than more advanced ones like, for example, the most popular video container format as of today, Matroska (using .MKV files). Among other things, they, officially, can't include any kind of non-dumb (think of ASS or even SSA subtitles) or graphical (VobSub, SUP etc.) subtitle formats. (Incidentally, they do support MP4 Closed Captions, which are entirely different, are very rarely used in non-Apple products and, in general, are only utilized by the videos available in Apple's own iTunes Store, including the iTunes U.) While some players (for example, VLC – see THIS) do play back non-standard MP4 files with graphical (in particular, VobSub) subtitles, Apple's hardware decoder doesn't.
This means you MUST convert all your original subtitles to SRT so that they can be displayed by the hardware player (unless you use a third-party player that plays video using the hardware player but also displays the subtitles over the hardware player's output).
The latter is, in some cases, far from trivial. A trivial example is Transport Stream (TS) files natively produced by most cable TV recorders and also offered by online TV archive services like TVKaista (in Finland). These services (assuming they don't use “hard”, that is, “burnt-in” subs) all use the SUP format for subtitles, which is, generally, a series of two-bit images, which are shown as an overlay over the original video by the TV. These SUP files have absolutely no textual representation.
You'll need to process your source TS files the following way:
1, demux the TS file with Project-X to get the separated audio / video / subtitle tracks.
2, in SubtitleCreator, open the subtitle (SUP) file exported by Project-X and save it in VobSub format.
3, in SubRip, open the VobSub file exported by Project-X and run Optical Character Recognition (OCR for short) on it. Save the results as a SRT file.
(4, you can already directly embed the SRT file created in Step 3 in the target (converted) MP4 file with Subler, as is explained in my dedicated, previous articles.)
I've created a quick video showing the first three steps:
First, I start with Project-X (see Bullet 1 above). At 0:14, I open the source TS file and go (0:20) straight to “prepare >>”, where, after having made sure the “demux” radio button is selected in the upper left corner (I quickly hover the mouse cursor over it at 0:21) I click the Play / Pause button (0:22).
The demuxing process finishes at 0:29. Then, I reload the Finder view to show you the new files created by Project-X and, then, switch to Windows. (Note that I could have run Project-X under Windows too. It's just that I prefer working under OS X that I ran Project-X under it. The other two apps, unfortunately, only run under Windows; this is why I needed to use it.) There, I start SubtitleCreator (installed previously) at 0:36. At 0:41, I pick up the newly-created subtitle file. At 0:50, I quickly scroll the bottom center scrollable area to show all images have been loaded and, then, at 0:53, I immediately select “File > Save VobSub”. There, I select the language (it'll be used by SubRip to present the accented characters in a toolbox for quick selection during the OCR training phase) at 0:57.
Then, at 1:17, I start SubRip (decompressed previously). At 1:22, via File > Open VOB(s), I press the Open Dir button and (slowly – sorry for the speed, I use Windows emulation, which needs a bit of time to read the mapped directories from OS X) navigate to the VobSub file created in the previous step.
When I press OK, the OCR window is immediately displayed by SubRip and I start the OCR process. At first, SubRip asks me to train all the characters it doesn't recognize: T, u, l, e, h, a, n, y, t, comma, ä (clicked on the language-specific toolbar below the input field), k, I, dot and so on.
Note that at 2:16, an error message is displayed (probably because I've selected the wrong subtitle track? Dunno. My other tests with the same input TS file resulted in correctly readable subtitles.) I (slowly) get rid of the error dialog and try continuing the OCR process only to find out the source is, for some reason, indeed messed up. After having realized this, I save the SRT file (at 2:43) and, finally, at 2:51, I quickly check out (in Total Commander) the contents of the just-created (and Subler-compliant) SRT file to show you it's indeed standard.
Additional remarks & tips (also elaborating on other and, at least for this, not recommended titles)
Note that the section below is in no way needed for the above to work. I only provide this info for more info on the subtitle handlers currently available, should you want to know why some of the other titles are incapable of handling the conversion process outlined above.
Project-X demuxes the stream just fine, also creating an ifo/sub pair for each of the (valid) subtrack tracks. For example, for THIS one, there'll be one with 10 and another with 2 subtitle pages.
You can directly read in the SUP files via the free, Windows-only (runs OK under Parallels) SupViewer for checking (but not for image export or OCR). This app also has a neat feature: merging two files and, this way, displaying for example subtitles in two languages at the same time. Upon loading the second subtitle SUP stream (see the “Load/merge secondary SUP” icon at the top center), you'll need to supply an Y offset (by default, -400) to the current one.
SupRip 1.16 (note the P!) doesn't work: while it was able to correctly decode (but, at least without OCR training, correctly recognize) the (soft) timestamp in the SUP files demuxed from floatplanepassing.mts ( linked from HERE) but not the ones demuxed (by Project-X) from Mother of mine – it displayed an exception.
SubRip 1.50 Beta 5 (note the B!) can't at all read SUP files demuxed from (M(2))TS files - “only” image sequences, original VOB files and VobSub files. The OCR process is highly reliable. It, first, asks for every new letter it encounters. Typically, this is only done by once for each character. After that, the recognition is 100% exact, even for non-English (here: Finnish) text. The training can be saved (Characters Matrix > Save Characters Matrix File) and, later, retrieved so that TS streams using the same character set don't need to be re-trained. In addition, before closing the subtitles text window automatically displayed (and also toggable with the third icon on the top left) after finishing the OCR process containing the OCR'ed text, you can also run (Corrections > Post OCR Spelling Correction) a dictionary-based spell checker to further fix the errors. There are dictionaries for most major languages. Its image sequence can be already directly processed by SubRip but the latter in no way can process the TXT file containing the timing data (the single TXT file) exported by DVD Subtitle Tools (see next paragraph). Note that there isn't a way of quickly removing / re-editing a character entered: you need to go right to Characters Matrix > View/Edit Characters Matrix, select the last entry in the leftmost list and enter the right equivalent in the “Modify” input text field in the right center. (Alternatively, you can also delete the record.)
The decoder (DVDSupDecode.exe) of DVD Subtitle Tools 1.62, in addition to displaying timing information (when individual BMP files are displayed), is also capable of BMP dumping if you supply the additional command line switch “-bitmap” (as in “DVDSupDecode.exe -bitmap Kotikatsomo—Aideista-parhain--K11-_2011.05.09_YLE-TV1_12882718[copy]-02.sup”)
BDSup2Sub 4.0.0 (somewhat more up-to-date alternative; discussion) can't open the SUP files extracted from TS files either – it seems to be only compatible with native SUP files extracted from BD disks but not standard SD transport streams.
SubtitleCreator 2.3 rc1: it also has some OCR capabilities, but they require Microsoft Office 2003 (it seems to be incompatible with Microsoft Office 2010, at least on my Windows XP running under Parallels). As with DVD Subtitle Tools, it's also able to export all the consituting images from a SUP input file. However, its real strength lies in its VobSub export capabilities: SubRip can read the VobSub file it creates with all its goodies (most importantly, the timecodes). (Note that SubtitleCreator also uses code from the older and no longer supported / developed sup2vobsub. The initially Finnish app (the original Finnish language thread is HERE; English-language one HERE) could also convert SUP files demuxed from TS streams into VobSub streams already handled by SubRip, the de facto tool for OCR'ing originally image-based subtitles. Now that SubtitleCreator also has a GUI (as opposed to the strictly command-line sup2vobsub), however, I don't recommend it any more.)