Browse Source

approve_matches: Created, split from search_subs

There are now two steps in subtitle searching:
1. Save all matches to a CSV.
2. Iterate over this CSV and manually approve or dimiss the matches.

This has the advantage that time consuming searching is done only once,
not every time you want to approve new matches.

Executables, README, and setup.py were updated to reflect this change.

Module common.py was created -- it contains code share by search_subs
and approve_matches.
master
Jakub Valenta 4 years ago
parent
commit
a1befc7706
11 changed files with 420 additions and 255 deletions
  1. +50
    -11
      README.md
  2. +12
    -6
      setup.py
  3. +0
    -5
      tv-series-check-approved-subs
  4. +5
    -0
      tv-series-matches-approve
  5. +5
    -0
      tv-series-matches-check-approved
  6. +5
    -0
      tv-series-matches-print-approved
  7. +0
    -5
      tv-series-print-approved-subs
  8. +1
    -1
      tv-series-search-subs
  9. +230
    -0
      tv_series/approve_matches.py
  10. +72
    -0
      tv_series/common.py
  11. +40
    -227
      tv_series/search_subs.py

+ 50
- 11
README.md View File

@@ -1,7 +1,7 @@
# TV Series Tools

- Download subtitles for all TV series episodes.
- Search subtitles for specific dialogues and create transcript for [videogrep](https://github.com/antiboredom/videogrep).
- Search subtitles for specific expressions.

## Installation

@@ -24,6 +24,10 @@ Then you can call the executables:
```
./tv-series-download-subs -h
./tv-series-find-episode-ids -h
./tv-series-search-subs -h
./tv-series-matches-approve -h
./tv-series-matches-check-approved -h
./tv-series-matches-print-approved -h
```

Or you can install this software as a Python package, which will also install all the dependencies and make the executables available globally:
@@ -33,11 +37,15 @@ python2 setup.py install

tv-series-download-subs -h
tv-series-find-episode-ids -h
tv-series-search-subs -h
tv-series-matches-approve -h
tv-series-matches-check-approved -h
tv-series-matches-print-approved -h
```

## Usage

This software works in two phases:
This software works in several phases:

### 1. Find IMDB IDs for all episodes of passed TV series

@@ -55,17 +63,17 @@ True Detective
Then call:

```
tv-series-find-episode-ids -i my_series.txt -o my_episode_ids.txt
tv-series-find-episode-ids -i my_series.txt -o my_episodes.csv
```

Episode IDs for all the TV series mentioned in `my_series.txt` will be written to `my_episode_ids.txt` like this:
Episode IDs for all the TV series mentioned in `my_series.txt` will be written to `my_episode_ids.csv` like this:

```
2510426
2580386
2639284
2545702
2639288
2510426;"Title of the episode"
2580386;"Title of the episode"
2639284;"Title of the episode"
2545702;"Title of the episode"
2639288;"Title of the episode"
```

### 2. Download subtitles for passed IMDB IDs
@@ -74,7 +82,7 @@ Sign up at [OpenSubtitles.org](https://www.opensubtitles.org/). Consider buying

Set environment variables `OPENSUB_USER` and `OPENSUB_PASSWD` to contain your OpenSubtitles.org credentials.

```sh
```
export OPENSUB_USER='you@example.com'
export OPENSUB_PASSWD='yourpassword'
```
@@ -82,11 +90,42 @@ export OPENSUB_PASSWD='yourpassword'
Then call:

```
tv-series-download-subs -i my_episode_ids.txt -o my_subs/
tv-series-download-subs -i my_episodes.csv -o my_subs/
```

All the episodes's subtitles will be downloaded to the directory `my_subs/` as SRT files.

### 3. Search downloaded subtitles

my_regex.txt:

```
one.*regular expression per line
case insensitive
```

```
tv-series-search-subs -i my_subs/ -p my_regex.txt -o my_matches.csv
```

### 4. Approve matches

```
tv-series-matches-approve -i my_matches.csv -o my_answers.csv
```

### 5. Check all positive answers again

```
tv-series-matches-check-approved -i my_answers.csv -o my_answers_checked.csv
```

### 5. Print all positive answers

```
tv-series-matches-print-approved -i my_answers_checked.csv
```

## Help

Call any of the scripts mentioned in [Usage](#usage) with the parameter `-h` or `--help` to see full documentation. Example:


+ 12
- 6
setup.py View File

@@ -45,12 +45,18 @@ setup(

entry_points={
'console_scripts': [
'tv-series-download-subs=tv_series.download_subs:download_subs_and_cache_results',
'tv-series-find-episode-ids=tv_series.find_episode_ids:find_and_write_episode_ids',
'tv-series-search-subs=tv_series.search_subs:search_and_approve_subs',
'tv-series-chech-approved-subs=tv_series.search_subs:check_approved_subs',
'tv-series-print-approved-subs=tv_series.search_subs:print_approved_subs',
'tv-series-video=tv_series.video:create_super_cut',
'tv-series-download-subs='
'tv_series.download_subs:download_subs_and_cache_results',
'tv-series-find-episode-ids='
'tv_series.find_episode_ids:find_and_write_episode_ids',
'tv-series-search-subs='
'tv_series.search_subs:search_subs_and_save_matches',
'tv-series-matches-approve='
'tv_series.approve_matches:approve_matches_and_save_answers',
'tv-series-matches-check-approved='
'tv_series.approve_matches:check_positive_answers',
'tv-series-matches-print-approved='
'tv_series.approve_matches:print_positive_answers',
],
},
)

+ 0
- 5
tv-series-check-approved-subs View File

@@ -1,5 +0,0 @@
#!/usr/bin/env python

import tv_series.search_subs
if __name__ == '__main__':
tv_series.search_subs.check_approved_subs()

+ 5
- 0
tv-series-matches-approve View File

@@ -0,0 +1,5 @@
#!/usr/bin/env python

import tv_series.approve_matches
if __name__ == '__main__':
tv_series.approve_matches.approve_matches_and_save_answers()

+ 5
- 0
tv-series-matches-check-approved View File

@@ -0,0 +1,5 @@
#!/usr/bin/env python

import tv_series.approve_matches
if __name__ == '__main__':
tv_series.approve_matches.check_positive_answers()

+ 5
- 0
tv-series-matches-print-approved View File

@@ -0,0 +1,5 @@
#!/usr/bin/env python

import tv_series.approve_matches
if __name__ == '__main__':
tv_series.approve_matches.print_positive_answers()

+ 0
- 5
tv-series-print-approved-subs View File

@@ -1,5 +0,0 @@
#!/usr/bin/env python

import tv_series.search_subs
if __name__ == '__main__':
tv_series.search_subs.print_approved_subs()

+ 1
- 1
tv-series-search-subs View File

@@ -2,4 +2,4 @@

import tv_series.search_subs
if __name__ == '__main__':
tv_series.search_subs.search_and_approve_subs()
tv_series.search_subs.search_subs_and_save_matches()

+ 230
- 0
tv_series/approve_matches.py View File

@@ -0,0 +1,230 @@
import math
import re
import sys

import listio
from termcolor import colored

from tv_series import common


def convert_match_to_answer(match, yes_or_no, start, end):
return (
yes_or_no.upper(),
match['file_path'],
match['time_start'],
match['time_end'],
start,
end,
)


def convert_answer_to_match(answer):
return {
'file_path': answer[1],
'time_start': float(answer[2]),
'time_end': float(answer[3]),
}


def find_subs_context(subs, current, no=1):
context = []
for i in range(current - no, current + no + 1):
if i < 0 or i >= len(subs):
continue
context.append(subs[i])
return context


def add_subs_context_to_matches(matches, no=1):
for match in matches:
subs = common.read_subs(match['file_path'])
if subs is None:
continue
for i, sub in enumerate(subs):
sub_match = common.convert_sub_to_match(match['file_path'], sub)
if sub_match != match:
continue
match['subs_context'] = find_subs_context(subs, i, no)
yield match


def format_sub_match_with_context(match, color=True, i=None, total=None):
if color:
file_path_formatted = colored(match['file_path'], attrs=('bold',))
else:
file_path_formatted = match['file_path']

if i is not None:
if total is not None:
index_formatted = '{}/{} '.format(i, total)
else:
index_formatted = '{} '.format(i)
else:
index_formatted = ''

return(
'\n'
'{index_formatted}'
'{file_path} {time_start} --> {time_end}\n'
'\n'
'{subs_context}\n'
'\n'
.format(
file_path=file_path_formatted,
time_start=match['time_start'],
time_end=match['time_end'],
subs_context=common.format_subs(match['subs_context'], color),
index_formatted=index_formatted
)
)


def approve_matches(matches, totals=False):
if totals:
matches = list(matches)
total = len(matches)
else:
total = None
for i, match in enumerate(matches):
print(format_sub_match_with_context(match, i=i+1, total=total))

inp = None
while inp is None or (inp not in ('y', 'n', 'x', '') and
not re.match(r'^\d{1,2}$', inp)):
print('Do you like this match? "y" = yes, "n" or nothing = no,'
' "?" = ask again next time, "AB" start at line number A and'
' end at B')
inp = input('--> ')

if inp == '?':
continue
if inp in ('', 'y', 'n'):
if inp == '':
yes_or_no = 'n'
else:
yes_or_no = inp
no_start = 0
no_end = no_start
elif len(inp) == 1:
yes_or_no = 'y'
no_start = int(inp)
no_end = no_start
else:
yes_or_no = 'y'
no_start = -1 * int(inp[0])
no_end = int(inp[1])

i_middle = math.floor(len(match['subs_context']) / 2)
i_start = i_middle + no_start
i_end = i_middle + no_end
yield convert_match_to_answer(
match,
yes_or_no,
common.parse_sub_time(match['subs_context'][i_start].start),
common.parse_sub_time(match['subs_context'][i_end].end)
)


def filter_approved_answers(answers):
return (
answer
for answer in answers
if answer[0] == 'Y'
)


def filter_not_answered_matches(matches, answer_matches):
for match in matches:
if match in answer_matches:
print('ALREADY ANSWERED "{f}" {s} --> {e}'.format(
f=match['file_path'],
s=match['time_start'],
e=match['time_end']))
continue
yield match


def approve_matches_and_save_answers():
import argparse

parser = argparse.ArgumentParser(
description='TV Series Tools: Check matches and save answers'
)
parser.add_argument('--input', '-i', dest='inputfile', required=True,
help='path to a file with the matches')
parser.add_argument('--output', '-o', dest='outputfile', required=True,
help='path to a file in which the answers will be'
' stored')
parser.add_argument('--totals', '-t', dest='totals', action='store_true',
help='show total number of matches to answer,'
' this will cause that if there are a lot of matches'
' it will take quite a lot of time before the first'
' question shows up')
args = parser.parse_args()

matches_list = listio.read_map(args.inputfile)
matches = (common.convert_list_to_match(l) for l in matches_list)
answers = listio.read_map(args.outputfile)
# Using list instead of generator so that we can read it several times.
matches_answered = [convert_answer_to_match(answer) for answer in answers]
matches_not_answered = filter_not_answered_matches(
matches,
matches_answered
)
matches_with_context = add_subs_context_to_matches(matches_not_answered, 2)
answers = approve_matches(matches_with_context, totals=args.totals)

listio.write_map(args.outputfile, answers)

sys.exit()


def check_positive_answers():
import argparse

parser = argparse.ArgumentParser(
description='TV Series Tools: Check again positive answers'
)
parser.add_argument('--input', '-i', dest='inputfile', required=True,
help='path to a file with the answers')
parser.add_argument('--output', '-o', dest='outputfile', required=True,
help='path to a file in which the new answers will be'
' stored')
args = parser.parse_args()

answers = listio.read_map(args.inputfile)
approved = filter_approved_answers(answers)
matches = (convert_answer_to_match(answer) for answer in approved)
matches_with_context = add_subs_context_to_matches(matches, 3)
new_answers = approve_matches(matches_with_context)

listio.write_map(args.outputfile, new_answers)

sys.exit()


def print_positive_answers():
import argparse

parser = argparse.ArgumentParser(
description='TV Series Tools: Print positive answers with context'
)
parser.add_argument('--input', '-i', dest='inputfile', required=True,
help='path to a file with the answers')
args = parser.parse_args()

answers = listio.read_map(args.inputfile)
approved = filter_approved_answers(answers)
matches = (convert_answer_to_match(answer) for answer in approved)
matches_with_context = add_subs_context_to_matches(matches, 3)
(
print(format_sub_match_with_context(match, color=False))
for match in matches_with_context
)

sys.exit()


if __name__ == '__main__':
approve_matches_and_save_answers()

+ 72
- 0
tv_series/common.py View File

@@ -0,0 +1,72 @@
import math

import pysrt
from termcolor import colored


def parse_sub_text(s):
return s.replace('\n', ' ')


def parse_sub_time(t):
return t.ordinal / 1000


def convert_seconds_to_subriptime(t):
milliseconds = t * 1000
return pysrt.SubRipTime(milliseconds=milliseconds)


def convert_sub_to_match(file_path, sub):
return {
'file_path': file_path,
'time_start': parse_sub_time(sub.start),
'time_end': parse_sub_time(sub.end),
}


def convert_list_to_match(l):
return {
'file_path': l[0],
'time_start': float(l[1]),
'time_end': float(l[2]),
}


def convert_match_to_list(match):
return (
match['file_path'],
match['time_start'],
match['time_end'],
)


def read_subs(file_path):
try:
return pysrt.open(file_path)
except UnicodeDecodeError:
print(colored('ERROR Failed to parse "{}"'
.format(file_path), 'red'))
return None


def format_subs(subs, color=True):
if type(subs) != tuple and type(subs) != list:
subs = [subs]
middle = math.floor(len(subs) / 2)
formatted = []
for i, sub in enumerate(subs):
no = i - middle
if no == 0 and color:
text = colored(sub.text, 'blue')
else:
text = sub.text
formatted.append('{no} {text:<80} {start} --> {end}'.format(
no=abs(no),
text=parse_sub_text(text),
start='{:02d}:{:02d}:{:02d}'
.format(sub.start.hours, sub.start.minutes, sub.start.seconds),
end='{:02d}:{:02d}:{:02d}'
.format(sub.end.hours, sub.end.minutes, sub.end.seconds),
))
return '\n'.join(formatted)

+ 40
- 227
tv_series/search_subs.py View File

@@ -1,56 +1,11 @@
import math
import os
import re
import readline
import sys

import listio
import pysrt
from termcolor import colored


def parse_sub_text(s):
return s.replace('\n', ' ')


def parse_sub_time(t):
return t.ordinal / 1000


def convert_sub_to_match(file_path, sub):
return {
'file_path': file_path,
'time_start': parse_sub_time(sub.start),
'time_end': parse_sub_time(sub.end),
}


def convert_match_to_answer(match, yes_or_no, start, end):
return (
yes_or_no.upper(),
match['file_path'],
match['time_start'],
match['time_end'],
start,
end,
)


def convert_answer_to_match(answer):
return {
'file_path': answer[1],
'time_start': float(answer[2]),
'time_end': float(answer[3]),
}


def find_subs_context(subs, current, no=1):
context = []
for i in range(current - no, current + no + 1):
if i < 0 or i >= len(subs):
continue
context.append(subs[i])
return context
from tv_series import common


def compile_regex(pattern):
@@ -65,153 +20,53 @@ def search_line(line, regex_list):
return False


def read_subs(file_path):
try:
return pysrt.open(file_path)
except UnicodeDecodeError:
print('ERROR Subtitle file "{}" could not be read.'.format(file_path))
return None


def iter_subs(dir_path):
def iter_subs_files(dir_path):
d = os.scandir(dir_path)
for entry in sorted(d, key=lambda entry: entry.name):
if entry.name.startswith('.'):
continue
if entry.is_dir():
for ret in iter_subs(entry.path):
yield ret
for file_path in iter_subs_files(entry.path):
yield file_path
continue
if not entry.is_file() or not entry.name.endswith('.srt'):
continue
abspath = os.path.abspath(entry.path)
subs = read_subs(entry.path)
if subs is not None:
yield (abspath, subs)
file_path = os.path.abspath(entry.path)
print('READING "{}"'.format(file_path))
yield file_path


def search_subs(subs_dir, excl, regex_list):
for file_path, subs in subs_dir:
for i, sub in enumerate(subs):
match = convert_sub_to_match(file_path, sub)
def search_subs(paths_and_subs, excl, regex_list):
for file_path, subs in paths_and_subs:
for sub in subs:
match = common.convert_sub_to_match(file_path, sub)
if match in excl:
print('ANSWERED', match)
print('ALREADY PROCESSED "{f}" {s} --> {e}'.format(
f=match['file_path'],
s=match['time_start'],
e=match['time_end']))
continue
line = parse_sub_text(sub.text)
line = common.parse_sub_text(sub.text)
if not search_line(line, regex_list):
continue
subs_context = find_subs_context(subs, i, 2)
yield {
'file_path': match['file_path'],
'time_start': match['time_start'],
'time_end': match['time_end'],
'subs_context': subs_context,
}


def format_subs(subs, color=True):
if type(subs) != tuple and type(subs) != list:
subs = [subs]
middle = math.floor(len(subs) / 2)
formatted = []
for i, sub in enumerate(subs):
no = i - middle
if no == 0 and color:
text = colored(sub.text, 'red')
else:
text = sub.text
formatted.append('{no} {text:<80} {start} --> {end}'.format(
no=abs(no),
text=parse_sub_text(text),
start='{:02d}:{:02d}:{:02d}'
.format(sub.start.hours, sub.start.minutes, sub.start.seconds),
end='{:02d}:{:02d}:{:02d}'
.format(sub.end.hours, sub.end.minutes, sub.end.seconds),
))
return '\n'.join(formatted)


def format_sub_match_with_context(match, color=True):
if color:
file_path_formatted = colored(match['file_path'], attrs=('bold',))
else:
file_path_formatted = match['file_path']
return(
'\n'
'{file_path} {time_start} --> {time_end}\n'
'\n'
'{subs_context}\n'
'\n'
.format(
file_path=file_path_formatted,
time_start=match['time_start'],
time_end=match['time_end'],
subs_context=format_subs(match['subs_context'], color)
)
)


def approve_matches(matches):
for match in matches:
print(format_sub_match_with_context(match))

inp = None
while inp is None or (inp not in ('y', 'n', 'x', '') and
not re.match(r'^\d{1,2}$', inp)):
print('Do you like this match? "y" = yes, "n" or nothing = no,'
' "?" = ask again next time, "AB" start at line number A and'
' end at B')
inp = input('--> ')

if inp == '?':
continue
if inp in ('', 'y', 'n'):
if inp == '':
yes_or_no = 'n'
else:
yes_or_no = inp
no_start = 0
no_end = no_start
elif len(inp) == 1:
yes_or_no = 'y'
no_start = int(inp)
no_end = no_start
else:
yes_or_no = 'y'
no_start = -1 * int(inp[0])
no_end = int(inp[1])

i_middle = math.floor(len(match['subs_context']) / 2)
i_start = i_middle + no_start
i_end = i_middle + no_end
yield convert_match_to_answer(
match,
yes_or_no,
parse_sub_time(match['subs_context'][i_start].start),
parse_sub_time(match['subs_context'][i_end].end)
)
print(colored(
'MATCHED "{f}" {s} --> {e}'.format(
f=match['file_path'],
s=match['time_start'],
e=match['time_end']),
'green'))
yield match


def filter_approved_answers(answers):
for answer in answers:
if answer[0] == 'Y':
yield answer
def read_subs(paths):
for file_path in paths:
subs = common.read_subs(file_path)
if subs:
yield (file_path, subs)


def add_subs_context_to_matches(matches, no=1):
for match in matches:
subs = read_subs(match['file_path'])
if subs is None:
continue
for i, sub in enumerate(subs):
sub_match = convert_sub_to_match(match['file_path'], sub)
if sub_match != match:
continue
match['subs_context'] = find_subs_context(subs, i, no)
yield match


def search_and_approve_subs():
def search_subs_and_save_matches():
import argparse

parser = argparse.ArgumentParser(
@@ -220,74 +75,32 @@ def search_and_approve_subs():
parser.add_argument('--input', '-i', dest='inputdir', required=True,
help='path to a directory with the subtitle SRT files')
parser.add_argument('--output', '-o', dest='outputfile', required=True,
help='path to a file in which the answers will be'
help='path to a file in which the matches will be'
' stored')
parser.add_argument('--patterns', '-p', dest='patterns', required=True,
help='path to a file with search patterns')
args = parser.parse_args()

excl = [
convert_answer_to_match(answer)
for answer in listio.read_map(args.outputfile)
common.convert_list_to_match(match)
for match in listio.read_map(args.outputfile)
]
regex_list = [
compile_regex(pattern)
for pattern in listio.read_list(args.patterns)
]
subs_dir = iter_subs(args.inputdir)
matches_with_context = search_subs(subs_dir, excl, regex_list)
answers = approve_matches(matches_with_context)

listio.write_map(args.outputfile, answers)

sys.exit()


def check_approved_subs():
import argparse

parser = argparse.ArgumentParser(
description='TV Series Tools: Check approved subtitles'
)
parser.add_argument('--input', '-i', dest='inputfile', required=True,
help='path to a file with the answers')
parser.add_argument('--output', '-o', dest='outputfile', required=True,
help='path to a file in which the new answers will be'
' stored')
args = parser.parse_args()

answers = listio.read_map(args.inputfile)
approved = filter_approved_answers(answers)
matches = [convert_answer_to_match(answer) for answer in approved]
matches_with_context = add_subs_context_to_matches(matches, 3)
new_answers = approve_matches(matches_with_context)

listio.write_map(args.outputfile, new_answers)

sys.exit()


def print_approved_subs():
import argparse

parser = argparse.ArgumentParser(
description='TV Series Tools: Print answers with context'
paths = iter_subs_files(args.inputdir)
subs = read_subs(paths)
matches = search_subs(subs, excl, regex_list)
matches_list = (
common.convert_match_to_list(match)
for match in matches
)
parser.add_argument('--input', '-i', dest='inputfile', required=True,
help='path to a file with the answers')
args = parser.parse_args()

answers = listio.read_map(args.inputfile)
approved = filter_approved_answers(answers)
matches = [convert_answer_to_match(answer) for answer in approved]
matches_with_context = add_subs_context_to_matches(matches, 3)
[
print(format_sub_match_with_context(match, False))
for match in matches_with_context
]
listio.write_map(args.outputfile, matches_list)

sys.exit()


if __name__ == '__main__':
search_and_approve_subs()
search_subs_and_save_matches()

Loading…
Cancel
Save