gpt4 book ai didi

elasticsearch - 在elasticsearch中搜索字幕数据

转载 作者:行者123 更新时间:2023-12-03 14:45:29 26 4
gpt4 key购买 nike

有以下数据(简单的srt)

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...

在 Elasticsearch 中索引它的最佳方法是什么?现在这里有一个问题:我希望搜索结果突出显示链接到时间戳指示的确切时间。此外,还有多个 srt 行重叠的短语(例如上例中的 final approach)。

我的想法是
  • 将 srt 文件索引为列表类型,时间戳是索引。我相信这不会匹配重叠多个键的短语
  • 创建仅索引文本部分的自定义标记器。我不确定elasticsearch 能在多大程度上突出显示原始内容。
  • 仅索引文本部分并将其映射回 elasticsearch 之外的时间戳

  • 或者还有另一种选择可以优雅地解决这个问题吗?

    最佳答案

    有趣的问题。这是我的看法。
    本质上,字幕彼此“不知道”——这意味着最好在每个文档中包含前后的字幕文本( n - 1nn + 1 )。
    因此,您需要一个类似于以下内容的文档结构:

    {
    "sub_id" : 0,
    "start" : "00:02:17,440",
    "end" : "00:02:20,375",
    "text" : "Senator, we're making our final",
    "overlapping_text" : "Senator, we're making our final approach into Coruscant."
    }
    为了达到这样的文档结构,我使用了以下内容(受 this excellent answer 启发):
    from itertools import groupby
    from collections import namedtuple


    def parse_subs(fpath):
    # "chunk" our input file, delimited by blank lines
    with open(fpath) as f:
    res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]

    Subtitle = namedtuple('Subtitle', 'sub_id start end text')

    subs = []

    # grouping
    for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
    sub = [x.strip() for x in sub]
    sub_id, start_end, *content = sub # py3 syntax
    start, end = start_end.split(' --> ')

    # ints only
    sub_id = int(sub_id)

    # join multi-line text
    text = ', '.join(content)

    subs.append(Subtitle(
    sub_id,
    start,
    end,
    text
    ))

    es_ready_subs = []

    for index, sub_object in enumerate(subs):
    prev_sub_text = ''
    next_sub_text = ''

    if index > 0:
    prev_sub_text = subs[index - 1].text + ' '

    if index < len(subs) - 1:
    next_sub_text = ' ' + subs[index + 1].text

    es_ready_subs.append(dict(
    **sub_object._asdict(),
    overlapping_text=prev_sub_text + sub_object.text + next_sub_text
    ))

    return es_ready_subs
    一旦字幕被解析,它们就可以被摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可以正确搜索和排序:
    PUT my_subtitles_index
    {
    "mappings": {
    "properties": {
    "start": {
    "type": "text",
    "fields": {
    "as_timestamp": {
    "type": "date",
    "format": "HH:mm:ss,SSS"
    }
    }
    },
    "end": {
    "type": "text",
    "fields": {
    "as_timestamp": {
    "type": "date",
    "format": "HH:mm:ss,SSS"
    }
    }
    }
    }
    }
    }
    完成后,继续摄取:
    from elasticsearch import Elasticsearch
    from elasticsearch.helpers import bulk

    from utils.parse import parse_subs

    es = Elasticsearch()

    es_ready_subs = parse_subs('subs.txt')

    actions = [
    {
    "_index": "my_subtitles_index",
    "_id": sub_group['sub_id'],
    "_source": sub_group
    } for sub_group in es_ready_subs
    ]

    bulk(es, actions)
    摄取后,您可以定位原始字幕 text如果它与您的短语直接匹配,则提升它。否则,在 overlapping 上添加回退确保返回两个“重叠”字幕的文本。
    在返回之前,您可以确保命中按 start 排序。 , 上升。这违背了提升的目的,但如果你进行排序,你可以指定 track_scores:true在 URI 中以确保也返回最初计算的分数。
    把它们放在一起:
    POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
    {
    "query": {
    "bool": {
    "should": [
    {
    "match_phrase": {
    "text": {
    "query": "final approach",
    "boost": 2
    }
    }
    },
    {
    "match_phrase": {
    "overlapping_text": {
    "query": "final approach"
    }
    }
    }
    ]
    }
    },
    "sort": [
    {
    "start.as_timestamp": {
    "order": "asc"
    }
    }
    ]
    }
    产量:
    {
    "hits" : {
    "hits" : [
    {
    "_index" : "my_subtitles_index",
    "_type" : "_doc",
    "_id" : "0",
    "_score" : 6.0236287,
    "_source" : {
    "sub_id" : 0,
    "start" : "00:02:17,440",
    "end" : "00:02:20,375",
    "text" : "Senator, we're making our final",
    "overlapping_text" : "Senator, we're making our final approach into Coruscant."
    },
    "sort" : [
    137440
    ]
    },
    {
    "_index" : "my_subtitles_index",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 5.502407,
    "_source" : {
    "sub_id" : 1,
    "start" : "00:02:20,476",
    "end" : "00:02:22,501",
    "text" : "approach into Coruscant.",
    "overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
    },
    "sort" : [
    140476
    ]
    }
    ]
    }
    }

    关于elasticsearch - 在elasticsearch中搜索字幕数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28431583/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com