python - 语音到文本 - 将说话者标签映射到 JSON 响应中的相应转录本-6ren

python - 语音到文本 - 将说话者标签映射到 JSON 响应中的相应转录本

转载作者：行者123 更新时间：2023-12-05 05:16:05

每隔一段时间就会出现一段 JSON 数据，它提出了一个挑战，可能需要数小时才能从中提取所需的信息。我从 Speech To Text API 引擎生成了以下 JSON 响应。

它显示每个说话者的转录本、每个单词的发音以及时间戳和说话者标签 speaker 0和 speaker 2在谈话中。

   {
    "results": [
        {
            "alternatives": [
                {
                    "timestamps": [
                        [
                            "the",
                            6.18,
                            6.63
                        ],
                        [
                            "weather",
                            6.63,
                            6.95
                        ],
                        [
                            "is",
                            6.95,
                            7.53
                        ],
                        [
                            "sunny",
                            7.73,
                            8.11
                        ],
                        [
                            "it's",
                            8.21,
                            8.5
                        ],
                        [
                            "time",
                            8.5,
                            8.66
                        ],
                        [
                            "to",
                            8.66,
                            8.81
                        ],
                        [
                            "sip",
                            8.81,
                            8.99
                        ],
                        [
                            "in",
                            8.99,
                            9.02
                        ],
                        [
                            "some",
                            9.02,
                            9.25
                        ],
                        [
                            "cold",
                            9.25,
                            9.32
                        ],
                        [
                            "beer",
                            9.32,
                            9.68
                        ]
                    ],
                    "confidence": 0.812,
                    "transcript": "the weather is sunny it's time to sip in some cold beer "
                }
            ],
            "final": "True"
        },
        {
            "alternatives": [
                {
                    "timestamps": [
                        [
                            "sure",
                            10.52,
                            10.88
                        ],
                        [
                            "that",
                            10.92,
                            11.19
                        ],
                        [
                            "sounds",
                            11.68,
                            11.82
                        ],
                        [
                            "like",
                            11.82,
                            12.11
                        ],
                        [
                            "a",
                            12.32,
                            12.96
                        ],
                        [
                            "plan",
                            12.99,
                            13.8
                        ]
                    ],
                    "confidence": 0.829,
                    "transcript": "sure that sounds like a plan"
                }
            ],
            "final": "True"
        }
    ],
    "result_index":0,
    "speaker_labels": [
        {
            "from": 6.18,
            "to": 6.63,
            "speaker": 0,
            "confidence": 0.475,
            "final": "False"
        },
        {
            "from": 6.63,
            "to": 6.95,
            "speaker": 0,
            "confidence": 0.475,
            "final": "False"
        },
        {
            "from": 6.95,
            "to": 7.53,
            "speaker": 0,
            "confidence": 0.475,
            "final": "False"
        },
        {
            "from": 7.73,
            "to": 8.11,
            "speaker": 0,
            "confidence": 0.499,
            "final": "False"
        },
        {
            "from": 8.21,
            "to": 8.5,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 8.5,
            "to": 8.66,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 8.66,
            "to": 8.81,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 8.81,
            "to": 8.99,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 8.99,
            "to": 9.02,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 9.02,
            "to": 9.25,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 9.25,
            "to": 9.32,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 9.32,
            "to": 9.68,
            "speaker": 0,
            "confidence": 0.472,
            "final": "False"
        },
        {
            "from": 10.52,
            "to": 10.88,
            "speaker": 2,
            "confidence": 0.441,
            "final": "False"
        },
        {
            "from": 10.92,
            "to": 11.19,
            "speaker": 2,
            "confidence": 0.364,
            "final": "False"
        },
        {
            "from": 11.68,
            "to": 11.82,
            "speaker": 2,
            "confidence": 0.372,
            "final": "False"
        },
        {
            "from": 11.82,
            "to": 12.11,
            "speaker": 2,
            "confidence": 0.372,
            "final": "False"
        },
        {
            "from": 12.32,
            "to": 12.96,
            "speaker": 2,
            "confidence": 0.383,
            "final": "False"
        },
        {
            "from": 12.99,
            "to": 13.8,
            "speaker": 2,
            "confidence": 0.428,
            "final": "False"
        }
    ]
}

请原谅缩进问题(如果有的话)，但 JSON 是有效的，我一直在尝试将每个抄本与其相应的演讲者标签进行映射。

我想要类似下面的东西。上面的 JSON 大约有 20,000 行，根据时间戳和单词发音提取说话者标签并将其与 transcript 放在一起是一场噩梦。 .

[
    {
        "transcript": "the weather is sunny it's time to sip in some cold beer ",
        "speaker" : 0
    },
    {
        "transcript": "sure that sounds like a plan",
        "speaker" : 2
    }

]

到目前为止我尝试了什么:JSON 数据存储在名为 example.json 的文件中.我已经能够将每个单词及其对应的时间戳和说话者标签放入元组列表中(请参见下面的输出):

import json
# with open('C:\\Users\\%USERPROFILE%\\Desktop\\example.json', 'r') as f:
    # data = json.load(f)

l1 = []
l2 = []
l3 = []

for i in data['results']:
    for j in i['alternatives'][0]['timestamps']:
        l1.append(j)

for m in data['speaker_labels']:
     l2.append(m)

for q in l1:
    for n in l2:
        if q[1]==n['from']:
            l3.append((q[0],n['speaker'], q[1], q[2]))
print(l3)

这给出了输出:

 [('the', 0, 6.18, 6.63),
 ('weather', 0, 6.63, 6.95),
 ('is', 0, 6.95, 7.53),
 ('sunny', 0, 7.73, 8.11),
 ("it's", 0, 8.21, 8.5),
 ('time', 0, 8.5, 8.66),
 ('to', 0, 8.66, 8.81),
 ('sip', 0, 8.81, 8.99),
 ('in', 0, 8.99, 9.02),
 ('some', 0, 9.02, 9.25),
 ('cold', 0, 9.25, 9.32),
 ('beer', 0, 9.32, 9.68),
 ('sure', 2, 10.52, 10.88),
 ('that', 2, 10.92, 11.19),
 ('sounds', 2, 11.68, 11.82),
 ('like', 2, 11.82, 12.11),
 ('a', 2, 12.32, 12.96),
 ('plan', 2, 12.99, 13.8)]

但现在我不确定如何根据时间戳比较将单词关联在一起，并“存储”每组单词以再次形成带有说话人标签的文字记录。

我还成功地获得了列表中的文字记录，但现在如何从上面的列表中提取每个文字记录的说话人标签。扬声器标签 speaker 0和 speaker 2不幸的是，我希望每个词都适用 transcript相反。

for i in data['results']:
    l4.append(i['alternatives'][0]['transcript'])

这给出了输出:

["the weather is sunny it's time to sip in some cold beer ",'sure that sounds like a plan']

我已尽力解释问题，但我愿意接受任何反馈，并会在必要时进行更改。另外，我很确定有更好的方法来解决这个问题，而不是制作多个列表，非常感谢任何帮助。

对于更大的数据集，请参阅 pastebin .我希望这个数据集可以有助于性能基准测试。我可以在可用时或需要时提供更大的数据集。

当我处理大型 JSON 数据时，性能是一个重要因素，同样，在重叠转录中准确地实现说话人隔离是另一个要求。

最佳答案

使用 pandas，这是我刚才处理它的方法。

假设数据存储在名为 data 的字典中

import pandas as pd

labels = pd.DataFrame.from_records(data['speaker_labels'])

transcript_tstamps = pd.DataFrame.from_records(
    [t for r in data['results'] 
       for a in r['alternatives'] 
       for t in a['timestamps']], 
    columns=['word', 'from', 'to']
)
# this list comprehension more-efficiently de-nests the dictionary into
# records that can be used to create a DataFrame

df = labels.merge(transcript_tstamps)
# produces a dataframe of speakers to words based on timestamps from & to
# since I knew I wanted to merge on the from & to columns, 
# I named the columns thus when I created the transcript_tstamps data frame
# like this:
    confidence  final   from  speaker     to     word
0        0.475  False   6.18        0   6.63      the
1        0.475  False   6.63        0   6.95  weather
2        0.475  False   6.95        0   7.53       is
3        0.499  False   7.73        0   8.11    sunny
4        0.472  False   8.21        0   8.50     it's
5        0.472  False   8.50        0   8.66     time
6        0.472  False   8.66        0   8.81       to
7        0.472  False   8.81        0   8.99      sip
8        0.472  False   8.99        0   9.02       in
9        0.472  False   9.02        0   9.25     some
10       0.472  False   9.25        0   9.32     cold
11       0.472  False   9.32        0   9.68     beer
12       0.441  False  10.52        2  10.88     sure
13       0.364  False  10.92        2  11.19     that
14       0.372  False  11.68        2  11.82   sounds
15       0.372  False  11.82        2  12.11     like
16       0.383  False  12.32        2  12.96        a
17       0.428  False  12.99        2  13.80     plan

speaker & word data join后，需要将同一speaker的连续词组合在一起，推导出当前speaker。例如，如果扬声器数组看起来像 [2,2,2,2,0,0,0,2,2,2,0,0,0,0]，我们需要将前四个 2 在一起，然后是接下来的三个 0，然后是三个 2，然后是剩余的 0。

按 ['from', 'to'] 对数据进行排序，然后为此设置一个名为 current_speaker 的虚拟变量，如下所示:

df = df.sort_values(['from', 'to'])
df['current_speaker'] = (df.speaker.shift() != df.speaker).cumsum()

从这里开始，按 current_speaker 分组，将单词聚合成一个句子并转换为 json。有一些额外的重命名来修复输出 json 键

transcripts = df.groupby('current_speaker').agg({
   'word': lambda x: ' '.join(x),
   'speaker': min
}).rename(columns={'word': 'transcript'})
transcripts[['speaker', 'transcript']].to_json(orient='records')
# produces the following output (indentation added by me for legibility):
'[{"speaker":0,
  "transcript":"the weather is sunny it\'s time to sip in some cold beer"},    
 {"speaker":2,
  "transcript":"sure that sounds like a plan"}]'

要在转录开始/结束时添加额外的数据，您可以将 from/to 的最小值/最大值添加到 groupby

transcripts = df.groupby('current_speaker').agg({
   'word': lambda x: ' '.join(x),
   'speaker': min,
   'from': min,
   'to': max
}).rename(columns={'word': 'transcript'})

此外，(尽管这不适用于此示例数据集)您或许应该为每个时间片选择具有最高置信度的备选方案。

关于python - 语音到文本 - 将说话者标签映射到 JSON 响应中的相应转录本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50900340/

文章推荐： Azure ARM - 将作为参数传递的标签与文字标签结合起来

文章推荐： angular - 防止 Observable 在服务器关闭时重试

文章推荐： spring-webflux - Spring Webflux - 服务器/客户端线程利用率

postgresql - 组内级联的Postgres交叉表(文本，文本)
表架构 DROP TABLE bla; CREATE TABLE bla (id INTEGER, city INTEGER, year_ INTEGER, month_ INTEGER, val I
javascript - 按一定顺序分割字符串。例如文本/0000/文本/文本
我需要拆分字符串/或从具有以下结构的字符串中获取更容易的子字符串。字符串将来自 window.location.pathname 或 window.location.href，看起来像 text/n
ios - 将对象添加到数组时更新 textView 文本，而不覆盖前一个对象的 textView 文本
每当将对象添加到数组中时，我都会尝试更新 TextView ，并在 TextView 中显示该文本，如下所示: "object 1" "object 2" 问题是，每次将新对象添加到数组时，它都会覆盖
java - Html 2 文本 - 删除 "hidden"文本
我目前正在寻找使用 Java 读取网站可见文本并将其存储为纯文本字符串的方法。换句话说，我想转换成这样: Hello stupid World进入“ Hello World ” 或者类似的东西 Un
php - Pear Mail，如何以UTF-8发送纯文本/文本+文本/html
我正在尝试以文本和 HTML 格式发送电子邮件，但无法正确发送正确的 header 。特别是，我想设置 Content-Type header ，但我找不到如何为 html 和文本部分单独设置它。这
c# - 从资源 wpf 绑定(bind)文本 block 文本
我尝试了上面的代码，但我无法绑定(bind)文本，我怎样才能将资源内部文本 bloc
unity3d - Unity 网络播放器因 UI 文本(新 Canvas 文本)而崩溃
我刚刚完成了 Space Shooter 教程，由于没有 GUIText 对象，所以我创建了 UI.Text 对象并进行了相应的编码。它在统一播放器中有效，但在构建 Web 应用程序后无效。我花了一段
ios - 为什么 UITextField 文本 setter 无法识别 [UIView 文本] 选择器
我有这个代码: - (IBAction)setButtonPressed:(id)sender { NSUserDefaults *sharedDefaults = [[NSUserDefau
java - 在 JLabel 图标上添加 JLabel 文本。使用相同的 JLabel 文本
抱歉标题含糊不清，但我想不出我想在标题中做什么。无论如何，对于图像上的文本，我使用了 JLabel 文本并将其添加到图标中。 JLabel icon = new JLabel(new Imag
javascript - "The stylesheet was not loaded because its MIME type, "文本/html "is not "文本/css"
关闭。这个问题是not reproducible or was caused by typos .它目前不接受答案。这个问题是由于错别字或无法再重现的问题引起的。虽然类似的问题可能是on-topi
html - 是否可以使用 CSS 定位 HTML(文本)？ - 它显示为(文本)作为 ID
我在将 Twitter 嵌入到我从 HTML 5 转换的 wordpress 运行网站时遇到问题。我遇到的问题是推文不是我的自定义字体... 这是我无法使用任何 css 定位的 HTML 代码，我正
java - 将 logger.debug ("message: "+ 文本)转换为 logger.debug(消息 : {}", 文本)
我正在尝试找到解决由于使用以下形式的代码而导致的冗余字符串连接问题的最佳方法: logger.debug("Entering loop, arg is: " + arg) // @1 在大多数情况下，
java分组正则表达式无法匹配字符串+文本
我写了这个测试 @Test public void removeRequestTextFromRouteError() throws Exception { String input = "F
java正则表达式匹配&[文本]
我目前正在创建一个正则表达式来拆分所有匹配以下格式的字符串:&[文本]，并且需要获取文本。字符串可能类似于:something &[text] &[text] everything &[text] 等
CSS变形词/文本
有没有办法将标题文本从一个词变形为另一个词，同时保留两个词中使用的字母？我看过的许多 css 文本动画大多是视觉的，很少有旋转整个单词的。我想要做的是从一个词过渡，例如“BEACH”到“CHANGE
学习python中matplotlib绘图设置坐标轴刻度、文本
总结matplotlib绘图如何设置坐标轴刻度大小和刻度。上代码： ?
容器内的 Flutter 文本
我在容器 (1) 中创建了容器 (2)。你能帮忙如何向容器(1)添加文本吗？下面是我的代码 return Scaffold( body: Padding( padding: c
具有渐变和渐变轮廓的 CSS 文本
我似乎找不到任何人或任何人这样做过。我试图限制我们使用的图像数量，并想创建一个带有渐变作为其“颜色”的文本，并在其周围设置渐变轮廓/描边到目前为止，我还没有看到任何将两者结合在一起的东西。我可以自
从视频游戏截图中提取 Python 文本
我正在为视频游戏暗黑破坏神 2 使用 discord.py 构建一个不和谐机器人。其中一项功能要求机器人从暗黑破坏神 2 屏幕截图中提取项目的名称和属性。我目前正在为此使用 pytesseract，但
在ggplot2中旋转 strip 文本
我很难弄清楚如何旋转 strip.text theme 中的属性来自 ggplot2 .我使用的是 R 版本 3.4.2 和 ggplot2 版本 2.2.1。以下是 MWE 的数据。 > dput

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 语音到文本 - 将说话者标签映射到 JSON 响应中的相应转录本