gpt4 book ai didi

r - 使用 unnest_tokens() 标记句子,忽略缩写

转载 作者:行者123 更新时间:2023-12-05 00:48:59 26 4
gpt4 key购买 nike

我正在使用出色的 tidytext 包来标记多个段落中的句子。例如,我想采取以下段落:

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

并将其标记为两个句子

  1. “我完全相信达西先生没有缺陷。”
  2. “他自己拥有,毫不掩饰。”

但是,当我使用 tidytext 的默认句子标记器时,我得到三个句子。

代码

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

结果

# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.

什么是使用 tidytext 标记句子的简单方法,但不会遇到“先生”等常见缩写的问题或“博士”被解释为句尾?

最佳答案

您可以使用正则表达式作为拆分条件,但不能保证这将包括所有常见的恐怖:

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\\b\\p{L}r)\\.")

结果:

# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise

您当然可以随时创建自己的常用标题列表,并根据该列表创建正则表达式:

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)

关于r - 使用 unnest_tokens() 标记句子,忽略缩写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47211643/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com