gpt4 book ai didi

python - spaCy 句子分割在引号上失败

转载 作者:行者123 更新时间:2023-11-30 22:34:23 25 4
gpt4 key购买 nike

我正在使用 spaCy 解析一些新闻数据,并注意到存在引用的句子分割始终失败。还有其他人解决了这个问题吗?

这是一个可重现的示例 - 请注意下面输出中的句子 4。 spaCy 未能在引用的开头进行拆分,这与我正在处理的其他新闻文章是一致的。

非常感谢。

示例:

原始数据:

u'body': u'\n LONDON Nov 4 Britons hurt by lower incomes and rising food prices after the financial crisis have cut back on fruit and vegetables and turned instead to fatty, sugary, processed food, an academic study showed on Monday.Britain has seen food prices rise much more sharply than most other developed economies between 2005 and 2012, while wage growth has been low and unemployment has risen.The net effect has been that Britons are spending 8.5 percent less in real terms on food purchased at home than before the recession - with the trend even greater for pensioners and families with young children.The research is likely to be politically sensitive at a time when Britain\'s Conservative-led government is under pressure from the opposition Labour Party, over declining standards of living and sharply rising demand at food banks which hand out free food to the poorest Britons. People have economised by buying less food, measured in number of calories, but also on its quality, picking products that are less nutritious and higher in saturated fat and sugar."Various measures of nutritional quality declined over this period, with bigger decreases for pensioner households and households with young children," said the Institute for Fiscal Studies, an economics research body.OBESITY Families with children were prone to switching to more sugary food, while pensioners favoured food high in saturated fat, the study showed. Both groups often have lower incomes.While the economy is starting to show signs of growth after suffering the biggest hit to economic growth since records began during the 2008-09 recession, households\' disposable incomes are no higher than a decade ago. However, the IFS said a lower-quality diet was not an inevitable consequence of having less money, and that some households had been able to eat as healthily as before while spending less. More research was needed to see why this was not the case for other households, the researchers added.The study looked at data on more than 15,000 households\' shopping habits collected by market research company Kantar Worldpanel between 2005 and 2012.The figures do not include meals purchased or provided away from home, for example in restaurants or at schools, which in England provide free lunches for poorer pupils.The study was released alongside a piece of longer-term research from the IFS, which showed the English now consume 15-30 percent fewer calories than in 1980, despite higher obesity rates probably due to less physical activity.This contrasts with the United States, where calorie consumption has risen as well as obesity. The IFS said it was were researching further into trends in Britons\' physical activity over the period.',

要拆分的代码:

from __future__ import unicode_literals
import spacy
nlp = spacy.load('en')
doc1 = nlp(article_to_json['body'].decode('utf-8'), parse=True)

for number, sent in enumerate(doc1.sents):
print number, sent, "\n"

输出:

0 LONDON Nov 4 Britons hurt by lower incomes and rising food prices after the financial crisis have cut back on fruit and vegetables and turned instead to fatty, sugary, processed food, an academic study showed on Monday.

1 Britain has seen food prices rise much more sharply than most other developed economies between 2005 and 2012, while wage growth has been low and unemployment has risen.

2 The net effect has been that Britons are spending 8.5 percent less in real terms on food purchased at home than before the recession - with the trend even greater for pensioners and families with young children.

3 The research is likely to be politically sensitive at a time when Britain's Conservative-led government is under pressure from the opposition Labour Party, over declining standards of living and sharply rising demand at food banks which hand out free food to the poorest Britons.

4 People have economised by buying less food, measured in number of calories, but also on its quality, picking products that are less nutritious and higher in saturated fat and sugar."Various measures of nutritional quality declined over this period, with bigger decreases for pensioner households and households with young children," said the Institute for Fiscal Studies, an economics research body.

5 OBESITY Families with children were prone to switching to more sugary food, while pensioners favoured food high in saturated fat, the study showed.

6 Both groups often have lower incomes.

7 While the economy is starting to show signs of growth after suffering the biggest hit to economic growth since records began during the 2008-09 recession, households' disposable incomes are no higher than a decade ago.

8 However, the IFS said a lower-quality diet was not an inevitable consequence of having less money, and that some households had been able to eat as healthily as before while spending less.

9 More research was needed to see why this was not the case for other households, the researchers added.

10 The study looked at data on more than 15,000 households' shopping habits collected by market research company Kantar Worldpanel between 2005 and 2012.The figures do not include meals purchased or provided away from home, for example in restaurants or at schools, which in England provide free lunches for poorer pupils.

11 The study was released alongside a piece of longer-term research from the IFS, which showed the English now consume 15-30 percent fewer calories than in 1980, despite higher obesity rates probably due to less physical activity.

12 This contrasts with the United States, where calorie consumption has risen as well as obesity.

13 The IFS said it was were researching further into trends in Britons' physical activity over the period.

最佳答案

我用谷歌搜索了原始新闻文章,试图找出为什么你的数据看起来像这样(句子之间缺少空格,而我在正式的新闻文章中不会想到它),看起来最初的问题是没有HTML 段落之间插入空格。如果您可以通过从原始 HTML 中提取文章的方式来解决该问题(在遇到

时插入空格),则使用 spacy 或其他工具就不会遇到此问题。

标准工具中可用的模型通常会在新闻数据上进行训练,并且可以合理地期望它们能够很好地处理这样的数据,但它们期望句子之间有空格。除非您使用包含句子之间缺少空格的数据重新训练模型(或按照评论中的建议预处理数据),否则您将会遇到此类问题。

关于python - spaCy 句子分割在引号上失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44853107/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com