gpt4 book ai didi

python - PyParsing 书目引用

转载 作者:太空宇宙 更新时间:2023-11-04 00:41:45 27 4
gpt4 key购买 nike

我在使用 PyParsing 时遇到了一些问题。我需要从简历中解析一些书目信息。一个例子:

AuthorA, B., AuthorB, M. R., AuthorC, V., and B. LastAuthor. Some sciency title. Name of the confernce, City, State, December 3, 2012

我想出了一些代码来解析(主要)作者列表和日期……其他信息对我来说不是特别重要。

from pyparsing import (Word, Literal, OneOrMore, alphanums, delimitedList, printables, 
alphas, nums)

family_name = Word(alphanums+'-')
first_init = Word(alphanums+'.')
author = (family_name("LastName") + Literal(',').suppress() +
OneOrMore(first_init("FirstInitials") ) )
last_author = first_init("FirstInitials") + family_name("LastName")

author_list = delimitedList(author) + Literal('and').suppress() + last_author

sentence = OneOrMore(Word(printables))
location = delimitedList(Word(printables))
date = Word(alphas) + Word(nums) + Literal(',').suppress() + Word(nums)

citation = (author_list('AuthorLst') + sentence('Title') + location('Location')
+ date('Date'))

citation.parseString(ntext)

但是,它对“和”放屁,作为作者列表和最后一位作者之间的区别。

我收到错误信息:

---------------------------------------------------------------------------
ParseException Traceback (most recent call last)
<ipython-input-142-5d7946dcb775> in <module>()
15
16
---> 17 citation.parseString(ntext)

/Users/willdampier/anaconda/lib/python2.7/site-packages/pyparsing.pyc in parseString(self, instring, parseAll)
1123 else:
1124 # catch and re-raise exception from here, clears out pyparsing internal stack trace
-> 1125 raise exc
1126 else:
1127 return tokens

ParseException: Expected "and" (at char 40), (line:1, col:41)

有什么建议吗?

最佳答案

定义author后,添加这一行:

author.setName("author").setDebug()

跟踪 author 表达式的匹配。然后为了获得更好的整体诊断,将您的测试线更改为:

author_list.runTests(ntext)

通过这些更改,您将获得如下输出:

Match author at loc 0(1,1)
Matched author -> ['AuthorA', 'B.']
Match author at loc 12(1,13)
Matched author -> ['AuthorB', 'M.', 'R.']
Match author at loc 28(1,29)
Matched author -> ['AuthorC', 'V.']
Match author at loc 41(1,42)
Exception raised:Expected "," (at char 46), (line:1, col:47)

AuthorA, B., AuthorB, M. R., AuthorC, V., and B. LastAuthor. Some sciency title. Name of the confernce, City, State, December 3, 2012
^
FAIL: Expected "and" (at char 40), (line:1, col:41)

所以您的直接问题是您没有处理“and”之前的尾随“,”。您还需要添加尾随“。”到您对 author_list 的定义。

但是从那里开始,您的 sentence 解析器将出现问题,因为它将处理整个字符串的其余部分。由于您的主要兴趣是获取日期,因此这可能适合您:

stuff = OneOrMore(Word(printables), stopOn=date)
citation = (author_list('AuthorLst') + stuff('body') + date('Date'))

最后,关于您对结果名称的使用(“FirstInitials”、“LastName”等)。好样的,这是一个功能我对 pyparsing 特别满意。但是你需要对每个作者引用文献中的名字进行一些隔离,否则你只会得到最后一位作者的名字。为此,将每个作者包装在一个 pyparsing 组中:

author = Group(family_name("LastName") + Literal(',').suppress() + 
OneOrMore(first_init("FirstInitials") ) )
last_author = Group(first_init("FirstInitials") + family_name("LastName"))

现在您的 author_list 应该给您一个子结构列表。如果这样做,您可以看到它们:

print(citation.parseString(ntext).dump())

通过我的更改,我得到了您的示例文本:

[['AuthorA', 'B.'], ['AuthorB', 'M.', 'R.'], ['AuthorC', 'V.'], ',', 
['B.', 'LastAuthor'], '.', 'Some', 'sciency', 'title.', 'Name', 'of',
'the', 'confernce,', 'City,', 'State,', 'December', '3', '2012']
- AuthorLst: [['AuthorA', 'B.'], ['AuthorB', 'M.', 'R.'],
['AuthorC', 'V.'], ',', ['B.', 'LastAuthor'], '.']
[0]:
['AuthorA', 'B.']
- FirstInitials: 'B.'
- LastName: 'AuthorA'
[1]:
['AuthorB', 'M.', 'R.']
- FirstInitials: 'R.'
- LastName: 'AuthorB'
[2]:
['AuthorC', 'V.']
- FirstInitials: 'V.'
- LastName: 'AuthorC'
[3]:
,
[4]:
['B.', 'LastAuthor']
- FirstInitials: 'B.'
- LastName: 'LastAuthor'
[5]:
.

仍然需要抑制 ',' 和 '.'标点符号,但这只是清理。然后你就可以轻松地遍历您的作者列表并获取每位作者的姓名。

关于python - PyParsing 书目引用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41651200/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com