gpt4 book ai didi

language-agnostic - 来自PDF的纯净文本

转载 作者:行者123 更新时间:2023-12-04 08:53:01 25 4
gpt4 key购买 nike

这更多是算法问题,而不是特定的语言问题,因此我很高兴收到任何语言的答案-甚至是伪代码,甚至只是一个想法。

这是我的问题:我需要处理来自PDF文章的大量论文,这些论文被残酷地复制/粘贴到.txt中。对于3.5 GB或文本(我使用的语料库是ACL Antology网络http://clair.si.umich.edu/clair/aan/DatasetContents.html),我仅有大约16k篇论文的可憎结果。

“垃圾”来自公式,图像,表格等。它只是在正在运行的文本中间弹出,所以我不能使用正则表达式来清理它,也无法想到使用机器学习的任何方法。我已经花了一个星期的时间,然后决定继续进行快速&肮脏的修复。我不在乎完全清除它,我不在乎假阴性和阳性,只要删除了大部分此文本区域即可。

文本的一些示例:请注意,公式包含垃圾字符,但表格和标题则没有(但是它们仍然使我的句子很长,因此无法解析)。粗体显示为垃圾。

简单的一个:

The experiments were repeated while inhibiting specialization of first the scheme with the most expansions, and then the two most expanded schemata. Measures of coverage and speedup are important 1 As long as we are interested in preserving the f-structure assigned to sentences, this notion of coverage is stricter than necessary. The same f-structure can in fact be assigned by more than one parse, so that in some cases a sentence is considered out of coverage even if the specialized grammar assigns to it the correct f-structure. 2'VPv' and 'VPverb[main]' cover VPs headed by a main verb. 'NPadj' covers NPs with adjectives attached. 205 The original rule: l/Pperfp --+ ADVP* SE (t ADJUNCT) ($ ADV_TYPE) = t,padv ~/r { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl $) { AD VP+ SE ('~ ADJUNCT) ($ ADV_TYPE) = vpadv is replaced by the following: ADVP,[.E (~ ADJUNCT) (.l. ADV_TYPE) = vpadv l/'Pperfp --+ @PPadjunct @PPcase_obl {@M.Head_Pevfp [@M..Head_Passp} @( Anaph_Ctrl ~ ) V { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl ~) Figure 1: The pruning of a rule from the actual French grammar. The "*" and the "+" signs have the usual interpretation as in regular expressions. A sub-expression enclosed in parenthesis is optional. Alternative sub-expressions are enclosed in curly brackets and separated by the "[" sign. An "@" followed by an identifier is a macro expansion operator, and is eventually replaced by further functional descriptions. Corpus --.. ,, 0.1[ Disambiguated Treebank treebank Human expert Grammar specialization Specialized grammar Figure 2: The setting for our experiments on grammar specialization. indicators of what can be achieved with this form of grammar pruning. However, they could potentially be misleading, since failure times for uncovered sentences might be considerably lower than their sentences times, had they not been out of coverage.



难点一:

Table 4 summarizes the precision results for both English and Romanian coreference. The results indicate that the English coreference is more indicate than the Romanian coreference, but SNIZZLE improves coreference resolution in both languages. There were 64% cases when the English coreference was resolved by a heuristic with higher priority than the corresponding heuristic for the Romanian counterpart. This result explains why there is better precision enhancement for English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal Pronominal 73% 89% 66% 78% 76% 93% 71°/o 82% Table 4: Coreference precision Total 84% 72% 87% 76% English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal 69% 63% 66% 61% Pronominal Total 89% 78% 83% 72% 87% 77% 80% 70% Table 5: Coreference recall the English coreference. Table 5 also illustrates the recall results. The advantage of the data-driven coreference resolution over other methods is based on its better recall performance. This is explained by the fact that this method captures a larger variety of coreference patterns. Even though other coreference resolution systems perform better for some specific forms of systems, their recall results are surpassed by the systems approach. Multilingual coreference in turn improves more the precision than the recall of the monolingual data-driven coreference systems. In addition, Table 5 shows that the English coref- erence results in better recall than Romanian coref- erence. However, the recall shows a decrease for both languages for SNIZZLE because imprecise coreference links are deleted. As is usually the case, deleting data lowers the recall. All results were obtained by using the automatic scorer program developed for the MUC evaluations.



请注意该表如何不包含奇怪的字符,并且恰好位于句子的中间:“此结果说明了为什么-TABLE HERE-英语共指具有更好的精度增强。”我不知道表格相对于正在运行的文字在哪里。在这种情况下,它可能出现在句子之前,之后或之后。另请注意,表格内容并非以句号结尾(论文中的大多数标题都不是...),因此我不能依靠标点符号来识别它。我当然对边界不准确感到满意,但是我仍然需要对这些表做一些事情。其中一些包含单词而不是数字,在这种情况下,我没有足够的信息:没有垃圾字符,一无所有。仅对人类很明显:S

最佳答案

(我讨厌糟糕的复制和粘贴。)

很少有您觉得有用的想法(在那一点或另一点我自己使用了其中的每一个)

  • (非常强力):使用分词器和字典(真正的字典,而不是数据结构)-解析出单词,然后将不是词典单词的任何单词删除-。如果您的文字中包含很多公司/产品名称,可能会出现问题-但也可以使用正确的索引来解决(网络上有一些索引-我使用的是一些适当的索引,因此我无法共享它们,对不起)
  • 给定一组干净的文档(假设为2K),为其建立一个tf/idf索引,并将其用作字典-其他文档中未出现在索引中(或带有一个非常低tf/idf)-将其删除。这应该给您一个比较干净的文档。
  • 使用Amazon的机械特克机制:设置任务,使阅读文档的人需要标记没有意义的段落。对于机械Turk平台来说应该很容易(16.5K并不多)-这可能要花费您数百美元,但是您可能会获得相当不错的文本清理效果(因此,如果是用公司的钱,那可能是您的出路-他们需要为自己的错误付出代价:))。
  • 考虑到您的文档来自相同的域(主题相同,总共而言),并且问题也完全相同(表标题相同,公式大致相同):将所有文档分解为句子,然后尝试使用ML对句子进行聚类。如果表格标题/公式相对相似,则它们应与其余句子很好地聚在一起,然后您可以逐句清理文档(获取文档,将其分解为句子,对于每个句子,如果“怪异”群集的一部分,将其删除)
  • 关于language-agnostic - 来自PDF的纯净文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10416077/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com