gpt4 book ai didi

python - 从字符串位置提取python中的周围单词

转载 作者:行者123 更新时间:2023-11-28 19:18:37 27 4
gpt4 key购买 nike

假设,我有一个字符串:

string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

我在这个字符串中有一个单词的位置,例如:

>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

我需要从每个位置提取后面的几个词和后面的几个词。如何使用Python和正则表达式实现?

例如:

def look_through(d, s):
r = []
content = readFile(d["path"])
content = BeautifulSoup(content)
content = content.getText()
pos = [m.start() for m in re.finditer(s, content)]
if pos:
if "phrase" not in d:
d["phrase"] = [s]
else:
d["phrase"].append(s)
for p in pos:
r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
for b in d["decendent"] or []:
r += look_through(b, s)
return r

>>> dict = {
"content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right within the overall right-to-left context. </p>""",
"name": "directory",
"decendent": [
{
"content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""",
"name": "subdirectory",
"decendent": None
},
{
"content": """It tells you how to use HTML markup for elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""",
"name": "subdirectory_two",
"decendent": [
{
"content": "Name 4",
"name": "subsubdirectory",
"decendent": None
}
]
}
]
}

所以:

>>> look_through(dict, "tells you")
[
{ "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
{ "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

谢谢!

最佳答案

您想要正则表达式命中的“一致性”,让我们在正则表达式匹配的位置之前和之后说两个词。最简单的方法是在那里打断你的字符串并将你的搜索锚定到片段的端点。例如,要获取索引 263 前后的两个词(您的第一个 m.start()),您可以:

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

第一个表达式应该从字符串的末尾向后读取:它锚定在末尾$,如果匹配在单词中间结束,则可能会跳过部分单词,(\S *), 跳过一些空格 (\s+), 然后匹配最多两个 {2,} 词空间序列, \s+\S+。不是恰好两个,因为如果我们到达字符串的开头,我们想要返回一个短匹配。

第二个正则表达式的作用相同,但方向相反。

对于索引,您可能希望在正则表达式匹配的结束 之后开始阅读,而不是开始。在这种情况下,使用 m.end() 作为第二个字符串的开头。

我认为如何将其与正则表达式匹配列表一起使用是非常明显的。

关于python - 从字符串位置提取python中的周围单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30106082/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com