gpt4 book ai didi

python - 在 Python 中使用 Scrapy 进行解析时保留换行符

转载 作者:太空狗 更新时间:2023-10-30 00:46:28 33 4
gpt4 key购买 nike

我编写了一个从页面中提取文本的 Scrapy 蜘蛛。蜘蛛程序在许多页面上正确解析和输出,但被少数页面抛出。我试图在文档中保留换行符和格式。 http://www.state.gov/r/pa/prs/dpb/2011/04/160298.htm等页面像这样正确格式化:

April 7, 2011

Mark C. Toner

2:03 p.m. EDT

MR. TONER: Good afternoon, everyone. A couple of things at the top, and then I’ll take your questions. We condemn the attack on innocent civilians in southern Israel in the strongest possible terms, as well as ongoing rocket fire from Gaza. As we have reiterated many times, there’s no justification for the targeting of innocent civilians, and those responsible for these terrorist acts should be held accountable. We are particularly concerned about reports that indicate the use of an advanced anti-tank weapon in an attack against civilians and reiterate that all countries have obligations under relevant United Nations Security Council resolutions to prevent illicit trafficking in arms and ammunition. Also just a brief statement --

QUESTION: Can we stay on that just for one second?

MR. TONER: Yeah. Go ahead, Matt.

QUESTION: Apparently, the target of that was a school bus. Does that add to your outrage?

MR. TONER: Well, any attack on innocent civilians is abhorrent, but certainly the nature of the attack is particularly so.

虽然像 http://www.state.gov/r/pa/prs/dpb/2009/04/121223.htm 这样的页面没有换行符的输出是这样的:

April 2, 2009

Robert Wood

11:53 a.m. EDTMR. WOOD: Good morning, everyone. I think it’s just about still morning. Welcome to the briefing. I don’t have anything, so – sir.QUESTION: The North Koreans have moved fueling tankers, or whatever, close to the site. They may or may not be fueling this missile. What words of wisdom do you have for the North Koreans at this moment?MR. WOOD: Well, Matt, I’m not going to comment on, you know, intelligence matters. But let me just say again, we call on the North to desist from launching any type of missile. It would be counterproductive. It’s provocative. It further inflames tensions in the region. We want to see the North get back to the Six-Party framework and focus on denuclearization.Yes.QUESTION: Japan has also said they’re going to call for an emergency meeting in the Security Council, you know, should this launch go ahead. Is this something that you would also be looking for?MR. WOOD: Well, let’s see if this test happens. We certainly hope it doesn’t. Again, calling on the North not to do it. But certainly, we will – if that test does go forward, we will be having discussions with our allies.

我使用的代码如下:

def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)

hxs = HtmlXPathSelector(response)

speaker = hxs.select("//span[contains(@class, 'official_s_name')]") #gets the speaker
speaker = speaker.select('string()').extract()[0] #extracts speaker text
date = hxs.select('//*[@id="date_long"]') #gets the date
date = date.select('string()').extract()[0] #extracts the date
content = hxs.select('//*[@id="centerblock"]') #gets the content
content = content.select('string()').extract()[0] #extracts the content

texts = "%s\n\n%s\n\n%s" % (date, speaker, content) #puts everything together in a string

filename = ("/path/StateDailyBriefing-" + '%s' ".txt") % (date) #creates a file using the date

#opens the file defined above and writes 'texts' using utf-8
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)

我认为他们的问题在于页面 HTML 的格式。在输出文本错误的页面上,段落之间用<br> <p></p>分隔。 ,而在正确输出的页面上,段落包含在 <p align="left" dir="ltr"> 中.因此,虽然我已经确定了这一点,但我不确定如何以正确的形式一致地输出所有内容。

最佳答案

问题是当你得到 text()string() , <br>标签不会转换为换行符。

解决方法 - 替换 <br>在执行 XPath 请求之前标记。代码:

response = response.replace(body=response.body.replace('<br />', '\n')) 
hxs = HtmlXPathSelector(response)

让我给一些建议,如果你知道只有一个节点,你可以使用 text()相反 string() :

date = hxs.select('//*[@id="date_long"]/text()').extract()[0]

关于python - 在 Python 中使用 Scrapy 进行解析时保留换行符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8748053/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com