gpt4 book ai didi

python - 输出没有正确显示所有 utf-8

转载 作者:太空宇宙 更新时间:2023-11-04 01:12:39 25 4
gpt4 key购买 nike

我正在为 http://www.delfi.lt 编写网站抓取工具(在 Windows 8 上使用 lxml 和 py3k) - 目标是将某些信息输出到 .txt 文件。显然,由于网站是立陶宛语,ASCII 不能用作编码,因此我尝试以 UTF-8 打印它。但是,并非所有非 ASCII 字符都被正确打印到文件中。

这方面的一个例子是我得到 DELFI Žinios > Dienos naujienos > Užsienyje 而不是 DELFI Žinios > Dienos naujienos > Užsienyje

这是我用刮刀得到的最远的地方:

from lxml import html
import sys

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string
def accept_user_input():
if len(sys.argv) < 2 or len(sys.argv) > 3:
raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.')
if len(sys.argv) == 2:
return [sys.argv[1], '']
else:
return sys.argv[1:]

def main():
url, name = accept_user_input()
page = html.parse(url)

title = page.find('//h1[@itemprop="headline"]')
category = page.findall('//span[@itemprop="title"]')

with open('output.txt', encoding='utf-8', mode='w') as f:
f.write((title.text) + "\n")
f.write(' > '.join([x.text for x in category]) + '\n')

if __name__ == "__main__":
main()

运行示例:python scraper.py http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799 生成一个名为 output.txt 的文件,其中包含

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ
DELFI Žinios > Dienos naujienos > Užsienyje

相对于

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ
DELFI Žinios > Dienos naujienos > Užsienyje

如何让脚本正确输出所有文本?

最佳答案

使用请求和 beautifulSoup 并让请求使用 .content 处理编码对我有用:

import requests
from bs4 import BeautifulSoup

def main():
url, name = "http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799","foo.csv"
r = requests.get(url)

page = BeautifulSoup(r.content)

title = page.find("h1",{"itemprop":"headline"})
category = page.find_all("span",{"itemprop":"title"})
print(title)
with open('output.txt', encoding='utf-8', mode='w') as f:
f.write((title.text) + "\n")
f.write(' > '.join([x.text for x in category]) + '\n')

输出:

Ukraina: separatistai siautėja, O. Turčynovas atnaujina mobilizacijąnaujausi susirėmimų vaizdo įrašai
DELFI Žinios > Dienos naujienos > Užsienyje

更改解析器编码也有效:

parser = etree.HTMLParser(encoding="utf-8")
page = html.parse(url,parser)

因此将您的代码更改为以下内容:

from lxml import html,etree
import sys

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string
def accept_user_input():
if len(sys.argv) < 2 or len(sys.argv) > 3:
raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.')
if len(sys.argv) == 2:
return [sys.argv[1], '']
else:
return sys.argv[1:]

def main():
parser = etree.HTMLParser(encoding="utf-8")
page = html.parse(url,parser))

title = page.find('//h1[@itemprop="headline"]')
category = page.findall('//span[@itemprop="title"]')

with open('output.txt', encoding='utf-8', mode='w') as f:
f.write((title.text) + "\n")
f.write(' > '.join([x.text for x in category]) + '\n')

if __name__ == "__main__":
main()

关于python - 输出没有正确显示所有 utf-8,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26815502/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com