gpt4 book ai didi

python - 为什么 Python 3 shell 中的文本格式与生成的文本文件不同?

转载 作者:太空宇宙 更新时间:2023-11-03 16:42:05 25 4
gpt4 key购买 nike

我正在尝试读取网页并将格式化文本输出到文本文件。下面的代码以格式化方式打印到 shell,但是当我将其写入文件时,它会将其放在一行上(文本中存在换行符/n)。

我尝试了多种方法,例如不将其转换为字符串、使用 beautiful soup 中的 prettify,但似乎都没有生成带格式的文本文件。我想我错过了一些相当基本的东西。任何帮助或指导将不胜感激。

# Import 
from urllib.request import urlopen
from bs4 import BeautifulSoup

#The actual code


URL = "https://simple.wikipedia.org/wiki/castle" #The target URL
html = urlopen(URL).read() # Reads the url to variable html
soup = BeautifulSoup(html, "lxml") # Uses BS4 to create the soup using the lxml parser
soup = soup.get_text() # Extracts the text
print(soup) # Prints to python 3.5.1 shell, formatted as I would expect


# Now writing what I have extracted to a text file
file = open("TextOutput.txt", 'w') # Creates the file and opens as write (w)
file.writelines(str(soup.encode('UTF-8'))) # Tried file.write/lines(soup), convertion to string and encoding as UTF-8 needed to avoid errors
file.close()

文件输出示例如下所示:

b'\n\n\nCastle - Simple English Wikipedia, the free encyclopedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Castle","wgTitle":"Castle","wgCurRevisionId":5333370,"wgRevisionId":5333370,"wgArticleId":15933,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":[""],"wgCategories":["Castles"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Castle","wgRelevantArticleId":15933,"wgRequestId":"VxUR5gpAIDAAAEXY6FMAAACC","wgIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgWikiEditorEnabledModules":{"toolbar":true,"dialogs":true,"preview":false,"publish":false},"wgBetaFeaturesFeatures":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","usePageImages":true,"usePageDescriptions":true},"wgPreferredVariant":"en","wgRelatedArticles":null,"wgRelatedArticlesUseCirrusSearch":true,"wgRelatedArticlesOnlyUseCirrusSearch":false,"wgULSAcceptLanguageList":[],"wgULSCurrentAutonym":"English","wgCategoryTreePageCategoryOptions":"{\"mode\":0,\"hideprefix\":20,\"showcount\":true,\"namespaces\":false}","wgNoticeProject":"wikipedia","wgCentralNoticeCategoriesUsingLegacy":["Fundraising","fundraising"],"wgCentralAuthMobileDomain":false,"wgWikibaseItemId":"Q23413","wgVisualEditorToolbarScrollOffset":0});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function ( $, jQuery ) {\nmw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});/@nomin*/;\n\n});mw.loader.load(["mw.MediaWikiPlayer.loader","mw.PopUpMediaTransform","mw.TMHGalleryHook.js","mediawiki.page.startup","mediawiki.legacy.wikibits","ext.centralauth.centralautologin","mmv.head","ext.visualEditor.desktopArticleTarget.init","ext.uls.init","ext.uls.interface","ext.centralNotice.bannerController","skins.vector.js"]);});\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCastle\n\nFrom Wikipedia, the free encyclopedia\n\n\n\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearch\n\n\n\n\n\nBodiam Castle in England surrounded by a water-filled moat.\n\n\n\n\n\n\nLichtenstein Castle\n\n\nA castle (from the Latin word castellum) is a fortified structure made in Europe and the Middle East during the Middle Ages. People argue about what the word castle means. However, it usually means a private structure of a lord or noble. This is different from a fortress, which is not a home, and from a fortified town, which was a public defence. For about 900\xc2\xa0years that castles were built they had many different shapes and different details.\nCastles began in Europe in the 9th and 10th centuries. They controlled the places surrounding them, and could both help in attacking and defending. Weapons could be fired from castles, or people could be protected from enemies in castles. However, castles were also a symbol of power. They could be used to control the people and roads around it.\nMany castles were built with earth and wood at first often using manual labour, and then had their defences replaced by stone instead. Early castles often used nature for protection, and did not have towers. By the late 12th and early 13th centuries, though, castles became longer and more complex.\n

最佳答案

file.writelines(str(soup.encode('UTF-8'))) 有点疯狂,它是:

  1. 将文本 (str) 编码为二进制 (bytes)
  2. 通过包裹在 str 中获取该内容的文本表示形式(因此您需要键入该内容来重新创建二进制字节,但它不是原始二进制文件)
  3. 一次写入一个字符(writelines 迭代您提供的内容,str 按字符迭代)

第 3 步既愚蠢又低效,但基本上无害。如果您随后将原始二进制文件写入为二进制写入而打开的文件并实际写入了 bytes 对象,则步骤 #1 就可以了。但 #1 和 #2 一起意味着像新行这样的内容会在输出中转换为文字 \n,而不是实际中断一行。像 é 这样的非 ASCII 内容会输出为 \xc3\xa9,并且整个内容都包含在 b''(或 b"")。

你想要这样的东西:

# open with UTF-8 encoding (in case your system defaults to something else)
with open("TextOutput.txt", 'w', encoding='utf-8') as file:
# Get the text and write it as a single block
file.write(soup.get_text())

关于python - 为什么 Python 3 shell 中的文本格式与生成的文本文件不同?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36701046/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com