Python爬取三国演义的实现方法-6ren

Python爬取三国演义的实现方法

转载作者：qq735679552 更新时间：2022-09-27 22:32:09

CFSDN坚持开源创造价值，我们致力于搭建一个资源共享平台，让每一个IT人在这里找到属于你的精彩世界.

这篇CFSDN的博客文章Python爬取三国演义的实现方法由作者收集整理，如果你对这篇文章有兴趣，记得点赞哟.

本文的爬虫教程分为四部:

1.从哪爬 where 。

2.爬什么 what 。

3.怎么爬 how 。

4.爬了之后信息如何保存 save 。

1、从哪爬。

三国演义。

2、爬什么。

三国演义全文。

3、怎么爬。

在Chrome页面打开F12，就可以发现文章内容在节点。

 
    ? 
   
         < 
         div 
         id 
         = 
         "con" 
         class 
         = 
         "bookyuanjiao" 
         >

只要找到这个节点，然后把内容写入到一个html文件即可.

 
    ? 
   
         content  
         = 
         soup.find( 
         "div" 
         , { 
         "class" 
         :  
         "bookyuanjiao" 
         ,  
         "id" 
         :  
         "con" 
         })

4、爬了之后如何保存。

主要就是拿到内容，拼接到一个html文件，然后保存下来就可以了.

 
    ? 
   
         #!usr/bin/env  
        
         # -*-coding:utf-8 -*- 
        
         import 
         urllib2 
        
         import 
         os 
        
         from 
         bs4  
         import 
         BeautifulSoup as BS 
        
         import 
         locale 
        
         import 
         sys 
        
         from 
         lxml  
         import 
         etree 
        
         import 
         re 
        
         reload 
         (sys) 
        
         sys.setdefaultencoding( 
         'gbk' 
         ) 
        
         sub_folder  
         = 
         os.path.join(os.getcwd(),  
         "sanguoyanyi" 
         ) 
        
         if 
         not 
         os.path.exists(sub_folder): 
        
         os.mkdir(sub_folder) 
        
         path  
         = 
         sub_folder 
        
         # customize html as head of the articles 
        
         input 
         = 
         open 
         (r 
         '0.html' 
         ,  
         'r' 
         ) 
        
         head  
         = 
         input 
         .read() 
        
         domain  
         = 
         'http://www.shicimingju.com/book/sanguoyanyi.html' 
        
         t  
         = 
         domain.find(r 
         '.html' 
         ) 
        
         new_domain  
         = 
         '/' 
         .join(domain.split( 
         "/" 
         )[: 
         - 
         2 
         ]) 
        
         first_chapter_url  
         = 
         domain[:t]  
         + 
         "/" 
         + 
         str 
         ( 
         1 
         )  
         + 
         '.html' 
        
         print 
         first_chapter_url 
        
         # Get url if chapter lists 
        
         req  
         = 
         urllib2.Request(url 
         = 
         domain) 
        
         resp  
         = 
         urllib2.urlopen(req) 
        
         html  
         = 
         resp.read() 
        
         soup  
         = 
         BS(html,  
         'lxml' 
         ) 
        
         chapter_list  
         = 
         soup.find( 
         "div" 
         , { 
         "class" 
         :  
         "bookyuanjiao" 
         ,  
         "id" 
         :  
         "mulu" 
         }) 
        
         sel  
         = 
         etree.HTML( 
         str 
         (chapter_list)) 
        
         result  
         = 
         sel.xpath( 
         '//li/a/@href' 
         ) 
        
         for 
         each_link  
         in 
         result: 
        
         each_chapter_link  
         = 
         new_domain  
         + 
         "/" 
         + 
         each_link 
        
         print 
         each_chapter_link 
        
         req  
         = 
         urllib2.Request(url 
         = 
         each_chapter_link) 
        
         resp  
         = 
         urllib2.urlopen(req) 
        
         html  
         = 
         resp.read() 
        
         soup  
         = 
         BS(html,  
         'lxml' 
         ) 
        
         content  
         = 
         soup.find( 
         "div" 
         , { 
         "class" 
         :  
         "bookyuanjiao" 
         ,  
         "id" 
         :  
         "con" 
         }) 
        
         title  
         = 
         soup.title.text 
        
         title  
         = 
         title.split(u 
         '_《三国演义》_诗词名句网' 
         )[ 
         0 
         ] 
        
         html  
         = 
         str 
         (content) 
        
         html  
         = 
         head  
         + 
         html  
         + 
         "</body></html>" 
        
         filename  
         = 
         path  
         + 
         "\\" + title + " 
         .html" 
        
         print 
         filename 
        
         # write file 
        
         output  
         = 
         open 
         (filename,  
         'w' 
         ) 
        
         output.write(html) 
        
         output.close()

0.html的内容如下。

 
    ? 
   
 
     
       
       
         < 
         html 
         >< 
         head 
         >< 
         meta 
         http-equiv 
         = 
         "Content-Type" 
         content 
         = 
         "text/html; charset=utf-8" 
         ></ 
         head 
         >< 
         body 
         > 
        
 
     
 
   

总结。

以上就是利用Python爬取三国演义的实现方法，希望对大家学习python能有所帮助，如果有疑问大家可以留言交流.

最后此篇关于Python爬取三国演义的实现方法的文章就讲到这里了,如果你想了解更多关于Python爬取三国演义的实现方法的内容请搜索CFSDN的文章或继续浏览相关文章，希望大家以后支持我的博客！。

文章推荐： python如何查看系统网络流量的信息

文章推荐：详解 PHP加密解密字符串函数附源码下载

文章推荐： python 读写、创建文件的方法(必看)

Ruby 方法() 方法
我想了解 Ruby 方法 methods() 是如何工作的。我尝试使用“ruby 方法”在 Google 上搜索，但这不是我需要的。我也看过 ruby-doc.org，但我没有找到这种方法。
VBS教程：方法-Test 方法
Test 方法对指定的字符串执行一个正则表达式搜索，并返回一个 Boolean 值指示是否找到匹配的模式。 object.Test(string) 参数 object 必选项。总是一个
VBS教程：方法-Replace 方法
Replace 方法替换在正则表达式查找中找到的文本。 object.Replace(string1, string2) 参数 object 必选项。总是一个 RegExp 对象的名称。
VBS教程：方法-Raise 方法
Raise 方法生成运行时错误 object.Raise(number, source, description, helpfile, helpcontext) 参数 object 应为
VBS教程：方法-Execute 方法
Execute 方法对指定的字符串执行正则表达式搜索。 object.Execute(string) 参数 object 必选项。总是一个 RegExp 对象的名称。 string
VBS教程：方法-Clear 方法
Clear 方法清除 Err 对象的所有属性设置。 object.Clear object 应为 Err 对象的名称。说明在错误处理后，使用 Clear 显式地清除 Err 对象。此
VBS教程：方法-CopyFile 方法
CopyFile 方法将一个或多个文件从某位置复制到另一位置。 object.CopyFile source, destination[, overwrite] 参数 object 必选
VBS教程：方法-Copy 方法
Copy 方法将指定的文件或文件夹从某位置复制到另一位置。 object.Copy destination[, overwrite] 参数 object 必选项。应为 File 或 F
VBS教程：方法-Close 方法
Close 方法关闭打开的 TextStream 文件。 object.Close object 应为 TextStream 对象的名称。说明下面例子举例说明如何使用 Close 方
VBS教程：方法-BuildPath 方法
BuildPath 方法向现有路径后添加名称。 object.BuildPath(path, name) 参数 object 必选项。应为 FileSystemObject 对象的名称
VBS教程：方法-GetFolder 方法
GetFolder 方法返回与指定的路径中某文件夹相应的 Folder 对象。 object.GetFolder(folderspec) 参数 object 必选项。应为 FileSy
VBS教程：方法-GetFileName 方法
GetFileName 方法返回指定路径（不是指定驱动器路径部分）的最后一个文件或文件夹。 object.GetFileName(pathspec) 参数 object 必选项。应为
VBS教程：方法-GetFile 方法
GetFile 方法返回与指定路径中某文件相应的 File 对象。 object.GetFile(filespec) 参数 object 必选项。应为 FileSystemObject
VBS教程：方法-GetExtensionName 方法
GetExtensionName 方法返回字符串，该字符串包含路径最后一个组成部分的扩展名。 object.GetExtensionName(path) 参数 object 必选项。应
VBS教程：方法-GetDriveName 方法
GetDriveName 方法返回包含指定路径中驱动器名的字符串。 object.GetDriveName(path) 参数 object 必选项。应为 FileSystemObjec
VBS教程：方法-GetDrive 方法
GetDrive 方法返回与指定的路径中驱动器相对应的 Drive 对象。 object.GetDrive drivespec 参数 object 必选项。应为 FileSystemO
VBS教程：方法-GetBaseName 方法
GetBaseName 方法返回字符串，其中包含文件的基本名 (不带扩展名), 或者提供的路径说明中的文件夹。 object.GetBaseName(path) 参数 object 必
VBS教程：方法-GetAbsolutePathName 方法
GetAbsolutePathName 方法从提供的指定路径中返回完整且含义明确的路径。 object.GetAbsolutePathName(pathspec) 参数 object
VBS教程：方法-FolderExists 方法
FolderExists 方法如果指定的文件夹存在，则返回 True；否则返回 False。 object.FolderExists(folderspec) 参数 object 必选项
VBS教程：方法-FileExists 方法
FileExists 方法如果指定的文件存在返回 True；否则返回 False。 object.FileExists(filespec) 参数 object 必选项。应为 FileS

qq735679552

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Python爬取三国演义的实现方法