Python实现的爬取百度文库功能示例-6ren

Python实现的爬取百度文库功能示例

转载作者：qq735679552 更新时间：2022-09-28 22:32:09

CFSDN坚持开源创造价值，我们致力于搭建一个资源共享平台，让每一个IT人在这里找到属于你的精彩世界.

这篇CFSDN的博客文章Python实现的爬取百度文库功能示例由作者收集整理，如果你对这篇文章有兴趣，记得点赞哟.

本文实例讲述了Python实现的爬取百度文库功能。分享给大家供大家参考，具体如下:

 
    ? 
   
         # -*- coding: utf-8 -*- 
        
         from 
         selenium  
         import 
         webdriver 
        
         from 
         bs4  
         import 
         BeautifulSoup 
        
         from 
         docx  
         import 
         Document 
        
         from 
         docx.enum.text  
         import 
         WD_ALIGN_PARAGRAPH 
         # 用来居中显示标题 
        
         from 
         time  
         import 
         sleep 
        
         from 
         selenium.webdriver.common.keys  
         import 
         Keys 
        
         # 浏览器安装路径 
        
         #BROWSER_PATH=\'C:\Users\Administrator\AppData\Local\Google\Chrome\Application\chromedriver.exe' 
        
         #目的URL 
        
         DEST_URL 
         = 
         'https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html' 
        
         #用来保存文档 
        
         doc_title  
         = 
         '' 
        
         doc_content_list  
         = 
         [] 
        
         def 
         find_doc(driver, init 
         = 
         True 
         ): 
        
         global 
         doc_content_list 
        
         global 
         doc_title 
        
         stop_condition  
         = 
         False 
        
         html  
         = 
         driver.page_source 
        
         soup1  
         = 
         BeautifulSoup(html,  
         'lxml' 
         ) 
        
         if 
         (init  
         is 
         True 
         ):  
         # 得到标题 
        
         title_result  
         = 
         soup1.find( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'doc-title' 
         }) 
        
         doc_title  
         = 
         title_result.get_text()  
         # 得到文档标题 
        
         # 拖动滚动条 
        
         init_page  
         = 
         driver.find_element_by_xpath(  
         "//div[@class='foldpagewg-text-con']" 
         ) 
        
         print 
         ( 
         type 
         (init_page), init_page) 
        
         driver.execute_script( 
         'arguments[0].scrollIntoView();' 
         , init_page) 
        
         init_page.click() 
        
         init  
         = 
         False 
        
         else 
         : 
        
         try 
         : 
        
         page  
         = 
         driver.find_element_by_xpath(  
         "//div[@class='pagerwg-schedule']" 
         ) 
        
         #print(type(next_page), next_page) 
        
         next_page  
         = 
         driver.find_element_by_class_name( 
         "pagerwg-button" 
         ) 
        
         station  
         = 
         driver.find_element_by_xpath(  
         "//div[@class='bottombarwg-root border-none']" 
         ) 
        
         driver.execute_script( 
         'arguments[0].scrollIntoView(false);' 
         , station) 
        
         #js.executeScript("arguments[0].click();",next_page); 
        
         #sleep(5) 
        
         '''js = "window.scrollTo(508,600)" 
        
         driver.execute_script(js)''' 
        
         next_page.click() 
        
         except 
         : 
        
         #结束条件 
        
         print 
         ( 
         "找不到元素" 
         ) 
        
         stop_condition  
         = 
         True 
        
         #next_page.send_keys(Keys.ENTER) 
        
         # 遍历所有的txt标签标定的文档，将其空格删除，然后进行保存 
        
         content_result  
         = 
         soup1.find_all( 
         'p' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'txt' 
         }) 
        
         for 
         each  
         in 
         content_result: 
        
         each_text  
         = 
         each.get_text() 
        
         if 
         ' ' 
         in 
         each_text: 
        
         text  
         = 
         each_text.replace( 
         ' ' 
         , '') 
        
         else 
         : 
        
         text  
         = 
         each_text 
        
         # print(each_text) 
        
         doc_content_list.append(text) 
        
         # 得到正文内容 
        
         sleep( 
         2 
         )  
         # 防止页面加载过慢 
        
         if 
         stop_condition  
         is 
         False 
         : 
        
         doc_title, doc_content_list  
         = 
         find_doc(driver, init) 
        
         return 
         doc_title, doc_content_list 
        
         def 
         save(doc_title, doc_content_list): 
        
         document  
         = 
         Document() 
        
         heading  
         = 
         document.add_heading(doc_title,  
         0 
         ) 
        
         heading.alignment  
         = 
         WD_ALIGN_PARAGRAPH.CENTER  
         # 居中显示 
        
         for 
         each  
         in 
         doc_content_list: 
        
         document.add_paragraph(each) 
        
         # 处理字符编码问题 
        
         t_title  
         = 
         doc_title.split()[ 
         0 
         ] 
        
         #print(t_title) 
        
         #document.save('2.docx') 
        
         document.save( 
         '百度文库-%s.docx' 
         % 
         t_title) 
        
         print 
         ( 
         "\n\nCompleted: %s.docx, to read." 
         % 
         t_title) 
        
         driver.quit() 
        
         if 
         __name__  
         = 
         = 
         '__main__' 
         : 
        
         options  
         = 
         webdriver.ChromeOptions() 
        
         options.add_argument( 
         'user-agent="Mozilla/5.0 (Linux; Android 4.0.4; \ Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) \ Chrome/18.0.1025.133 Mobile Safari/535.19"' 
         ) 
        
         #driver = webdriver.Chrome(BROWSER_PATH, chrome_options=options) 
        
         driver  
         = 
         webdriver.Chrome(chrome_options 
         = 
         options) 
        
         driver.get(DEST_URL) 
        
         #JavascriptExecutor js = (JavascriptExecutor) driver; 
        
         print 
         ( 
         "**********START**********" 
         ) 
        
         title, content  
         = 
         find_doc(driver,  
         True 
         ) 
        
         save(title, content) 
        
         driver.quit()

希望本文所述对大家Python程序设计有所帮助.

最后此篇关于Python实现的爬取百度文库功能示例的文章就讲到这里了,如果你想了解更多关于Python实现的爬取百度文库功能示例的内容请搜索CFSDN的文章或继续浏览相关文章，希望大家以后支持我的博客！。

文章推荐： golang通过反射设置结构体变量的值

文章推荐：如何使用vbs 监控电脑活动记录

文章推荐： C语言实现骑士飞行棋小游戏

文章推荐：如何用VBS脚本收集远程计算机或本地计算机安装的软件

wordpress实现发布文章自动ping 百度
为了加快收录情况除了谷歌勤快点百度也不能落下复制代码代码如下: //文章发布主动ping baidu function pi
android - 百度 map SDK如何实现点聚类？
对于我的中国用户，我需要在我的应用程序中使用百度 map ，但我不明白如何使用 BaiduSdk 实现集群。没有像 android-maps-utils 中那样的任何实用程序。也许有人可以建议 lib
javascript - Echarts3(百度)工具提示中的彩色圆形
Echarts3(baidu)工具提示中的彩色圆形默认情况下，工具提示具有与图形相同颜色的圆形，如下所示: http://echarts.baidu.com/gallery/editor.html?
javascript - 重叠条形图 apache echarts 百度
我想创建多个系列的条形图重叠彼此。堆叠条形图将条形置于另一个条形之上。我想要所有的酒吧开始来自底部像下面的例子。有什么方法可以通过echarts 来实现吗？ ? 最佳答案这个exam

qq735679552

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Python实现的爬取百度文库功能示例