python - 网页抓取 : output CSV is messed up-6ren

python - 网页抓取 : output CSV is messed up

转载作者：行者123 更新时间：2023-11-30 22:45:16

25

4

此代码旨在循环遍历所有结果页面，然后循环遍历每个页面上的结果表，并从表中抓取所有数据以及表外部存储的一些信息。

但是，生成的 CSV 文件似乎没有任何合理的组织，每行的不同列中都有不同类别的信息。我所追求的是每一行包含定义的所有信息类别(日期、政党、开始日期、结束日期、选区、注册协会、候选人是否当选、候选人姓名、地址和财务代理) )。其中一些数据存储在每个页面的表格中，而其余数据(日期、政党、地区、注册协会)存储在表格外部，需要与每个页面上每个表格行中的每个候选人相关联。此外，似乎没有任何“当选”、“地址”或“财务代理人”的输出，我不确定我哪里出错了。

如果您能帮助我弄清楚如何修复我的代码以实现此输出，我将非常感激。如下:

from bs4 import BeautifulSoup
import requests
import re
import csv

url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"

rows = []

for i in range(1, 56):
    print(i)
    r  = requests.get(url.format(i))
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    links = []

    for link in soup.find_all('a', href=re.compile('selectedid=')):
        links.append("http://www.elections.ca" + link.get('href'))

    for link in links:
        r  = requests.get(link)
        data = r.text
        cat = BeautifulSoup(data, "html.parser")
        header = cat.find_all('span')
        tables = cat.find_all("table")[0].find_all("td")        

        rows.append({
            #"date": 
            header[2].contents[0],
            #"party": 
            re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
            #"start_date": 
            header[3].contents[0],
            #"end_date": 
            header[5].contents[0],
            #"electoral district": 
            re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
            #"registered association": 
            re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
            #"elected": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
            #"name": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
            #"address": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
            #"financial_agent": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()
        })

with open('scrapeOutput.csv', 'w') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(rows)

最佳答案

我认为你的字典有点乱，你没有分配键。请记住，如果将字典转换为列表，python 会根据键按字母顺序对它们进行排序。但使用 csv 库，您可以轻松打印 csv，而无需执行所有这些操作。

所以分配键:

rows.append({
        "date": 
        header[2].contents[0],
        "party": 
        re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
        "start_date": 
        header[3].contents[0],
        "end_date": 
        header[5].contents[0],
        "electoral district": 
        re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
        "registered association": 
        re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
        "elected": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
        "name": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
        "address": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
        "financial_agent": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()
    })

然后使用DictWriter写入您的csv:

with open('scrapeOutput.csv', 'w') as f_output:
    csv_output = csv.DictWriter(f_output, rows[0].keys())
    csv_output.writeheader() # Write header to understand the csv
    csv_output.writerows(rows)

我对此进行了测试，它可以正常工作，但请注意您的某些字段(例如地址或选举)为空:)

再见!

关于python - 网页抓取 : output CSV is messed up，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41237695/

25

4

0

文章推荐： python - Python Pickle 中的已知错误？

文章推荐： c# - 如何在 C# 中使用 LINQ 从 XML 文件中删除特定节点

文章推荐： c# - CaSTLe Windsor Nhibernate Facility 延迟加载

文章推荐： python - 无法将干净的 unicode 文本插入 pandas 中的 DataFrame

c - 在 C 中实现 Shell : Output redirection is writing output file name to output file rather than command results
我正在用 C 语言实现一个带有输入和输出重定向的 shell。我可以成功进行输入重定向，但输出重定向不起作用。例如，如果我执行 ls > out.txt，则 out.txt 包含文本“out.txt”
output - 如何将 Pulumi Output 转换为字符串？
我正在处理创建 AWS API 网关。我正在尝试创建 CloudWatch Log 组并将其命名 API-Gateway-Execution-Logs_${restApiId}/${stageName
php: output[] w/join vs $output .=
我正在修改原作者使用数组构建网页的一些代码: $output[]=$stuff_from_database; $output[]='more stuff'; // etc echo join(
linux - "sort < output"和 "sort output"之间的区别
我只想知道它们之间的区别: sort < output 和 sort output 在 Linux 中。它是如何工作的？最佳答案这已经在 unix.stackexchange 上讨论过:Perfo
C# : Redirect console application output : How to flush the output?
我正在生成外部控制台应用程序并使用异步输出重定向。 as shown in this SO post 我的问题是，在我收到 OutputDataReceived 事件通知之前，生成的进程似乎需要产生一
Angular 2/ typescript : @Input/@output or input/output?
在 Udemy 上开设类(class)时，我们一直允许使用组件类中的 @Input() 装饰器向组件传递数据。在阅读 ngBook-2 时，我发现还有另一种方法，即在 @Component 装饰器中
python Fabric : filter out server output when capturing output of run()
考虑一个 Linux 服务器，它在您的用户的 .bash_profile 中有以下行: echo "Hello world" 因此，每次您通过 ssh 进入它时，您都会看到 Hello world 现
java - 尝试创建回文...如何比较 "output.charAt(k)"与原始 "output"字符串？
public static void main(String[] args) { String input = new String(JOptionPane.showInputDialog("
c++ - FFTW3 : Interpret fftw_plan_r2c_1d output and access imaginary part of output
我正在使用 MSVS 2008 中的 FFTW3 库对某些数据执行 r2c DFT (n=128)。我已经发现只使用了真实数据 DFT 输出的前半部分……如果我查看我的输出，这似乎是正确的: 0-64
c++ - NetBeans 集成开发环境 : "Run Success" Output Occurs Before Output Finishes
我制作了一个 C 程序，可以从二进制文件中打印出很多值。我相信程序完成它的功能并在它实际显示它吐出的值之前结束。因此，结果我得到了一个可爱的 RUN SUCCESSFUL(总时间:198ms) 突然出
hadoop - "Map output materialized bytes"与 "map output bytes"
在 hadoop 作业计数器中，“映射输出具体化字节”与“映射输出字节”之间有什么区别？当我禁用映射输出压缩时我没有看到前者所以我猜它是真正的输出字节(压缩)而后者是未压缩的字节？最佳答案我认为你
Windows 批处理 : Pipe output from exe into a SET VARIABLE where output has spaces
有很多 Stack Overflow 文章与此相关，但没有直接的答案。这条命令会输出一堆单词 OutputVariable.exe %FILEPATH% 输出: Mary had a little
c++ - "standard output stream"和 "standard output device"有什么区别？
互联网上的许多文章都使用“标准输入/输出/错误流”术语好像每个术语都与使用的“标准输入/输出/错误设备”术语具有相同的含义在其他文章上。例如，很多文章说标准输出流默认是监视器，但可以重定向到文件、打印
python - 值错误 : Output tensors to a Model must be the output of a TensorFlow `Layer`
我在 Keras 中使用一些 tensorflow 函数(reduce_sum 和 l2_normalize)在最后一层构建模型时遇到了这个问题。我已经搜索了一个解决方案，但所有这些都与“Keras
visual-studio-code - VSCode 扩展 : How to render colored output in output channel?
我有来自 API 的自定义输出，我想将其格式化为带有一些颜色值的字符串。最佳答案输出 channel 可以用 TmLanguage grammar 着色. Output Colorizer扩展扩展
azurerm_virtual_machine(远程执行): (output suppressed due to sensitive value in config) Terraform output
我正在寻找一种方法来查看虚拟机创建过程中发生的情况，因为我使用复杂的集群配置并测试其是否正常工作，我需要能够查看输出，在某些情况下我是不是因为敏感。这与运行remote-exec选项有关 module
haskell - 堆栈构建结果为 "output was redirected with -o, but no output will be generated because there is no Main module."
当谷歌搜索此错误时没有看到任何相关结果，所以我想发布它。 stack build Building all executables for `gitchapter' once. After a suc
verilog - Verilog : output reg vs assign reg to wire output 中的模块
假设module_a里面有register_a，它需要链接到module_b。 register_a 是否应该单独声明并分配给 module_a 的输出: reg register_a; assign
azurerm_virtual_machine(远程执行): (output suppressed due to sensitive value in config) Terraform output
我正在寻找一种方法来查看虚拟机创建过程中发生的情况，因为我使用复杂的集群配置并测试其是否正常工作，我需要能够查看输出，在某些情况下我是不是因为敏感。这与运行remote-exec选项有关 module
regex - hive SERDE 正则表达式 : Output format - want to use only few of the output Strings
输入文件如下 eno::ename::dept::sal 101::emp1::comp1::2800000 201::emp2::comp2::2800000 301::emp3::comp3::3

首页

博学

6Ren·AI

商城

python - 网页抓取 : output CSV is messed up