gpt4 book ai didi

python - 函数内 Beautiful Soup 中的“ResultSet”对象没有属性 'findAll' 错误消息

转载 作者:行者123 更新时间:2023-12-01 03:24:12 25 4
gpt4 key购买 nike

我正在学习 Pyhton,尤其是 beautiful soup,并且我正在使用一组 html 文件进行正则表达式的 Google 练习,其中包含不同年份的流行婴儿名字(例如baby1990.html 等)。如果您有兴趣,可以在这里找到此数据集:https://developers.google.com/edu/python/exercises/baby-names

每个 html 文件都包含一个包含婴儿姓名数据的表格,如下所示:

enter image description here

我编写了一个函数,从 html 文件中提取婴儿的名字并将它们存储到数据帧中,字典中的数据帧以及聚合在单个数据帧中的所有数据帧。

每个 html 文件中有两个表。包含婴儿数据的表具有以下 html 代码:

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">

在这一行中,独特的属性是summary =“formatting”。

我编写的函数是根据我收到的反馈进行编辑的,如下:

def babynames(path):

# This function takes the path of the directory where the html files are stored and returns a list containing the
# a dataframe which encompasses all the tabular baby-names data in the files and as well as a dictionary holding
# a separate dataframe for each html file

# 0: Initialize objects
dicnames = {} # will hold the dataframes containing the tabular data of each year
dfnames = pd.DataFrame([]) # will hold the aggregate data

# 1: Create a list containing the full paths of the baby files in the directory indicated by the path argument of the babynames
# function
allfiles = files(path)

# 2: Begin the looping through the files

for file in allfiles:
with open(file,"r") as f: soup = bs(f.read(), 'lxml') # Convert the file to a soup

# 3. Initialize empty lists to hold the contents of the cells
Rank=[]
Baby_1 =[]
Baby_2 =[]
df = pd.DataFrame([])

# 4. Extract the Table containing the Baby data and loop through the rows of this table

for row in soup.select("table[summary=formatting] tr"):

# 5. Extract the cells

cells = row.findAll("td")

# 6. Convert to text and append to lists
try:
Rank.append(cells[0].find(text=True))
Baby_1.append(cells[1].find(text=True))
Baby_2.append(cells[2].find(text=True))
except:
print "file: " , file
try:
print "cells[0]: " , cells[0]
except:
print "cells[0] : NaN"
try:
print "cells[1]: " , cells[1]
except:
print "cells[1] : NaN"
try:
print "cells[2]: " , cells[2]
except:
print "cells[2] : NaN"

# 7. Append the lists to the empty dataframe df
df["Rank"] = Rank
df["Baby_1"] = Baby_1
df["Baby_2"] = Baby_2

# 8. Append the year to the dataframe as a separate column
df["Year"] = extractyear(file) # Call the function extractyear() defined in the environment with input
# the full pathname stored in variable file and examined in the current
# iteration

# 9. Rearrange the order of columns
# df.columns.tolist() = ['Year', 'Rank', 'Baby_1', 'Baby_2']

#10. Store the dataframe to a dictionary as the value which key is the name of the file
pattern = re.compile(r'.*(baby\d\d\d\d).*')
filename = re.search(pattern, file).group(1)
dicnames[filename] = df

# 11. Combine the dataframes stored in the dictionary dicname to an aggregate dataframe dfnames
for key, value in dicnames.iteritems():
dfnames = pd.concat[dfnames, value]

# 12. Store the dfnames and dicname in a list called result. Return result.
result = [dfnames, dicnames]
return result

当我使用给定路径(存储 html 文件的目录的路径)运行该函数时,我收到以下错误消息:

result = babynames(path)

输出:

---------------------------------------------------------------------------


file: C:/Users/ALEX/MyFiles/JUPYTER NOTEBOOKS/google-python-exercises/babynames/baby1990.html
cells[0]: cells[0] : NaN
cells[1]: cells[1] : NaN
cells[2]: cells[2] : NaN
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-72-5c9ebdc4dcdb> in <module>()
----> 1 result = babynames(path)

<ipython-input-71-a0263a6790da> in babynames(path)
54
55 # 7. Append the lists to the empty dataframe df
---> 56 df["Rank"] = Rank
57 df["Baby_1"] = Baby_1
58 df["Baby_2"] = Baby_2

C:\users\alex\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
2355 else:
2356 # set column
-> 2357 self._set_item(key, value)
2358
2359 def _setitem_slice(self, key, value):

C:\users\alex\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
2421
2422 self._ensure_valid_index(value)
-> 2423 value = self._sanitize_column(key, value)
2424 NDFrame._set_item(self, key, value)
2425

C:\users\alex\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value)
2576
2577 # turn me into an ndarray
-> 2578 value = _sanitize_index(value, self.index, copy=False)
2579 if not isinstance(value, (np.ndarray, Index)):
2580 if isinstance(value, list) and len(value) > 0:

C:\users\alex\Anaconda2\lib\site-packages\pandas\core\series.pyc in _sanitize_index(data, index, copy)
2768
2769 if len(data) != len(index):
-> 2770 raise ValueError('Length of values does not match length of ' 'index')
2771
2772 if isinstance(data, PeriodIndex):

ValueError: Length of values does not match length of index

细胞[0],细胞1并且 cells[2] 应该有值。

正如我提到的,前面还有一个表由以下 html 代码标识:

<table width="100%" border="0" cellspacing="0" cellpadding="4">

我运行了一个没有指定表的函数版本——我没有观察到 html 文件中有两个表。在那个版本中我没有遇到这种类型的错误。我在第 6 行收到了错误消息,指出 try 语句的标识不正确——我不明白这一点——并且在第 9 行收到错误消息,我试图重新排列数据帧的列——我也无法理解。/p>

我们将不胜感激您的建议。

最佳答案

right_table 是一个 ResultSet 实例(基本上是表示元素的 Tag 实例的列表),它没有 findAll( )find_all() 方法。

相反,如果您有多个元素,则可以循环遍历 right_table 中的元素:

right_table = soup.find_all("table", summary_ = "formatting")

for table in right_table:
for row in table.findAll("tr"):
# ...

或者,如果只有一个,请使用 find():

right_table = soup.find("table", summary_ = "formatting")

或者,使用单个 CSS 选择器:

for row in soup.select("table[summary=formatting] tr"):
# ...

关于python - 函数内 Beautiful Soup 中的“ResultSet”对象没有属性 'findAll' 错误消息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41592627/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com