gpt4 book ai didi

python - Pandas DataFrame 字符串中包含的 HTML

转载 作者:行者123 更新时间:2023-12-01 01:16:52 24 4
gpt4 key购买 nike

假设我有一个如下所示的字符串:

stuff = "<table><tr><td>Tuesday, January 15, 2019</td><td>2:44 PM EST</td><td>12</td><td>$530</td></tr><tr><td>Thursday, January 3, 2019</td><td>11:55 PM EST</td><td>11.5</td><td>$821</td></tr><tr><td>Friday, December 7, 2018</td><td>2:49 AM EST</td><td>11</td><td>$800</td></tr><tr><td>Wednesday, November 28, 2018</td><td>11:49 AM EST</td><td>9.5</td><td>$487</td></tr><tr><td>Monday, November 26, 2018</td><td>10:25 AM EST</td><td>11</td><td>$650</td></tr><tr><td>Thursday, November 22, 2018</td><td>5:52 PM EST</td><td>8.5</td><td>$792</td></tr><tr><td>Thursday, November 8, 2018</td><td>3:42 PM EST</td><td>11.5</td><td>$600</td></tr><tr><td>Saturday, September 29, 2018</td><td>9:40 PM EST</td><td>10</td><td>$470</td></tr><tr><td>Tuesday, September 4, 2018</td><td>4:11 PM EST</td><td>9.5</td><td>$649</td></tr><tr><td>Friday, July 13, 2018</td><td>2:07 PM EST</td><td>8</td><td>$650</td></tr><tr><td>Friday, July 6, 2018</td><td>1:21 PM EST</td><td>12</td><td>$495</td></tr><tr><td>Wednesday, June 13, 2018</td><td>5:14 PM EST</td><td>10</td><td>$450</td></tr><tr><td>Monday, June 4, 2018</td><td>4:24 PM EST</td><td>9.5</td><td>$476</td></tr><tr><td>Friday, April 13, 2018</td><td>9:16 AM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Monday, March 5, 2018</td><td>7:23 AM EST</td><td>8.5</td><td>$560</td></tr><tr><td>Thursday, January 11, 2018</td><td>1:40 PM EST</td><td>12</td><td>$800</td></tr><tr><td>Saturday, January 6, 2018</td><td>3:13 PM EST</td><td>9</td><td>$600</td></tr><tr><td>Thursday, December 14, 2017</td><td>1:06 PM EST</td><td>7.5</td><td>$726</td></tr><tr><td>Thursday, November 9, 2017</td><td>6:10 PM EST</td><td>10.5</td><td>$601</td></tr><tr><td>Wednesday, September 20, 2017</td><td>9:40 AM EST</td><td>10.5</td><td>$850</td></tr><tr><td>Friday, July 6, 2018</td><td>1:21 PM EST</td><td>12</td><td>$495</td></tr><tr><td>Wednesday, June 13, 2018</td><td>5:14 PM EST</td><td>10</td><td>$450</td></tr><tr><td>Monday, June 4, 2018</td><td>4:24 PM EST</td><td>9.5</td><td>$476</td></tr><tr><td>Friday, April 13, 2018</td><td>9:16 AM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Monday, March 5, 2018</td><td>7:23 AM EST</td><td>8.5</td><td>$560</td></tr><tr><td>Thursday, January 11, 2018</td><td>1:40 PM EST</td><td>12</td><td>$800</td></tr><tr><td>Saturday, January 6, 2018</td><td>3:13 PM EST</td><td>9</td><td>$600</td></tr><tr><td>Thursday, December 14, 2017</td><td>1:06 PM EST</td><td>7.5</td><td>$726</td></tr><tr><td>Thursday, November 9, 2017</td><td>6:10 PM EST</td><td>10.5</td><td>$601</td></tr><tr><td>Wednesday, September 20, 2017</td><td>9:40 AM EST</td><td>10.5</td><td>$850</td></tr><tr><td>Monday, July 24, 2017</td><td>12:22 PM EST</td><td>10.5</td><td>$600</td></tr><tr><td>Saturday, June 17, 2017</td><td>7:54 AM EST</td><td>11</td><td>$550</td></tr><tr><td>Saturday, June 10, 2017</td><td>7:32 PM EST</td><td>7.5</td><td>$750</td></tr><tr><td>Wednesday, May 24, 2017</td><td>3:10 PM EST</td><td>11</td><td>$741</td></tr><tr><td>Sunday, May 14, 2017</td><td>4:34 AM EST</td><td>10.5</td><td>$750</td></tr><tr><td>Monday, April 17, 2017</td><td>8:45 AM EST</td><td>10.5</td><td>$750</td></tr><tr><td>Saturday, April 1, 2017</td><td>9:44 PM EST</td><td>11</td><td>$750</td></tr><tr><td>Thursday, March 2, 2017</td><td>4:05 PM EST</td><td>11</td><td>$970</td></tr><tr><td>Thursday, February 23, 2017</td><td>3:03 PM EST</td><td>11.5</td><td>$675</td></tr><tr><td>Monday, January 23, 2017</td><td>3:29 PM EST</td><td>11</td><td>$726</td></tr><tr><td>Sunday, January 22, 2017</td><td>6:47 PM EST</td><td>11</td><td>$655</td></tr><tr><td>Friday, December 9, 2016</td><td>2:38 AM EST</td><td>10</td><td>$575</td></tr><tr><td>Thursday, December 8, 2016</td><td>5:23 PM EST</td><td>11.5</td><td>$1,200</td></tr><tr><td>Thursday, December 8, 2016</td><td>8:29 AM EST</td><td>12</td><td>$946</td></tr><tr><td>Saturday, November 26, 2016</td><td>3:09 PM EST</td><td>12</td><td>$1,031</td></tr><tr><td>Wednesday, November 23, 2016</td><td>3:45 PM EST</td><td>7.5</td><td>$650</td></tr><tr><td>Monday, November 21, 2016</td><td>7:23 AM EST</td><td>11</td><td>$1,031</td></tr><tr><td>Friday, November 18, 2016</td><td>5:12 PM EST</td><td>11</td><td>$1,031</td></tr><tr><td>Thursday, November 17, 2016</td><td>9:11 AM EST</td><td>11</td><td>$660</td></tr><tr><td>Tuesday, November 8, 2016</td><td>7:17 AM EST</td><td>6.5</td><td>$777</td></tr><tr><td>Saturday, September 24, 2016</td><td>5:57 PM EST</td><td>8</td><td>$815</td></tr><tr><td>Thursday, August 25, 2016</td><td>3:52 PM EST</td><td>6.5</td><td>$750</td></tr><tr><td>Saturday, August 20, 2016</td><td>2:20 PM EST</td><td>10.5</td><td>$721</td></tr><tr><td>Saturday, August 20, 2016</td><td>1:39 PM EST</td><td>8</td><td>$721</td></tr><tr><td>Thursday, July 21, 2016</td><td>1:21 PM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Wednesday, July 20, 2016</td><td>6:14 AM EST</td><td>7.5</td><td>$777</td></tr><tr><td>Saturday, June 25, 2016</td><td>10:00 AM EST</td><td>9.5</td><td>$950</td></tr><tr><td>Thursday, June 23, 2016</td><td>5:26 PM EST</td><td>10.5</td><td>$580</td></tr><tr><td>Tuesday, June 21, 2016</td><td>1:19 PM EST</td><td>12.5</td><td>$600</td></tr><tr><td>Tuesday, May 31, 2016</td><td>10:06 AM EST</td><td>9.5</td><td>$828</td></tr></table>"

我如何在 Pandas 中使用类似 .read_html() 的东西来理解该字符串?

我通过使用 Selenium 来获得这个:

stuff = html_table.get_attribute('innerHTML')

我必须这样做,因为有一堆 JavaScript 阻止我访问东西。

最佳答案

使用read_html返回 DataFrame 列表,因此需要首先通过索引进行选择:

df = pd.read_html(stuff)[0]
print (df.head())
0 1 2 3
0 Tuesday, January 15, 2019 2:44 PM EST 12.0 $530
1 Thursday, January 3, 2019 11:55 PM EST 11.5 $821
2 Friday, December 7, 2018 2:49 AM EST 11.0 $800
3 Wednesday, November 28, 2018 11:49 AM EST 9.5 $487
4 Monday, November 26, 2018 10:25 AM EST 11.0 $650

然后,如果有必要,可以进行一些数据清理:

df.columns = ['date','time','val1','val2']

df['date'] = pd.to_datetime(df['date'] + '-' + df.pop('time').str[:-4],
format='%A, %B %d, %Y-%I:%M %p')

df['val2'] = df['val2'].replace(['\$',','],'', regex=True).astype(int)
print (df.head())
date val1 val2
0 2019-01-15 14:44:00 12.0 530
1 2019-01-03 23:55:00 11.5 821
2 2018-12-07 02:49:00 11.0 800
3 2018-11-28 11:49:00 9.5 487
4 2018-11-26 10:25:00 11.0 650

关于python - Pandas DataFrame 字符串中包含的 HTML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54264490/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com