gpt4 book ai didi

python - 在 python 中使用 "unclean"文本

转载 作者:太空宇宙 更新时间:2023-11-04 08:03:50 25 4
gpt4 key购买 nike

请原谅我的无知,我是新手。我搜索了这个并尝试了几个示例,但我认为我发现大多数可能在 python2.7 中工作的东西,但我需要使用 python3.5 才能工作。我试图从维基百科上的这个列表中只提取城市

Cities in Oklahoma

标签名称不同,否则我会尝试使用请求,这实际上是理想的,因为我们需要随着维基百科的更新而更新我们的列表。相反,我复制了数据并将其粘贴到一个 txt 文档中,以便我可以构建概念证明并获得该项目的批准。我最终得到的结果看起来像这样:

1. Oklahoma City 1,012,389

2. Tulsa 609,450

3. Norman 110,925

4. Broken Arrow 98,850

5. Lawton (town) 96,867

6. Edmond 81,405

7. Moore 55,081

8. Midwest City 54,371

我发现了几件事,我尝试了几种不同的方法,认为如果我找到拆分文件的正确方法,我就可以得到所有有内容的行。然后我可以再次拆分它们并返回索引为 1 的行项目。

我在尝试:

file = open('cities_oklahoma.txt', 'r')
s = file.readline()

for line in s:
line_has_txt = line.split() # I have no clue what should be here
print([line_has_txt.split(' ')[1])

我什至接近我想在这里做的事情了吗?另请注意,我在示例中操纵了第 5 行,以显示发生的数据可能存在的一些偏差。另外,正如您从第 1 行中看到的那样,一些城市名称实际上有 city 这个词,这打破了我的理论

最佳答案

如果你想要城市列表:

import requests

r = requests.get("https://en.wikipedia.org/wiki/List_of_towns_and_cities_in_Oklahoma_by_population#Largest_10_cities_by_population")

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content)

for p in soup.find("div",{"class":"mw-content-ltr"}).find_all("p"):
print(p.text)

这给了你所有的城市和标题:

The following list of towns and cities in Oklahoma, shows the incorporated places in the U.S. state of Oklahoma, in order of population according to the 2010 United States Census:[1]


1. Oklahoma City 1,012,389
2. Tulsa 609,450
3. Norman 110,925
4. Broken Arrow 98,850
5. Lawton 96,867
6. Edmond 81,405
7. Moore 55,081
8. Midwest City 54,371
9. Enid 49,379
10. Stillwater 45,688
11. Muskogee 39,223
12. Bartlesville 35,750
13. Shawnee 29,857
14. Owasso 28,915
.......................
359. Greenfield (town) 93
360. Roosevelt (town) 25
361. Cooperton (town) 12

你可以跳过标题和空字符串,你必须更加小心你过滤的内容,但这是一般的想法:

soup = BeautifulSoup(r.content)
ps = soup.find("div", {"class": "mw-content-ltr"}).find_all("p")

city_data = dict(p.text.lstrip("0123456789. ").rsplit(None, 1) for p in ps[3:])
from pprint import pprint as pp

pp(city_data)

这给了你:

{'Achille town, Bryan County': '492',
'Ada': '16,810',
'Adair (town)': '790',
'Afton (town)': '1,049',
'Agra (town)': '339',
'Alex (town)': '550',
'Allen (town)': '932',
'Altus': '19,813',
'Alva': '4,945',
'Amber town, Grady County': '419',
'Anadarko': '6,762',
'Antlers': '2,453',
'Apache (town)': '1,444',
'Arapaho (town)': '796',
'Ardmore': '24,283',
'Arkoma (town)': '1,989',
'Arnett (town)': '524',
'Asher town, Pottawatomie County': '393',
'Atoka': '3,107',
'Avant (town)': '320',
'Barnsdall': '1,243',
'Bartlesville': '35,750',
'Beaver (town)': '1,515',
'Beggs': '1,321',
'Bernice (town)': '562',
'Bethany': '19,051',
'Bethel Acres (town)': '2,895',
'Billings (town)': '509',
'Binger (town)': '672',
'Bixby': '20,884',
'Blackwell': '7,092',
'Blair (town)': '818',
'Blanchard': '7,670',
'Boise City': '1,266',
'Bokchito (town)': '632',
'Bokoshe (town)': '512',
'Boley (town)': '1,184',
'Boswell (town)': '709',
'Bowlegs town, Seminole County': '405',
'Bray (town)': '1,209',
'Bristow': '4,222',
'Broken Arrow': '98,850',
'Broken Bow': '4,120',
'Buffalo (town)': '1,299',
'Burns Flat (town)': '2,057',
'Butler (town)': '287',
'Byng (town)': '1,175',
'Cache': '2,796',
'Caddo (town)': '997',
'Calera (town)': '2,164',
'Calumet (town)': '507',
'Canton (town)': '625',
'Canute (town)': '541',
'Carmen (town)': '355',
'Carnegie (town)': '1,723',
'Carney (town)': '647',
'Cashion (town)': '802',
'Catoosa': '7,151',
'Cement (town)': '501',
'Central High (town)': '1,199',
'Chandler': '3,100',
'Chattanooga town, Comanche County': '461',
'Checotah': '3,335',
'Chelsea (town)': '1,964',
'Cherokee': '1,498',
'Cheyenne (town)': '801',
'Chickasha': '16,036',
'Choctaw': '11,146',
'Chouteau (town)': '2,097',
'Claremore': '18,581',
'Clayton (town)': '821',
'Cleveland': '3,251',
'Clinton': '9,033',
'Coalgate': '1,967',
'Colbert (town)': '1,140',
'Colcord (town)': '815',
'Cole (town)': '555',
'Collinsville': '5,606',
'Comanche': '1,663',
'Commerce': '2,473',
'Cooperton (town)': '12',
'Copan (town)': '733',
'Corn (town)': '503',
'Covington (town)': '527',
'Coweta': '9,943',
'Coyle town, Logan County': '325',
'Crescent': '1,411',
'Crowder town, Pittsburg County': '430',
'Cushing': '7,826',
'Custer City (town)': '375',
'Cyril (town)': '1,059',
'Davenport (town)': '814',
'Davidson (town)': '315',
'Davis': '2,683',
'Del City': '21,332',
'Delaware town, Nowata County': '417',
'Depew (town)': '476',
'Dewar (town)': '888',
'Dewey': '3,432',
'Dickson (town)': '1,207',
'Dill City (town)': '562',
'Dover town, Kingfisher County': '464',
'Drummond (town)': '455',
'Drumright': '2,907',
'Duncan': '23,431',
'Durant': '15,856',
'Dustin town, Hughes County': '395',
'Earlsboro (town)': '628',
'East Duke (town)': '424',
'Edmond': '81,405',
'El Reno': '16,749',
'Eldorado town, Jackson County': '446',
'Elgin': '2,156',
'Elk City': '11,693',
'Elmore City (town)': '697',
'Empire City (town)': '955',
'Enid': '49,379',
'Erick': '1,052',
'Eufaula': '2,813',
'Fairfax (town)': '1,380',
'Fairland (town)': '1,057',
'Fairview': '2,579',
'Fanshawe (town)': '419',
'Fletcher (town)': '1,177',
'Forest Park (town)': '998',
'Forgan (town)': '547',
'Fort Cobb (town)': '634',
'Fort Coffee town, Le Flore County': '424',
'Fort Gibson (town)': '4,154',
'Fort Supply (town)': '330',
'Fort Towson (town)': '519',
'Francis (town)': '315',
'Frederick': '3,940',
'Gage (town)': '442',
'Garber': '822',
'Geary': '1,280',
'Geronimo (town)': '1,268',
'Glencoe (town)': '601',
'Glenpool': '10,808',
'Goldsby (town)': '1,801',
'Goodwell (town)': '1,293',
'Gore (town)': '977',
'Grandfield': '1,038',
'Granite (town)': '2,065',
'Greenfield (town)': '93',
'Grove': '6,623',
'Guthrie': '10,191',
'Guymon': '11,442',
'Haileyville': '813',
'Hammon (town)': '568',
'Harrah': '5,095',
'Hartshorne': '2,125',
'Haskell (town)': '2,007',
'Haworth (town)': '297',
'Healdton': '2,788',
'Heavener': '3,414',
'Helena (town)': '1,403',
'Hennessey (town)': '2,131',
'Henryetta': '5,927',
'Hinton (town)': '3,196',
'Hobart': '3,756',
'Holdenville': '5,771',
'Hollis': '2,060',
'Hominy': '3,565',
'Hooker': '1,918',
'Howe (town)': '802',
'Hugo': '5,301',
'Hulbert (town)': '590',
'Hydro (town)': '969',
'Idabel': '7,010',
'Indiahoma (town)': '344',
'Inola (town)': '1,788',
'Jay': '2,448',
'Jenks': '16,924',
'Jennings town, Pawnee County': '363',
'Jones (town)': '2,692',
'Kansas (town)': '802',
'Kaw City city, Kay County': '375',
'Kellyville (town)': '1,150',
'Keota (town)': '564',
'Ketchum Town, Craig County': '442',
'Keyes (town)': '324',
'Kiefer (town)': '1,685',
'Kingfisher': '4,633',
'Kingston (town)': '1,601',
'Kiowa (town)': '731',
'Konawa': '1,298',
'Krebs': '2,053',
'Lahoma (town)': '611',
'Lamont town, Grant County': '417',
'Langley (town)': '819',
'Langston (town)': '1,724',
'Laverne (town)': '1,344',
'Lawton': '96,867',
'Lexington': '2,152',
'Lindsay': '2,840',
'Locust Grove (town)': '1,423',
'Lone Grove': '5,054',
'Lone Wolf town, Kiowa County': '438',
'Luther (town)': '1,221',
'Madill': '3,770',
'Mangum': '3,010',
'Mannford (town)': '3,076',
'Mannsville (town)': '863',
'Marietta': '2,626',
'Marlow': '4,662',
'Maud': '1,048',
'Maysville (town)': '1,232',
'McAlester': '18,383',
'McCurtain (town)': '516',
'McLoud (town)': '4,044',
'Medford': '996',
'Medicine Park (town)': '382',
'Meeker (town)': '1,144',
'Miami': '13,570',
'Midwest City': '54,371',
'Mill Creek (town)': '319',
'Millerton (town)': '320',
'Minco': '1,632',
'Moore': '55,081',
'Mooreland (town)': '1,190',
'Morris': '1,479',
'Morrison (town)': '733',
'Mounds (town)': '1,168',
'Mountain Park town, Kiowa County': '409',
'Mountain View (town)': '795',
'Muldrow (town)': '3,466',
'Muskogee': '39,223',
'Mustang': '17,395',
'New Cordell': '2,915',
'Newcastle': '7,685',
'Newkirk': '2,317',
'Nichols Hills': '3,710',
'Nicoma Park': '2,393',
'Ninnekah (town)': '1,002',
'Noble': '6,481',
'Norman': '110,925',
'North Enid (town)': '860',
'North Miami town, Ottawa County': '374',
'Nowata': '3,731',
'Oakland town, Marshall County': '1,057',
'Oaks (town)': '288',
'Ochelata town, Washington County': '424',
'Oilton': '1,013',
'Okarche (town)': '1,215',
'Okay (town)': '620',
'Okeene (town)': '1,204',
'Okemah': '3,223',
'Oklahoma City': '1,012,389',
'Okmulgee': '12,321',
'Oktaha town, Muskogee County': '390',
'Olustee (town)': '607',
'Oologah (town)': '1,146',
'Owasso': '28,915',
'Paden (town)': '461',
'Panama (town)': '1,413',
'Paoli (town)': '610',
'Pauls Valley': '6,187',
'Pawhuska': '3,584',
'Pawnee': '2,196',
'Perkins': '2,831',
'Perry': '5,126',
'Piedmont': '5,720',
'Pink (town)': '2,058',
'Pocola (town)': '4,056',
'Ponca City': '25,387',
'Pond Creek': '856',
'Porter (town)': '566',
'Porum (town)': '727',
'Poteau': '8,520',
'Prague': '2,386',
'Prue town, Osage County': '465',
'Pryor': '9,539',
'Purcell': '5,884',
'Quapaw (town)': '906',
'Quinton (town)': '1,051',
'Ralston (town)': '330',
'Ramona (town)': '535',
'Randlett (town)': '438',
'Ravia (town)': '528',
'Red Oak (town)': '549',
'Ringling (town)': '1,037',
'Ringwood (town)': '497',
'Ripley town, Payne County': '403',
'Rock Island (town)': '646',
'Roff (town)': '725',
'Roland (town)': '3,169',
'Roosevelt (town)': '25',
'Rush Springs (town)': '1,231',
'Ryan (town)': '816',
'Salina (town)': '1,396',
'Sallisaw': '8,880',
'Sand Springs': '18,906',
'Sapulpa': '20,544',
'Savanna (town)': '686',
'Sayre': '4,375',
'Schulter (town)': '509',
'Seiling': '860',
'Seminole': '7,488',
'Sentinel (town)': '901',
'Shady Point (town)': '1,026',
'Shattuck (town)': '1,356',
'Shawnee': '29,857',
'Shidler': '441',
'Skiatook': '7,397',
'Slaughterville (town)': '4,137',
'Snyder': '1,394',
'Soper (town)': '261',
'South Coffeyville (town)': '785',
'Spavinaw (town)': '437',
'Spencer': '3,912',
'Sperry (town)': '1,206',
'Spiro (town)': '2,164',
'Springer (town)': '700',
'Sterling (town)': '793',
'Stigler': '2,685',
'Stillwater': '45,688',
'Stilwell': '3,949',
'Stonewall (town)': '470',
'Stratford (town)': '1,525',
'Stringtown town, Atoka County': '410',
'Stroud': '2,690',
'Sulphur': '4,929',
'Taft (town)': '250',
'Tahlequah': '15,753',
'Talihina (town)': '1,114',
'Taloga (town)': '299',
'Tecumseh': '6,457',
'Temple (town)': '1,002',
'Terral town, Jefferson County': '382',
'Texhoma (town)': '926',
'Thackerville town, Love County': '445',
'The Village': '8,929',
'Thomas': '1,181',
'Tipton (town)': '847',
'Tishomingo': '3,034',
'Tonkawa': '3,216',
'Tryon town, Lincoln County': '491',
'Tulsa': '609,450',
'Tupelo': '329',
'Tushka town, Atoka County': '312',
'Tuttle': '6,019',
'Tyrone (town)': '762',
'Union City (town)': '1,645',
'Valley Brook (town)': '765',
'Valliant (town)': '754',
'Velma (town)': '620',
'Verden (town)': '530',
'Verdigris (town)': '3,993',
'Vian (town)': '1,466',
'Vici (town)': '699',
'Vinita': '5,743',
'Wagoner': '8,323',
'Wakita town, Grant County': '344',
'Walters': '2,551',
'Wanette town, Pottawatomie County': '350',
'Wapanucka town, Johnston County': '438',
'Warner (town)': '1,641',
'Warr Acres': '10,043',
'Washington (town)': '618',
'Watonga': '5,111',
'Waukomis (town)': '1,286',
'Waurika': '2,064',
'Wayne (town)': '688',
'Waynoka': '927',
'Weatherford': '10,833',
'Webbers Falls (town)': '616',
'Welch (town)': '619',
'Weleetka (town)': '998',
'Wellston (town)': '788',
'West Siloam Springs (town)': '846',
'Westville (town)': '1,639',
'Wetumka': '1,282',
'Wewoka': '3,430',
'Wilburton': '2,843',
'Wilson': '1,724',
'Winchester (town)': '516',
'Wister (town)': '1,102',
'Woodward': '12,051',
'Wright City (town)': '762',
'Wyandotte (town)': '333',
'Wynnewood': '2,212',
'Wynona (town)': '437',
'Yale': '1,227',
'Yukon': '22,709'}

如果您打算分析数据,您可能会发现 pandas 很有用::

city_data =(p.text.lstrip("0123456789. ").rsplit(None, 1) for p in ps[3:])
import pandas as pd

df = pd.DataFrame(city_data,columns=["City", "Population"])

print(df)

输出:

                                  City Population
0 Oklahoma City 1,012,389
1 Tulsa 609,450
2 Norman 110,925
3 Broken Arrow 98,850
4 Lawton 96,867
5 Edmond 81,405
6 Moore 55,081
7 Midwest City 54,371
8 Enid 49,379
9 Stillwater 45,688
10 Muskogee 39,223
11 Bartlesville 35,750
12 Shawnee 29,857
13 Owasso 28,915
14 Ponca City 25,387
15 Ardmore 24,283
16 Duncan 23,431
17 Yukon 22,709
18 Del City 21,332
19 Bixby 20,884
20 Sapulpa 20,544
21 Altus 19,813
22 Bethany 19,051
23 Sand Springs 18,906
24 Claremore 18,581
25 McAlester 18,383
26 Mustang 17,395
27 Jenks 16,924
28 Ada 16,810
29 El Reno 16,749
.. ... ...

您可能希望将 population 列转换为 int 以进行任何计算:

import locale
locale.setlocale(locale.LC_NUMERIC, '')

df["Population"] = df["Population"].apply(locale.atoi)
print(df["Population"])

0 1012389
1 609450
2 110925
3 98850
4 96867
5 81405
6 55081
7 54371
8 49379
9 45688
10 39223
11 35750
12 29857
..................

关于python - 在 python 中使用 "unclean"文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35142260/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com