gpt4 book ai didi

python - 使用 df.apply 处理异常

转载 作者:行者123 更新时间:2023-12-01 08:38:06 26 4
gpt4 key购买 nike

我正在使用 tld python 库通过 apply 函数从代理请求日志中获取第一级域。

当我遇到一个 tld 不知道如何处理的奇怪请求(例如“http:1 CON”或“http:/login.cgi%00”)时,我遇到了如下错误消息:

TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url,
fail_silently, fix_protocol, search_public, search_private, **kwargs)
385 fix_protocol=fix_protocol,
386 search_public=search_public,
--> 387 search_private=search_private
388 )
389

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
289 return None, None, parsed_url
290 else:
--> 291 raise TldBadUrl(url=url)
292
293 domain_parts = domain_name.split('.')

与此同时,我一直在使用许多行(如以下代码)来清除这些内容,但此数据集中有数百或数千行:

request_2 = request_1[request_1['request'] != 'http:1 CON']
request_2 = request_1[request_1['request'] != 'http:/login.cgi%00']

数据框:

request
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12

代码:

from tld import get_tld
from tld import get_fld
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
import numpy as np

request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column
request = request[pd.notnull(request['request'])]
#Reset index
request.reset_index(drop=True)
#Find the urls that contain IP addresses and exclude them from the new dataframe
request_1 = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request_1 = request_1.reset_index(drop=True)
#Appply the get_fld lib on the request column
new_fld_column = request_2['request'].apply(get_fld)

是否有办法防止此错误发生,而是将那些会出错的错误添加到单独的数据帧中?

最佳答案

如果您可以将函数包装在 try- except 子句中,则可以通过使用 NaN 查询这些行来确定哪些行出错:

import tld
from tld import get_fld

def try_get_fld(x):
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan

print(df)
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
6 http:1 CON 10
7 http:/login.cgi%00 200

df['flds'] = df['request_url'].apply(try_get_fld)
print(df['flds'])
0 microsoftonline.com
1 adsafeprotected.com
2 doubleclick.net
3 amazon.com
4 microsoft.com
5 adnxs.com
6 NaN
7 NaN
Name: flds, dtype: object

faulty_url_df = df[df['flds'].isna()]
print(faulty_url_df)

request_url count flds
6 http:1 CON 10 NaN
7 http:/login.cgi%00 200 NaN

关于python - 使用 df.apply 处理异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53622946/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com