python - 使用多个 if-else 子句对 pandas 数据框进行矢量化以分割域-6ren

python - 使用多个 if-else 子句对 pandas 数据框进行矢量化以分割域

转载作者：行者123 更新时间：2023-12-01 01:30:28

请帮助使以下 pandas 数据帧代码矢量化/更快，它非常慢。

我有下面的代码，它完全按照我想要的方式工作。它需要具有大量子域的域，并将它们规范化为主机名 + TLD。

我找不到任何使用 if-else 语句的矢量化示例。

import pandas as pd
import time
#import file into dataframe

start = time.time()
path = "Desktop/dom1.csv"

df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")

#strip out all ---- values
df2 = df[((df['domain'] != '----'))]

#extract only 2 columns from dataframe
df3 = df2[['domain', 'web.optimisedsize']]

#define tld and cdn lookup lists
tld = ['co.uk', 'com', 'org', 'gov.uk', 'co', 'net', 'news', 'it', 'in' 'es', 'tw', 'pe', 'io', 'ca', 'cat', 'com.au',
  'com.ar', 'com.mt', 'com.co', 'ws', 'to', 'es', 'de', 'us', 'br', 'im', 'gr', 'cc', 'cn', 'org.uk', 'me', 'ovh', 'be',
  'tv', 'tech', '..', 'life', 'com.mx', 'pl', 'uk', 'ru', 'cz', 'st', 'info', 'mobi', 'today', 'eu', 'fi', 'jp', 'life',
  '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'earth', 'ninja', 'ie', 'im', 'ai', 'at', 'ch', 'ly', 'market', 'click',
  'fr', 'nl', 'se']
cdns = ['akamai', 'maxcdn', 'cloudflare']

#iterate through each row of the datafrme and split each domain at the dot
for row in df2.itertuples():
  index = df3.domain.str.split('.').tolist()
  cleandomain = []
  #iterate through each of the split domains
  for x in index:
    #if it isn't a string, then print the value directly in the cleandomain list
    if not isinstance(x, str):
        cleandomain.append(str(x))
    #if it's a string that encapsulates numbers, then it's an IP
    elif str(x)[-1].isnumeric():
        try:
            cleandomain.append(str(x[0])+'.'+str(x[1])+'.*.*')
        except IndexError:
            cleandomain.append(str(x))
    #if its in the CDN list, take a subdomain as well
    elif len(x) > 3 and str(x[len(x)-2]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 3 and str(x[len(x)-3]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-4])+'.'+str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its in the TLD list, do this
    elif len(x) > 2 and str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 2 and str(x[len(x)-1]) in tld:
        try:
            cleandomain.append(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its not in the TLD list, do this
    else:
      cleandomain.append(str(x))

#add the column to the dataframe
df3['newdomain2']=cleandomain
se = pd.Series(cleandomain)
df3['newdomain2'] = se.values

#select only the new domain column & usage
df4 = df3[['newdomain2', 'web.optimisedsize']]

#group by
df5 = df4.groupby(['newdomain2'])[['web.optimisedsize']].sum()

#sort
df6 = df5.sort_values(['web.optimisedsize'], ascending=["true"])
end = time.time()
print(df6)
print(end-start)

我的输入是这个 DF:

In [4]: df
Out[4]:
                     Domain      Use
0        graph.facebook.com     4242
1            news.bbc.co.uk    23423
2  news.more.news.bbc.co.uk   234432
3       profile.username.co   235523
4           offers.o2.co.uk   235523
5     subdomain.pyspark.org     2325
6       uds.data.domain.net    23523
7         domain.akamai.net    23532
8           333.333.333.333  3432324

期间，索引将其拆分为:

[['graph', 'facebook', 'com'], ['news', 'bbc' .....

然后，我将新域作为新列附加到原始数据帧。然后按 + 进行分组以创建最终的数据帧。

In [10]: df
Out[10]:
                     Domain      Use         newdomain
0        graph.facebook.com     4242       facebook.com
1            news.bbc.co.uk    23423          bbc.co.uk
2  news.more.news.bbc.co.uk   234432          bbc.co.uk
3       profile.username.co   235523        username.co

最佳答案

问题之一是，在执行的每次迭代中都会有 index = df3.domain.str.split('.').tolist()。当我将这条线放在循环之外时，计算速度加快了 2 倍。 587 毫秒 VS 1.1 秒。

我也认为你的代码是错误的。您不使用 row 变量，而是使用 index 。当您迭代索引时，一个元素始终是一个列表。因此，if not isinstance(x, str) 始终为 True。 (您可以在下面的 line_debugger 输出中看到它)

字符串运算通常不可向量化。甚至 .str 符号实际上也是一个 Python 循环。

这是 Jupyter Notebook 中 line_debugger 工具的输出:初始化(f 是一个包裹在代码中的函数):

%load_ext line_profiler
%lprun -f f f(df2, df3)

输出:

Total time: 1.82219 s
File: <ipython-input-8-79f01a353d31>
Function: f at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def f(df2,df3):
     2         1       8093.0   8093.0      0.2      index = df3.Domain.str.split('.').tolist()
     3                                               #iterate through each row of the datafrme and split each domain at the dot
     4       901      11775.0     13.1      0.2      for row in df2.itertuples():
     5                                           
     6       900      26241.0     29.2      0.5        cleandomain = []
     7                                                 #iterate through each of the split domains
     8    810900     971082.0      1.2     18.8        for x in index:
     9                                                   #if it isn't a string, then print the value directly in the cleandomain list
    10    810000    1331253.0      1.6     25.8          if not isinstance(x, str):
    11    810000    2819163.0      3.5     54.6              cleandomain.append(str(x))
    12                                                   #if it's a string that encapsulates numbers, then it's an IP
    13                                                   elif str(x)[-1].isnumeric():
    14                                                       try:
    15                                                           cleandomain.append(str(x[0])+'.'+str(x[1])+'.*.*')
    16                                                       except IndexError:
    17                                                           cleandomain.append(str(x))
    18                                                   #if its in the CDN list, take a subdomain as well
    19                                                   elif len(x) > 3 and str(x[len(x)-2]).rstrip() in cdns:
    20                                                       try:
    21                                                           cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+str(x[len(x)-1]))
    22                                                       except IndexError:
    23                                                           cleandomain.append(str(x))
    24                                                   elif len(x) > 3 and str(x[len(x)-3]).rstrip() in cdns:
    25                                                       try:
    26                                                           cleandomain.append(str(x[len(x)-4])+'.'+str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
    27                                                       except IndexError:
    28                                                           cleandomain.append(str(x))
    29                                                   #if its in the TLD list, do this
    30                                                   elif len(x) > 2 and str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
    31                                                       try:
    32                                                           cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
    33                                                       except IndexError:
    34                                                           cleandomain.append(str(x))
    35                                                   elif len(x) > 2 and str(x[len(x)-1]) in tld:
    36                                                       try:
    37                                                           cleandomain.append(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
    38                                                       except IndexError:
    39                                                           cleandomain.append(str(x))
    40                                                   #if its not in the TLD list, do this
    41                                                   else:
    42                                                     cleandomain.append(str(x))

我的代码:
数据准备:

from io import StringIO
import pandas as pd
#import file into dataframe
TESTDATA=StringIO("""Domain,Use
      graph.facebook.com,   4242
          news.bbc.co.uk,  23423
news.more.news.bbc.co.uk, 234432
     profile.username.co, 235523
         offers.o2.co.uk, 235523
   subdomain.pyspark.org,   2325
     uds.data.domain.net,  23523
       domain.akamai.net,  23532
         333.333.333.333,3432324
""")
df=pd.read_csv(TESTDATA)
df["Domain"] = df.Domain.str.strip()
df = pd.concat([df]*100)

df2 = df
#extract only 2 columns from dataframe
df3 = df2
#define tld and cdn lookup lists
tld = ['co.uk', 'com', 'org', 'gov.uk', 'co', 'net', 'news', 'it', 'in' 'es', 'tw', 'pe', 'io', 'ca', 'cat', 'com.au',
  'com.ar', 'com.mt', 'com.co', 'ws', 'to', 'es', 'de', 'us', 'br', 'im', 'gr', 'cc', 'cn', 'org.uk', 'me', 'ovh', 'be',
  'tv', 'tech', '..', 'life', 'com.mx', 'pl', 'uk', 'ru', 'cz', 'st', 'info', 'mobi', 'today', 'eu', 'fi', 'jp', 'life',
  '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'earth', 'ninja', 'ie', 'im', 'ai', 'at', 'ch', 'ly', 'market', 'click',
  'fr', 'nl', 'se']
cdns = ['akamai', 'maxcdn', 'cloudflare']

jupyter 笔记本中的计时:

%%timeit
index = df3.Domain.str.split('.').tolist()
#iterate through each row of the datafrme and split each domain at the dot
for row in df2.itertuples():

  cleandomain = []
  #iterate through each of the split domains
  for x in index:
    #if it isn't a string, then print the value directly in the cleandomain list
    if not isinstance(x, str):
        cleandomain.append(str(x))
    #if it's a string that encapsulates numbers, then it's an IP
    elif str(x)[-1].isnumeric():
        try:
            cleandomain.append(str(x[0])+'.'+str(x[1])+'.*.*')
        except IndexError:
            cleandomain.append(str(x))
    #if its in the CDN list, take a subdomain as well
    elif len(x) > 3 and str(x[len(x)-2]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 3 and str(x[len(x)-3]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-4])+'.'+str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its in the TLD list, do this
    elif len(x) > 2 and str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 2 and str(x[len(x)-1]) in tld:
        try:
            cleandomain.append(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its not in the TLD list, do this
    else:
      cleandomain.append(str(x))

关于python - 使用多个 if-else 子句对 pandas 数据框进行矢量化以分割域，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52915301/

文章推荐： jquery - 通过 jquery 将 div 从其他 div 后面滑动

文章推荐： javascript - Native Base如何获取List dataArray的ListItem索引？

文章推荐： jquery - 如何打印JQuery ThickBox插件的内容？

文章推荐： javascript - 将 Angular 6 更新到 Angular 7 出现错误

grails - 为什么IP(域)地址重定向到localhost而不是Grails中的IP(域)
这是我的本地域名 http://10.10.1.101/uxsurvey/profile/dashboard 在 Controller 中，我为用户列表设置了一个操作 redirect(control
dns - 规范 URL 的 www 域 IP 地址和非 www 域 IP 地址
要处理 Canonical URL，最佳做法是执行 301 重定向还是更好地为 www 和非 www 域使用相同的 IP 地址？例如: 想要的规范 URL/域是 http://example.com
内网之工作组、域分析
1 内网基础内网/局域网（Local Area Network，LAN），是指在某一区域内有多台计算机互联而成的计算机组，组网范围通常在数千米以内。在局域网中，可以实现文件管理、应用软件共享、打印机
内网之工作组、域分析
1 内网基础内网/局域网（Local Area Network，LAN），是指在某一区域内有多台计算机互联而成的计算机组，组网范围通常在数千米以内。在局域网中，可以实现文件管理、应用软件共享、打印机
用于物理上分离的托管服务器的 Weblogic 域
我想创建一个 weblogic 集群，其中有两个托管服务器，每个服务器在物理上独立的远程计算机上运行根据weblogic文档 All Managed Servers in a cluster mus
Grails 域 - 多个多对多关系
我正在运行 grails 3.1.4，但在创建允许我将多个域对象绑定(bind)到其他几个域对象的模式时遇到了问题。作为我正在尝试做的一个例子: 我有三个类(class)。书籍、作者和阅读列表。作者
ios - 域@count查询问题
我试图使用@count函数来根据它获取数据，但是在没有崩溃报告的情况下它以某种方式崩溃了。这是代码 class PSMedia: Object { @objc dynamic var id
PostgreSQL 域 : no numbers
有谁知道是否有办法只输入字母字符而不输入数字？我想过这样的事情 CREATE DOMAIN countryDomain AS VARCHAR(100) CHECK( VALUE ??? );
具有子字典匹配的 JavaScript 域
我的代码: const checkoutUrl = 'https://example.com/checkout/*' window.onload = startup() function st
PHP setcookie 域
一些不是我编写的应用程序，也不是用 PHP 编写的，它为域 www.example.com 创建了一个 cookie。我正在尝试替换该 cookie。所以在 PHP 中我做到了: setcookie
oauth - 什么是 oauth 域
什么是 oauth 域？是否有任何免费的 oauth 服务？我可以将它用于 StackApps registration 吗？？我在谷歌上搜索了很多，但找不到答案。最佳答案这是redirect_
regex - 电子邮件正则表达式将如何处理新的 unicode 域？
自从 In October 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the cre
apache - 更改 Cookie 域
我使用 apache 作为我的应用程序 Web 服务器的代理，并希望即时更改与 sessionid cookie 关联的域名。该cookie有一个与之关联的.company.com域，我想使用apa
cloudflare - 是否可以仅在cloudflare上托 pipe 域
我只想托管一个子域到cloudflare。我不想将主域名的域名服务器更改为他们的域名服务器。真的有可能吗？最佳答案是的，这是可能的，但是需要通过CloudFlare合作伙伴进行设置，或者您需要采用
unix - AF_UNIX 域 - 为什么只使用本地文件名？
When using socket in the UNIX domain, it is advisable to use path name for the directory directory m
grails - 如何实现 "remote"域？
想象两个共享一个域类的 Grails 应用程序。也许是 Book 域类。一个应用程序被标识为数据的所有者，一个应用程序必须访问域数据。类似于亚马逊和亚马逊网络服务。我想拥有的应用程序将使用普通的域
JavaScript 正则表达式 - 域 URL
我有一个包含字段“URL”的表单。第一部分需要用户在文本框中填写。第二部分是预定义的，显示在文本框的右侧。例如，用户在文本框中输入“test”。第二部分预定义为“.example.com”。因此，总
Azure 域 Controller 关闭释放
如果我要关闭并取消分配 azure 中的域 Controller ，从而生成新的 vm Generationid，我需要采取哪些步骤来恢复它？最佳答案 what steps do I need to
azure - 更改免费试用帐户上的 Azure 域
我想尝试使用 Azure 作为托管提供商(我有一个域)。我读过那篇文章https://learn.microsoft.com/en-us/azure/app-service-web/web-sites
windows - 从Docker容器访问Windows文件共享(域)内的伪像
所以.... 我想知道是否有人可以在这方面协助我？基本上，我已经创建了一个自托管的Docker容器，用作构建代理(Azure DevOps) 现在，我已经开始测试代理，并且由于我们的放置文件夹位于W

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用多个 if-else 子句对 pandas 数据框进行矢量化以分割域