python - 使用正则表达式从大型 SFrame 或数据帧中提取信息，而不使用循环-6ren

python - 使用正则表达式从大型 SFrame 或数据帧中提取信息，而不使用循环

转载作者：行者123 更新时间：2023-11-30 22:43:38

26

4

我有以下代码，其中使用循环提取一些信息并使用这些信息创建一个新矩阵。但是，由于我使用的是循环，因此该代码需要很长时间才能完成。

我想知道是否有更好的方法通过使用 GraphLab 的 SFrame 或 pandas dataframe 来做到这一点。我感谢任何帮助!

# This is the regex pattern
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read"

# Using the pattern, I filter my records
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)]

# Then for each record in the final set, 
# I need to extract topic and entry info using match.group
for request in requests_topic_entry_read:
    for match in regex.finditer(pattern_topic_entry_read, request['url']):
        topic, entry  = match.group('topic'), match.group('entry')

        # Then, I need to create a new SFrame (or dataframe, or anything suitable) 
        newRow = gl.SFrame({'user_id':[request['user_id']], 
                            'url':[request['url']], 
                            'topic':[topic], 'entry':[entry]})

        # And, append it to my existing SFrame (or dataframe)
        entry_read_matrix = entry_read_matrix.append(newRow)

一些示例数据:

user_id | url
1000    | /123456832960900/discussion_topics/770000832912345/read
1001    | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307
1002    | /123456832960900/discussion_topics/770000834562343/entries/832350330/read
1003    | /123456832960900/discussion_topics/770000534344444/entries/832350367/read

我想获得这个:

user_id | topic           | entry
1002    | 770000834562343 | 832350330
1003    | 770000534344444 | 832350367

最佳答案

Pandas 系列为此提供了 string functions。例如，您的数据位于 df 中:

pattern = re.compile(r'.*/discussion_topics/(?P<topic>\d+)(?:/entries/(?P<entry>\d+))?')
df = pd.read_table(io.StringIO(data), sep=r'\s*\|\s*', index_col='user_id')
df.url.str.extract(pattern, expand=True)

产量

                   topic      entry
user_id                            
1000     770000832912345        NaN
1001     770000832923456        NaN
1002     770000834562343  832350330
1003     770000534344444  832350367

关于python - 使用正则表达式从大型 SFrame 或数据帧中提取信息，而不使用循环，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41716911/

26

4

0

文章推荐： java - 存储过程在调用时不起作用，但准备好的语句可以

文章推荐： c# - 绘制用户定义的矩形

文章推荐： c# - 使用 group by 帮助 LINQ 查询

python - Graphlab sframe - 有什么方法可以转置 sframe 吗？
我正在使用 graphlab 库中的 sframes。我需要按行进行一些计算。此外，如果我能够转置 sframe，则 sframe 结构在我的情况下会更有意义。有什么办法可以做到吗？还是可以在我可以
python - SFrame 中的字符串对象到日期时间对象
我有一个大约 20GB 的庞大数据集。我已经使用 graphlab.SFrame.read_csv() 读取了数据。我有一个日期列，它被读取为格式为 yyyy-dd-mm 的字符串。但我希望将该列作为
pandas - 按 SFrame 列记录值
拜托，谁能告诉我，我如何从 SFrame 中的每个值中取对数，graphlab(或 DataFrame，pandas)列，而不遍历 SFrame 列的整个长度？我对类似的功能特别感兴趣，比如 Gro
python - 选择 SFrame 中的特定行
我对如何选择 SFrame 数组中的特定行感到困惑。我可以在此处选择第一行: sf +-------------------------------+ | X1
python - 无法加载 graphlab.sframe
我已加入 coursera 上的机器学习类(class)。我在执行以下命令时遇到问题: sales = graphlab.SFrame('home_data.gl/') 错误如下: IOErr
python - 添加新列后 SFrame 操作速度变慢
我正在使用 graphlab 和 sframes 在 ipython 笔记本中构建重复订单报告。我有一个 csv 文件，其中包含大约 10 万行数据，其中包含 user_id、user_email、u
python - 将 sframe 列转换为列表
我需要将 SFrame 列转换为列表。输入: `+---------+ | word | +---------+ | love | | loves | |
python - 通过关键列合并一列的中位数 - SFrame/Pandas
在 graphlab 中，我有以下 SFrame 调用 train: import graphlab train = graphlab.read_csv('clean_train.csv') trai
python - 如何在 SFrame 中用中位数或众数替换缺失值？
我正在浏览 Graphlab 文档，我正在尝试弄清楚如何复制 pandas 功能，如果 na 值被中值、均值或模式等替换...在 Pandas 中，您只需通过以下方式执行此操作:df.dropna()
python - 将唯一列转换为具有相应值的 SFrame 标题
我有一个制表符分隔的文件: $ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t0.23\nabc\tjkl\t0.5\n' > test.txt $
python - 如何在存在联合条件和两个单独条件的 sframe 中提取行？
我有这样一个 sframe: +---------+------+-------------------------------+-----------+------------------+ | t
python - SFrame，Python 中的图形实验室
任何人都可以，请告诉我，我如何绘制 SFrame (甚至更好 SArray )或将此类型转换为 python 中的某些常见类型。例如，当我尝试将 SArray 转换为 Pandas 对象时: pand
python - graphlab SFrame 对一列中的所有值求和
如何对 SFrame graphlab 的一列中的所有值求和。我试着查看官方文档，它只针对 SaArray( doc )没有任何例子。最佳答案 >>> import graphlab as gl >
python - 如何在 Graphlab SFrame 中通过划分两列来创建新列？
给定一个 Graphlab SFrame: +-------+------------+---------+-----------+ | Store | Date | Sales |
Python:对 graphlab.SFrame 的所有行的一行的不同列进行迭代操作
有一个 SFrame，其中的列具有 dict 元素。 import graphlab import numpy as np a = graphlab.SFrame({'col1':[{'oshan':
python - 使用正则表达式从大型 SFrame 或数据帧中提取信息，而不使用循环
我有以下代码，其中使用循环提取一些信息并使用这些信息创建一个新矩阵。但是，由于我使用的是循环，因此该代码需要很长时间才能完成。我想知道是否有更好的方法通过使用 GraphLab 的 SFrame 或
python - 使用 sframe.apply() 导致运行时错误
我正在尝试对充满数据的 s 帧使用简单的应用。这是针对其中一列的简单数据转换，应用一个接受文本输入并将其拆分为列表的函数。这是函数及其调用/输出: In [1]: def count_word
python - graphlab 创建 sframe 合并两列
我有两列字符串。让我们说 col1 和 col2现在我们如何使用 graphlab SFrame 将 col1 和 col2 的内容合并到 col3 中？ col1 col2 23 33 42
python - 如何找到在 Graphlab SFrame 中保存时引发错误的特定行？
我有一个 SFrame，其外观与 sf.print_rows(10) 类似: +--------------+---------------+-------+---------------------
python - 来自 numpy 数组的 SFrame
我想创建一个SFrame来自 NumPy 数组。我具体想要的是: np.arange(16).reshape(4, 4) => +----+----+----+----+ | 0 | 1 | 2

首页

博学

6Ren·AI

商城

python - 使用正则表达式从大型 SFrame 或数据帧中提取信息，而不使用循环