sql - 如何合并 R 中的网络流量数据流对行？-6ren

sql - 如何合并 R 中的网络流量数据流对行？

转载作者：行者123 更新时间：2023-12-02 03:46:17

我有很多 SiLK 流数据，我想对其进行一些数据挖掘。看起来目标 IP 列与更下方一行数据的源 IP 列相匹配。如何将源 ID 行与 R 中的目标 ID 行合并？我为您提供了一些简化的网络流量数据:

id    sip    dip    notes
1     20     30     20 is talking to 30
2     20     31     20 is talking to 31
3     20     32     20 is talking to 32
4     30     20     30 is responding to 20
5     31     20     31 is responding to 20
6     32     20     32 is responding to 20
7     20     32     20 is talking to 32 again
8     20     30     20 is talking to 30 again
9     32     20     32 is responding to 20 again
10    20     31     20 is talking to 31 again
11    31     20     31 is responding to 20 again
12    30     20     30 is responding to 20 again
13    21     30     21 is talking to 30
14    30     21     30 is responding to 21

我想合并行，使它们看起来像这样:

id_S    sip_S    dip_S    notes_S                      id_D    sip_D    dip_D    notes_D
1       20       30       20 is talking to 30          4       30       20       30 is responding to 20
2       20       31       20 is talking to 31          5       31       20       31 is responding to 20
3       20       32       20 is talking to 32          6       32       20       32 is responding to 20
7       20       32       20 is talking to 32 again    9       32       20       32 is responding to 20 again
8       20       30       20 is talking to 30 again    12      30       20       30 is responding to 20 again
10      20       31       20 is talking to 31 again    11      31       20       31 is responding to 20 again
13      21       30       21 is talking to 30          14      30       21       30 is responding to 21

我有超过一百万行数据。在 SQL Express 中完成它需要几天时间和大量磁盘空间:

WITH flowtest_merged AS(
SELECT
    s.id AS id_S,
    s.sip AS sip_S,
    s.dip AS dip_S,
    s.notes AS notes_S,
    d.id AS id_D,
    d.sip AS sip_D,
    d.dip AS dip_D,
    d.notes AS notes_D,
    ROW_NUMBER() OVER(PARTITION BY s.id ORDER BY d.id) AS RN
FROM
    flowtest AS s INNER JOIN
    flowtest AS d ON
    s.dip = d.sip AND /* The source id is talking to the destination id */
    s.sip = d.dip AND /* The destination id is responding to the source id */
    s.id < d.id AND /* The source id is the initiator of the exchange */
    s.sip < 30 /* shorthand for "I'm selecting the internal ip range here" */
)
SELECT
    id_S,
    sip_S,
    dip_S,
    notes_S,
    id_D,
    sip_D,
    dip_D,
    notes_D
FROM flowtest_merged
WHERE (RN = 1)

问题是，我不知道如何执行 ROW_NUMBER() OVER(PARTITION BY s.id ORDER BY d.id) 部分。因此，如果我在 R 中重建示例数据框:

> flowtest <- data.frame(
+     "id" = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
+     "sip" = c(20, 20, 20, 30, 31, 32, 20, 20, 32, 20, 31, 30, 21, 30),
+     "dip" = c(30, 31, 32, 20, 20, 20, 32, 30, 20, 31, 20, 20, 30, 21),
+     "notes" = c(
+         "20 is talking to 30",
+         "20 is talking to 31",
+         "20 is talking to 32",
+         "30 is responding to 20",
+         "31 is responding to 20",
+         "32 is responding to 20",
+         "20 is talking to 32 again",
+         "20 is talking to 30 again",
+         "32 is responding to 20 again",
+         "20 is talking to 31 again",
+         "31 is responding to 20 again",
+         "30 is responding to 20 again",
+         "21 is talking to 30",
+         "30 is responding to 21"),
+     stringsAsFactors = FALSE)

使其看起来与 SQL 数据相同:

> flowtest
   id sip dip                        notes
1   1  20  30          20 is talking to 30
2   2  20  31          20 is talking to 31
3   3  20  32          20 is talking to 32
4   4  30  20       30 is responding to 20
5   5  31  20       31 is responding to 20
6   6  32  20       32 is responding to 20
7   7  20  32    20 is talking to 32 again
8   8  20  30    20 is talking to 30 again
9   9  32  20 32 is responding to 20 again
10 10  20  31    20 is talking to 31 again
11 11  31  20 31 is responding to 20 again
12 12  30  20 30 is responding to 20 again
13 13  21  30          21 is talking to 30
14 14  30  21       30 is responding to 21

当我进行微弱的合并尝试时:

> flowtest_merged <- merge(
+     flowtest[,setdiff(colnames(flowtest), "dip")],
+     flowtest[,setdiff(colnames(flowtest), "sip")],
+     by.x = "sip",
+     by.y = "dip",
+     all = FALSE,
+     suffixes = c("_S", "_D"))

它有很多很多行(和错误的列):

> flowtest_merged
   sip id_S                      notes_S id_D                      notes_D
1   20    1          20 is talking to 30    5       31 is responding to 20
2   20    1          20 is talking to 30    6       32 is responding to 20
3   20    1          20 is talking to 30   11 31 is responding to 20 again
4   20    1          20 is talking to 30    4       30 is responding to 20
5   20    1          20 is talking to 30    9 32 is responding to 20 again
6   20    1          20 is talking to 30   12 30 is responding to 20 again
7   20    2          20 is talking to 31    5       31 is responding to 20
8   20    2          20 is talking to 31    6       32 is responding to 20
9   20    2          20 is talking to 31   11 31 is responding to 20 again
10  20    2          20 is talking to 31    4       30 is responding to 20
11  20    2          20 is talking to 31    9 32 is responding to 20 again
12  20    2          20 is talking to 31   12 30 is responding to 20 again
13  20    3          20 is talking to 32    5       31 is responding to 20
14  20    3          20 is talking to 32    6       32 is responding to 20
15  20    3          20 is talking to 32   11 31 is responding to 20 again
16  20    3          20 is talking to 32    4       30 is responding to 20
17  20    3          20 is talking to 32    9 32 is responding to 20 again
18  20    3          20 is talking to 32   12 30 is responding to 20 again
19  20    8    20 is talking to 30 again    5       31 is responding to 20
20  20    8    20 is talking to 30 again    6       32 is responding to 20
21  20    8    20 is talking to 30 again   11 31 is responding to 20 again
22  20    8    20 is talking to 30 again    4       30 is responding to 20
23  20    8    20 is talking to 30 again    9 32 is responding to 20 again
24  20    8    20 is talking to 30 again   12 30 is responding to 20 again
25  20   10    20 is talking to 31 again    5       31 is responding to 20
26  20   10    20 is talking to 31 again    6       32 is responding to 20
27  20   10    20 is talking to 31 again   11 31 is responding to 20 again
28  20   10    20 is talking to 31 again    4       30 is responding to 20
29  20   10    20 is talking to 31 again    9 32 is responding to 20 again
30  20   10    20 is talking to 31 again   12 30 is responding to 20 again
31  20    7    20 is talking to 32 again    5       31 is responding to 20
32  20    7    20 is talking to 32 again    6       32 is responding to 20
33  20    7    20 is talking to 32 again   11 31 is responding to 20 again
34  20    7    20 is talking to 32 again    4       30 is responding to 20
35  20    7    20 is talking to 32 again    9 32 is responding to 20 again
36  20    7    20 is talking to 32 again   12 30 is responding to 20 again
37  21   13          21 is talking to 30   14       30 is responding to 21
38  30    4       30 is responding to 20    1          20 is talking to 30
39  30    4       30 is responding to 20    8    20 is talking to 30 again
40  30    4       30 is responding to 20   13          21 is talking to 30
41  30   14       30 is responding to 21    1          20 is talking to 30
42  30   14       30 is responding to 21    8    20 is talking to 30 again
43  30   14       30 is responding to 21   13          21 is talking to 30
44  30   12 30 is responding to 20 again    1          20 is talking to 30
45  30   12 30 is responding to 20 again    8    20 is talking to 30 again
46  30   12 30 is responding to 20 again   13          21 is talking to 30
47  31    5       31 is responding to 20    2          20 is talking to 31
48  31    5       31 is responding to 20   10    20 is talking to 31 again
49  31   11 31 is responding to 20 again    2          20 is talking to 31
50  31   11 31 is responding to 20 again   10    20 is talking to 31 again
51  32    9 32 is responding to 20 again    3          20 is talking to 32
52  32    9 32 is responding to 20 again    7    20 is talking to 32 again
53  32    6       32 is responding to 20    3          20 is talking to 32
54  32    6       32 is responding to 20    7    20 is talking to 32 again
>

换句话说，我不会像我希望的那样将一行与另一行合并。如何将源 ID 行与其目标 ID 行合并？

谢谢

戴夫

编辑:这是第一个匹配对:

UID|SIP|DIP|PROTOCOL|SPORT|DPORT|PACKETS|BYTES|FLAGS|STIME|DURATION|ETIME|SENSOR|FLOWTYPE|ICMP_TYPE|ICMP_CODE|APPLICATION|INPUT|OUTPUT|TIMEOUT|CONTINUATION|INIT_FLAGS|SESSION_FLAGS|BLACKLIST|WHITELIST|NORMALIZED_DOMAIN|COUNTRY
720109425873|3232248427|3232248333|17|57554|53|1|70|0|2013-01-01 00:00:15.046|0|2013-01-01 00:00:15.046|THERMOPYLAE|6|||0|0|0|0|0|0|0|N|Y|erath.mechesrx.net|NULL
...
720107126014|3232248333|3232248427|17|53|57868|2|238|0|2013-01-01 00:02:15.827|0|2013-01-01 00:02:15.827|THERMOPYLAE|6|||0|0|0|0|0|0|0|N|Y|NULL|NULL

最佳答案

library(data.table)
#split your dataset in "talking"  and responding part
#this will need some seconds for several million entries
a <- data.table(df[grep('*talk*',df$notes),],key=c("sip","dip"))
b <- data.table(df[grep('*responding*',df$notes),],key=c("dip","sip"))
#create a second id for each couple
a[,id2:=seq_len(.N),by=key(a)]
b[,id2:=seq_len(.N),by=key(b)]

#merge
setnames(b,c("sip","dip"),c("dip","sip"))
merge(a,b,by=c("sip","dip","id2"),all=TRUE)

#    sip dip id2 id.x                   notes.x id.y                      notes.y
# 1:  20  30   1    1       20 is talking to 30    4       30 is responding to 20
# 2:  20  30   2    8 20 is talking to 30 again   12 30 is responding to 20 again
# 3:  20  31   1    2       20 is talking to 31    5       31 is responding to 20
# 4:  20  31   2   10 20 is talking to 31 again   11 31 is responding to 20 again
# 5:  20  32   1    3       20 is talking to 32    6       32 is responding to 20
# 6:  20  32   2    7 20 is talking to 32 again    9 32 is responding to 20 again
# 7:  21  30   1   13       21 is talking to 30   14       30 is responding to 21

如果有可能一个合作伙伴讲了两次而另一个合作伙伴没有回应，我不确定您要如何处理。

关于sql - 如何合并 R 中的网络流量数据流对行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16875030/

文章推荐： jprofiler - 如何改进本地人的 JProfiler 会计

文章推荐： .net - 将输入参数传递给 ActionResult MVC 4

文章推荐： rest - Sencha touch 2 oauth2 身份验证

SSIS 数据流 - 具有所需外键值的顺序插入
是否可以插入到初始表，然后使用插入的 ID 插入到主表中，该主表在一个数据流的列之间具有外键约束？我是集成服务的新手，不知道这些功能场景: 表 A - ID - DESC 表 B - ID - A
Azure 数据流 - 动态分组依据
在 Azure 数据流中，在聚合转换中是否可以在分组依据中动态包含列？我在分组依据中可能需要 8 列，具体取决于它们的值，即如果值为 1，则包含在分组依据中。简化为 2 列: Column1
Azure 数据流/数据工厂错误处理
我想要实现的是在azure数据流中包含错误处理，如果在传输行时发生错误，它不应该失败，它会处理其他行并将发生错误的行的ID保存在文本文件或日志中示例: 假设我们有 10 行要沉入表中，不知何故我们在
Azure 数据流-源查询下推
我的数据流作业将源和接收器作为突触数据库。我在从突触数据库提取数据时有一个源查询，其中包含数据流中的联接和转换。众所周知，底层的数据流将启动 databricks 集群来执行数据流代码。我的问题
java - 同步和合并消息/数据流
这是关于非常常见的传感器数据处理问题。为了同步和合并来自不同来源的传感器数据，我想用 Java 实现它，而不需要太复杂的第三个库或框架。假设我定义了一个对象 (O)，它由 4 个属性 (A1,..
适合初学者的 HTTP 数据流？
我开始从事一个项目，我需要使用 PowerTrack/GNIP 流式传输 Twitter 数据，老实说，我在网络方面非常非常缺乏经验，而且我完全不了解网络方面的知识到数据流 (HTTP)，它们如何工作
javascript - HTTP 数据流
我有一个后端要用 Python 实现，它应该将数据流式传输到 JavaScript 正在创建表示的 Web 浏览器(例如，不断更新变量或绘制到 )。该数据将以高达 100 Hz 的速率更新(最坏情
javascript - Mongoose 数据流
我构建了一个简单的 MERN 应用程序，用户可以在其中对电话号码进行评分。用户只需填写电话号码，选择评级(1 - 5 星评级)、城市和短文本。该应用程序具有带过滤和排序选项的搜索功能。这一切都足够好
c# - 如何以优雅的方式关闭发生致命异常的 TPL 数据流？
我在 TPL 数据流上使用顺序管道构建，它由 3 个块组成: B1 - 准备消息 B2 - 将消息发布到远程服务 B3 - 保存结果问题是如何在发生服务关闭等错误时关闭管道。管道必须以受控方式关闭，
Azure 数据工厂(数据流)- 数据预览中出现不存在的列
我在 ADF 数据流中有一个数据集(ADLS Gen2 中存在的 csv 文件)。我第一次尝试进行数据预览时，原始文件中的所有列都正确显示。然后，我从 csv 文件中删除了第一列并刷新了“数据预览”选
Azure数据工厂-数据流-完成后-移动
我正在使用 ADF v2 DataFlow ativity 将数据从 Blob 存储中的 csv 文件加载到 Azure SQL 数据库中的表中。在数据流(源 - Blob 存储)中，在源选项中，有一
azure - 动态展平 - 数据流 ADF
我有很多带有嵌套列表的 json 文件需要展平。问题是它们是不同的，我不想为它们每一个创建一个分支。如何通过输入参数动态执行具有“展开依据”和“输入列”字段的展平事件？谢谢! 最佳答案对于展开方式
azure - 数据流 - Azure - isDecimal
我一直在尝试使用 Azure 数据工厂的数据流在文件的小数列中进行数据类型检查，但它没有按预期工作。我的问题如下: 我想检查数字 121012132.12 是否为小数，因此我使用数据流的派生列并编写表
Azure 数据流 md5 函数不将十进制值识别为唯一
我们使用 Azure 数据流在 Azure SQL 数据仓库中生成数据表的历史记录。在数据流中，我们在所有列上使用 md5 或 sha1 函数来生成唯一的行指纹来检测记录中的更改，或识别已删除/新记录
Azure 数据流 md5 函数不将十进制值识别为唯一
我们使用 Azure 数据流在 Azure SQL 数据仓库中生成数据表的历史记录。在数据流中，我们在所有列上使用 md5 或 sha1 函数来生成唯一的行指纹来检测记录中的更改，或识别已删除/新记录
Python bz2 - 文本与交互式控制台(数据流)
我之前使用 bz2 来尝试解压缩输入。我想要解码的输入已经是压缩格式，因此我决定将格式输入到交互式 Python 控制台中: >>> import bz2 >>> bz2.decompress(inp
c# - 涉及递归未完成的 TPL 数据流
在测试 WPF 项目中，我尝试使用 TPL 数据流来枚举给定父目录的所有子目录，并创建具有特定文件扩展名的文件列表，例如“.xlsx”。我使用 2 个 block ，第一个 dirToFilesBlo
c# - TPL 数据流 block
问题:为什么使用 WriteOnceBlock (或 BufferBlock )用于从另一个 BufferBlock 取回答案(类似回调) (取回答案发生在发布的 Action 中)导致死锁(在此代码
C# TPL 数据流 - 完成不起作用
此代码永远不会到达最后一行，因为完成不会从 saveBlock 传播到 sendBlock。我做错了什么？ var readGenerateBlock = new TransformBlock(n =
c# - 为网站抓取工具实现的 TPL 数据流
好吧，我知道我的问题需要更多的指导，而不是技术细节，但我希望 SO 成员不会介意 TPL 数据流的新手提出一些非常基础的问题。我有一个简单的 Windows 窗体应用程序，它负责从我系统上的 Exc

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

sql - 如何合并 R 中的网络流量数据流对行？