- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有两个 data.frame
和多个公共(public)列(这里: date
、 city
、 ctry
和( other_
) number
)。
我现在想将它们合并到上述列中,但可以容忍某种程度的差异:
threshold.numbers <- 3
threshold.date <- 5 # in days
date
条目之间的差异是
> threshold.date
(以天为单位)或
> threshold.numbers
,我不希望合并这些行。
city
中的条目是
df
列中另一个
city
条目的子字符串,我希望合并这些行。 [如果有人有更好的想法来测试实际城市名称的相似性,我很乐意听到。] (并保留第一个
df
的条目
date
、
city
和
country
但两者(
other_
)
number
列和
df
中的所有其他列。
df1 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17",
"1999-06-30", "1999-03-16", "1999-07-16", "2001-08-29", "2002-07-30"),
city = c("Berlin", "Paris", "London", "Rome", "Bern",
"Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"),
ctry = c("Germany", "France", "UK", "Italy", "Switzerland",
"Denmark", "Poland", "Russia", "Tunisia", "Austria"),
number = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
col = c("apple", "banana", "pear", "banana", "lemon", "cucumber", "apple", "peach", "cherry", "cherry"))
df2 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17", # all identical to df1
"1999-06-29", "1999-03-14", "1999-07-17", # all 1-2 days different
"2000-01-29", "2002-07-01"), # all very different (> 2 weeks)
city = c("Berlin", "East-Paris", "near London", "Rome", # same or slight differences
"Zurich", # completely different
"Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"), # same
ctry = c("Germany", "France", "UK", "Italy", "Switzerland", # all the same
"Denmark", "Poland", "Russia", "Tunisia", "Austria"),
other_number = c(13, 17, 3100, 45, 51, 61, 780, 85, 90, 101), # slightly different to very different
other_col = c("yellow", "green", "blue", "red", "purple", "orange", "blue", "red", "black", "beige"))
data.frames
并收到一个
df
,如果满足上述条件,则合并行。
.
)或行是来自
df1
(
1
)还是
df2
(
2
)。
date city ctry number other_col other_number other_col2 #comment
1. 2003-08-29 Berlin Germany 10 apple 13 yellow # matched on date, city, number
2. 1999-06-12 Paris France 20 banana 17 green # matched on date, city similar, number - other_number == threshold.numbers
31 2000-08-29 London UK 30 pear <NA> <NA> # not matched: number - other_number > threshold.numbers
32 2000-08-29 near London UK <NA> <NA> 3100 blue #
41 1999-02-24 Rome Italy 40 banana <NA> <NA> # not matched: number - other_number > threshold.numbers
42 1999-02-24 Rome Italy <NA> <NA> 45 red #
51 2001-04-17 Bern Switzerland 50 lemon <NA> <NA> # not matched: cities different (dates okay, numbers okay)
52 2001-04-17 Zurich Switzerland <NA> <NA> 51 purple #
6. 1999-06-30 Copenhagen Denmark 60 cucumber 61 orange # matched: date difference < threshold.date (cities okay, dates okay)
71 1999-03-16 Warsaw Poland 70 apple <NA> <NA> # not matched: number - other_number > threshold.numbers (dates okay)
72 1999-03-14 Warsaw Poland <NA> <NA> 780 blue #
81 1999-07-16 Moscow Russia 80 peach <NA> <NA> # not matched: number - other_number > threshold.numbers (dates okay)
82 1999-07-17 Moscow Russia <NA> <NA> 85 red #
91 2001-08-29 Tunis Tunisia 90 cherry <NA> <NA> # not matched: date difference < threshold.date (cities okay, dates okay)
92 2000-01-29 Tunis Tunisia <NA> <NA> 90 black #
101 2002-07-30 Vienna Austria 100 cherry <NA> <NA> # not matched: date difference < threshold.date (cities okay, dates okay)
102 2002-07-01 Vienna Austria <NA> <NA> 101 beige #
if there is a case where abs("date_df2" - "date_df1") <= threshold.date:
if "ctry_df2" == "ctry_df1":
if "city_df2" ~ "city_df1":
if abs("number_df2" - "number_df1") <= threshold.numbers:
merge and go to next row in df2
else:
add row to df1```
最佳答案
我首先将城市名称转换为字符向量,因为(如果我理解正确的话)您想要包含包含在 df2.xml 中的城市名称。
df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)
df = merge(df1, df2, by = ("ctry"))
> df
ctry date.x city.x number col date.y city.y other_number other_col
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue
stringr
将允许您在此处查看 city.x 是否在 city.y 内(请参见最后一列):
library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE
df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)
df$numDiff<-abs(df$number - df$other_number)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE 29 1
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE 0 5
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE 2 710
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE 1 5
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE 0 1
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE 212 0
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE 0 3070
df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
ctry date.x city.x number col other_number other_col
2 Denmark 1999-06-30 Copenhagen 60 cucumber 61 orange
3 France 1999-06-12 Paris 20 banana 17 green
4 Germany 2003-08-29 Berlin 10 apple 13 yellow
关于r - 基于多列和阈值合并数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58715919/
我正在用 R 编写程序。我卡在这里。 我有像这样的矢量 X=c(84.05, 108.04, 13.95, -194.05, 64.03, 208.05, 84.13, 57.04) 我想在用 180
我正在编写一个应用程序,该应用程序涉及使用手指或手写笔在屏幕上书写。我有那部分工作。在 ACTION_DOWN 上,开始绘制;在 ACTION_MOVE 上,添加线段;在 ACTION_UP 上,完成
我正在尝试构建 OCR 以从图像中提取文本,我正在使用轮廓来形成文本字符的边界, 经过几次更改 cv2.threshold 的试验后,我在形成文本字符的边界时得到了最适合的轮廓。 #files = o
我正在尝试使用 OpenCV 的 cv::threshold函数(更具体 THRESH_OTSU ),只是我想用蒙版(任何形状)来做,以便在计算过程中忽略外部(背景)。 图像是单 channel (必
对于学校项目,我试图用 Python 编写一个程序来跟踪学生的运动。为了做到这一点,我正在使用 OpenCV。 在互联网上查找了一些教程后,我注意到几乎每个人都使用阈值来实现这一点,因为几乎每一步都需
我使用 jest 来驱动 selenium 测试,它报告一个需要 12 秒的测试缓慢(持续时间以红色突出显示)。在这种情况下,12 秒就可以了。 如何将阈值配置为 30 秒? 最佳答案 Jest is
我想找到 list1 中与 list2 中的值足够接近的值(基于指定的阈值),即与下面的代码类似的功能。然而,与 pyhton 的 set 交集相比,下面的 intersect_with_thresh
我正在尝试创建一张图表上有两个系列并带有阈值选项的浮线图。我知道当我只有 1 个系列时如何启用阈值(就像这里 http://people.iola.dk/olau/flot/examples/thre
我已正确应用 d3 (v 4.0) 直方图函数对数据数组进行分箱。我的代码如下所示: var bins = d3.histogram() .domain([data_points_min,
我正在使用带有自然语言全文的 Mysql FULLTEXT 搜索,不幸的是,我遇到了 FULLTEXT 50% 阈值,如果给定的关键字出现在总行数的 50% 时间,则不允许我搜索行。 我搜索并找到了一
关闭。此题需要details or clarity 。目前不接受答案。 想要改进这个问题吗?通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
这是我的绘图数据 var data = [{ data: [[4, 80], [8, 50], [9, 130]], color: "r
是否可以制作Canny忽略短边还是忽略低梯度边?在我的例子中,我将卡片放在木头上,并在 canny 之后看到木结构的许多边缘 canny 函数中的两个阈值有什么用? 最佳答案 Large intens
我正在尝试使用 OpenCV 的 cv::threshold 函数(更具体的 THRESH_OTSU),只是我想使用掩码(任何形状) ), 以便在计算过程中忽略外部(背景)。 图像是单 channel
我正在寻找根据提供的音频、频率范围(例如 20hz-1000hz)和阈值缩放 PNG 文件,以获得平滑的效果。 例如,当有脚踢时,比例平滑到 120%,我想让那些音频可视化器,如 dubstep 等.
我正在尝试找到最佳阈值,以使我的逻辑回归具有最高的 f1 分数。但是,当我写下以下几行时: val f1Score = metrics.fMeasureByThreshold f1Score.fore
我使用 Flot 创建了一个实时(每 10 毫秒更新一次)垂直样条图。图表可见here on Codepen 。我包括了 Flot multiple threshold plugin ,但我希望阈值使
我有一个数据框,其中包含从第 1 天到第 7 天的三个人(John、Terry、Henry)的分数。 1 2 3 4 5 6 7
我正在尝试实现多级 Otsu 阈值,更具体地说,我需要 3 个阈值/4 个类。 我知道关于 SO 的 2 个类似问题:#34856019 和 #22706742。问题是我没有得到好的结果:我已经阅读了
The documentation在 THRESH_BINARY 上说: dst(x,y) = maxval if src(x,y) > thresh else 0 这对我来说并不意味着这不适用于彩色
我是一名优秀的程序员,十分优秀!