- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在 build an answer提供给上一个关于使用 stringdist.
进行模糊匹配的问题
我有两个带有长字符串(消费品名称)的大型数据集(~30k 行),我想通过生成距离分数来模糊匹配。两个产品名称列表中预期会有一些重叠,但有些产品对于每个列表都是唯一的。
问题是:我的计算机正在努力扩展包含如此多数据的网格,而 R 不断崩溃。但是,我有一个可能有助于优化的想法 - 我只是无法让它发挥作用。
我的大部分字符串都可以根据品牌名称(例如 pantene
、 neutrogena
等)分成数据子集。我不想计算所有字符串组合之间的距离,而是 grep
对于品牌名称,对数据进行子集化,然后计算距离。
首先,我使用与上一篇文章相同的功能。
# Function by @C8H10N4O2
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
brand_filter
.
brand_filter<-c("pantene","neutrogena","maybelline", "revlon", "colour prevails", "nyx professional makeup",
"covergirl", "no7", "milani", "japonesque", "rimmel", "thebalm",
"physicians formula", "e.l.f.", "almay", "soap \\& glory", "l'oreal paris"
)
for loop
我使用
tar
过滤两个数据集(
wg
和
grep
)在扩展网格、计算字符串距离和应用赋值函数之前。
dput(brand_filter[1:15])
for (i in seq_along(brand_filter)) {
d1<-tar$product.title_r[grep(brand_filter[i], tar$product.title_r)]
d2<-wg$product.title_r[grep(brand_filter[i], wg$product.title_r)]
d <- expand.grid(d1,d2) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
match<-data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
all_matches<-rbind(all_matches,match)
}
tar$product.title_r<-
c("neutrogena oil free moisture ultra gentle facial moisturizer",
"pantene pro v overnight miracle repair serum", "able life space saver walker regal rose",
"neutrogena acne stress control triple action toner", "e.l.f. studio eyebrow kit,light",
"neutrogena mineral sheers loose powder foundation,natural beige",
"neutrogena naturals multivitamin nourishing moisturizer lotion",
"neutrogena healthy skin liquid makeup spf 20,classic ivory",
"able life bedside sturdy rail black", "neutrogena norwegian formula moisture wrap body lotion",
"e.l.f. shape & stay wax pencil clear", "pantene pro v damage detox rebuilding conditioner",
"pantene pro v 3 minute miracle curl perfection daily conditioner",
"l'oreal paris visible lift radiance booster", "neutrogena triple moisture professional deep recovery hair mask",
"neutrogena bb cream, spf 30,light/medium", "neutrogena transparent facial bar, original formula fragrance free",
"pantene pro v daily moisture renewal hydrating conditioner",
"pantene pro v repair and protect 3 minute miracle deep conditioner",
"pantene pro v medium thick hair solutions shampoo", "pantene daily moisture renewal foaming conditioner",
"neutrogena hydro boost hydrating cleansing gel", "neutrogena rapid clear foaming scrub acne treatment",
"pantene radiant color shine foaming conditioner", "pantene pro v ice shine luminous conditioner",
"neutrogena acne stress control power clear scrub", "pantene pro v truly relaxed hair moisturizing shampoo",
"e.l.f. studio angled contour brush", "pantene pro v radiant color shine 2 in 1 shampoo & conditioner",
"neutrogena healthy skin pressed powder compact,fair 10", "paul mitchell foaming pomade",
"e.l.f. mad for matte eyeshadow palette,nude mood", "e.l.f. skincare starter kit",
"pantene pro v smooth and sleek 3 minute miracle deep conditioner",
"depend real fit incontinence underwear for men, maximum absorbency, small/medium, gray gray",
"e.l.f. pointed powder brush", "neutrogena nourishing longwear makeup, spf 20,buff",
"pantene pro v 2in1 medium thick hair solutions shampoo & conditioner",
"e.l.f. studio makeup remover cleansing cloths", "pantene pro v truly relaxed hair moisturizing shampoo",
"pantene pro v color hair solutions shampoo", "neutrogena makeup remover cleansing towelettes fragrance free",
"paul mitchell firm style freeze and shine super spray", "depend real fit incontinence briefs for men, maximum absorbency large/extra large gray & blue",
"neutrogena t gel therapeutic shampoo original formula", "pantene pro v gold series repairing mask treatment",
"neutrogena hydro boost water gel spf15", "pantene pro v full & strong flexible conditioner",
"neutrogena healthy skin liquid makeup honey", "lierac liftissime re lifting eye serum",
"e.l.f. eyebrow stencil kit clear", "neutrogena skinclearing blemish concealer,buff 09",
"pantene pro v beautiful lengths shampoo", "pantene pro v classic clean conditioner",
"e.l.f. hd mattifying balm,clear", "pantene pro v volume root lifting spray hair gel",
"neutrogena ageless intensives anti wrinkle deep wrinkle daily moisturizer spf 20",
"pantene pro v classic clean 2 in 1 shampoo & conditioner", "pantene pro v daily moisture renewal moisturizing combing cream",
"pantene pro v 3 minute miracle radiant color deep conditioner",
"e.l.f. makeup remover pen", "pantene pro v fine hair solutions 2 in 1 shampoo & conditioner",
"neutrogena t gel therapeutic shampoo", "e.l.f. beautifully bare luminous matte makeup primer",
"e.l.f. beautifully bare eyeshadow,nude linen", "pantene pro v volume texturizing non aerosol hairspray",
"pantene pro v breakage defense conditioner", "ardell fashion lashes",
"neutrogena oil free eye makeup remover", "pantene pro v gold series moisture boost shampoo",
"neutrogena oil free acne wash pink grapefruit facial cleanser",
"e.l.f. studio cream eyeliner,black", "neutrogena makeup remover cleansing towelettes",
"neutrogena all in 1 acne control daily scrub", "neutrogena ultra gentle hydrating cleanser, creamy formula",
"neutrogena oil free acne wash cream cleanser pink grapefruit",
"e.l.f. aqua beauty primer mist clear", "e.l.f. mineral infused face primer",
"pantene pro v gold series deep hydrating co wash", "e.l.f. studio angled foundation brush",
"neutrogena oil free cleansing wipes pink grapefruit", "e.l.f. perfect finish hd powder,clear",
"neutrogena norwegian formula hand cream scented", "able life ez door knob grips",
"neutrogena body clear body scrub, salicylic acid acne treatment",
"depend real fit for men briefs maximum absorbency small/medium gray",
"depend real fit incontinence underwear for men, maximum absorbency, large/xlarge",
"neutrogena oil free facial moisturizer lotion spf 35", "special k cereal fruit & yogurt",
"teatrical stem cell facial moisturizer", "e.l.f. beautifully bare sheer tint finishing powder,light/medium",
"neutrogena triple moisture professional daily deep conditioner",
"e.l.f. studio kabuki face brush", "l'oreal paris visible lift radiance cheek duo,201 romantic in rose",
"e.l.f. studio blush,candid coral", "pantene pro v repair & protect shampoo",
"neutrogena 3 in 1 concealer for eyes,light", "neutrogena healthy skin compact makeup spf 55,classic ivory",
"able life auto assist grab bar", "e.l.f. beautifully bare foundation serum spf 25,fair/light"
)
wg$product.title_r<-c("pantene curl perfection leave in conditioning spray 8.5 fl oz",
"neutrogena shine control powder", "neutrogena ultra sheer body mist sunscreen broad spectrum spf 100+ 5oz",
"neutrogena healthy lengths mascara", "pantene pro v gold series repairing mask 7.6oz",
"neutrogena healthy skin blends", "pantene shaping hair gel 6.8 fl oz",
"pantene pro v repair and protect dream care conditioner 23.7 fl oz",
"pantene pro v air spray extra hold alcohol free hairspray 7oz",
"pantene pro v sheer volume dream care conditioner 23.7 fl oz",
"pantene pro v curl perfection moisturizing shampoo 20.1oz",
"neutrogena hydro boost hydrating lip shine 0.12oz", "pantene 3 minute miracle sheer volume deep conditioner 8oz",
"pantene curl perfection controlling curl crme 7.6oz", "neutrogena moisture shine lip soothers spf 20",
"neutrogena t/sal therapeutic shampoo scalp build up control 4.5oz",
"pantene pro v color preserve volume conditioner 17.7oz", "neutrogena healthy skin glow sheers light shades 1.1 fl oz",
"pantene pro v sheer volume shampoo", "pantene curl perfection conditioner 12.6 fl oz",
"pantene pro v classic clean dream care conditioner 23.7 fl oz",
"pantene pro v beautiful lengths dream care conditioner 23.7 fl oz",
"pantene pro v smooth and sleek dream care shampoo 25 fl oz",
"pantene sheer volume foam conditioner 6oz", "pantene pro v ultimate 10 bb shampoo",
"pantene pro v radiant color shine dream care shampoo 25 fl oz",
"pantene pro v curl enhancing spray gel 5.7 fl oz", "pantene extra strong hold level 4 hold hairspray 11oz",
"neutrogena ultra sheer broad spectrum sunscreen body mist spf 30 5oz",
"pantene pro v classic clean dream care shampoo 25 fl oz",
"neutrogena healthy volume mascara", "neutrogena hydro boost hydrating concealer",
"neutrogena men skin clearing acne wash 3 pk", "pantene pro v daily moisture renewal dream care shampoo 25 fl oz",
"pantene daily moisture renewal shampoo", "pantene daily moisture renewal hair shampoo travel size 3.38 fl oz",
"neutrogena anti residue gentle clarifying shampoo 6 fl oz",
"pantene radiant colour shine foam conditioner 6oz", "neutrogena hydro boost plumping mascara",
"pantene pro v curl perfection moisturizing conditioner 17.7oz",
"neutrogena healthy skin liquid makeup fair shades 1 fl oz",
"pantene pro v radiant color shine dream care conditioner 23.7 fl oz",
"pantene pro v micellar shampoo 17.9 fl oz", "neutrogena wet skin sunscreen spray broad spectrum spf 50 5 fl oz",
"pantene pro v beautiful lengths shampoo", "pantene pro v ultimate 10 bb conditioner",
"neutrogena mineral sheers compact powder", "neutrogena healthy skin anti aging perfector",
"pantene pro v micellar shampoo 10.1 fl oz", "neutrogena build a tan lotion 6.7oz"
)
最佳答案
我可能遗漏了一些东西,但 greedyAssign 似乎比必要的更复杂。例如,即使使用基本 R 版本进行模糊匹配(adist 函数),也可以获得更多矢量化的代码。
fuzzy.matcher = function(a,b) {
dists<- adist(a,b) # calculate the distance matrix.
simi <- -dists # converts it to a similarity matrix
bestbyindex <- max.col(simi)
matches <- cbind( a, b[bestbyindex], apply(simi,1,max) )
return(matches)
}
brand.index.tar = pmatch( tar$product.title_r, brand_filter)
brand.index.wg = pmatch( wg$product.title_r, brand_filter)
split.tar = split(tar$product.title_r, brand.index.tar) # Separate brand names in different data.frames.
split.wg = split(wg$product.title_r, brand.index.wg)
mapply(fuzzy.matcher, split.tar, split.wg)
fuzzy.matcher()
,在
mapply()
的内部循环中.
关于r - 使用 stringdist 对变量上的数据进行分区以加速 "fuzzy match",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51757806/
我喜欢 smartcase,也喜欢 * 和 # 搜索命令。但我更希望 * 和 # 搜索命令区分大小写,而/和 ?搜索命令遵循 smartcase 启发式。 是否有隐藏在某个地方我还没有找到的设置?我宁
关闭。这个问题是off-topic .它目前不接受答案。 想改进这个问题? Update the question所以它是on-topic对于堆栈溢出。 10年前关闭。 Improve this qu
从以下网站,我找到了执行java AD身份验证的代码。 http://java2db.com/jndi-ldap-programming/solution-to-sslhandshakeexcepti
似乎 melt 会使用 id 列和堆叠的测量变量 reshape 您的数据框,然后通过转换让您执行聚合。 ddply,从 plyr 包看起来非常相似..你给它一个数据框,几个用于分组的列变量和一个聚合
我的问题是关于 memcached。 Facebook 使用 memcached 作为其结构化数据的缓存,以减少用户的延迟。他们在 Linux 上使用 UDP 优化了 memcached 的性能。 h
在 Camel route ,我正在使用 exec 组件通过 grep 进行 curl ,但使用 ${HOSTNAME} 的 grep 无法正常工作,下面是我的 Camel 路线。请在这方面寻求帮助。
我正在尝试执行相当复杂的查询,在其中我可以排除与特定条件集匹配的项目。这是一个 super 简化的模型来解释我的困境: class Thing(models.Model) user = mod
我正在尝试执行相当复杂的查询,我可以在其中排除符合特定条件集的项目。这里有一个 super 简化的模型来解释我的困境: class Thing(models.Model) user = mod
我发现了很多嵌入/内容项目的旧方法,并且我遵循了在这里找到的最新方法(我假设):https://blog.angular-university.io/angular-ng-content/ 我正在尝试
我正在寻找如何使用 fastify-nextjs 启动 fastify-cli 的建议 我曾尝试将代码简单地添加到建议的位置,但它不起作用。 'use strict' const path = req
我正在尝试将振幅 js 与 React 和 Gatsby 集成。做 gatsby developer 时一切看起来都不错,因为它发生在浏览器中,但是当我尝试 gatsby build 时,我收到以下错
我试图避免过度执行空值检查,但同时我想在需要使代码健壮的时候进行空值检查。但有时我觉得它开始变得如此防御,因为我没有实现 API。然后我避免了一些空检查,但是当我开始单元测试时,它开始总是等待运行时异
尝试进行包含一些 NOT 的 Kibana 搜索,但获得包含 NOT 的结果,因此猜测我的语法不正确: "chocolate" AND "milk" AND NOT "cow" AND NOT "tr
我正在使用开源代码共享包在 iOS 中进行 facebook 集成,但收到错误“FT_Load_Glyph failed: glyph 65535: error 6”。我在另一台 mac 机器上尝试了
我正在尝试估计一个标准的 tobit 模型,该模型被审查为零。 变量是 因变量 : 幸福 自变量 : 城市(芝加哥,纽约), 性别(男,女), 就业(0=失业,1=就业), 工作类型(失业,蓝色,白色
我有一个像这样的项目布局 样本/ 一种/ 源/ 主要的/ java / java 资源/ .jpg 乙/ 源/ 主要的/ java / B.java 资源/ B.jpg 构建.gradle 设置.gr
如何循环遍历数组中的多个属性以及如何使用map函数将数组中的多个属性显示到网页 import React, { Component } from 'react'; import './App.css'
我有一个 JavaScript 函数,它进行 AJAX 调用以返回一些数据,该调用是在选择列表更改事件上触发的。 我尝试了多种方法来在等待时显示加载程序,因为它当前暂停了选择列表,从客户的 Angul
可能以前问过,但找不到。 我正在用以下形式写很多语句: if (bar.getFoo() != null) { this.foo = bar.getFoo(); } 我想到了三元运算符,但我认
我有一个表单,在将其发送到 PHP 之前我正在执行一些验证 JavaScript,验证后的 JavaScript 函数会发布用户在 中输入的文本。页面底部的标签;然而,此消息显示短暂,然后消失...
我是一名优秀的程序员,十分优秀!