r - (空间)在一个点的 X 米内找到所有点的有效方法？-6ren

r - (空间)在一个点的 X 米内找到所有点的有效方法？

转载作者：行者123 更新时间：2023-12-04 03:56:15

我有一个大型空间数据集(12M 行)。几何图形是 map 上的点。对于数据集中的每一行，我想找到该点 500 米内的所有点。

在 r 中，使用 sf，我一直试图通过并行循环遍历每一行并运行 st_buffer 和 st_intersects，然后将结果保存为键值格式的列表(键是原点，值是邻居)。

问题是数据集太大。即使并行到 60 个以上的内核，操作时间也太长(> 1 周，通常会崩溃)。

这种蛮力方法的替代方案是什么？是否可以使用 sf 构建索引？也许将操作推送到外部数据库？

代表:

library(sf)
library(tidyverse)
library(parallel)
library(foreach)


# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())


## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)

# or just run in sequence:
registerDoSEQ()

neighbors <- foreach(ii = 1:nrow(nc)
                      , .verbose = FALSE
                      , .errorhandling = "pass") %dopar% {

                        l = 500 # 500 meters

                        # isolate the row as the origin point:
                        row_interest <- filter(nc, row_number()==ii)

                        # create the buffer:
                        buffer <- row_interest %>% st_buffer(dist = l)

                        # extract the row numbers of the neighbors
                        comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]

                        # get all the neighbors:
                        comps <- nc %>% filter(row_number() %in% comps_idx)

                        # remove the geometry:
                        comps <- comps %>% st_set_geometry(NULL)

                        # flow control in case there are no neibors:
                        if(nrow(comps)>0) {
                          comps$Origin_Key <- row_interest$Id
                        } else {
                          comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
                          comps$Origin_Key <- row_interest$Id
                        }


                        return(comps)
                      }

closeAllConnections()

length(neighbors)==nrow(nc)
[1] TRUE

最佳答案

与 sf 一起工作时对象，显式循环要执行的功能
诸如相交之类的二元运算通常会适得其反(另请参见
How can I speed up spatial operations in `dplyr::mutate()`? )

一种类似于您的方法(即缓冲和相交)，但没有
显式 for循环效果更好。

让我们看看它在 50000 点的相当大的数据集上的表现如何:

library(sf)
library(spdep)
library(sf)

pts <- data.frame(x = runif(50000, 0, 100000),
                  y = runif(50000, 0, 100000))
pts     <- sf::st_as_sf(pts, coords = c("x", "y"), remove = F)
pts_buf <- sf::st_buffer(pts, 5000)
coords  <- sf::st_coordinates(pts)

microbenchmark::microbenchmark(
  sf_int = {int <- sf::st_intersects(pts_buf, pts)},
  spdep  = {x   <- spdep::dnearneigh(coords, 0, 5000)}
  , times = 1)
#> Unit: seconds
#>    expr       min        lq      mean    median        uq       max neval
#>  sf_int  21.56186  21.56186  21.56186  21.56186  21.56186  21.56186     1
#>   spdep 108.89683 108.89683 108.89683 108.89683 108.89683 108.89683     1

您可以在这里看到 st_intersects方法比速度快 5 倍 dnearneigh一。

不幸的是，这不太可能解决您的问题。看执行
我们得到不同大小的数据集的时间:

subs <- c(1000, 3000, 5000, 10000, 15000, 30000, 50000)
times <- NULL
for (sub in subs[1:7]) {
  pts_sub <- pts[1:sub,]
  buf_sub <- pts_buf[1:sub,]
  t0 <- Sys.time()
  int <- sf::st_intersects(buf_sub, pts_sub)
  times <- cbind(times, as.numeric(difftime(Sys.time() , t0, units = "secs")))
}

plot(subs, times)

times <- as.numeric(times)
reg <- lm(times~subs+I(subs^2))
summary(reg)
#> 
#> Call:
#> lm(formula = times ~ subs + I(subs^2))
#> 
#> Residuals:
#>        1        2        3        4        5        6        7 
#> -0.16680 -0.02686  0.03808  0.21431  0.10824 -0.23193  0.06496 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2.429e-01  1.371e-01   1.772    0.151    
#> subs        -2.388e-05  1.717e-05  -1.391    0.237    
#> I(subs^2)    8.986e-09  3.317e-10  27.087  1.1e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1908 on 4 degrees of freedom
#> Multiple R-squared:  0.9996, Adjusted R-squared:  0.9994 
#> F-statistic:  5110 on 2 and 4 DF,  p-value: 1.531e-07

在这里，我们看到时间和时间之间几乎完美的二次关系
点数(正如预期的那样)。在一个 10M 点的子集上，假设
行为不会改变，你会得到:

predict(reg, newdata = data.frame(subs = 10E6))
#>        1 
#> 898355.4

，这相当于大约 10 天，假设趋势是恒定的
当进一步增加点数时(但同样会发生在 dnearneigh ...)

我的建议是将您的点“拆分”成块，然后进行处理
每个拆分的基础。

例如，您可以在开始时订购您的积分
x 轴，然后使用 data.table 轻松快速地提取缓冲区和点的子集，以便与它们进行比较。

显然，“点”缓冲区需要大于“缓冲区”，根据
到比较距离。因此，例如，如果您制作 pts_buf 的子集和
[50000 - 55000]中的质心， pts的对应子集应该包括
[49500 - 55500] 范围内的点。
这种方法很容易通过将不同的子集分配给 foreach 中的不同内核或类似的构造。

我什至不知道在这里使用空间对象/操作是否有益，因为一旦我们有了坐标，所有需要的是计算和设置欧氏距离的子集:我怀疑仔细编码的蛮力 data.table基于的方法也可能是一个可行的解决方案。

哼!

更新

最后，我决定试一试，看看我们可以从这种方法中获得多少速度。这是一个可能的实现:

points_in_distance_parallel <- function(in_pts,
                                        maxdist,
                                        ncuts = 10) {

  require(doParallel)
  require(foreach)
  require(data.table)
  require(sf)
  # convert points to data.table and create a unique identifier
  pts <-  data.table(in_pts)
  pts <- pts[, or_id := 1:dim(in_pts)[1]]

  # divide the extent in quadrants in ncuts*ncuts quadrants and assign each
  # point to a quadrant, then create the index over "xcut"
  range_x  <- range(pts$x)
  limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
  range_y  <- range(pts$y)
  limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
  pts[, `:=`(xcut =  as.integer(cut(x, ncuts, labels = 1:ncuts)),
             ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
    setkey(xcut, ycut)

  results <- list()

  cl <- parallel::makeCluster(parallel::detectCores() - 2, type =
                                ifelse(.Platform$OS.type != "windows", "FORK",
                                       "PSOCK"))
  doParallel::registerDoParallel(cl)
  # start cycling over quadrants
  out <- foreach(cutx = seq_len(ncuts)), .packages = c("sf", "data.table")) %dopar% {

    count <- 0

    # get the points included in a x-slice extended by `dist`, and build
    # an index over y
    min_x_comp    <- ifelse(cutx == 1, limits_x[cutx], (limits_x[cutx] - maxdist))
    max_x_comp    <- ifelse(cutx == ncuts,
                            limits_x[cutx + 1],
                            (limits_x[cutx + 1] + maxdist))
    subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
      setkey(y)

    for (cuty in seq_len(pts$ycut)) {

      count <- count + 1

      # subset over subpts_x to find the final set of points needed for the
      # comparisons
      min_y_comp  <- ifelse(cuty == 1,
                            limits_y[cuty],
                            (limits_y[cuty] - maxdist))
      max_y_comp  <- ifelse(cuty == ncuts,
                            limits_y[cuty + 1],
                            (limits_y[cuty + 1] + maxdist))
      subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]

      # subset over subpts_comp to get the points included in a x/y chunk,
      # which "neighbours" we want to find. Then buffer them.
      subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
        sf::st_as_sf() %>%
        st_buffer(maxdist)

      # retransform to sf since data.tables lost the geometric attrributes
      subpts_comp <- sf::st_as_sf(subpts_comp)

      # compute the intersection and save results in a element of "results".
      # For each point, save its "or_id" and the "or_ids" of the points within "dist"

      inters <- sf::st_intersects(subpts_buf, subpts_comp)

      # save results
      results[[count]] <- data.table(
        id = subpts_buf$or_id,
        int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))

    }
    return(data.table::rbindlist(results))
  }
parallel::stopCluster(cl)
data.table::rbindlist(out)
}

该函数将 作为输入积分sf对象 , 目标距离和一个数字
“削减”用于在象限中划分范围，并在输出中提供
一个数据框，其中对于每个原始点，其中的点的“id” maxdist被举报 在 int_ids列表栏 .

在具有不同数量的均匀分布点的测试数据集上，
和 maxdist 的两个值我得到了这些结果(“并行”运行是使用 6 个内核完成的):

所以，我们得到 速度提升 5-6 倍 已经在“串行”实现上，和 另一个 5X 多亏了 6 个内核的并行化。
尽管此处显示的时间仅是指示性的，并且与
我们构建的特定测试数据集(在分布较不均匀的数据集上，我希望速度提升较低)我认为这非常好。

哼!

PS:更彻底的分析可以在这里找到:

https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analyses-by.html

关于r - (空间)在一个点的 X 米内找到所有点的有效方法？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48650274/

文章推荐： visual-studio-code - 用于启动配置的 vscode 键盘快捷键

文章推荐： webpack - extract-text-webpack-plugin 不创建 CSS 文件

文章推荐： c++11 - move 容器的迭代器？

Ruby 方法() 方法
我想了解 Ruby 方法 methods() 是如何工作的。我尝试使用“ruby 方法”在 Google 上搜索，但这不是我需要的。我也看过 ruby-doc.org，但我没有找到这种方法。
VBS教程：方法-Test 方法
Test 方法对指定的字符串执行一个正则表达式搜索，并返回一个 Boolean 值指示是否找到匹配的模式。 object.Test(string) 参数 object 必选项。总是一个
VBS教程：方法-Replace 方法
Replace 方法替换在正则表达式查找中找到的文本。 object.Replace(string1, string2) 参数 object 必选项。总是一个 RegExp 对象的名称。
VBS教程：方法-Raise 方法
Raise 方法生成运行时错误 object.Raise(number, source, description, helpfile, helpcontext) 参数 object 应为
VBS教程：方法-Execute 方法
Execute 方法对指定的字符串执行正则表达式搜索。 object.Execute(string) 参数 object 必选项。总是一个 RegExp 对象的名称。 string
VBS教程：方法-Clear 方法
Clear 方法清除 Err 对象的所有属性设置。 object.Clear object 应为 Err 对象的名称。说明在错误处理后，使用 Clear 显式地清除 Err 对象。此
VBS教程：方法-CopyFile 方法
CopyFile 方法将一个或多个文件从某位置复制到另一位置。 object.CopyFile source, destination[, overwrite] 参数 object 必选
VBS教程：方法-Copy 方法
Copy 方法将指定的文件或文件夹从某位置复制到另一位置。 object.Copy destination[, overwrite] 参数 object 必选项。应为 File 或 F
VBS教程：方法-Close 方法
Close 方法关闭打开的 TextStream 文件。 object.Close object 应为 TextStream 对象的名称。说明下面例子举例说明如何使用 Close 方
VBS教程：方法-BuildPath 方法
BuildPath 方法向现有路径后添加名称。 object.BuildPath(path, name) 参数 object 必选项。应为 FileSystemObject 对象的名称
VBS教程：方法-GetFolder 方法
GetFolder 方法返回与指定的路径中某文件夹相应的 Folder 对象。 object.GetFolder(folderspec) 参数 object 必选项。应为 FileSy
VBS教程：方法-GetFileName 方法
GetFileName 方法返回指定路径（不是指定驱动器路径部分）的最后一个文件或文件夹。 object.GetFileName(pathspec) 参数 object 必选项。应为
VBS教程：方法-GetFile 方法
GetFile 方法返回与指定路径中某文件相应的 File 对象。 object.GetFile(filespec) 参数 object 必选项。应为 FileSystemObject
VBS教程：方法-GetExtensionName 方法
GetExtensionName 方法返回字符串，该字符串包含路径最后一个组成部分的扩展名。 object.GetExtensionName(path) 参数 object 必选项。应
VBS教程：方法-GetDriveName 方法
GetDriveName 方法返回包含指定路径中驱动器名的字符串。 object.GetDriveName(path) 参数 object 必选项。应为 FileSystemObjec
VBS教程：方法-GetDrive 方法
GetDrive 方法返回与指定的路径中驱动器相对应的 Drive 对象。 object.GetDrive drivespec 参数 object 必选项。应为 FileSystemO
VBS教程：方法-GetBaseName 方法
GetBaseName 方法返回字符串，其中包含文件的基本名 (不带扩展名), 或者提供的路径说明中的文件夹。 object.GetBaseName(path) 参数 object 必
VBS教程：方法-GetAbsolutePathName 方法
GetAbsolutePathName 方法从提供的指定路径中返回完整且含义明确的路径。 object.GetAbsolutePathName(pathspec) 参数 object
VBS教程：方法-FolderExists 方法
FolderExists 方法如果指定的文件夹存在，则返回 True；否则返回 False。 object.FolderExists(folderspec) 参数 object 必选项
VBS教程：方法-FileExists 方法
FileExists 方法如果指定的文件存在返回 True；否则返回 False。 object.FileExists(filespec) 参数 object 必选项。应为 FileS

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - (空间)在一个点的 X 米内找到所有点的有效方法？