r - 使用 R 对随机生成的横断面进行有效排序-6ren

r - 使用 R 对随机生成的横断面进行有效排序

转载作者：行者123 更新时间：2023-12-03 13:32:43

问题

我正在寻找一种方法来有效地对固定对象周围发生的随机选择的采样横断面进行排序。这些横断面一旦生成，就需要以一种在空间上有意义的方式进行排序，以使行进的距离最小化。这将通过确保当前横断面的终点尽可能靠近下一个横断面的起点来实现。此外，没有一个横断面可以重复。

因为有数千条断面需要订购，这是手动完成的一项非常繁琐的任务，我正在尝试使用 R 来自动化此过程。我已经生成了横断面，每个横断面都有一个起点和终点，其位置使用 360 度系统指示(例如，0 是北，90 是东，180 是南，270 是西)。我还生成了一些似乎指示下一个横断面的起点和 ID 的代码，但此代码存在一些问题:(1) 根据所考虑的起点和终点，它可能会产生错误，(2 ) 它没有实现我最终需要它实现的目标，并且 (3) 照原样，代码本身似乎过于复杂，我不禁想知道是否有更直接的方法来做到这一点。

理想情况下，代码会导致横断面被重新排序，以便它们匹配它们应该飞行的顺序，而不是它们最初输入的顺序。

数据

为简单起见，我们假设只有 10 个横断面需要排序。

# Transect ID for the start point
StID <- c(seq(1, 10, 1))

# Location of transect start point, based on a 360-degree circle
StPt <- c(342.1, 189.3, 116.5, 67.9, 72, 208.4, 173.2, 97.8, 168.7, 138.2)

# Transect ID for the end point
EndID <- c(seq(1, 10, 1))

# Location of transect start point, based on a 360-degree circle
EndPt <- c(122.3, 313.9, 198.7, 160.4, 166, 26.7, 312.7, 273.7, 288.8, 287.5)

# Dataframe
df <- cbind.data.frame(StPt, StID, EndPt, EndID)

我试过的

请随意忽略此代码，必须有更好的方法，但它并没有真正达到预期的结果。现在我正在使用嵌套的 for 循环，它很难直观地遵循，但代表了我迄今为止最好的尝试。

# Create two new columns that will be populated using a loop
df$StPt_Next <- NA
df$ID_Next <- NA

# Also create a list to be populated as end and start points are matched
used <- c(df$StPt[1]) #puts the start point of transect #1 into the used vector since we will start with 1 and do not want to have it used again

# Then, for every row in the dataframe...
for (i in seq(1,length(df$EndPt)-1, 1)){ # Selects all rows except the last one as the last transect should have no "next" transect
  # generate some print statements to indicate that the script is indeed running while you wait....
  print(paste("######## ENDPOINT", i, ":", df$EndPt[i], " ########"))
  print(paste("searching for a start point that fits criteria to follow this endpoint",sep=""))
  # sequentially select each end point
  valueEndPt <- df[i,1]
  # and order the index by taking the absolute difference of end and start points and, if this value is greater than 180, also subtract from 360 so all differences are less than 180, then order differences from smallest to largest  
  orderx <- order(ifelse(360-abs(df$StPt-valueEndPt) > 180, 
                         abs(df$StPt-valueEndPt),
                         360-abs(df$StPt-valueEndPt)))
  tmp <- as.data.frame(orderx)
  # specify index value
  index=1
  # for as long as there is an "NA" present in the StPt_Next created before for loop...  
  while (is.na(df$StPt_Next[i])) {
    #select the value of the ordered index in sequential order     
    j=orderx[index]
    # if the start point associated with a given index is present in the list of used values...
    if (df$StPt[j] %in% used){
      # then have R print a statement indicate this is the case...
      print(paste("passing ",df$StPt[j], " as it has already been used",sep=""))
      # and move onto the next index
      index=index+1
      # break statement intended to skip the remainder of the code for values that have already been used      
      next
      # if the start point associated with a given index is not present in the list of used values...      
    } else {
      # then identify the start point value associated with that index ID... 
      valueStPt <- df$StPt[j]
      # and have R print a statement indicating an attempt is being made to use the next value      
      print(paste("trying ",df$StPt[j],sep=""))
      # if the end transect number is different from the start end transect number...
      if (df$EndID[i] != df$StID[j]) { 
        # then put the start point in the new column...
        df$StPt_Next[i] <- df$StPt[j]
        # note which record this start point came from for ease of reference/troubleshooting...
        df$ID_Next[i] <- j
        # have R print a statement that indicates a value for the new column has beed selected...        
        print(paste("using ",df$StPt[j],sep=""))
        # and add that start point to the list of used ones
        used <- c(used,df$StPt[j])
        # otherwise, if the end transect number matches the start end transect number...
      } else {
        # keep NA in this column and try again
        df$StPt_Next[i] <- NA
        # and indicate that this particular matched pair can not be used
        print(paste("cant use ",valueStPt," as the column EndID (related to index in EndPt) and StID (related to index in StPt) values are matching",sep=""))
      }# end if else statement to ensure that start and end points come from different transects
      # and move onto the next index
      index=index+1
    }# end if else statement to determine if a given start point still needs to be used
  }# end while loop to identify if there are still NA's in the new column
}# end for loop

输出

当代码没有产生显式错误时，例如对于提供的示例数据，输出如下:

    StPt StID EndPt EndID StPt_Next ID_Next
1  342.1    1 122.3     1      67.9       4
2  189.3    2 313.9     2     173.2       7
3  116.5    3 198.7     3      97.8       8
4   67.9    4 160.4     4      72.0       5
5   72.0    5 166.0     5     116.5       3
6  208.4    6  26.7     6     189.3       2
7  173.2    7 312.7     7     168.7       9
8   97.8    8 273.7     8     138.2      10
9  168.7    9 288.8     9     208.4       6
10 138.2   10 287.5    10        NA      NA

最后两列由代码生成并添加到原始数据帧中。 StPt_Next 具有下一个最近起点的位置，而 ID_Next 指示与该下一个起点位置相关联的断面 ID。 ID_Next 列指示应该飞行的顺序横断面如下 1,4,5,3,8,10,NA(又名结束)，2,7,9,6 形成自己的循环，回到2.

有两个具体问题我无法解决:

(1)存在形成一个连续的序列链的问题。我认为这与 1 是起始横断线和 10 是最后一个横断线无论如何，但不知道如何在代码中指示倒数第二个横线必须与 10 匹配，以便该序列包括所有 10 条横线在终止于代表最终终点的“NA”之前。

(2) 为了真正自动化这个过程，在修复由于过早引入“NA”作为 ID_next 导致的序列提前终止之后，将创建一个新列，允许基于最有效的横断面重新排序而不是其 EndID/StartID 的原始顺序。

预期结果

如果我们假设我们只有 6 个横断面要排序，而忽略由于过早引入“NA”而无法排序的 4 个横断面，这将是预期的结果:

    StPt StID EndPt EndID StPt_Next ID_Next TransNum
1  342.1    1 122.3     1      67.9       4        1
4   67.9    4 160.4     4      72.0       5        2
5   72.0    5 166.0     5     116.5       3        3
3  116.5    3 198.7     3      97.8       8        4
8   97.8    8 273.7     8     138.2      10        5
10 138.2   10 287.5    10        NA      NA        6

编辑:关于代码显式产生的错误消息的说明

如前所述，该代码有一些缺陷。另一个缺陷是，在尝试订购大量横断面时，通常会产生错误。我不完全确定错误是在过程中的哪个步骤产生的，但我猜测它与无法匹配最后几个断面有关，可能是由于不符合“orderx”规定的标准。打印语句说“尝试 NA”而不是数据库中的起点，这会导致此错误:“if (df$EndID[i] != df$StID[j]) { : missing value where TRUE/需要假”。我猜测需要另一个 if-else 语句以某种方式指示“如果剩余的点不符合 orderx 标准，则只需强制它们与剩余的任何横断面相匹配，以便为所有内容分配 StPt_Next 和 ID_Next” .

这是一个会产生错误的更大的数据集:

EndPt <- c(158.7,245.1,187.1,298.2,346.8,317.2,74.5,274.2,153.4,246.7,193.6,302.3,6.8,359.1,235.4,134.5,111.2,240.5,359.2,121.3,224.5,212.6,155.1,353.1,181.7,334,249.3,43.9,38.5,75.7,344.3,45.1,285.7,155.5,183.8,60.6,301,132.1,75.9,112,342.1,302.1,288.1,47.4,331.3,3.4,185.3,62,323.7,188,313.1,171.6,187.6,291.4,19.2,210.3,93.3,24.8,83.1,193.8,112.7,204.3,223.3,210.7,201.2,41.3,79.7,175.4,260.7,279.5,82.4,200.2,254.2,228.9,1.4,299.9,102.7,123.7,172.9,23.2,207.3,320.1,344.6,39.9,223.8,106.6,156.6,45.7,236.3,98.1,337.2,296.1,194,307.1,86.6,65.5,86.6,296.4,94.7,279.9)

StPt <- c(56.3,158.1,82.4,185.5,243.9,195.6,335,167,39.4,151.7,99.8,177.2,246.8,266.1,118.2,358.6,357.9,99.6,209.9,342.8,106.5,86.4,35.7,200.6,65.6,212.5,159.1,297,285.9,300.9,177,245.2,153.1,8.1,76.5,322.4,190.8,35.2,342.6,8.8,244.6,202,176.2,308.3,184.2,267.2,26.6,293.8,167.3,30.5,176,74.3,96.9,186.7,288.2,62.6,331.4,254.7,324.1,73.4,16.4,64,110.9,74.4,69.8,298.8,336.6,58.8,170.1,173.2,330.8,92.6,129.2,124.7,262.3,140.4,321.2,34,79.5,263,66.4,172.8,205.5,288,98.5,335.2,38.7,289.7,112.7,350.7,243.2,185.4,63.9,170.3,326.3,322.9,320.6,199.2,287.1,158.1)

EndID <- c(seq(1, 100, 1))

StID <- c(seq(1, 100, 1))

df <- cbind.data.frame(StPt, StID, EndPt, EndID)

任何建议将不胜感激!

最佳答案

正如@chinsoon12 指出隐藏在您的问题中，您有一个(不对称的)旅行商问题。不对称的出现是因为您的 transec 的起点和终点不同。

ATSP 是一个著名的 NP 完全问题。因此，即使对于中等规模的问题，精确的解决方案也非常困难(有关更多信息，请参阅 wikipedia)。因此，在大多数情况下，我们能做的最好的事情是近似或启发式。正如您提到的，有数千条断面，这至少是一个中等规模的问题。

不是从一开始就编写 ATSP 近似算法，而是有一个现有的 R 的 TSP 库。这包括几个近似算法。引用文档是 here .

以下是我对您的问题使用的 TSP 包。从设置开始(假设我在您的问题中运行了 StPt 、 StID 、 EndPt 和 EndID 。

install.packages("TSP")
library(TSP)
library(dplyr)

# Dataframe
df <- cbind.data.frame(StPt, StID, EndPt, EndID)
# filter to 6 example nodes for requested comparison
df = df %>% filter(StID %in% c(1,3,4,5,8,10))

我们将使用距离矩阵中的 ATSP。职位 [row,col]矩阵中是从(结束)横断面的成本/距离 row到(起点)横断面 col .此代码创建整个距离矩阵。

# distance calculation
transec_distance = function(end,start){
  abs_dist = abs(start-end)
  ifelse(360-abs_dist > 180, abs_dist, 360-abs_dist)
}

# distance matrix
matrix_distance = matrix(data = NA, nrow = nrow(df), ncol = nrow(df))

for(start_id in 1:nrow(df)){
  start_point = df[start_id,'StPt']

  for(end_id in 1:nrow(df)){
    end_point = df[end_id,'EndPt']
    matrix_distance[end_id,start_id] = transec_distance(end_point, start_point)
  }
}

请注意，有更有效的方法可以构建距离矩阵。但是，我选择这种方法是为了它的清晰度。根据您的计算机和横断面的确切数量，此代码运行速度可能非常缓慢。

另请注意，此矩阵的大小与横断面数量成二次方。所以对于大量的样条，你会发现没有足够的内存。

解决方法非常令人兴奋。距离矩阵被转换成 ATSP 对象，并且 ATSP 对象被传递给求解器。然后我们继续将订购/旅行信息添加到原始 df 中。

answer = solve_TSP(as.ATSP(matrix_distance))
# get length of cycle
print(answer)

# sort df to same order as solution
df_w_answer = df[as.numeric(answer),]
# add info about next transect to each transect
df_w_answer = df_w_answer %>%
  mutate(visit_order = 1:nrow(df_w_answer)) %>%
  mutate(next_StID = lead(StID, order_by = visit_order),
         next_StPt = lead(StPt, order_by = visit_order))
# add info about next transect to each transect (for final transect)
df_w_answer[df_w_answer$visit_order == nrow(df_w_answer),'next_StID'] =
  df_w_answer[df_w_answer$visit_order == 1,'StID']
df_w_answer[df_w_answer$visit_order == nrow(df_w_answer),'next_StPt'] =
  df_w_answer[df_w_answer$visit_order == 1,'StPt']
# compute distance between end of each transect and start of next
df_w_answer = df_w_answer %>% mutate(dist_between = transec_distance(EndPt, next_StPt))

在这一点上，我们有一个循环。您可以选择任何节点作为起点，按照 df 中给出的顺序: from EndID至 next_StID ，并且您将覆盖(一个很好的近似值)最小距离内的每个断面。

但是，在您的“预期结果”中，您有一个路径解决方案(例如，从横断面 1 开始并在横断面 10 结束)。我们可以通过排除单个最昂贵的转换来将循环变成一条路径:

# as path (without returning to start)
min_distance = sum(df_w_answer$dist_between) - max(df_w_answer$dist_between)
path_start = df_w_answer[df_w_answer$dist_between == max(df_w_answer$dist_between), 'next_StID']
path_end = df_w_answer[df_w_answer$dist_between == max(df_w_answer$dist_between), 'EndID']

print(sprintf("minimum cost path = %.2f, starting at node %d, ending at node %d",
              min_distance, path_start, path_end))

运行上述所有内容给了我一个不同但更好的答案，你的预期结果。我得到以下订单: 1 --> 5 --> 8 --> 4 --> 3 --> 10 --> 1 .

您从横断面 1 到横断面 10 的路径总距离为 428，如果我们也从横断面 10 返回到横断面 1，使其成为一个循环，则总距离将为 483。

使用 R 中的 TSP 包，我们得到一条从 1 到 10 的路径，总距离为 377，循环为 431。

如果我们改为从节点 4 开始并在节点 8 结束，我们得到的总距离为 277。

一些额外的节点:

并非所有 TSP 求解器都是确定性的，因此如果再次运行或以不同顺序运行输入行，您的答案可能会有所不同。

TSP 是一个比您描述的横断面问题更普遍的问题。您的问题可能具有足够的附加/特殊功能，这意味着它可以在合理的时间内完美解决。但这会将您的问题带入数学领域。

如果您的内存不足，无法创建距离矩阵，请查看 TSP 包的文档。它包含几个使用地理坐标而不是距离矩阵作为输入的示例。这是一个小得多的输入大小(大概是该包会动态计算距离)，因此如果您将起点和终点转换为坐标并指定欧几里得(或其他一些常见的距离函数)，您可以绕过(某些)计算机内存限制.

关于r - 使用 R 对随机生成的横断面进行有效排序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62124593/