gpt4 book ai didi

.net - 使用 For 循环优化大数据比较

转载 作者:塔克拉玛干 更新时间:2023-11-03 04:24:42 25 4
gpt4 key购买 nike

好的,这是交易人员。我有两个文本文件。每个包含 500 行(句子)。

我已将它们加载到内存 中并放入它们自己的数组中(数据类型:字符串)。我们将 数组 A 和 B 命名为

接下来我得到 array A 中的 first sentence,使用 SPACE 将其拆分为另一个 array C一个分隔符,以便得到单词。

然后对于 array B 中的每个句子,我将其拆分为 array D 再次使用 SPACE 作为分隔符来获取单词,并进行比较数组 C 中的每个单词与 数组 D 中的每个单词计算两个句子之间的匹配百分比。

我计算了 array A 中第一个句子与 array B 中所有句子的平均匹配百分比。

然后我将它存储到一个Array E 中,其中包含array A 的所有句子及其平均匹配百分比。

我对数组 A 中的每个标题都使用上面的 first sentence 做的事情。

问题是处理数组 A 中的每个标题大约需要 15 秒。无论如何我可以优化这段时间以加快速度吗?

硬件 AMD Phenom I 32 位四核

代码:

Imports System.IO
Imports System.Object
Imports System.Xml
Imports System.Text.RegularExpressions

Module Module1

Sub Main()
'Important File Paths
Dim titlesFilePath As String = Environment.CurrentDirectory & "\titles.txt"
Dim xmlTitlesFilePath As String = Environment.CurrentDirectory & "\extractedTitles.txt"
Dim stopWordsFilePath As String = Environment.CurrentDirectory & "\stopWords.txt"

'Import Important Data From Files -> Memory
Dim titles As Array = FileToArray(titlesFilePath)
Dim stopWords As Array = FileToArray(stopWordsFilePath)
Dim xmlDataUnprocessed As Array = FileToArray(xmlTitlesFilePath)

'Delimters To Filter Titles For
Dim userDefinedDelimeters(4, 1)

userDefinedDelimeters(0, 0) = "-"
userDefinedDelimeters(0, 1) = " "

userDefinedDelimeters(1, 0) = ","
userDefinedDelimeters(1, 1) = " "

userDefinedDelimeters(2, 0) = "—"
userDefinedDelimeters(2, 1) = " "

userDefinedDelimeters(3, 0) = "'s"
userDefinedDelimeters(3, 1) = ""

userDefinedDelimeters(4, 0) = "'"
userDefinedDelimeters(4, 1) = " "

'Declare Important Variables
Dim xmlData(xmlDataUnprocessed.Length / 2, 1)
Dim xmlTurn = 0
Dim xmlDataCount = 0

'Create Feed Title/URL Array
For i = 0 To (xmlDataUnprocessed.Length - 1)
If xmlTurn = 0 Then
xmlData(xmlDataCount, 0) = xmlDataUnprocessed(i)
xmlTurn = 1
Else
xmlData(xmlDataCount, 1) = xmlDataUnprocessed(i)
xmlTurn = 0

xmlDataCount += 1
End If
Next


'CPU-Intensive Stuff Occurs
Dim xmlTitle As String
Dim xmlTitleWords As Array
Dim savedTitleWords As Array
Dim titleResults(xmlData.GetUpperBound(0) - 1, 1)
Dim titlePercentageMatch As Integer
Dim numberOfTitlesMatched As Integer


For i = 0 To xmlData.GetUpperBound(0) - 1
Console.WriteLine("Working On Title No. " & i & " Out Of " & xmlData.GetUpperBound(0) - 1)
titlePercentageMatch = 0
numberOfTitlesMatched = 0

xmlTitle = xmlData(i, 0)
xmlTitle = processTitle(stopWords, userDefinedDelimeters, xmlTitle)
xmlTitleWords = xmlTitle.Split(" ")

For Each title In titles
title = processTitle(stopWords, userDefinedDelimeters, title)
savedTitleWords = title.split(" ")
Dim compareResult = compareTitle(xmlTitleWords, savedTitleWords)
If compareResult > 0 Then
titlePercentageMatch += compareResult
numberOfTitlesMatched += 1
End If
Next

titleResults(i, 0) = xmlData(i, 0)
titleResults(i, 1) = (titlePercentageMatch / numberOfTitlesMatched)
Next

For i = 0 To titleResults.GetUpperBound(0) - 1
Console.WriteLine(titleResults(i, 0) & " ---> " & titleResults(i, 1) & vbCrLf)
Next

Console.Read()
End Sub

Function compareTitle(ByRef xmlTitleWords As Array, ByRef savedTitleWords As Array)
Dim NumberOfMatches = 0

For Each xmlWord In xmlTitleWords
For Each savedWord In savedTitleWords
If (xmlWord.ToString.ToLower = savedWord.ToString.ToLower) Then
NumberOfMatches += 1
End If
Next
Next

Return ((NumberOfMatches / xmlTitleWords.Length) * 100)
End Function

Function processTitle(ByRef stopWordArray As Array, ByRef delimArray As Array, ByVal title As String)
title = removeStopWords(stopWordArray, title)
title = removeDelims(delimArray, title)

Return title
End Function

Function removeStopWords(ByRef stopWordsArray As Array, ByVal sentence As String)
For i = 0 To stopWordsArray.Length - 1
If sentence.ToLower.Contains(" " & stopWordsArray(i).ToString.ToLower & " ") = True Then
sentence = Microsoft.VisualBasic.Strings.Replace(sentence, " " & stopWordsArray(i) & " ", " ", 1, -1, Constants.vbTextCompare)
'ElseIf sentence.ToLower.Contains(stopWordsArray(i).ToString.ToLower & " ") = True Then
'sentence = Microsoft.VisualBasic.Strings.Replace(sentence, stopWordsArray(i) & " ", "", 1, -1, Constants.vbTextCompare)
End If

sentence = Regex.Replace(sentence, "\s+", " ")

Dim Words = sentence.ToLower.Split(" ")

If Words(0).ToString.ToLower & " " = stopWordsArray(i).ToString.ToLower & " " Then
sentence = sentence.Remove(0, stopWordsArray(i).ToString.ToLower.Length + 1)
End If

Words = sentence.ToLower.Split(" ")
Dim LastWord = Words(Words.Length - 1)
'Console.WriteLine(LastWord & "++")

If " " & LastWord.ToString.ToLower = " " & stopWordsArray(i).ToString.ToLower Then
sentence = sentence.Remove(sentence.Length - 1 - LastWord.Length, stopWordsArray(i).ToString.ToLower.Length + 1)
End If

Next

sentence = Regex.Replace(sentence, "\s+", " ")

Return sentence
End Function

Function removeDelims(ByRef delimArray As Array, ByVal sentence As String)
For i = 0 To delimArray.GetUpperBound(0) - 1
sentence = sentence.Replace(delimArray(i, 0), delimArray(i, 1))
Next
sentence = Regex.Replace(sentence, "\s+", " ")
Return sentence
End Function

Function FileToArray(ByVal filePath As String) As String()
Dim content As String
Dim lines As New ArrayList
Dim sr As System.IO.StreamReader

' read the file's lines into an ArrayList
Try
sr = New System.IO.StreamReader(filePath)
Do While sr.Peek() >= 0
lines.Add(sr.ReadLine())
Loop
Finally
If Not sr Is Nothing Then sr.Close()
End Try

' convert from ArrayList to a String array
Return CType(lines.ToArray(GetType(String)), String())
End Function

End Module

编辑:我希望它不会太困惑。对于那个很抱歉!编辑 2: 提供酱汁 :P

最佳答案

你的基本算法是N*M*A2,其中

  • N = 第一个文件中的标题数
  • M = 第二个文件中的标题数
  • A = 每个标题的平均字数

如果您有 500*500*52,您将进行 6,250,000 次不区分大小写的字符串比较。但这就是你所做的一切。您的内循环根据外循环的长度为每个 title 调用 processTitle。它不需要那样做。

单线程

您可以做的是有一个预处理步骤,用代表该词的整数(符号)替换每个词。为此,您可以使用字典查找符号,如果没有,则分配一个新的唯一符号(例如,保留一个整数计数器并使用下一个值)。

然后您的主处理循环将与之前的循环类似,但您将进行整数比较(快得多)。事实上,您希望此处理步骤 进行比较和收集统计信息。其他所有东西都应该搬出去。

多线程

保留预处理步骤。

并行化您的处理步骤。一种方法是使用 Parallel.For()对于最外层的循环:Parallel.For(0, xmlData.GetUpperBound(0) - 1, Sub(i) ... End Sub) 其中操作是上面的循环体。 TPL 可能会很好地平衡负载(平均使用您的 4 个内核)。

另一种方法是使用任务并行库来启动对 1/4 数据进行操作的任务。然后开始使用结果的延续。

关于.net - 使用 For 循环优化大数据比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13385393/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com