algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？-6ren

algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？

转载作者：行者123 更新时间：2023-12-04 16:23:56

问题:

Consider the following data compression technique. We have a table ofm text strings, each at most k in length. We want to encode a datastring D of length n using as few text strings as possible. Forexample, if our table contains (a,ba,abab,b) and the data string isbababbaababa, the best way to encode it is (b,abab,ba,abab,a)—a totalof five code words. Give an O(nmk) algorithm to find the length of thebest encoding. You may assume that every string has at least oneencoding in terms of the table.

我在skiena的算法设计手册中发现了这个问题并尝试解决。

我最好的猜测是将字符串(D)的所有可能的子字符串与长度(m)的表中的所有文本匹配，时间复杂度:O(n*n*k*m) .

有没有更好的方法在 O(nmk) 伪多项式时间内解决这个问题？

最佳答案

如果我误解了，请纠正我，但听起来你的解决方案的时间复杂度会比 O(nnkm) 大。

要生成包含 D 的所有元素的 D 的所有可能分区的列表，需要 O(nn)。然后，对于每个分区中的每个子集(其中最多有 n 个子集)，您需要验证表中是否存在匹配，每个子集需要 O(km)。在 n*n 个可能的分区之后(实际上不是 n*n，而是 n 的贝尔数)，匹配最多 n 个集合，每个集合将得到 O(nnnmk) 的复杂度

也就是说，我认为有一种更快的方法来尝试构建一棵树。

从 D 的开头开始，您可以尝试匹配表中的所有字符串。表中的每个字符串代表一个分支，如果表字符串在第一个位置与 D 不匹配，则分支终止。如果匹配，则从刚刚匹配的表字符串的末尾开始重复该过程。

此方法的一个技巧是您只需要保留产生唯一长度的一系列匹配项。以广度优先的方式重复这个过程意味着如果已经发现了一组达到特定长度的表字符串，那么任何再次达到该长度的表字符串都保证包含相同数量或更多的字符串。表但仍然匹配 D 的相同子字符串。例如，表字符串 {abab,ab} 都将产生长度为 4 的字符串 'abab'，但其中一个将在第一次迭代时匹配，一个将在第一次迭代时匹配第二。在这个例子中，包含 'ab'+'ab' 的编码总是比包含 'abab' 的编码差，所以我们丢弃了 'ab'+'ab' 编码。可以使用一个简单的 O(1) 查找表来检查这一点，并且以前见过的任何长度也可以终止其分支。因此，最多可以发现 n 个唯一长度，每个长度只能发现一次，这意味着最多有 n 个可能的分支。

仍然必须检查表中的字符串，这仍然是 O(mk)。添加分支，我相信这会产生 O(nmk) 的复杂度，因为最多创建 n 个分支。

例如，字符串 'aaaaaa' 上的表 {a, aa} 每次迭代将包含 2 个分支，共 3 次迭代。每个分支都需要 O(mk) 来检查，所以这个例子的时间复杂度是 O(nmk)。

编辑:由于对我提出的实现有些怀疑，我用 C 实现了它并做了一些分析。


#include <stdlib.h>
#include <string.h>
#include <stdio.h>

const char* alphabet = "abcdef";
const int alphabetLen = strlen(alphabet);

void printTable(char** table, int k){
    printf("Table: {%s", table[0]);
    for(int i=1; i<k; i++){
        printf(", %s", table[i]);
    }
    printf("}\n");
}

char** generateTable(int k, int m){

    char** table = (char**)malloc(sizeof(char*)*k);
    for(int i=0; i<k; i++){
        int tableStringLen = (rand() % (m-1)) + 1;
        table[i] = (char*)malloc(tableStringLen+1);
        for(int j=0; j<tableStringLen; j++){
            table[i][j] = alphabet[rand() % alphabetLen];
        }
        table[i][tableStringLen] = 0;
    }

    return table;
}

// Unfortunately, the length of the string is partially determined by the size of m
// it's done this way to avoid generating a random string and ensuring the table can encode it
char* generateString(char** table, int k, int m){
    int minLength = rand()%200 + m*2;
    char* string = (char*)malloc(minLength + m);

    int j = 0;
    for(int i=0; j<minLength; i++){
        int tableStringIndex = rand() % k;
        strcpy(string+j, table[tableStringIndex]);
        j += strlen(table[tableStringIndex]);
    }
    string[j] = 0;

    return string;
}

void printSolution(char* string, int* branchLengths, int n){
    if(!n){ return; }
    printSolution(string, branchLengths, branchLengths[n]);
    printf("%.*s ", n-branchLengths[n], string+branchLengths[n]);
}

int* solve(char* string, char** table, int n, int m, int k, int* comparisons){
    int* currentBranchEnds = (int*)calloc(sizeof(int), n); // keeps track of all of the current ends
    int currentBranchCount = 1; // keeps track of how many active branch ends there are

    int* newBranchEnds = (int*)calloc(sizeof(int), n); // keeps track of the new branches on each iteration
    int newBranchCount = 0;

    int* branchLengths = (int*)calloc(sizeof(int), (n+1)); // used to reconstruct the solution in the end

    // used for O(1) length lookups
    int* tableStringLengths = (int*)malloc(sizeof(int) * k);
    for(int i=0; i<k; i++){ tableStringLengths[i] = strlen(table[i]); }

    *comparisons = 0;

    // continue searching while the entire string hasn't been encoded
    while(!branchLengths[n]){

        // for every active branch
        for(int i=0; i<currentBranchCount; i++){


            // try all table strings
            for(int j=0; j<k; j++){
                int newBranchEnd = currentBranchEnds[i] + tableStringLengths[j];
                
                // if the new length (branch end) already exists OR
                // if the new branch would be too long for the string, discard the branch
                *comparisons += 1;
                if(newBranchEnd > n || branchLengths[newBranchEnd]){ continue; }

                // check to see if the table string matches the target string at this position
                // could be done with strcmp, but is done in a loop here to be explicit about complexity
                char match = 1;
                for(int l=0; table[j][l]; l++){
                    *comparisons += 1;
                    if(string[currentBranchEnds[i] + l] != table[j][l]){
                        match = 0;
                        break;
                    }
                }

                // if it matches, we can create a new branch at this position
                *comparisons += 1;
                if(match){
                    branchLengths[newBranchEnd] = currentBranchEnds[i];
                    newBranchEnds[newBranchCount] = newBranchEnd;
                    newBranchCount += 1;
                }
            }
        }

        // swap the branch ends arrays to save copying
        int* tmp = currentBranchEnds;
        currentBranchEnds = newBranchEnds;
        newBranchEnds = tmp;

        currentBranchCount = newBranchCount;
        newBranchCount = 0;
    }

    free(currentBranchEnds);
    free(newBranchEnds);
    free(tableStringLengths);

    return branchLengths;
}

int main(){
    int k = rand() % 30 + 2;
    int m = rand() % 15 + 2;

    char** table = generateTable(k, m);
    printTable(table, k);

    char* string = generateString(table, k, m);
    int n = strlen(string);
    printf("String: %s\n", string);

    int comparisons;
    int* solution = solve(string, table, n, m, k, &comparisons);
    printf("Solution: ");
    printSolution(string, solution, n);
    printf("\n");
    printf("Comparisons: %d\n", comparisons);

    for(int i=0; i<k; i++){ free(table[i]); }
    free(table);
    free(solution);
    free(string);
}

分析是通过对随机生成的 n、m 和 k 值运行 >500000 次算法生成的。如代码所示，“比较”的次数在每个 if 语句中都会增加，并且绘制了每个 n、m 和 k 值的平均比较次数。

比较次数(计算得出的)与 n、m 和 k 的大小之间显然存在线性关系。这表明复杂度是 O(nmk)

关于algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69025441/

文章推荐： r - 如何识别两个数据矩阵之间的哪些列和行匹配？

文章推荐： python - 操作 numpy 数组(连接内部子数组)

文章推荐： Haskell RankNTypes - 函数域的限制

algorithm - 隐私和匿名化 "Algorithm"
我在一本书(Interview Question)中读到这个问题，想在这里详细讨论这个问题。请点亮它。问题如下:- 隐私和匿名化马萨诸塞州集团保险委员会早在 1990 年代中期就有一个绝妙的主意
algorithm - 微软技术面试 : Matrix Algorithm
我最近接受了一次面试，面试官给了我一些伪代码并提出了相关问题。不幸的是，由于准备不足，我无法回答他的问题。由于时间关系，我无法向他请教该问题的解决方案。如果有人可以指导我并帮助我理解问题，以便我可以改
algorithm - 获取二叉树中给定值的根到节点的距离 : Algorithm correctness
这是我的代码 public int getDist(Node root, int value) { if (root == null && value !=0) return
algorithm - 交叉点 : Strassen's Algorithm
就效率而言，Strassen 算法应该停止递归并应用乘法的最佳交叉点是多少？我知道这与具体的实现和硬件密切相关，但对于一般情况应该有某种指南或某人的一些实验结果。在网上搜索了一下，问了一些他们认为
algorithm - 图书请求 : Distributed algorithms
我想学习一些关于分布式算法的知识，所以我正在寻找任何书籍推荐。我对理论书籍更感兴趣，因为实现只是个人喜好问题(我可能会使用 erlang(或 c#))。但另一方面，我不想对算法进行原始的数学分析。只是
algorithm - "classical algorithms"的真实世界实现
我想知道你们中有多少人实现了计算机科学的“ classical algorithms ”，例如 Dijkstra's algorithm或现实世界中的数据结构(例如二叉搜索树)，而不是学术项目？当有
algorithm - 我试图找到一个 "bartender algorithm"
我正在解决旧编程竞赛中的一些示例问题。在这个问题中，我们得到了我们有多少调酒师以及他们知道哪些食谱的信息。制作每杯鸡尾酒需要 1 分钟，我们需要使用所有调酒师计算是否可以在 5 分钟内完成订单。解决
algorithm - 函数式编程中存在 "algorithms"吗？
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 8 年前。 Improve
javascript - if (!options.algorithms) throw new Error ('algorithms should be set' );错误 : algorithms should be set
我开始学习 Nodejs，但我被困在中间的某个地方。我从 npm 安装了一个新库，它是 express -jwt ，它在运行后显示某种错误。附上代码和错误日志，请帮助我! const jwt = re
algorithm - SSL 证书 : Signature Algorithm shows "sha256rsa" but thumbprint algorithm shows "sha1"
我有一个证书，其中签名算法显示“sha256rsa”，但指纹算法显示“sha1”。我的证书 SHA1/SHA2 的标识是什么？谢谢! 最佳答案 TL;TR:签名和指纹是完全不同的东西。对于证书的强度
algorithm - "algorithm problem size"到底是什么意思？
我目前在我的大学学习数据结构类(class)，并且在之前的类(class)中做过一些算法分析，但这是我在之前的类(class)中遇到的最困难的部分。我们现在将在我的数据结构类(class)中学习算法分
algorithm - 选择不相邻的单元格 : algorithm's time complexity
有一个由 N 个 1x1 方格组成的区域，并且该区域的所有部分都是相连的(没有任何方格无法到达的方格)。下面是一些面积的例子。我想在这个区域中选择一些方块，并且两个相邻的方块不能一起选择(对角接触
algorithm - 粗糙度降低 : Algorithm for smoothing out shapes
我有一些多边形形状的点列表，我想将其包含在我页面上的 Google map 中。我已经从原始数据中删除了尽可能多的不必要的多边形，现在我剩下大约 12 个，但它们非常详细以至于导致了问题。现在我的文
algorithm - Marching Squares Algorithm 的位移步骤
我目前正在实现 Marching Squares用于计算等高线曲线，我对此处提到的位移位的使用有疑问 Compose the 4 bits at the corners of the cell to
algorithm - 理解约束满足问题 : map coloring algorithm
我正在尝试针对给定算法的约束满足问题实现此递归回溯函数: function BACKTRACKING-SEARCH(csp) returns solution/failure return R
algorithm - 将矩阵除以矩阵 : Bartlett Correlation Algorithm
是否有包含反函数的库？作为项目的一部分，我目前正在研究测向算法。我正在使用巴特利特相关性。在 Bartlett 相关性中，我需要将已经是 3 次矩阵乘法(包括 Hermitian 转置)的分子除以作
algorithm - 多项式时间 : Accepting and Decision Algorithms
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 8 年前。 Improve
algorithm - 长波紫外线 - 1394 : And There Was One Algorithm
问题的链接是UVA - 1394 : And There Was One . 朴素的算法是扫描整个数组并在每次迭代中标记第 k 个元素并在最后停止:这需要 O(n^2) 时间。我搜索了一种替代算法并
algorithm - 什么是 "Decentralized Uniqueness Algorithm"？
COM 中创建 GUID 的函数 (CoCreateGUID) 使用“分散唯一性算法”，但我的问题是，它是什么？谁能解释一下？最佳答案一种生成 ID 的方法，该 ID 具有一定的唯一性保证，而不
algorithm - 最小化颜色 : a variation of the knapsack algorithm?
在做一个项目时我遇到了这个问题，我将在这个问题的实际领域之外重新措辞(我想我可以谈论烟花的口径和形状，但这会使理解更加复杂).我正在寻找一种(可能是近似的)算法来解决它。我有 n 个不同大小的容器，

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？