gpt4 book ai didi

algorithm - 我们是否必须创建一个所有节点都有 3 个子节点的树?

转载 作者:行者123 更新时间:2023-12-01 22:31:50 24 4
gpt4 key购买 nike

构建霍夫曼树的步骤输入是唯一字符及其出现频率的数组,输出是霍夫曼树。

  1. 为每个唯一字符创建一个叶子节点,并构建一个所有叶子节点的最小堆(Min Heap用作优先级队列。频率字段的值用于比较最小堆中的两个节点。最初,出现频率最低的字符在根)

  2. 从最小堆中提取频率最小的两个节点。

  3. 创建一个新的内部节点,其频率等于两个节点频率的总和。将第一个提取的节点作为其左子节点,将另一个提取的节点作为其右子节点。将此节点添加到最小堆。

  4. 重复步骤#2 和#3,直到堆只包含一个节点。剩下的节点就是根节点,树就完成了。

在堆中,一个节点最多可以有 2 个子节点,对吧?

因此,如果我们想在三元系统(即使用符号 0 、 1 和 2 的编码词)中推广霍夫曼算法,我们可以做什么?我们是否必须创建一个所有节点都有 3 个子节点的树?

编辑:

我认为应该是这样的。

构建霍夫曼树的步骤输入是唯一字符及其出现频率的数组,输出是哈夫曼树。

  1. 为每个唯一字符创建一个叶节点,并构建所有叶节点的最小堆

  2. 从最小堆中提取频率最小的三个节点。

  3. 创建一个新的内部节点,其频率等于三个节点频率的总和。将提取的第一个节点作为其左子节点,将第二个提取的节点作为其中间子节点,将第三个提取的节点作为其右子节点。将此节点添加到最小堆。

  4. 重复步骤#2 和#3,直到堆只包含一个节点。剩下的节点就是根节点,树就完成了。

我们如何证明该算法产生最佳三元码?

编辑 2:假设我们有频率 5、9、12、13、16、45。

它们的数量是偶数,所以我们添加一个频率为 0 的虚拟节点。我们是否将其放在数组的末尾?那么,会不会是这样呢?

enter image description here

那么会不会有下面的堆呢?

enter image description here

然后:

enter image description here

然后:

enter image description here

还是我理解错了?

最佳答案

是的!您必须创建具有 3 个子节点的所有节点。为什么是3?你也可以有 n-ary huffman coding使用带有 n 个子节点的节点。这棵树看起来像这样-(对于 n=3)

    *
/ | \
* * *
/|\
* * *

三元码字的哈夫曼算法

为了便于引用,我给出了算法。

HUFFMAN_TERNARY(C)
{
IF |C|=EVEN
THEN ADD DUMMY CHARACTER Z WITH FREQUENCY 0.

N=|C|
Q=C; //WE ARE BASICALLY HEAPIFYING THE CHARACTERS

FOR I=1 TO floor(N/2)
{
ALLOCATE NEW_NODE;
LEFT[NEW_NODE]= U= EXTRACT_MIN(Q)
MID[NEW_NODE] = V= EXTRACT_MIN(Q)
RIGHT[NEW_NODE]=W= EXTRACT_MIN(Q)
F[NEW_NODE]=F[U]+F[V]+F[W];
INSERT(Q,NEW_NODE);
}
RETURN EXTRACT_MIN(Q);
} //END-OF-ALGO

Why are we adding extra nodes? To make the number of nodes odd.(Why?) Because we want to get out of the for loop with just one node in Q.


Why floor(N/2)? At first we take 3 nodes. Then replace with it 1 node.There are N-2 nodes. After that we always take 3 nodes (if not available 1 node,it is never possible to get 2 nodes because of the dummy node) and replace with 1. In each iteration we are reducing it by 2 nodes. So that's why we are using the term floor(N/2).

使用一些示例字符集在纸上自行检查。你会明白的。

正确性

我在这里引用了 Cormen, Rivest 的“Introduction to Algorithms”。

证明:一步一步的数学证明太长,无法在这里发布,但它与书中给出的证明非常相似。

想法

Any optimal tree has the lowest three frequencies at the lowest level.(We have to prove this).(using contradiction) Suppose it is not the case then we could switch a leaf with a higher frequency from the lowest level with one of the lowest three leaves and obtain a lower average length. Without any loss of generality, we can assume that all the three lowest frequencies are the children of the same node. if they are at the same level, the average length does not change irrespective of where the frequencies are). They only differ in the last digit of their codeword (one will be 0,1 or 2).

Again as the binary codewords we have to contract the three nodes and make a new character out of it having frequency=total of three node's(character's) frequencies. Like binary Huffman codes, we see that the cost of the optimal tree is the sum of the tree with the three symbols contracted and the eliminated sub-tree which had the nodes before contraction. Since it has been proved that the sub-tree has to be present in the final optimal tree, we can optimize on the tree with the contracted newly created node.

例子

  • 假设字符集包含频率5,9,12,13,16,45。

  • 现在 N=6-> 偶数。所以添加 freq=0 的虚拟字符

  • 现在N=7,C中的freq为0,5,9,12,13,16,45

  • 现在使用 min priority queue得到 3 个值。 0然后 5然后 9 .

  • 添加它们,在优先级队列中插入 freq=0+9+5 的新字符。这样继续。

树会变成这样

         100
/ | \
/ | \
/ | \
39 16 45 step-3
/ | \
14 12 13 step-2
/ | \
0 5 9 step-1

最终证明

我现在将直接模拟 Cormen 的证明。

引理1.设C是一个字母表,其中属于C的每个字符d的频率为c.freq。让x , y 和 z 是 C 中频率最低的三个字符。那么存在C 的最佳前缀代码,其中 x 、y 和 z 的代码字具有相同的长度,仅在最后一位不同。


证明:

想法

  • 首先考虑生成任意最优前缀码的任意树T。

  • 然后我们将对其进行修改,使其成为代表另一个最佳前缀的树,使得字符 x、y、z 在最大深度处显示为兄弟节点。

  • 如果我们可以构建这样一棵树,那么 x、y 和 z 的代码字将具有相同的长度,仅在最后一位不同。

证明--

  • 令 a,b,c 为 T 中最大深度的同级叶子的三个字符。
  • Without loss of generality, we assume that a.freq < b:freq < c.freq and x.freq < y.freq < z.freq. Since x.freq and y.freq and z.freq are the 3 lowest leaf frequencies, in order (means there are no frequencies between them) and a.freq , b.freq and c.freq are two arbitrary frequencies, in order, we have x.freq < a:freq and
    y.freq < b.freq and z.freq< c.freq.

In the remainder of the proof we can have x.freq=a.freq or y.freq=b.freq or z.freq=c.freq. But if x.freq=b.freq or x.freq=c.freq or y.freq=c.freq then all of them are same. WHY?? Let's see. Suppose x!=y,y!=z,x!=z but z=c and x<y<z in order and aa<b<c. Also x!=a. --> x<a y!=b. --> y<b z!=c. --> z<c but z=c is given. This contradicts our assumption. (Thus proves). The lemma would be trivially true. Thus we will assume x!=b and x!=c.

 T1

* |
/ | \ |
* * x +---d(x)
/ | \ |
y * z +---d(y) or d(z)
/|\ |
a b c +---d(a) or d(b) or d(c) actually d(a)=d(b)=d(c)

T2
*
/ | \
* * a
/ | \
y * z
/|\
x b c

T3

*
/ | \
* * x
/ | \
b * z
/|\
x y c

T4
*
/ | \
* * a
/ | \
b * c
/|\
x y z


In case of T1 costt1= x.freq*d(x)+ cost_of_other_nodes + y.freq*d(y) + z.freq*d(z) + d(a)*a.freq + b.freq*d(b) + c.freq*d(c)
In case of T2 costt2= x.freq*d(a)+ cost_of_other_nodes + y.freq*d(y) + z.freq*d(z) + d(x)*a.freq + b.freq*d(b) + c.freq*d(c)
costt1-costt2= x.freq*[d(x)-d(a)]+0 + 0 + 0 + a.freq[d(a)-d(x)]+0 + 0
= (a.freq-x.freq)*(d(a)-d(x))
>= 0
So costt1>=costt2. --->(1)
Similarly we can show costt2 >= costt3--->(2)

And costt3 >= costt4--->(3)

From (1),(2) and (3) we get

costt1>=costt4.-->(4)

But T1 is optimal.

So costt1<=costt4 -->(5)

From (4) and (5) we get costt1=costt2.

SO, T4 is an optimal tree in which x,y,and z appears as sibling leaves at maximum depth, from which the lemma follows.

引理 2

Let C be a given alphabet with frequency c.freq defined for each character c belonging to C. Let x , y, z be three characters in C with minimum frequency. Let C1 be the alphabet C with the characters x and y removed and a new character z1 added, so that C1 = C - {x,y,z} union {z1}. Define f for C1 as for C, except that z1.freq=x.freq+y.freq+z.freq. Let T1 be any tree representing an optimal prefix code for the alphabet C1. Then the tree T , obtained from T1 by replacing the leaf node for z with an internal node having x , y and z as children, represents an optimal prefix code for the alphabet C.

证明:

看,我们正在从 T1-> T 进行转换。所以我们必须找到一种方法来表达 T,即用 costt1 表示 costt。

     *                            *              
/ | \ / | \
* * * * * *
/ | \ / | \
* * * ----> * z1 *
/|\
x y z

对于属于(C-{x,y,z})的c,dT(c)=dT1(c)。 [深度对应T和T1树]

因此 c.freq*dT(c)=c.freq*dT1(c) .

dT(x)=dT(y)=dT(z)=dT1(z1)+1

So we have `x.freq*dT(x)+y.freq*dT(y)+z.freq*dT(z)=(x.freq+y.freq+z.freq)(dT1(z)+1)`
= `z1.freq*dT1(z1)+x.freq+y.freq+z.freq`
Adding both side the cost of other nodes which is same in both T and T1.
x.freq*dT(x)+y.freq*dT(y)+z.freq*dT(z)+cost_of_other_nodes= z1.freq*dT1(z1)+x.freq+y.freq+z.freq+cost_of_other_nodes

So costt=costt1+x.freq+y.freq+z.freq
or equivalently

costt1=costt-x.freq-y.freq-z.freq ---->(1)

现在我们用反证法证明引理。

我们现在用反证法证明引理。假设 T 不代表C 的最佳前缀代码。然后存在最佳树 T2,使得 costt2 < costt .不失一般性(根据引理 1),T2 有 x 和 y 和 z 作为 sibling 。设 T3 是树 T2,其中 x 和 y 和 z 的共同父代被替换为 a叶 z1 与频率 z1.freq=x.freq+y.freq+z.freq然后

costt3 = costt2-x.freq-y.freq-z.freq
< costt-x.freq-y.freq-z.freq
= costt1 (From 1)

与 T1 代表最佳前缀码的假设相矛盾对于 C1。因此,T必须代表字母表C的最佳前缀码。

-Proved.

程序 HUFFMAN 产生最佳前缀代码。证明:直接来自引理 1 和 2。

注意.: 术语来自Introduction to Algorithms 3rd edition Cormen Rivest

关于algorithm - 我们是否必须创建一个所有节点都有 3 个子节点的树?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29200112/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com