gpt4 book ai didi

What's the rationale for null terminated strings?(以空值结尾的字符串的基本原理是什么?)

转载 作者:bug小助手 更新时间:2023-10-25 19:00:45 25 4
gpt4 key购买 nike



As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:

尽管我非常喜欢C和C++,但对于选择以空值结尾的字符串,我还是忍不住摸不着头脑:




  • Length prefixed (i.e. Pascal) strings existed before C

  • Length prefixed strings make several algorithms faster by allowing constant time length lookup.

  • Length prefixed strings make it more difficult to cause buffer overrun errors.

  • Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.

  • Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.

  • C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.

  • Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.



Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?

其中有几件事比C更晚曝光,所以C不知道它们是有道理的。然而,有几个在C出现之前就已经很普通了。为什么选择以空值结尾的字符串,而不是使用明显优越的长度前缀?



EDIT: Since some asked for facts (and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:

编辑:由于一些人就我上面的效率点询问事实(而且不喜欢我已经提供的事实),他们源于以下几点:




  • Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).

  • Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).

  • Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.



From answers below, these are some cases where null terminated strings are more efficient:

从以下答案可以看出,在以下情况下,以空值结尾的字符串效率更高:




  • When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.

  • In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).



None of the above are nearly as common as length and concat.

以上这些都不像Long和Conat那样常见。



There's one more asserted in the answers below:

在下面的回答中,还有一个断言:




  • You need to cut off the end of the string



but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)

但是这个是不正确的--对于空值终止和长度前缀的字符串来说,它的时间是相同的。(以空值结尾的字符串只需将空值粘贴到您想要的新结尾的位置,长度前缀只需从前缀中减去。)


更多回答

I always thought it was a rite of passage for all C++ programmers to write their own string library.

我一直认为这是所有C++程序员编写自己的字符串库的仪式。

What's this about expecting rational explanations now. I suppose you'll want to hear a rationale for x86 or DOS next? As far as I'm concerned, the worst technology wins. Every time. And the worst string representation.

现在期待理性的解释是什么意思。我想您接下来会想听听关于x86或DOS的基本原理?在我看来,最差的技术会赢。每次都是。和最差的字符串表示法。

Why do you claim length prefixing strings are superior? After all, C became popular because it used null-terminated strings, which set it apart from the other languages.

为什么你声称长度前缀字符串更好?毕竟,C之所以流行,是因为它使用以空结尾的字符串,这使它有别于其他语言。

@Daniel: C became popular because it's a simple, efficient, and portable representation of programs executable on Von Neumann machines, and because it was used for Unix. It certainly isn't because it decided to use null terminated strings. If it was a good design decision, people would have copied it, and they haven't. They've certainly copied pretty much everything else from C.

@Daniel:C之所以流行,是因为它是冯·诺伊曼机器上可执行程序的简单、高效和可移植的表示形式,还因为它用于Unix。当然不是,因为它决定使用以空结尾的字符串。如果这是一个好的设计决定,人们会复制它,但他们没有。他们肯定复制了几乎所有其他的C语言。

Concat is only O(m) with length-prefixing if you destroy one of the strings. Otherwise, same speed. The most common uses C strings (historically) were printing and and scanning. In both of these, null-termination is faster because it saves one register.

如果销毁其中一个字符串,则Concat的前缀长度仅为O(M)。否则,速度是一样的。最常用的C字符串(历史上)是打印和扫描。在这两种情况下,空终止更快,因为它节省了一个寄存器。

优秀答案推荐

From the horse's mouth

从马的嘴里




None of BCPL, B, or C supports
character data strongly in the
language; each treats strings much
like vectors of integers and
supplements general rules by a few
conventions. In both BCPL and B a
string literal denotes the address of
a static area initialized with the
characters of the string, packed into
cells. In BCPL, the first packed byte
contains the number of characters in
the string; in B, there is no count
and strings are terminated by a
special character, which B spelled
*e. This change was made partially
to avoid the limitation on the length
of a string caused by holding the
count in an 8- or 9-bit slot, and
partly because maintaining the count
seemed, in our experience, less
convenient than using a terminator.




Dennis M Ritchie, Development of the C Language

丹尼斯·M·里奇,C语言的发展



C doesn't have a string as part of the language. A 'string' in C is just a pointer to char. So maybe you're asking the wrong question.

C语言没有字符串作为语言的一部分。C中的“字符串”只是一个指向字符的指针。所以也许你问错了问题。


"What's the rationale for leaving out a string type" might be more relevant. To that I would point out that C is not an object oriented language and only has basic value types. A string is a higher level concept that has to be implemented by in some way combining values of other types. C is at a lower level of abstraction.

“省略字符串类型的理由是什么”可能更相关。对此,我想指出的是,C不是一种面向对象的语言,只有基本的值类型。字符串是一个更高级别的概念,它必须通过以某种方式组合其他类型的值来实现。C处于较低的抽象级别。


in light of the raging squall below:


I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)

我只是想指出,我并不是想说这是一个愚蠢或糟糕的问题,也不是说C语言表示字符串的方式是最佳选择。我试图澄清的是,如果考虑到C没有将字符串作为数据类型与字节数组区分的机制,那么这个问题会更简洁。鉴于当今计算机的处理和存储能力,这是不是最好的选择?大概不会吧。但后见之明总是20/20和所有这些:)



The question is asked as a Length Prefixed Strings (LPS) vs zero terminated strings (SZ) thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.

这个问题是作为长度前缀字符串(LPS)与零终止字符串(SZ)的问题提出的,但主要揭示了长度前缀字符串的好处。这可能看起来势不可挡,但老实说,我们也应该考虑到LPS的缺点和SZ的优势。



As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".

根据我的理解,这个问题甚至可能被理解为一种有偏见的方式,即询问“零终止字符串的优势是什么?”



Advantages (I see) of Zero Terminated Strings:

零终止字符串的优势(我明白了):




  • very simple, no need to introduce new concepts in language, char
    arrays/char pointers can do.

  • the core language just include minimal syntaxic sugar to convert
    something between double quotes to a
    bunch of chars (really a bunch of
    bytes). In some cases it can be used
    to initialize things completely
    unrelated with text. For instance xpm
    image file format is a valid C source
    that contains image data encoded as a
    string.

  • by the way, you can put a zero in a string literal, the compiler will
    just also add another one at the end of the literal: "this\0is\0valid\0C".
    Is it a string ? or four strings ? Or a bunch of bytes...

  • flat implementation, no hidden indirection, no hidden integer.

  • no hidden memory allocation involved (well, some infamous non
    standard functions like strdup
    perform allocation, but that's mostly
    a source of problem).

  • no specific issue for small or large hardware (imagine the burden to
    manage 32 bits prefix length on 8
    bits microcontrollers, or the
    restrictions of limiting string size
    to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).

  • implementation of string manipulation is just a handful of
    very simple library function

  • efficient for the main use of strings : constant text read
    sequentially from a known start
    (mostly messages to the user).

  • the terminating zero is not even mandatory, all necessary tools
    to manipulate chars like a bunch of
    bytes are available. When performing
    array initialisation in C, you can
    even avoid the NUL terminator. Just
    set the right size. char a[3] =
    "foo";
    is valid C (not C++) and
    won't put a final zero in a.

  • coherent with the unix point of view "everything is file", including
    "files" that have no intrinsic length
    like stdin, stdout. You should remember that open read and write primitives are implemented
    at a very low level. They are not library calls, but system calls. And the same API is used
    for binary or text files. File reading primitives get a buffer address and a size and return
    the new size. And you can use strings as the buffer to write. Using another kind of string
    representation would imply you can't easily use a literal string as the buffer to output, or
    you would have to make it have a very strange behavior when casting it to char*. Namely
    not to return the address of the string, but instead to return the actual data.

  • very easy to manipulate text data read from a file in-place, without useless copy of buffer,
    just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).

  • prepending some int values of whatever size would implies alignment issues. The initial
    length should be aligned, but there is no reason to do that for the characters datas (and
    again, forcing alignment of strings would imply problems when treating them as a bunch of
    bytes).

  • length is known at compile time for constant literal strings (sizeof). So why would
    anyone want to store it in memory prepending it to actual data ?

  • in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.

  • in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.



That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring ? Or even C++ strings ?

也就是说,在标准C字符串确实效率低下的极少数情况下,没有必要抱怨。LIBS可用。如果我遵循这一趋势,我应该抱怨标准C不包括任何正则表达式支持函数……但实际上,每个人都知道这不是一个真正的问题,因为有可用于该目的的库。因此,当需要高效的字符串操作时,为什么不使用像bstring这样的库呢?甚至是C++字符串?



EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).

编辑:我最近看了一下D字符串。有趣的是,选择的解决方案既不是大小前缀,也不是零终止。与在C中一样,用双引号括起来的文字字符串只是不可变字符数组的缩写,而且该语言还有一个字符串关键字,意思是(不可变字符数组)。



But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.

但D数组比C数组丰富得多。在静态数组的情况下,长度在运行时是已知的,因此不需要存储长度。编译器在编译时拥有它。在动态数组的情况下,长度是可用的,但D文档没有说明它保存在哪里。就我们所知,编译器可以选择将其保存在某个寄存器中,或者保存在远离字符数据的某个变量中。



On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).

在正常的字符数组或非文字字符串上,没有最终的零,因此程序员如果想从D调用一些C函数,就必须把它自己放进去。在文字字符串的特殊情况下,D编译器仍然在每个字符串的末尾放一个零(为了便于转换为C字符串,以便更容易地调用C函数?),但这个零不是字符串的一部分(D不将其计入字符串大小)。



The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).

唯一让我有点失望的是,字符串应该是utf-8格式,但LENGTH显然仍然返回许多字节(至少在我的编译器GDC上是这样),即使使用多字节字符。我不清楚这是一个编译器错误还是故意的。(好的,我可能已经知道发生了什么。要告诉D编译器您的源代码使用UTF-8,您必须在开头加上一些愚蠢的字节顺序标记。我写得很愚蠢,因为我知道没有这样做的编辑器,特别是对于应该是ASCII兼容的UTF-8)。



I think, it has historical reasons and found this in wikipedia:

我认为,这是有历史原因的,并在维基百科上找到了这样的内容:




At the time C (and the languages that
it was derived from) were developed,
memory was extremely limited, so using
only one byte of overhead to store the
length of a string was attractive. The
only popular alternative at that time,
usually called a "Pascal string"
(though also used by early versions of
BASIC), used a leading byte to store
the length of the string. This allows
the string to contain NUL and made
finding the length need only one
memory access (O(1) (constant) time).
But one byte limits the length to 255.
This length limitation was far more
restrictive than the problems with the
C string, so the C string in general
won out.




Calavera is right, but as people don't seem to get his point, I'll provide some code examples.

Calvera是对的,但由于人们似乎不理解他的观点,我将提供一些代码示例。



First, let's consider what C is: a simple language, where all code has a pretty direct translation into machine language. All types fit into registers and on the stack, and it doesn't require an operating system or a big run-time library to run, since it were meant to write these things (a task to which is superbly well-suited, considering there isn't even a likely competitor to this day).

首先,让我们考虑一下C是什么:一种简单的语言,其中所有代码都可以相当直接地翻译成机器语言。所有类型都适合寄存器和堆栈,而且它不需要操作系统或大型运行时库来运行,因为它的目的是编写这些内容(考虑到目前甚至没有可能的竞争对手,这项任务非常适合)。



If C had a string type, like int or char, it would be a type which didn't fit in a register or in the stack, and would require memory allocation (with all its supporting infrastructure) to be handled in any way. All of which go against the basic tenets of C.

如果C有一个字符串类型,比如int或char,那么它就是一个不适合寄存器或堆栈的类型,并且需要以任何方式处理内存分配(及其所有支持基础设施)。所有这些都违背了C。



So, a string in C is:

因此,C中的字符串是:



char s*;


So, let's assume then that this were length-prefixed. Let's write the code to concatenate two strings:

因此,让我们假设这是长度前缀。让我们编写代码来连接两个字符串:



char* concat(char* s1, char* s2)
{
/* What? What is the type of the length of the string? */
int l1 = *(int*) s1;
/* How much? How much must I skip? */
char *s1s = s1 + sizeof(int);
int l2 = *(int*) s2;
char *s2s = s2 + sizeof(int);
int l3 = l1 + l2;
char *s3 = (char*) malloc(l3 + sizeof(int));
char *s3s = s3 + sizeof(int);
memcpy(s3s, s1s, l1);
memcpy(s3s + l1, s2s, l2);
*(int*) s3 = l3;
return s3;
}


Another alternative would be using a struct to define a string:

另一种选择是使用结构来定义字符串:



struct {
int len; /* cannot be left implementation-defined */
char* buf;
}


At this point, all string manipulation would require two allocations to be made, which, in practice, means you'd go through a library to do any handling of it.

在这一点上,所有的字符串操作都需要进行两次分配,这实际上意味着您需要遍历一个库来处理它。



The funny thing is... structs like that do exist in C! They are just not used for your day-to-day displaying messages to the user handling.

有趣的是..。这样的结构确实存在于C中!它们只是不用于您向用户处理的日常显示消息。



So, here is the point Calavera is making: there is no string type in C. To do anything with it, you'd have to take a pointer and decode it as a pointer to two different types, and then it becomes very relevant what is the size of a string, and cannot just be left as "implementation defined".

所以,这就是卡拉维拉的观点:在C中没有字符串类型,要对它做任何事情,你必须获得一个指针,并将其解码为指向两个不同类型的指针,然后它就变得非常相关,字符串的大小是什么,不能只是“实现定义的”。



Now, C can handle memory in anyway, and the mem functions in the library (in <string.h>, even!) provide all the tooling you need to handle memory as a pair of pointer and size. The so-called "strings" in C were created for just one purpose: showing messages in the context of writting an operating system intended for text terminals. And, for that, null termination is enough.

现在,C语言可以以任何方式处理内存,库中的mem函数(甚至在 中!)提供将内存作为一对指针和大小进行处理所需的所有工具。在C中创建所谓的“字符串”只有一个目的:在编写用于文本终端的操作系统的上下文中显示消息。为此,零终止就足够了。



Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

显然,出于性能和安全考虑,您需要在处理字符串时保持该字符串的长度,而不是重复对其执行strlen或等效值。然而,将长度存储在字符串内容之前的固定位置是一个令人难以置信的糟糕设计。正如Jörgen在对Sanjit的回答的评论中指出的那样,它排除了将字符串的尾部视为字符串的可能性,例如,这使得许多常见的操作,如路径到文件名或文件名到扩展名,在不分配新内存的情况下是不可能的(并且会招致失败和错误处理的可能性)。当然,还有一个问题是,没有人能就字符串长度字段应该占用多少字节达成一致(许多糟糕的“Pascal字符串”语言使用16位字段,甚至24位字段,这排除了对长字符串的处理)。



C's design of letting the programmer choose if/where/how to store the length is much more flexible and powerful. But of course the programmer has to be smart. C punishes stupidity with programs that crash, grind to a halt, or give your enemies root.

C的设计允许程序员选择是否/在哪里/如何存储长度,这要灵活得多,功能也更强大。但当然,程序员必须是聪明的。C用来惩罚愚蠢的程序,这些程序会崩溃、陷入停顿,或者让你的敌人扎根。



Lazyness, register frugality and portability considering the assembly gut of any language, especially C which is one step above assembly (thus inheriting a lot of assembly legacy code).
You would agree as a null char would be useless in those ASCII days, it (and probably as good as an EOF control char ).

懒惰,注册节俭和可移植性考虑到任何语言的汇编语言,特别是C语言,它比汇编语言高一步(因此继承了许多汇编语言遗留代码)。你会同意,因为空字符在那些ASCII时代是毫无用处的,它(可能和EOF控制字符一样好)。



let's see in pseudo code

让我们看一下伪代码



function readString(string) // 1 parameter: 1 register or 1 stact entries
pointer=addressOf(string)
while(string[pointer]!=CONTROL_CHAR) do
read(string[pointer])
increment pointer


total 1 register use

共使用1个寄存器



case 2

壳体2



 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
pointer=addressOf(string)
while(length>0) do
read(string[pointer])
increment pointer
decrement length


total 2 register used

共使用2个寄存器



That might seem shortsighted at that time, but considering the frugality in code and register ( which were PREMIUM at that time, the time when you know, they use punch card ). Thus being faster ( when processor speed could be counted in kHz), this "Hack" was pretty darn good and portable to register-less processor with ease.

这在当时可能看起来很短视,但考虑到在代码和寄存器方面的节俭(这在当时是溢价的,你知道,他们使用穿孔卡片)。因此,由于速度更快(当处理器速度可以用khz计算时),这个“Hack”非常好,并且可以很容易地移植到无注册处理器。



For argument sake I will implement 2 common string operation

为了便于讨论,我将实现2个常见的字符串操作



stringLength(string)
pointer=addressOf(string)
while(string[pointer]!=CONTROL_CHAR) do
increment pointer
return pointer-addressOf(string)


complexity O(n) where in most case PASCAL string is O(1) because the length of the string is pre-pended to the string structure (that would also mean that this operation would have to be carried in an earlier stage).

复杂性O(N),其中在大多数情况下Pascal字符串是O(1),因为字符串的长度是字符串结构的前缀(这也意味着必须在更早的阶段执行该操作)。



concatString(string1,string2)
length1=stringLength(string1)
length2=stringLength(string2)
string3=allocate(string1+string2)
pointer1=addressOf(string1)
pointer3=addressOf(string3)
while(string1[pointer1]!=CONTROL_CHAR) do
string3[pointer3]=string1[pointer1]
increment pointer3
increment pointer1
pointer2=addressOf(string2)
while(string2[pointer2]!=CONTROL_CHAR) do
string3[pointer3]=string2[pointer2]
increment pointer3
increment pointer1
return string3


complexity O(n) and prepending the string length wouldn't change the complexity of the operation, while I admit it would take 3 time less time.

复杂性O(N)和前缀字符串长度不会改变操作的复杂性,虽然我承认这会减少3倍的时间。



On another hand, if you use PASCAL string you would have to redesign your API for taking in account register length and bit-endianness, PASCAL string got the well known limitation of 255 char (0xFF) beacause the length was stored in 1 byte (8bits), and it you wanted a longer string (16bits->anything) you would have to take in account the architecture in one layer of your code, that would mean in most case incompatible string APIs if you wanted longer string.

另一方面,如果你使用Pascal字符串,你将不得不重新设计你的API来考虑寄存器长度和位字符顺序,Pascal字符串得到了众所周知的255字符(0xFF)的限制,因为长度是以1字节(8位)存储的,如果你想要更长的字符串(16位->任何东西),你必须考虑代码的一层中的体系结构,这意味着在大多数情况下,如果你想要更长的字符串,那将意味着不兼容的字符串API。



Example:

示例:



One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes.
Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble.
1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string.
Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

一个文件是在8位计算机上用您预先考虑的字符串API编写的,然后必须在32位计算机上读取,如果您的4个字节是字符串的长度,那么懒惰程序会做什么,然后分配大量内存,然后尝试读取那么多字节。另一种情况是将PPC 32字节字符串读取(小端)到x86(大端),当然,如果您不知道其中一个是由另一个写入的,就会有麻烦。1字节长度(0x00000001)将变为16777216(0x0100000),即读取1字节字符串的16MB。当然,你会说人们应该在一种标准上达成一致,但即使是16位Unicode也得到了越来越大的字节顺序。



Of course C would have its issues too but, would be very little affected by the issues raised here.

当然,C也会有它的问题,但不会受到这里提出的问题的影响。



In many ways, C was primitive. And I loved it.

在很多方面,C语言都是原始的。我爱死它了。



It was a step above assembly language, giving you nearly the same performance with a language that was much easier to write and maintain.

它比汇编语言更胜一筹,用一种更容易编写和维护的语言提供了几乎相同的性能。



The null terminator is simple and requires no special support by the language.

空终止符很简单,不需要语言的特殊支持。



Looking back, it doesn't seem that convenient. But I used assembly language back in the 80s and it seemed very convenient at the time. I just think software is continually evolving, and the platforms and tools continually get more and more sophisticated.

回首往事,这似乎并不那么方便。但我在80年代用的是汇编语言,当时看起来很方便。我只是认为软件在不断发展,平台和工具也在不断地变得越来越复杂。



Assuming for a moment that C implemented strings the Pascal way, by prefixing them by length: is a 7 char long string the same DATA TYPE as a 3-char string? If the answer is yes, then what kind of code should the compiler generate when I assign the former to the latter? Should the string be truncated, or automatically resized? If resized, should that operation be protected by a lock as to make it thread safe? The C approach side stepped all these issues, like it or not :)

假设C以Pascal的方式实现字符串,在字符串前面加上长度:7个字符的字符串是否与3个字符的字符串的数据类型相同?如果答案是肯定的,那么当我将前者赋给后者时,编译器应该生成什么样的代码?字符串应该被截断,还是应该自动调整大小?如果调整大小,是否应该使用锁保护该操作以使其线程安全?C方法方面回避了所有这些问题,不管你喜不喜欢:)



Somehow I understood the question to imply there's no compiler support for length-prefixed strings in C. The following example shows, at least you can start your own C string library, where string lengths are counted at compile time, with a construct like this:

不知何故,我理解了这个问题的意思是,C中不存在对长度前缀字符串的编译器支持。下面的示例显示,至少您可以使用如下结构启动您自己的C字符串库,其中字符串长度在编译时计算:



#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
prefix_str_t string1, string2;

string1 = PREFIX_STR("Hello!");
string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

return 0;
}


This won't, however, come with no issues as you need to be careful when to specifically free that string pointer and when it is statically allocated (literal char array).

然而,这不会带来任何问题,因为您需要特别注意何时释放字符串指针以及何时静态分配它(文字字符数组)。



Edit: As a more direct answer to the question, my view is this was the way C could support both having string length available (as a compile time constant), should you need it, but still with no memory overhead if you want to use only pointers and zero termination.

编辑:作为对这个问题的一个更直接的回答,我的观点是,这是C可以支持的方式,既有可用的字符串长度(作为编译时间常量),如果您需要的话,但如果您只想使用指针和零终止,仍然没有内存开销。



Of course it seems like working with zero-terminated strings was the recommended practice, since the standard library in general doesn't take string lengths as arguments, and since extracting the length isn't as straightforward code as char * s = "abc", as my example shows.

当然,使用以零结尾的字符串似乎是推荐的做法,因为标准库通常不接受字符串长度作为参数,而且提取长度不像char*S=“abc”那样简单,如我的示例所示。




"Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string."




First, extra 3 bytes may be considerable overhead for short strings. In particular, a zero-length string now takes 4 times as much memory. Some of us are using 64-bit machines, so we either need 8 bytes to store a zero-length string, or the string format can't cope with the longest strings the platform supports.

首先,对于短字符串来说,额外的3个字节可能是相当大的开销。特别是,零长度字符串现在占用的内存是原来的4倍。我们中的一些人使用的是64位机器,因此我们要么需要8个字节来存储零长度的字符串,要么字符串格式无法处理平台支持的最长字符串。



There may also be alignment issues to deal with. Suppose I have a block of memory containing 7 strings, like "solo\0second\0\0four\0five\0\0seventh". The second string starts at offset 5. The hardware may require that 32-bit integers be aligned at an address that is a multiple of 4, so you have to add padding, increasing the overhead even further. The C representation is very memory-efficient in comparison. (Memory-efficiency is good; it helps cache performance, for example.)

可能还需要处理调整问题。假设我有一个包含7个字符串的内存块,比如“solo\0Second\0\0four\0Five\0\077.”。第二个字符串从偏移量5开始。硬件可能要求32位整数在是4的倍数的地址对齐,因此您必须添加填充,这会进一步增加开销。相比之下,C表示非常节省内存。(内存效率很好;例如,它有助于提高缓存性能。)



One point not yet mentioned: when C was designed, there were many machines where a 'char' was not eight bits (even today there are DSP platforms where it isn't). If one decides that strings are to be length-prefixed, how many 'char's worth of length prefix should one use? Using two would impose an artificial limit on string length for machines with 8-bit char and 32-bit addressing space, while wasting space on machines with 16-bit char and 16-bit addressing space.

有一点还没有提到:在设计C语言的时候,有许多机器的‘char’不是8位(即使在今天,也有不是8位的DSP平台)。如果决定将字符串作为长度前缀,那么应该使用多少个字符的长度前缀?使用两个字符将对具有8位字符和32位寻址空间的计算机的字符串长度施加人为限制,而浪费具有16位字符和16位寻址空间的计算机的空间。



If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

如果人们想要允许有效地存储任意长度的字符串,并且如果‘char’总是8位,那么可以定义一种方案,即以偶数N为前缀的字符串将是N/2字节长,以奇数N和偶数值M为前缀的字符串可以是((N-1)+M*char_max)/2等等,并且要求任何声称提供一定量的空间来容纳字符串的缓冲区必须在该空间之前允许足够的字节来处理最大长度。然而,“char”并非总是8位的事实会使这种方案变得复杂,因为保存字符串长度所需的“char”数量会因CPU体系结构而异。



The null termination allows for fast pointer based operations.

空终止允许基于指针的快速操作。



Not a Rationale necessarily but a counterpoint to length-encoded

不一定是长度编码的基本原理,而是长度编码的对立面




  1. Certain forms of dynamic length encoding are superior to static length encoding as far as memory is concerned, it all depends on usage. Just look at UTF-8 for proof. It's essentially an extensible character array for encoding a single character. This uses a single bit for each extended byte. NUL termination uses 8 bits. Length-prefix I think can be reasonably termed infinite length as well by using 64 bits. How often you hit the case of your extra bits is the deciding factor. Only 1 extremely large string? Who cares if you're using 8 or 64 bits? Many small strings (Ie Strings of English words)? Then your prefix costs are a large percentage.


  2. Length-prefixed strings allowing time savings is not a real thing. Whether your supplied data is required to have length provided, you're counting at compile time, or you're truly being provided dynamic data that you must encode as a string. These sizes are computed at some point in the algorithm. A separate variable to store the size of a null terminated string can be provided. Which makes the comparison on time-savings moot. One just has an extra NUL at the end... but if the length encode doesn't include that NUL then there's literally no difference between the two. There's no algorithmic change required at all. Just a pre-pass you have to manually design yourself instead of having a compiler/runtime do it for you. C is mostly about doing things manually.


  3. Length-prefix being optional is a selling point. I don't always need that extra info for an algorithm so being required to do it for a every string makes my precompute+compute time never able to drop below O(n). (Ie hardware random number generator 1-128. I can pull from an "infinite string". Let's say it only generates characters so fast. So our string length changes all the time. But my usage of the data probably doesn't care how many random bytes I have. It just wants the next available unused byte as soon as it can get it after a request. I could be waiting on the device. But I could also have a buffer of characters pre-read. A length comparison is a needless waste of computation. A null check is more efficient.)


  4. Length-prefix is a good guard against buffer overflow? So is sane usage of library functions and implementation. What if I pass in malformed data? My buffer is 2 bytes long but I tell the function it's 7! Ex: If gets() was intended to be used on known data it could've had an internal buffer check that tested compiled buffers and malloc() calls and still follow spec. If it was meant to be used as a pipe for unknown STDIN to arrive at unknown buffer then clearly one can't know abut the buffer size which means a length arg is pointless, you need something else here like a canary check. For that matter, you can't length-prefix some streams and inputs, you just can't. Which means the length check has to be built into the algorithm and not a magic part of the typing system. TL;DR NUL-terminated never had to be unsafe, it just ended up that way via misuse.


  5. counter-counter point: NUL-termination is annoying on binary. You either need to do length-prefix here or transform NUL bytes in some way: escape-codes, range remapping, etc... which of course means more-memory-usage/reduced-information/more-operations-per-byte. Length-prefix mostly wins the war here. The only upside to a transform is that no additional functions have to be written to cover the length-prefix strings. Which means on your more optimized sub-O(n) routines you can have them automatically act as their O(n) equivalents without adding more code. Downside is, of course, time/memory/compression waste when used on NUL heavy strings. Depending on how much of your library you end up duplicating to operate on binary data, it may make sense to work solely with length-prefix strings. That said one could also do the same with length-prefix strings... -1 length could mean NUL-terminated and you could use NUL-terminated strings inside length-terminated.


  6. Concat: "O(n+m) vs O(m)" I'm assuming your referring to m as the total length of the string after concatenating because they both have to have that number of operations minimum (you can't just tack-on to string 1, what if you have to realloc?). And I'm assuming n is a mythical amount of operations you no longer have to do because of a pre-compute. If so, then the answer is simple: pre-compute. If you're insisting you'll always have enough memory to not need to realloc and that's the basis of the big-O notation then the answer is even more simple: do binary search on allocated memory for end of string 1, clearly there's a large swatch of infinite zeros after string 1 for us to not worry about realloc. There, easily got n to log(n) and I barely tried. Which if you recall log(n) is essentially only ever as large as 64 on a real computer, which is essentially like saying O(64+m), which is essentially O(m). (And yes that logic has been used in run-time analysis of real data structures in-use today. It's not bullshit off the top of my head.)


  7. Concat()/Len() again: Memoize results. Easy. Turns all computes into pre-computes if possible/necessary. This is an algorithmic decision. It's not an enforced constraint of the language.


  8. String suffix passing is easier/possible with NUL termination. Depending on how length-prefix is implemented it can be destructive on original string and can sometimes not even be possible. Requiring a copy and pass O(n) instead of O(1).


  9. Argument-passing/de-referencing is less for NUL-terminated versus length-prefix. Obviously because you're passing less information. If you don't need length, then this saves a lot of footprint and allows optimizations.


  10. You can cheat. It's really just a pointer. Who says you have to read it as a string? What if you want to read it as a single character or a float? What if you want to do the opposite and read a float as a string? If you're careful you can do this with NUL-termination. You can't do this with length-prefix, it's a data type distinctly different from a pointer typically. You'd most likely have to build a string byte-by-byte and get the length. Of course if you wanted something like an entire float (probably has a NUL inside it) you'd have to read byte-by-byte anyway, but the details are left to you to decide.




TL;DR Are you using binary data? If no, then NUL-termination allows more algorithmic freedom. If yes, then code quantity vs speed/memory/compression is your main concern. A blend of the two approaches or memoization might be best.

TL;DR您使用的是二进制数据吗?如果不是,则NUL终止允许更多的算法自由。如果是,则代码量与速度/内存/压缩是您主要关心的问题。将这两种方法结合起来,或者是备忘录,可能是最好的选择。



Many design decisions surrounding C stem from the fact that when it was originally implemented, parameter passing was somewhat expensive. Given a choice between e.g.

许多围绕C语言的设计决策源于这样一个事实,即最初实现它时,传递参数的成本有点高。如果要在以下选项中进行选择,例如



void add_element_to_next(arr, offset)
char[] arr;
int offset;
{
arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
for (i=0; i<39; i++)
add_element_to_next(array, i);
}


versus



void add_element_to_next(ptr)
char *p;
{
p[0]+=p[1];
}

char array[40];

void test()
{
int i;
for (i=0; i<39; i++)
add_element_to_next(arr+i);
}


the latter would have been slightly cheaper (and thus preferred) since it only required passing one parameter rather than two. If the method being called didn't need to know the base address of the array nor the index within it, passing a single pointer combining the two would be cheaper than passing the values separately.

后者会稍微便宜一些(因此更受欢迎),因为它只需要传递一个参数,而不是两个。如果被调用的方法既不需要知道数组的基地址,也不需要知道数组中的索引,那么传递一个组合了两者的单个指针比单独传递两个值要便宜。



While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

虽然有许多合理的方法可以让C对字符串长度进行编码,但到那时为止已经发明的方法都需要能够处理字符串的一部分的函数来接受字符串的基地址和所需的索引作为两个单独的参数。使用零字节终止使避免这一要求成为可能。尽管其他方法在当今的机器上会更好(现代编译器通常在寄存器中传递参数,并且可以通过strcpy()等效项无法实现的方式对memcpy进行优化),但是足够多的生产代码使用以零字节结尾的字符串,因此很难更改为其他任何字符串。



PS--In exchange for a slight speed penalty on some operations, and a tiny bit of extra overhead on longer strings, it would have been possible to have methods that work with strings accept pointers directly to strings, bounds-checked string buffers, or data structures identifying substrings of another string. A function like "strcat" would have looked something like [modern syntax]

PS--为了换取一些操作的轻微速度损失和较长字符串的少量额外开销,可以让处理字符串的方法直接接受指向字符串、边界检查的字符串缓冲区或标识另一个字符串的子字符串的数据结构的指针。像“strcat”这样的函数应该类似于[现代语法]



void strcat(unsigned char *dest, unsigned char *src)
{
struct STRING_INFO d,s;
str_size_t copy_length;

get_string_info(&d, dest);
get_string_info(&s, src);
if (d.si_buff_size > d.si_length) // Destination is resizable buffer
{
copy_length = d.si_buff_size - d.si_length;
if (s.src_length < copy_length)
copy_length = s.src_length;
memcpy(d.buff + d.si_length, s.buff, copy_length);
d.si_length += copy_length;
update_string_length(&d);
}
}


A little bigger than the K&R strcat method, but it would support bounds-checking, which the K&R method doesn't. Further, unlike the current method, it would be possible to easily concatenate an arbitrary substring, e.g.

比K&R strcat方法稍大一些,但它将支持边界检查,而K&R方法不支持边界检查。



/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
struct SUBSTRING_INFO *inf;
src = temp_substring(&inf, src, 10, 24);
strcat(dest, src);
}


Note that the lifetime of the string returned by temp_substring would be limited by those of s and src, which ever was shorter (which is why the method requires inf to be passed in--if it was local, it would die when the method returned).

请注意,TEMP_SUBSTRING返回的字符串的生存期将受到S和src的生存期的限制,这两个字符串的生存期更短(这就是该方法要求传入inf的原因--如果它是本地的,那么它将在方法返回时终止)。



In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

在内存成本方面,最多64个字节的字符串和缓冲区将有一个字节的开销(与以零结尾的字符串相同);较长的字符串将有稍微多一点的开销(是否允许两个字节之间的开销量和所需的最大值将是时间/空间的权衡)。长度/模式字节的特定值将被用来指示字符串函数被赋予了包含标志字节、指针和缓冲区长度(然后可以任意索引到任何其他字符串)的结构。



Of course, K&R didn't implement any such thing, but that's most likely because they didn't want to spend much effort on string handling--an area where even today many languages seem rather anemic.

当然,K&R没有实现任何这样的东西,但这很可能是因为他们不想在字符串处理上花费太多精力--即使在今天,许多语言似乎仍然相当贫乏。



According to Joel Spolsky in this blog post,

根据乔尔·斯波尔斯基在这篇博客文章中的说法,




It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."




After seeing all the other answers here, I'm convinced that even if this is true, it's only part of the reason for C having null-terminated "strings". That post is quite illuminating as to how simple things like strings can actually be quite hard.

在看到这里的所有其他答案后,我确信即使这是真的,这也只是C具有以空结尾的“字符串”的部分原因。这篇文章很有启发性,说明像字符串这样简单的事情实际上可能相当困难。



I don't buy the "C has no string" answer. True, C does not support built-in higher-level types but you can still represent data-structures in C and that's what a string is. The fact a string is just a pointer in C does not mean the first N bytes cannot take on special meaning as a the length.

我不买“C没有字符串”的答案。的确,C不支持内置的高级类型,但你仍然可以在C中表示数据结构,这就是字符串。事实上,字符串在C中只是一个指针,并不意味着前N个字节不能作为长度具有特殊意义。



Windows/COM developers will be very familiar with the BSTR type which is exactly like this - a length-prefixed C string where the actual character data starts not at byte 0.

Windows/COM开发人员将非常熟悉BSTR类型,它就像这样-一个长度前缀的C字符串,其中实际的字符数据不是从字节0开始。



So it seems that the decision to use null-termination is simply what people preferred, not a necessity of the language.

因此,使用零结尾的决定似乎是人们更喜欢的,而不是语言的必然。



One advantage of NUL-termination over length-prefixing, which I have not seen anyone mention, is the simplicity of string comparison. Consider the comparison standard which returns a signed result for less-than, equal, or greater-than. For length-prefixing the algorithm has to be something along the following lines:

与长度前缀相比,NUL结尾的一个优点是字符串比较简单,我还没有见过任何人提到这一点。考虑一下比较标准,它返回小于、等于或大于的带符号结果。对于长度前缀,算法必须遵循以下路线:



  1. Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3).

  2. Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length.

  3. When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.


Contrast this with the NUL-termination algorithm:

将其与NUL终止算法进行对比:



  1. Scan the two character sequences, subtracting characters at matching indices [note that this is handled better with moving pointers]. Stop when the difference is nonzero, returning the difference. NOTE: If one string is a PROPER prefix of the other, one of the characters in the subtraction will be NUL, i.e zero, and the comparison will naturally stop there.

  2. If the difference is zero, -only then- check if either character is NUL. If so, return zero, otherwise continue to next character.


The NUL-terminated case is simpler, and very easy to implement efficiently with a dual pointer scan. The length-prefixed case does at least as much work, nearly always more. If your algorithm has to do a lot of string comparisons [e.g a compiler!], the NUL-terminated case wins out. Nowadays that might not be as important, but back in the day, heck yeah.

NUL终止的情况更简单,并且使用双指针扫描非常容易有效地实现。带长度前缀的大小写至少执行同样多的工作,几乎总是更多。如果您的算法必须进行大量的字符串比较[例如,编译器!],则以NUL结尾的情况会胜出。如今,这可能不那么重要,但在过去,见鬼,是的。



gcc accept the codes below:

GCC接受以下代码:



char s[4] = "abcd";

字符S[4]=“abcd”;



and it's ok if we treat is as an array of chars but not string. That is, we can access it with s[0], s[1], s[2], and s[3], or even with memcpy(dest, s, 4). But we'll get messy characters when we trying with puts(s), or worse with strcpy(dest, s).

如果我们将is视为字符数组,而不是字符串,这是可以接受的。也就是说,我们可以使用S[0]、S[1]、S[2]和S[3]访问它,甚至可以使用Memcpy(DEST、S,4)访问它。但当我们尝试推杆(S)时,我们会得到乱七八糟的字符,或者更糟糕的是,用斯特西(最好的,S)。



I think the better question is why you think C owes you anything? C was designed to give you what you need, nothing more. You need to loose the mentality that the language must provide you with everything. Or just continue to use your higher level languages that will give you the luxary of String, Calendar, Containers; and in the case of Java you get one thing in tonnes of variety. Multiple types String, multiple types of unordered_map(s).

我觉得更好的问题是为什么你认为C欠你什么?C是为了满足您的需求而设计的,仅此而已。你需要放松这样一种心态,即语言必须为你提供一切。或者继续使用您的高级语言,让您拥有丰富的字符串、日历和容器;在Java中,您可以获得千变万化的东西。多类型字符串,多类型无序_图(S)。


Too bad for you, this was not the purpose of C. C was not designed to be a bloated language that offers from a pin to an anchor. Instead you must rely on third party libraries or your own. And there is nothing easier than creating a simple struct that will contain a string and its size.

对你来说太糟糕了,这不是C语言的目的,C语言不是被设计成一种提供从大头针到锚的臃肿语言。相反,您必须依赖第三方库或您自己的库。没有什么比创建一个包含字符串及其大小的简单结构更容易的了。


struct String
{
const char *s;
size_t len;
};

You know what the problem is with this though. It is not standard. Another language might decide to organize the len before the string. Another language might decide to use a pointer to end instead. Another might decide to use six pointers to make the String more efficient. However a null terminated string is the most standard format for a string; which you can use to interface with any language. Even Java JNI uses null terminated strings.

不过,你知道这有什么问题。这不是标准的。另一种语言可能决定在字符串之前组织镜头。另一种语言可能决定使用指针作为结尾。另一个可能决定使用六个指针来提高字符串的效率。但是,以空结尾的字符串是最标准的字符串格式;您可以使用这种格式与任何语言交互。即使是Java JNI也使用以空结尾的字符串。


Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

最后,这是一种常见的说法;任务的正确数据结构。如果您发现最需要知道字符串的大小,那么我们将使用允许您以最佳方式完成此任务的字符串结构。但不要声称,对每个人来说,这个手术比其他任何东西都更重要。比如,为什么知道字符串的大小比读取其内容更重要。我发现我主要做的就是读取字符串的内容,所以我使用以空结尾的字符串而不是std::String;这为我在GCC编译器上节省了5个指针。如果我能保住2分,那就太好了。


更多回答

Another relevant quote: "...the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe..."

另一个相关的引用是:“……字符串的语义完全包含在管理所有数组的更一般规则中,因此语言更易于描述……”

char *temp = "foo bar"; is a valid statement in C... hey! isn't that a string? isn't it null terminated?

Char*temp=“foo bar”;是C语言中的有效语句...嘿!那不是一根线吗?它不是以空结尾的吗?

@Yanick: that's just a convenient way to tell the compiler to create an array of char with a null at the end. it's not a 'string'

@Yanick:这只是一种告诉编译器创建末尾为空的字符数组的便捷方法。它不是‘字符串’

@calavera: But it could have just as simply meant "Create a memory buffer with this string content and a two byte length prefix",

@calavera:但它可能只是简单地意味着“用这个字符串内容和两个字节长度的前缀创建一个内存缓冲区”,

@Billy: well since a 'string' is really just a pointer to char, which is equivalent to a pointer to byte, how would you know that the buffer you're dealing with is really intended to be a 'string'? you would need a new type other than char/byte* to denote this. maybe a struct?

@BILY:既然‘字符串’实际上只是一个指向字符的指针,相当于指向字节的指针,你怎么知道你正在处理的缓冲区真的是一个‘字符串’呢?您需要一个不同于char/byte*的新类型来表示这一点。也许是一个结构?

I think @calavera is right, C doesn't have a data type for strings. Ok, you can consider an array of chars like a string, but this doesn't mean it's always a string (for string I mean a sequence of characters with a definite meaning). A binary file is an array of chars, but those chars don't mean anything for a human.

我认为@calvera是对的,C没有字符串的数据类型。好的,您可以将字符数组视为字符串,但这并不意味着它始终是字符串(对于字符串,我指的是具有明确含义的字符序列)。二进制文件是一组字符,但这些字符对人类没有任何意义。

... Continued... Several of your points I think are just plain wrong, i.e. the "everything is a file" argument. Files are sequential access, C strings are not. Length prefixing can also be done with minimal syntactic sugar. The only reasonable argument here is the trying to manage 32 bit prefixes on small (i.e. 8 bit) hardware; I think that could be simply solved by saying the size of the length is determined by the implementation. After all, that's what std::basic_string does.

..。继续..。我认为你的几点观点是完全错误的,即“一切都是文件”的论点。文件是顺序访问的,而C字符串不是。长度前缀也可以用最少的句法糖分来完成。这里唯一合理的论点是试图在小型(即8位)硬件上管理32位前缀;我认为这个问题可以简单地解决,因为长度的大小由实现决定。毕竟,这就是std::BASIC_STRING所做的。

@Billy ONeal: really there is two different parts in my answer. One is about what is part of the 'core C language', the other one is about what standard libraries should deliver. Regarding to string support, there is only one item from the core language: the meaning of a double quote enclosed bunch of bytes. I am not really happyer than you with C behavior. I feel magically adding that zero at end of every double closes enclosed bunch of bytes is bad enough. I would prefer and explicit \0 at the end when programmers wants that instead of the implicit one. Prepending length is much worse.

@比利·奥尼尔:我的答案真的有两个不同的部分。一个是关于什么是“核心C语言”的一部分,另一个是关于标准库应该提供什么。关于字符串支持,核心语言中只有一项:双引号括起的一串字节的含义。我并不真的比你的C行为更快乐。我觉得神奇的是,在每个双闭合的封闭字节串的末尾加上零就够糟糕的了。当程序员想要它而不是隐式的时候,我更喜欢在最后显式的\0。前置长度要差得多。

@Billy ONeal: that is just not true, the uses cares about what is core and what is libraries. The biggest point is when C is used to implement OS. At that level no libraries are available. C is also often used in embedded contexts or for programming devices where you often have the same kind of restrictions. In many cases Joes's should probably not use C at all nowaday: "OK, you want it on the console ? Do you have a console ? No ? Too bad..."

@比利奥尼尔:这不是真的,用户关心的是什么是核心,什么是库。最大的问题是什么时候用C实现操作系统。在该级别上没有可用的库。C还经常用于嵌入式环境或编程设备,在这些环境中,您经常会遇到同样的限制。在许多情况下,Joes现在可能根本不应该使用C语言:“好的,你想在游戏机上使用它吗?你有游戏机吗?没有?太糟糕了……”

@Billy "Well, for the .01% of C programmers implementing operating systems, fine." The other programmers can take a hike. C was created to write an operating system.

@bily“嗯,对于实现操作系统的0.01%的C程序员来说,很好。”其他程序员可以走人了。C语言是用来编写操作系统的。

Why? Because it says it is a general purpose language? Does it say what the people who wrote it was doing when it created? What was it used for for the first few years of its life? So, what is it that it says that disagrees with me? It is a general purpose language created to write an operating system. Does it deny it?

为什么?因为它说它是一种通用语言?它有没有说写这本书的人在写它的时候在做什么?在它生命的头几年里,它是用来做什么的?那么,它说什么不适合我呢?它是一种为编写操作系统而创建的通用语言。它否认这一点吗?

@muntoo Hmm... compatibility?

@munto嗯..。兼容性?

@muntoo: Because that would break monumential amounts of existing C and C++ code.

@munToo:因为这会破坏大量的现有C和C++代码。

@muntoo: Paradigms come and go, but legacy code is forever. Any future version of C would have to continue to support 0-terminated strings, otherwise 30+ years' worth of legacy code would have to be rewritten (which isn't going to happen). And as long as the old way is available, that's what people will continue to use, since that's what they're familiar with.

@munto:范例来来去去,但遗留代码是永恒的。任何未来的C版本都必须继续支持以0结尾的字符串,否则将不得不重写30多年的遗留代码(这是不可能发生的)。只要旧的方式可用,人们就会继续使用,因为这是他们所熟悉的。

@muntoo: Believe me, sometimes I wish I could. But I'd still prefer 0-terminated strings over Pascal strings.

@munto:相信我,有时我希望我能做到。但我仍然更喜欢以0结尾的字符串,而不是Pascal字符串。

Talk about legacy ... C++ strings are now mandated to be NUL-terminated.

谈论遗产..。C++字符串现在被强制以NUL结尾。

1. +1. 2. Obviously if the default behavior of the language would have been made using length prefixes, there would have been other things to make that easier. For example, all your casts there would have been hidden by calls to strlen and friends instead. As for the problem with "leaving it up to the implementation", you could say that the prefix is whatever a short is on the target box. Then all your casting would still work. 3. I can come up with contrived scenarios all day long that make one or the other system look bad.

1.+1.2.显然,如果该语言的默认行为是使用长度前缀设置的,那么还有其他事情可以让这一点变得更容易。例如,您在那里的所有强制转换都会被对strlen和Friends的调用所隐藏。至于“让它由实现决定”的问题,您可以说前缀是目标框上的任何短字符。那么你所有的选角都会继续有效。3.我可以一整天都想出人为的场景,让其中一个系统看起来很糟糕。

@Billy The library thing is true enough, aside from the fact that C was designed for minimal or no library usage. The use of prototypes, for instance, was not common early on. Saying the prefix is short effectively limits the size of the string, which seems to be one thing they weren't keen on. Myself, having worked with 8-bits BASIC and Pascal strings, fixed-size COBOL strings and similar things, became a huge fan of unlimited-size C strings quickly. Nowadays, a 32-bits size will handle any practical string, but adding those bytes early on was problematic.

@bily库的事情是真的,除了C是为最少或没有库使用而设计的事实。例如,原型的使用在早期并不常见。说前缀太短实际上限制了字符串的大小,这似乎是他们并不热衷的一件事。我自己使用过8位BASIC和PASCAL字符串、固定大小的COBOL字符串和类似的东西,很快就成为了无限大C字符串的超级粉丝。如今,32位大小可以处理任何实际的字符串,但在早期添加这些字节是有问题的。

@Billy: First, thank you Daniel... you seem to understand what I'm getting at. Second, Billy, I think you're still missing the point that is being made here. I for one am not arguing the pros and cons of prefixing string data-types with their length. What I am saying, and what Daniel very clearly emphasized, is that there was a decision made in the implementation of C to not handle that argument at all. Strings don't exist as far as the basic language is concerned. The decision on how to handle strings is left to the programmer... and null termination became popular.

@比利:首先,谢谢你,丹尼尔...你似乎明白我的意思。其次,比利,我认为你仍然没有抓住这一点。我并不是在争论在字符串数据类型前面加上它们的长度的利弊。我要说的是,Daniel非常明确地强调的是,在C的实现中做出了一个决定,根本不处理这个论点。就BASIC语言而言,字符串并不存在。如何处理字符串的决定权留给程序员……零终止变得流行起来。

+1 by me. One further thing I'd like to add; a struct as you propose it misses an important step towards a real string type: it is not aware of characters. It's an array of "char" (a "char" in machine lingo is as much a character as a "word" is what humans would call a word in a sentence). A string of characters is a higher-level concept which could be implemented on top of an array of char if you introduced the notion of encoding.

由我+1。我还想补充一件事;您提出的结构遗漏了迈向真正的字符串类型的重要一步:它不知道字符。它是一个“char”数组(在机器行话中,“char”与“word”是人类在句子中对单词的称呼一样多)。字符串是一个更高级的概念,如果您引入了编码的概念,则可以在char数组的顶部实现它。

@DanielC.Sobral: Also, the struct you mention wouldn't require two allocations. Either use it as you have it on the stack (so only buf requires an allocation), or use struct string {int len; char buf[]}; and allocate the whole thing with one allocation as a flexible array member, and pass it around as a string*. (Or Arguably, struct string {int capacity; int len; char buf[]}; for obvious performance reasons)

@DanielC.Sobral:还有,您提到的结构不需要两次分配。要么按堆栈上的方式使用它(因此只有buf需要分配),要么使用结构字符串{int len;char buf[]};并使用一个分配作为灵活的数组成员分配整个对象,并将其作为字符串*传递。(或者可以说,结构字符串{int Capacity;int len;char buf[]};出于明显的性能原因)

+1. It would be nice to have a standard place to store the length though so that those of us who want something like length prefixing didn't have to write tons of "glue code" everywhere.

+1.如果有一个标准的地方来存储长度,那会很好,这样我们中想要添加长度前缀的人就不需要在任何地方都写上成堆的“胶水代码”了。

There's no possible standard place relative to the string data, but you can of course use a separate local variable (recomputing it rather than passing it when the latter isn't convenient and the former isn't too wasteful) or a structure with a pointer to the string (and even better, a flag indicating whether the structure "owns" the pointer for allocation purposes or whether it's a reference to a string owned elsewhere. And of course you can include a flexible array member in structure for the flexibility to allocate the string with the structure when it suits you.

对于字符串数据没有可能的标准位置,但您当然可以使用单独的局部变量(重新计算它,而不是在后者不方便且前者不太浪费时传递它)或具有指向字符串的指针的结构(更好的是,一个标志指示该结构是否出于分配目的而“拥有”指针,或者它是否是对其他地方拥有的字符串的引用。当然,您可以在结构中包括一个灵活的数组成员,以便在适合您的时候灵活地将字符串与结构一起分配。

@deemoowoor: Concat: O(m+n) with nullterm strings, O(n) typical everywhere else. Length O(n) with nullterm strings, O(1) everywhere else. Join: O(n^2) with nullterm strings, O(n) everywhere else. There are some cases where null terminated strings are more efficient (i.e. the just add one to pointer case), but concat and length are by far the most common operations (length at least is required for formatting, file output, console display, etc). If you cache the length to amortize the O(n) you've merely made my point that the length should be stored with the string.

@deemoowoor:Conat:O(m+n)带有空项字符串,O(N)在其他任何地方都是典型的。对于空项字符串,长度为O(N),其他所有地方都是O(1)。连接:O(n^2)与空项字符串,O(N)在其他任何地方。在某些情况下,以空结尾的字符串效率更高(例如,只需将1加到指针),但连接和长度是最常见的操作(格式化、文件输出、控制台显示等至少需要长度)。如果您缓存长度以摊销O(N),那么您只是提出了我的观点,即长度应该与字符串一起存储。

I agree that in today's code this type of string is inefficient and prone to error, but for example Console display don't really have to know the length of the string to display it efficiently, file output didn't really need to know about string length (only allocating cluster on the go), And string formatting at this time was done on a fixed string length in most of the case. Anyway you must be writing bad code if you concat in C has an O(n^2) complexity, I am pretty sure I can write one in O(n) complexity

我同意在今天的代码中,这种类型的字符串效率低下且容易出错,但例如,控制台显示并不需要知道字符串的长度才能有效地显示它,文件输出实际上不需要知道字符串的长度(只在路上分配集群),并且在大多数情况下,此时的字符串格式化是在固定的字符串长度上完成的。不管怎样,你一定是在写糟糕的代码,如果你用C连结有O(n^2)的复杂度,我很确定我可以写一个O(N)的复杂度

@dvhh: I didn't say n^2 -- I said m + n -- it's still linear, but you need to seek to the end of the original string in order to do the concatenation, whereas with a length prefix no seeking is required. (This is really just another consequence of length requiring linear time)

@dvhh:我没有说n^2--我说的是m+n--它仍然是线性的,但您需要搜索到原始字符串的末尾才能进行连接,而使用长度前缀不需要搜索。(这实际上只是长度需要线性时间的另一个结果)

@Billy ONeal: from mere curiosity I did a grep on my current C project (about 50000 lines of code) for string manipulation function calls. strlen 101, strcpy and variants (strncpy, strlcpy) : 85 (I also have several hundreds of literal strings used for message, implied copies), strcmp: 56, strcat : 13 (and 6 are concatenations to zero length string to call strncat). I agree a length prefixed will speedup calls to strlen, but not to strcpy or strcmp (maybe if strcmp API does not use common prefix). The most interesting thing regarding the above comments is that strcat is very rare.

@billy ONeal:出于好奇,我在当前的C项目(大约50000行代码)上执行了一个grep,用于字符串操作函数调用。Strlen 101、strcpy和Variants(strncpy、strlcpy):85(我还有数百个用于消息的文字字符串,隐含的副本),strcMP:56,strcat:13(和6是零长度字符串的串联,以调用strncat)。我同意长度前缀将加速对strlen的调用,但不会加速对strcpy或strcMP的调用(如果strcmp API不使用公共前缀的话)。关于上述评论,最有趣的是strcat非常罕见。

@Hurkyl: That's not true. In the null terminated case, at each comparison step you need to have the pointer to the string (1 register) load the character it points to (2 registers) and compare with 0 (3 registers). In the length prefixed case you need to compare the pointer to the string (1 register) with the pointer to the end of the string (2 registers) and load the character it points to (3 registers again).

@Hurkyl:不是这样的。在空终止的情况下,在每个比较步骤中,您需要让指向字符串的指针(1个寄存器)加载它指向的字符(2个寄存器),并与0(3个寄存器)进行比较。在长度前缀的情况下,您需要比较指向字符串的指针(1个寄存器)和指向字符串末尾的指针(2个寄存器),并加载它所指向的字符(再次加载3个寄存器)。

I don't see what's anymore primitive about null terminated strings than anything else. Pascal predates C and it uses length prefixing. Sure, it was limited to 256 characters per string, but simply using a 16 bit field would have solved the problem in the vast majority of cases.

我看不出以空结尾的字符串比其他任何东西更原始的地方。Pascal在C之前,它使用长度前缀。当然,它被限制为每个字符串256个字符,但在绝大多数情况下,简单地使用16位字段就可以解决这个问题。

The fact that it limited the number of characters is exactly the type of issues you need to think about when doing something like that. Yes, you could make it longer, but back then bytes mattered. And is a 16-bit field going to be long enough for all cases? C'mon, you must admit that a null-terminate is conceptually primitive.

事实上,它限制了字符的数量,这正是你在做类似的事情时需要考虑的问题。是的,你可以让它更长,但当时字节很重要。一个16位的字段对所有情况都足够长吗?拜托,你必须承认null-terminate在概念上是原始的。

Either you limit the length of the string or you limit the content (no null characters), or you accept the extra overhead of a 4 to 8 byte count. There's no free lunch. At the time of inception the null terminated string made perfect sense. In assembly I sometimes used the top bit of a character to mark the end of a string, saving even one more byte!

要么限制字符串的长度,要么限制内容(无空字符),或者接受4到8字节计数的额外开销。天下没有免费的午餐。在开始时,以空结尾的字符串非常有意义。在汇编中,我有时使用字符的最高位来标记字符串的结尾,甚至节省了一个字节!

Exactly, Mark: There's no free lunch. It's always a compromise. These days, we don't need to make the same sort of compromises. But back then, this approach seemed as good as any other.

没错,马克:天下没有免费的午餐。这永远是一种妥协。如今,我们不需要做出同样的妥协。但在当时,这种方法似乎和其他任何方法一样好。

Err.. no it didn't. The C approach doesn't allow assigning the 7 char long string to the 3 char long string at all.

呃..。不,它没有。C方法根本不允许将7字符长的字符串赋给3字符长的字符串。

@Billy ONeal: why not? As far as I understand it in this case, all strings are the same data type (char*), so the length doesn't matter. Unlike Pascal. But that was a limitation of Pascal, rather than a problem with length-prefixed strings.

@比利奥尼尔:为什么不呢?就我所知,在这种情况下,所有的字符串都是相同的数据类型(char*),所以长度并不重要。不像帕斯卡。但这是Pascal的局限性,而不是长度前缀字符串的问题。

@Billy: I think you just restated Cristian's point. C deals with these issues by not dealing with them at all. You're still thinking in terms of C actually containing a notion of a string. It's just a pointer, so you can assign it to whatever you want.

@比利:我认为你刚刚重申了克里斯蒂安的观点。C通过根本不处理这些问题来处理这些问题。你仍然在思考C语言,实际上包含了一个字符串的概念。它只是一个指针,所以你可以把它赋给你想要的任何东西。

It's like **the matrix: "there is no string".

这就像**矩阵:“没有字符串”。

@calavera: I don't see how that proves anything. You can solve it the same way with length prefixing... i.e. don't allow the assignment at all.

@Calvera:我看不出这能证明什么。您可以使用长度前缀以相同的方式解决此问题。即根本不允许分配。

Problem is that libraries don't know the existence of your struct, and still handle things like embedded nulls incorrectly. Also, this doesn't really answer the question I asked.

问题是,库不知道您的结构的存在,并且仍然不正确地处理嵌入的空值之类的事情。而且,这并没有真正回答我提出的问题。

That's true. So the bigger problem is there's no better standard way to provide interfaces with string parameters than plain old zero-terminated strings. I'd still claim, there are libraries which support feeding in pointer-length pairs (well, at least you can construct a C++ std::string with them).

那是真的。因此,更大的问题是,没有比普通的以零结尾的旧字符串更好的标准方法来为接口提供字符串参数。我仍然要说,有一些库支持指针长度对的馈送(好的,至少您可以用它们构造一个C++std::字符串)。

Even if you store a length, you should never allow strings with embedded nulls. This is basic common sense. If your data might have nulls in it, you should never use it with functions which expect strings.

即使您存储了一个长度,也不应该允许具有嵌入空值的字符串。这是基本常识。如果您的数据中可能有空值,则不应将其与需要字符串的函数一起使用。

@supercat: From a standpoint of security I'd welcome that redundancy. Otherwise ignorant (or sleep-deprived) programmers end up concatenating binary data and strings and passing them into things that expect [null-terminated] strings...

@supercat:从安全的角度来看,我欢迎这种冗余。否则无知(或睡眠不足)的程序员最终会将二进制数据和字符串连接起来,并将它们传递给期望[null终止]字符串的东西。

@R..: While methods that expect null-terminated strings generally expect a char*, many methods which don't expect null termination also expect a char*. A more significant benefit of separating the types would relate to Unicode behavior. It may be worthwhile for a string implementation to maintain flags for whether strings are known to contain certain kinds of characters, or are known not to contain them [e.g. finding the 999,990th code point in a million-character string which is known not to contain any characters beyond the basic multilingual plane will be orders of magnitude faster...

@R..:虽然预期以空结尾的字符串的方法通常预期为char*,但许多不预期以空结尾的方法也预期为char*。分离类型的一个更重要的好处是与Unicode行为有关。对于字符串实现来说,维护标记以确定字符串是否已知包含某些类型的字符可能是值得的[例如,在已知不包含基本多语言平面之外的任何字符的百万字符字符串中找到第999,990个代码点将会快一个数量级……

I believe I addressed all of this in the question. Yes, on x64 platforms a 32 bit prefix can't fit all possible strings. On the other hand, you never want a string that big as a null terminated string, because to do anything you have to examine all 4 billion bytes to find the end for almost every operation you could want to do to it. Also, I'm not saying that null terminated strings are always evil -- if you're building one of these block structures and your specific application is sped up by that kind of construction, go for it. I just wish the default behavior of the language didn't do that.

我相信我在问题中解决了所有这些问题。是的,在x64平台上,32位前缀不能适合所有可能的字符串。另一方面,你永远不会想要一个像null终止的字符串那么大的字符串,因为要做任何事情,你必须检查所有40亿字节,以找到你想对它做的几乎每一个操作的结尾。而且,我并不是说null终止的字符串总是邪恶的--如果你正在构建这样的块结构之一,并且你的特定应用程序被这种结构加速,我只是希望语言的默认行为不会这样做。

I quoted that part of your question because in my view it underrated the efficiency issue. Doubling or quadrupling memory requirements (on 16-bit and 32-bit respectively) can be a big performance cost. Long strings may be slow, but at least they are supported and still work. My other point, about alignment, you don't mention at all.

我引用了你问题的这一部分,因为在我看来,它低估了效率问题。增加一倍或四倍的内存需求(分别针对16位和32位)可能会造成很大的性能成本。长弦可能很慢,但至少它们得到了支持,并且仍然有效。我的另一点,关于对齐,你根本没有提到。

Alignment may be dealt with by specifying that values beyond UCHAR_MAX should behave as though packed and unpacked using byte accesses and bit-shifting. A suitably-designed string type could offer storage efficiency essentially comparable to zero-terminated strings, while also allowing bounds-checking on buffers for no additional memory overhead (use one bit in the prefix to say whether a buffer is "full"; if it isn't and the last byte is non-zero, that byte would represent the remaining space. If the buffer isn't full and the last byte is zero, then the last 256 bytes would be unused, so...

可以通过指定UCHAR_MAX以外的值应该表现为使用字节访问和位移位进行打包和解包来处理对齐。适当设计的字符串类型可以提供基本上与以零结尾的字符串相当的存储效率,同时还允许在缓冲区上进行边界检查,而无需额外的内存开销(使用前缀中的一位来表示缓冲区是否“已满”;如果不是,并且最后一个字节不是零,则该字节将表示剩余空间。如果缓冲区未满且最后一个字节为零,则最后256个字节将未使用,因此...

...one could store within that space the exact number of unused bytes, with zero additional memory cost). The cost of working with the prefixes would be offset by the ability to use methods like fgets() without having to pass the string length (since buffers would know how big they were).

...可以在该空间中存储未使用的字节的确切数量,而不需要额外的存储成本)。使用前缀的成本将被使用fget()等方法的能力所抵消,而不必传递字符串长度(因为缓冲区将知道它们有多大)。

The prefix could easily be of implementation-defined size, just as is sizeof(char).

前缀可以很容易地具有实现定义的大小,就像sizeof(Char)一样。

@BillyONeal: sizeof(char) is one. Always. One could have the prefix be an implementation-defined size, but it would be awkward. Further, there's no real way of knowing what the "right" size should be. If one is holding lots of 4-character strings, zero-padding would impose 25% overhead, while a four-byte length prefix would impose 100% overhead. Further, the time spent packing and unpacking four-byte length prefixes could exceed the cost of scanning 4-byte strings for the zero byte.

@BillyONeal:sizeof(Char)为1。一直都是。可以将前缀设置为实现定义的大小,但这样做会很尴尬。此外,没有真正的方法来知道“合适的”尺寸应该是多少。如果包含大量4个字符的字符串,则填充零将产生25%的开销,而4字节长度的前缀将产生100%的开销。此外,打包和解包4字节长度前缀所花费的时间可能会超过扫描4字节字符串以查找零字节的成本。

Ah, yes. You're right. The prefix could easily be something other than char though. Anything that would make alignment requirements on the target platform work out would be fine. I'm not going to go there though -- I've already argued this to death.

啊,是的。你是对的。不过,前缀很容易不是char。任何可以使目标平台上的对齐需求起作用的东西都是可以的。不过,我不打算去那里--这一点我已经争论得够呛了。

Assuming strings were length-prefixed, probably the sanest thing to do would be a size_t prefix (memory waste be damned, it would be the sanest --- allowing strings of any possible length that could possibly fit into memory). In fact, that's kind of what D does; arrays are struct { size_t length; T* ptr; }, and strings are just arrays of immutable(char).

假设字符串是以长度为前缀的,那么最明智的做法可能是使用一个大小为t的前缀(该死的内存浪费,这将是最明智的做法-允许任何长度的字符串可以放入内存中)。事实上,这就是D所做的事情;数组是struct{size_t long;T*ptr;},而字符串只是不可变(Char)的数组。

@TimČas: Sorry--I read your use of "prefix" as referring to a length stored in memory immediately preceding the characters themselves, since you said "kind of" what D does, I thought you were expecting strings to be something like struct {size_t length; char text[]; }

@TimČas:对不起--我读到您使用的“Prefix”指的是存储在内存中的字符本身之前的长度,因为您说的是“某种”D所做的事情,我以为您希望字符串是这样的结构{SIZE_T LENGTH;CHAR TEXT[];}

Huh? What "fast pointer operations" don't work with length prefixing? More importantly, other languages which use length prefixing are faster than C w.r.t. string manipulation.

哈?哪些“快速指针操作”不适用于长度前缀?更重要的是,其他使用长度前缀的语言比Cw.r.t.字符串操作。

@billy: With length prefixed strings, you can't just take a string pointer and add 4 to it, and expect it to still be a valid string, because it doesn't have a length prefix (not a valid anyway).

@bily:对于带有长度前缀的字符串,您不能简单地将字符串指针加4,然后期望它仍然是有效的字符串,因为它没有长度前缀(无论如何都不是有效的)。

@j_random_hacker: Concatenation is much worse for asciiz strings (O(m+n) instead of potentially O(n)), and concat is much more common than any of the other operations listed here.

@j_RANDOM_HACKER:对于asciiz字符串(O(m+n)而不是潜在的O(N)),连接要糟糕得多,并且连接比这里列出的任何其他操作都更常见。

there's one tiiny little operation that becomes more expensive with null-terminated strings: strlen. I'd say that's a bit of a drawback.

有一个复杂的小操作,使用以空结尾的字符串会变得更加昂贵:strlen。我会说这是一个小小的缺陷。

@Billy ONeal: everyone else also support regex. So what ? Use libraries that's what they are made for. C is about maximal efficiency and minimalism, not batteries included. C tools also allow you to implement Length Prefixed string using structs very easily. And nothing forbids you to implement the string manipulation programs through managing your own length and char buffers. That's usually what I do when I want efficiency and use C, not calling a handful of functions that expect a zero at the end of a char buffer is not a problem.

@比利·奥尼尔:其他人也都支持regex。那又怎样?使用库,这就是它们的目的所在。C是关于最高效率和最低限度的,不包括电池。C工具还允许您非常容易地使用结构实现长度前缀字符串。而且没有什么能阻止您通过管理自己的长度和字符缓冲区来实现字符串处理程序。当我想要提高效率并使用C语言时,这通常是我所做的,而不是调用少数在字符缓冲区末尾期望为零的函数是不成问题的。

9 was kinda off-base/mis-represented. Length pre-fix doesn't have this problem. Lenth passing as a separate variable does. We were talking about pre-fiix but I got carried away. Still a good thing to think about so I'll leave it there. :d

9有点离谱/被曲解了。长度前缀没有这个问题。作为一个单独的变量传递。我们在谈论菲尼克斯之前的事,但我有点忘乎所以。仍然是一件值得考虑的事情,所以我就到此为止了。:D

There's nothing that would have prevented char* arr from pointing to a structure of the form struct { int length; char characters[ANYSIZE_ARRAY] }; or similar which would still be passable as a single parameter.

没有什么可以阻止char*arr指向以下形式的结构:struct{int long;char characters[ANYSIZE_ARRAY]};或类似形式的结构,这些结构仍然可以作为单个参数传递。

@BillyONeal: Two problems with that approach: (1) It would only allow passing the string as a whole, whereas the present approach also allows passing the tail of a string; (2) it will waste significant space when used with small strings. If K&R wanted to spend some time on strings they could have made things much more robust, but I don't think they intended that their new language would be in use ten years later, much less forty.

@BillyONeal:这种方法有两个问题:(1)它只允许将字符串作为一个整体传递,而目前的方法也允许传递字符串的尾部;(2)当使用较小的字符串时,它将浪费大量空间。如果K&R想花一些时间在字符串上,他们本可以让事情变得更健壮,但我认为他们并不打算在10年后使用他们的新语言,更不用说40年后了。

This bit about the calling convention is a just-so story with no relation to reality ... it wasn't a consideration in the design. And register-based calling conventions had already been "invented". Also, approaches such as two pointers weren't an option because structs weren't first class ... only primitives were assignable or passable; struct copying didn't arrive until UNIX V7. Needing memcpy (which also didn't exist) just to copy a string pointer is a joke. Try writing a full program, not just isolated functions, if you're making a pretense of language design.

关于呼叫约定的这一点是一个与现实无关的平庸故事……这不是设计中的一个考虑因素。而基于寄存器的调用约定已经被“发明”了。此外,像两个指针这样的方法不是一种选择,因为结构不是第一类的……只有原语是可赋值的或可传递的;结构复制直到UNIXV7才出现。仅仅复制一个字符串指针就需要Memcpy(它也不存在)是一个笑话。如果你以语言设计为幌子,那就试着写一个完整的程序,而不仅仅是孤立的函数。

"that's most likely because they didn't want to spend much effort on string handling" -- nonsense; the entire application domain of early UNIX was string handling. If it hadn't been for that, we never would have heard of it.

“这很可能是因为他们不想在字符串处理上花费太多精力”--胡说八道;早期Unix的整个应用程序域都是字符串处理。如果不是这样的话,我们永远不会听说它。

'I don't think "the char buffer begins with an int containing the length" is any more magical' -- it is if you're going to make str[n] refer to the right char. These are the sorts of things that the folks discussing this don't think about.

‘我不认为’字符缓冲区以包含长度的int开头‘更神奇’--如果您要让str[n]引用正确的字符,它就更神奇了。这些都是讨论这件事的人不会考虑的事情。

Look, I respect Joel for a lot of things; but this is something where he's speculating. Hans Passant's answer comes directly from C's inventors.

听着,我在很多方面都尊重乔尔;但这是他在猜测的事情。汉斯·帕桑特的答案直接来自于C语言的发明者。

Yes, but if what Spolsky says is true at all, then it would have been part of the "convenience" they were referring to. That's partly why I included this answer.

是的,但如果斯波尔斯基所说的完全是真的,那么这就是他们所指的“便利”的一部分。这就是我加入这个答案的部分原因。

AFAIK .ASCIZ was just an assembler statement to build a sequence of bytes, followed by 0. It just means that zero terminated string was a well established concept at that time. It does not mean that zero terminated strings were something related to the architecture of a PDP-*, except that you could write tight loops consisting of MOVB (copy a byte) and BNE (branch if the last byte copied was not zero).

AFAIK.ASCIZ只是一个用于构建字节序列的汇编语句,后面跟0。这只是意味着以零结尾的字符串在当时是一个很好的概念。这并不意味着以零结尾的字符串与PDP-*的体系结构有关,只是您可以编写由MOVB(复制一个字节)和BNE(如果复制的最后一个字节不为零则分支)组成的紧密循环。

It supposes to show that C is old, flabby, decrepit language.

它假定表明C是一种陈旧、松散、陈旧的语言。

This comparison advantage is only application for relational comparisons. For equality comparisons, length-prefixed strings will generally win out since strings of unequal length can be recognized as unequal without having to examine any of the content, and content comparison can be done using multi-byte chunks without having to stop as soon as a zero byte is found. Further, the advantage isn't really as great as you state since the scenario where the difference is zero will be true of all but the last loop iteration.

这种比较优势仅适用于关系比较。对于相等比较,长度前缀的字符串通常会胜出,因为长度不相等的字符串可以被识别为不相等,而不必检查任何内容,并且可以使用多字节块来完成内容比较,而不必在发现零字节后立即停止。此外,优势并不像您所说的那么大,因为除了最后一次循环迭代外,差异为零的情况将适用于所有其他循环。

@Adrian W. This is valid C. Exact length strings are special cased and NUL is omitted for them. This generally an unwise practice but can be useful in cases like populating header structs that use FourCC "strings".

@禤浩焯W.这是有效的C。精确长度的字符串是特殊大小写的,并且忽略NUL。这通常是一种不明智的做法,但在填充使用FourCC“字符串”的头结构之类的情况下可能很有用。

You are right. This is valid C, will compile and behaves as kkaaii described. The reason for the downvotes (not mine...) is probably rather that this answer does not answer OP's question in any way.

你是正确的。这是有效的C语言,将按照kkaaii描述的那样编译和运行。反对票的原因(不是我的……)更确切地说,这个答案没有以任何方式回答OP的问题。

Not having the multibillion dollar single byte mistake would not make C a "bloated" language.

没有数十亿美元的单字节错误并不会使C语言成为一种“臃肿”的语言。

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com