python - 显然，Python字符串不是“天生相等”的-6ren

python - 显然，Python字符串不是“天生相等”的

转载作者：太空狗更新时间：2023-10-30 02:27:50

我正试着用“文本编码标准”来包装我的大脑。当将一堆字节解释为“文本”时，必须知道哪个“编码sheme”适用。我知道的可能的候选人：
ascii：非常基本的编码方案，支持128个字符。
cp-1252：拉丁字母表的windows编码方案。也称为“ANSI”。
utf-8：unicode表的编码方案（1.114.112个字符）。如果可能，用一个字节表示每个字符，如果需要，用更多字节表示（最多4个字节）。
utf-16：unicode表的另一种编码方案（1.114.112个字符）。用最少2个字节表示每个字符，最多4个字节。
utf-32：unicode表的另一种编码方案。用4个字节表示每个字符。
是的。……
现在我希望python为其内置的字符串类型始终使用一种编码方案。我做了下面的测试，结果让我发抖。我开始相信python并不总是坚持使用一种编码方案在内部存储字符串。换句话说：python字符串似乎“天生不平等”…
编辑：
我忘了提到我正在使用Python3.x。抱歉：-）
一。测试
我在一个文件夹中有两个简单的文本文件：myAnsi.txt和myUtf.txt。如您所料，第一个是在CP-1252编码方案中编码的，也称为ANSI。后者在utf-8中编码。在我的测试中，我打开每个文件并读取其内容。我将内容分配给本机python字符串变量。然后我关闭文件。之后，我创建一个新文件并将字符串变量的内容写入该文件。下面是执行所有这些操作的代码：

    ##############################
    #    TEST ON THE ANSI-coded  #
    #    FILE                    #
    ##############################
    import os
    file = open(os.getcwd() + '\\myAnsi.txt', 'r')
    fileText = file.read()
    file.close()

    file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
    file.write(fileText)
    file.close()

    # A print statement here like:
    #    >> print(fileText)
    # will raise an exception.
    # But if you're typing this code in a python terminal,
    # you can just write:
    #    >> fileText
    # and get the content printed. In my case, it is the exact
    # content of the file.
    # PS: I use the native windows cmd.exe as my Python terminal ;-)

    ##############################
    #    TEST ON THE Utf-coded   #
    #    FILE                    #
    ##############################
    import os
    file = open(os.getcwd() + '\\myUtf.txt', 'r')
    fileText = file.read()
    file.close()

    file = open(os.getcwd() + '\\outputUtf.txt', 'w')
    file.write(fileText)
    file.close()

    # A print statement here like:
    #    >> print(fileText)
    # will just work fine (at least for me).

    ############# END OF TEST #############

2.我期待的结果
让我们假设python对其所有字符串始终坚持一种内部编码方案（例如 utf-8）。将其他内容分配给字符串将导致某种隐式转换。在这些假设下，我希望两个输出文件都是 utf-8类型：

    outputAnsi.txt   ->   utf-8 encoded
    outputUtf.txt    ->   utf-8 encoded

三。我得到的结果
我得到的结果是：

    outputAnsi.txt   ->   CP-1252 encoded (ANSI)
    outputUtf.txt    ->   utf-8 encoded

从这些结果中，我不得不得出结论，字符串变量 fileText以某种方式存储了它所遵循的编码方案。
很多人在回答中告诉我：
如果没有显式传递编码，则使用首选项。
用于读写的系统编码。
我就是不能把我的脑子都放在那句话上。如果OPEN（）使用“首选的系统编码”，比如说“cc>”，那么两个“CC”输出都应该以这种方式编码，不是吗？
四。问题。。
我的测试对我提出了几个问题：
（1）当我打开一个文件读取其内容时，python如何知道该文件的编码方案？我在打开文件时没有指定它。
（2）显然，python字符串可以遵循python支持的任何编码方案。所以并不是所有的python字符串都是生而平等的。如何找出特定字符串的编码方案，以及如何转换它？或者如何确保新创建的python字符串是预期的类型？
（3）当我创建一个文件时，python如何决定该文件将以何种编码方式创建？在测试中创建这些文件时，我没有指定编码方案。然而，python做了一个不同的（！）每种情况下的决定。
5个。额外信息（基于对该问题的评论）：
Python版本：Python 3 .x（从AcANDA安装）
操作系统：Windows 10
终端：标准windows命令提示符 open()
对临时变量 cp1252提出了一些问题。显然，指令 *.txt不适用于ansi情况。引发异常。但是在python终端窗口中，我可以简单地输入变量名 cmd.exe并打印出文件内容。
文件编码检测：记事本右下角++第一次检查，在线工具二次检查： https://nlp.fi.muni.cz/projects/chared/
在测试开始时不存在输出文件 fileText和 print(fileText)。它们是在我使用 fileText选项发出 outputAnsi.txt命令时创建的。
6.实际文件（为了完整性）：
我得到了几条鼓励我分享我正在做这个测试的实际文件的评论。那些文件很大，所以我把它们删减了，重新做了测试。结果是相似的。这里是文件（当然，我的文件包含源代码，还有什么？）以下内容：
myansi.txt文件

/*
******************************************************************************
**
**  File        : LinkerScript.ld
**
**  Author      : Auto-generated by Ac6 System Workbench
**
**  Abstract    : Linker script for STM32F746NGHx Device from STM32F7 series
**
**  Target      : STMicroelectronics STM32
**
**  Distribution: The file is distributed “as is,” without any warranty
**                of any kind.
**
*****************************************************************************
** @attention
**
** <h2><center>&copy; COPYRIGHT(c) 2014 Ac6</center></h2>
**
*****************************************************************************
*/

/* Entry Point */
/*ENTRY(Reset_Handler)*/
ENTRY(Default_Handler)

/* Highest address of the user mode stack */
_estack = 0x20050000;    /* end of RAM */

_Min_Heap_Size = 0;      /* required amount of heap  */
_Min_Stack_Size = 0x400; /* required amount of stack */

/* Memories definition */
MEMORY
{
  RAM (xrw)     : ORIGIN = 0x20000000, LENGTH = 320K
  ROM (rx)      : ORIGIN = 0x8000000, LENGTH = 1024K
}

outputUtf.txt变量的print语句导致以下异常：

>>> print(fileText)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Anaconda3\lib\encodings\cp850.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 357: character maps to <undefined>

但只要输入变量名就可以毫无问题地打印出内容：

>>> fileText
    ### contents of the file are printed out :-) ###

myutf.txt文件

/*--------------------------------------------------------------------------------------------------------------------*/
/*           _ _ _                                                                                                    */
/*          / -,- \                   __  _            _                                                              */
/*         //  |  \\                 / __\ | ___   ___| | __                   _            _                         */
/*         |   0--,|                / /  | |/ _ \ / __| |/ /    __ ___ _ _  __| |_ __ _ _ _| |_ ___                   */
/*         \\     //               / /___| | (_) | (__|   <    / _/ _ \ ' \(_-<  _/ _` | ' \  _(_-<                   */
/*          \_-_-_/                \____/|_|\___/ \___|_|\_\   \__\___/_||_/__/\__\__,_|_||_\__/__/                   */
/*--------------------------------------------------------------------------------------------------------------------*/


#include "clock_constants.h"
#include "../CMSIS/stm32f7xx.h"
#include "stm32f7xx_hal_rcc.h"


/*--------------------------------------------------------------------------------------------------*/
/*          S y s t e m C o r e C l o c k       i n i t i a l        v a l u e                      */
/*--------------------------------------------------------------------------------------------------*/
/*                                                                                                  */
/* This variable is updated in three ways:                                                          */
/*      1) by calling CMSIS function SystemCoreClockUpdate()                                        */
/*      2) by calling HAL API function HAL_RCC_GetHCLKFreq()                                        */
/*      3) each time HAL_RCC_ClockConfig() is called to configure the system clock frequency        */
/*          Note: If you use this function to configure the system clock; then there                */
/*                is no need to call the 2 first functions listed above, since SystemCoreClock      */
/*                variable is updated automatically.                                                */
/*                                                                                                  */
uint32_t SystemCoreClock = 16000000;
const uint8_t AHBPrescTable[16] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 6, 7, 8, 9};


/*--------------------------------------------------------------------------------------------------*/
/*          S y s t e m C o r e C l o c k       v a l u e      u p d a t e                          */
/*--------------------------------------------------------------------------------------------------*/
/*                                                                                                  */
/* @brief  Update SystemCoreClock variable according to Clock Register Values.                      */
/*         The SystemCoreClock variable contains the core clock (HCLK), it can                      */
/*         be used by the user application to setup the SysTick timer or configure                  */
/*         other parameters.                                                                        */
/*--------------------------------------------------------------------------------------------------*/

最佳答案

当没有明确地传递编码时，open() uses the preferred system encoding既用于读取，也用于写入（不确切地确定在Windows上检测到首选编码）。
所以，当你写：

file = open(os.getcwd() + '\\myAnsi.txt', 'r')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
file = open(os.getcwd() + '\\myUtf.txt', 'r')
file = open(os.getcwd() + '\\outputUtf.txt', 'w')

所有四个文件都使用相同的编码打开，既用于读取，也用于写入。
如果要确保使用以下编码打开文件，则必须传递 encoding='cp1252'或 encoding='utf-8'：

file = open(os.getcwd() + '\\myAnsi.txt', 'r', encoding='cp1252')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w', encoding='cp1252')
file = open(os.getcwd() + '\\myUtf.txt', 'r', encoding='utf-8')
file = open(os.getcwd() + '\\outputUtf.txt', 'w', encoding='utf-8')

（顺便说一下，我不是windows专家，但我认为您可以编写 'myAnsi.txt'而不是 os.getcwd() + '\\myAnsi.txt'）
除此之外，你必须考虑一些字符以不同的编码方式以相同的方式表示。例如，字符串 hello在ASCII、CP-1252或UTF-8中具有相同的表示形式。通常，您必须使用一些非ASCII字符来查看一些不同之处：

>>> 'hello'.encode('cp1252')
b'hello'
>>> 'hello'.encode('utf-8')
b'hello'  # different encoding, same byte representation

不仅如此，一些字节字符串在两种不同的编码中完全有效，即使它们有不同的含义，因此当您尝试用错误的编码解码文件时，您不会得到错误，而是一个奇怪的字符串：

>>> b'\xe2\x82\xac'.decode('utf-8')
'€'
>>> b'\xe2\x82\xac'.decode('cp1252')
'â‚¬'  # same byte representation, different string

对于记录， Python uses UTF-8, UTF-16 or UTF-32表示内部字符串。Python尝试使用“最短”表示，即使使用UTF-8和UTF-16没有连续字节，因此查找总是O（1）。
简而言之，您已经使用系统编码读取了两个文件，并使用相同的编码编写了两个文件（因此没有任何转换）。所读文件的内容与cp-1252和utf-8都兼容。

关于python - 显然，Python字符串不是“天生相等”的，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39213688/

文章推荐： python - 将 pandas 数据框保存到 csv 文件时的附加列

文章推荐： python - 使用 Oracle 的 AWS Python Lambda - OID 生成失败

文章推荐： python - 在 Pandas 中使用固定列对多列应用操作

.net - 在sharepoint中使用SPListCollection.Add方法(字符串，字符串，字符串，字符串，Int32，字符串，SPListTemplate.QuickLaunchOptions)
如何使用 SPListCollection.Add(String, String, String, String, Int32, String, SPListTemplate.QuickLaunchO
C++ 字符串 != 字符串
我刚刚开始使用 C++ 并且对 C# 有一些经验，所以我有一些一般的编程经验。然而，似乎我马上就被击落了。我试过在谷歌上寻找，以免浪费任何人的时间，但没有结果。 int main(int argc,
Java 8 : Converting Map>到映射<字符串，字符串[]>
这个问题已经有答案了: In Java 8 how do I transform a Map to another Map using a lambda? (8 个回答) Convert a Map>
node.js - "Type ' 字符串 |字符串[] ' is not assignable to type ' 字符串'
我正在使用 node + typescript 和集成的 swagger 进行 API 调用。我 Swagger 提出以下要求 http://localhost:3033/employees/sear
C++ 映射<字符串， vector <对<字符串，字符串>>> : adding a mapping to an empty vector?
我是 C++ 容器模板的新手。我收集了一些记录。每条记录都有一个唯一的名称，以及一个字段/值对列表。将按名称访问记录。字段/值对的顺序很重要。因此我设计如下: typedef string
java - 谁能帮我创建方法？ mystring.replacefirst(字符串,字符串);并替换(自，直到，字符串)；对于j2me，请
我需要这两种方法，但j2me没有，我找到了一个replaceall();但这是 replaceall(string,string,string); 第二个方法是SringBuffer但在j2me中它没
.net - 字符串 vs 字符串 - 区分大小写的联合
If string is an alias of String in the .net framework为什么会发生这种情况，我应该如何解释它: type JustAString = string
python - 考虑顺序如何检查列表(字符串)是否包含另一个列表(字符串)
我有两个列表(或字符串):一个大，另一个小。我想检查较大的(A)是否包含小的(B)。我的期望如下: 案例 1. B 是 A 的子集 A = [1,2,3] B = [1,2] contains(A
javascript - Jquery 字符串 + 对象 + 字符串
我有一个似乎无法解决的小问题。这里...我有一个像这样创建的输入... var input = $(''); 如果我这样做......一切都很好 $(this).append(input); 如果我
c# - ienumerable <字符串>到列表<字符串>
我有以下代码片段 string[] lines = objects.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.No
Java printf(字符串、Val、字符串)
这可能真的很简单，但我已经坚持了一段时间了。我正在尝试输出一个字符串，然后输出一个带有两位小数的 double ，后跟另一个字符串，这是我的代码。 System.out.printf("成本:%.2
firebase - Cloud Firestore 字符串 >= 字符串
以下是 Cloud Firestore 列表查询中的示例之一 citiesRef.where("state", ">=", "CA").where("state", "= 字符串，我们在Stack O
javascript - if(字符串.包含(字符串))。那可能吗？
我正在尝试检查一个字符串是否包含在另一个字符串中。后面的代码非常简单。我怎样才能在 jquery 中做到这一点？ function deleteRow(locName, locID) { if
C++ 字符串 (int) + 字符串 (int)
这个问题在这里已经有了答案: How to implement big int in C++ (14 个答案) 关闭 9 年前。我有 2 个字符串，都只包含数字。这些数字大于 uint64_t 的
java - 带有自定义转换器的推土机双向映射(字符串，字符串)不可能吗？
我有一个带有自定义转换器的 Dozer 映射: com.xyz.Customer com.xyz.CustomerDAO customerName
java - 字符串 a == 字符串 b 的规则
这个问题在这里已经有了答案: How do I compare strings in Java? (23 个回答) 关闭 6 年前。我想了解字符串池的工作原理以及一个字符串等于另一个字符串的规则是
Swift 字符串 vs. 字符串!与字符串？
我已阅读 this问题和其他一些问题。但它们与我的问题有些无关对于 UILabel 如果你不指定 ? 或 ! 你会得到这样的错误: @IBOutlet property has non-option
c - 字符串 [x] 与 *字符串++
这两种方法中哪一种在理论上更快，为什么？ (指向字符串的指针必须是常量。) destination[count] 和 *destination++ 之间的确切区别是什么？ destination[co
.net - String.Format与“字符串” +“字符串”还是StringBuilder？
This question already has answers here: Closed 11 years ago. Possible Duplicates: Is String.Format a
java - 流<字符串> 到映射<字符串、整数>
我有一个Stream一个文件的，现在我想将相同的单词组合成 Map这很重要，这个词在 Stream 中出现的频率. 我知道我必须使用 collect(Collectors.groupingBy(..)

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 显然，Python字符串不是“天生相等”的