gpt4 book ai didi

python - HDInsight 群集中的 UTF-8 文本出现 Spark 结果编码错误 'ascii' 编解码器无法对位置 : ordinal not in range(128) 中的字符进行编码

转载 作者:行者123 更新时间:2023-11-30 22:54:38 24 4
gpt4 key购买 nike

尝试在 Linux 上使用 Spark 处理 HDInsight 群集中的希伯来字符 UTF-8 TSV 文件,但出现编码错误,有什么建议吗?

这是我的 pyspark 笔记本代码:

from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))

# Infer the schema and create a table
transactionsTable = sqlContext.createDataFrame(transactions)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")

# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)

for name in names.collect():
print(name)

错误:

'ascii' codec can't encode characters in position 6-11: ordinal not in range(128) Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-11: ordinal not in range(128)

希伯来语文本文件内容:

id  name    age 
1 גיא 37
2 maor 32
3 danny 55

当我尝试英文文件时,它工作正常:

英文文本文件内容:

id  name    age
1 guy 37
2 maor 32
3 danny 55

输出:

name: guy
name: maor
name: danny

最佳答案

如果您使用希伯来语文本运行以下代码:

from pyspark.sql import *

path = "/people.txt"
transactionsText = sc.textFile(path)

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))

transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))

transactions.collect()

您会注意到,您获得的名称是 unicode 类型的列表:

[Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]

现在,我们将使用事务 RDD 注册一个表:

table_name = "transactionsTempTable"

# Infer the schema and create a table
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))

results.collect()

您会注意到,从 sqlContext.sql(... 返回的 Pyspark DataFrame 中的所有字符串都将是 Python unicode类型:

[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]

现在正在运行:

%%sql
SELECT * FROM transactionsTempTable

将得到预期的结果:

name: גיא
name: maor
name: danny
<小时/>

请注意,如果您想对这些名称进行一些处理,您需要将它们作为 unicode 字符串使用。来自 this article :

When you’re dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode strings as they abstract characters in a manner that’s appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

关于python - HDInsight 群集中的 UTF-8 文本出现 Spark 结果编码错误 'ascii' 编解码器无法对位置 : ordinal not in range(128) 中的字符进行编码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37698276/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com