gpt4 book ai didi

apache-spark - 在 Spark SQL 中加入 2 个以上的表

转载 作者:行者123 更新时间:2023-12-01 09:48:56 28 4
gpt4 key购买 nike

我正在尝试在执行三个表的连接的 SPARK SQL 中编写查询。但是查询输出实际上是null .它适用于单表。我的 Join 查询是正确的,因为我已经在 oracle 数据库中执行了它。我需要在这里申请什么更正? Spark 版本是 2.0.0。

from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)

lines = sc.textFile("/Users/Hadoop_IPFile/purchase")
lines2 = sc.textFile("/Users/Hadoop_IPFile/customer")
lines3 = sc.textFile("/Users/Hadoop_IPFile/book")

parts = lines.map(lambda l: l.split("\t"))
purchase = parts.map(lambda p: Row(year=p[0],cid=p[1],isbn=p[2],seller=p[3],price=int(p[4])))
schemapurchase = sqlContext.createDataFrame(purchase)
schemapurchase.registerTempTable("purchase")


parts2 = lines.map(lambda l: l.split("\t"))
customer = parts2.map(lambda p: Row(cid=p[0],name=p[1],age=p[2],city=p[3],sex=p[4]))
schemacustomer = sqlContext.createDataFrame(customer)
schemacustomer.registerTempTable("customer")

parts3 = lines.map(lambda l: l.split("\t"))
book = parts3.map(lambda p: Row(isbn=p[0],name=p[1]))
schemabook = sqlContext.createDataFrame(book)
schemabook.registerTempTable("book")

result_purchase = sqlContext.sql("""SELECT DISTINCT customer.name AS name FROM purchase JOIN book ON purchase.isbn = book.isbn JOIN customer ON customer.cid = purchase.cid WHERE customer.name != 'Harry Smith' AND purchase.isbn IN (SELECT purchase.isbn FROM customer JOIN purchase ON customer.cid = purchase.cid WHERE customer.name = 'Harry Smith')""")

result = result_purchase.rdd.map(lambda p: "name: " + p.name).collect()
for name in result:
print(name)


DataSet
---------
Purchase
1999 C1 B1 Amazon 90
2001 C1 B2 Amazon 20
2008 C2 B2 Barnes Noble 30
2008 C3 B3 Amazon 28
2009 C2 B1 Borders 90
2010 C4 B3 Barnes Noble 26


Customer
C1 Jackie Chan 50 Dayton M
C2 Harry Smith 30 Beavercreek M
C3 Ellen Smith 28 Beavercreek F
C4 John Chan 20 Dayton M

Book
B1 Novel
B2 Drama
B3 Poem
我在某些网页中找到了以下说明,但仍然无法正常工作:
schemapurchase.join(schemabook, schemapurchase.isbn == schemabook.isbn) 
schemapurchase.join(schemacustomer, schemapurchase.cid == schemacustomer.cid)

最佳答案

鉴于此输入 DataFrames 就像在您的示例中一样(对不起,如果某些列名错误,我猜对了):

购买:

+----+---+----+------------+-----+
|year|cid|isbn| shop|price|
+----+---+----+------------+-----+
|1999| C1| B1| Amazon| 90|
|2001| C1| B2| Amazon| 20|
|2008| C2| B2|Barnes Noble| 30|
|2008| C3| B3| Amazon| 28|
|2009| C2| B1| Borders| 90|
|2010| C4| B3|Barnes Noble| 26|
+----+---+----+------------+-----+

顾客:
+---+-----------+---+-----------+-----+
|cid| name|age| city|genre|
+---+-----------+---+-----------+-----+
| C1|Jackie Chan| 50| Dayton| M|
| C2|Harry Smith| 30|Beavercreek| M|
| C3|Ellen Smith| 28|Beavercreek| F|
| C4| John Chan| 20| Dayton| M|
+---+-----------+---+-----------+-----+

书:
+----+-----+
|isbn|genre|
+----+-----+
| B1|Novel|
| B2|Drama|
| B3| Poem|
+----+-----+

您可以使用 DataFrame 函数转换该 sql 查询,如下所示:
val result = purchase.join(book, purchase("isbn")===book("isbn"))
.join(customer, customer("cid")===purchase("cid"))
.where(customer("name") !== "Harry Smith")
.join(temp, purchase("isbn")===temp("purchase_isbn"))
.select(customer("name").as("NAME")).distinct()

其中“temp”是“SELECT IN”的结果,可以认为是另一个join的结果:
val temp = customer.join(purchase, customer("cid")===purchase("cid") )
.where(customer("name")==="Harry Smith")
.select(purchase("isbn").as("purchase_isbn"))


+-------------+
|purchase_isbn|
+-------------+
| B2|
| B1|
+-------------+

所以最后的结果是:
+-----------+
| NAME|
+-----------+
|Jackie Chan|
+-----------+

将此答案视为您可以开始思考的一点(例如,过多的联接可能会对性能产生不良影响)。

关于apache-spark - 在 Spark SQL 中加入 2 个以上的表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42150443/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com