本文主要是想看看dataframe中join操作后的结果。
left join
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
rdd1 = sc.parallelize([Row(name='Alice', weight=45)])
df1 = rdd1.toDF()
df.join(df1, ["name"], "left").show()
"""
+-----+---+------+------+
| name|age|height|weight|
+-----+---+------+------+
|Alice| 5| 80| 45|
|Alice| 10| 80| 45|
+-----+---+------+------+
"""
rdd2 = sc.parallelize([Row(name='Alice', weight=45), Row(name='Alice', weight=45)])
df2 = rdd2.toDF()
df.join(df2, ["name"], "left").show()
"""
+-----+---+------+------+
| name|age|height|weight|
+-----+---+------+------+
|Alice| 5| 80| 45|
|Alice| 5| 80| 45|
|Alice| 10| 80| 45|
|Alice| 10| 80| 45|
+-----+---+------+------+
"""
上面的例子,join也同样适用。
outer join
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80)])
df = rdd.toDF()
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Cycy', age=10, height=80)])
df1 = rdd.toDF()
df.join(df1, ["name"], "outer").show()
"""
+-----+----+------+----+------+
| name| age|height| age|height|
+-----+----+------+----+------+
| Cycy|null| null| 10| 80|
| Bob| 5| 80|null| null|
|Alice| 5| 80| 5| 80|
+-----+----+------+----+------+
"""
rdd1 = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Cycy', age=10, height=80)])
df2 = rdd1.toDF()
print df.join(df1, ["name"], "outer").join(df2, ["name"], "outer").show()
"""
+-----+----+------+----+------+----+------+
| name| age|height| age|height| age|height|
+-----+----+------+----+------+----+------+
| Cycy|null| null| 10| 80| 10| 80|
| Bob| 5| 80|null| null|null| null|
|Alice| 5| 80| 5| 80| 5| 80|
+-----+----+------+----+------+----+------+
"""
网友评论