美文网首页
21-SparkSQL02

21-SparkSQL02

作者: CrUelAnGElPG | 来源:发表于2018-09-06 09:36 被阅读0次

DataFrame

python pandas

R

RDD MapReduce

DataFrame vs Dataset(1.6)

DS: Java  Scala

DF: 4

SchemaRDD < 1.3

==>

DataFrame

A Dataset is a distributed collection of data.

剥洋葱式分析

A DataFrame is a distributed collection of data

organized into named columns

table in a relational database

DataFrame = Dataset[Row]

DataFrame vs RDD vs Dataset

概念  collection

API    map  filter  flatMap .....

数据结构

textFile(path)

RDD[Person]

name age height

spark.sql("").show()

Spark SQL入口点

<2: SQLContext  HiveContext

>=2: SparkSession

spark.read.format("json").load(path)

spark.read.format("text").load(path)

spark.read.format("parquet").load(path)

spark.read.format("orc").load(path)

源码面前 了无秘密

infos.txt ==> DataFrame

val students = sc.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF()

show()

=> show(20,true)

show(5)

相关文章

  • 21-SparkSQL02

    DataFrame python pandas R RDD MapReduce DataFrame vs Data...

网友评论

      本文标题:21-SparkSQL02

      本文链接:https://www.haomeiwen.com/subject/axptgftx.html