本文针对Notebook的使用,做一个具体的场景示例——航空公司数据导入及分析。

步骤一:创建集群并通过knox账号访问Notebook

创建集群参考: https://help.aliyun.com/document_detail/167621.html,需注意要设置ram子账号及保存好knox密码,登录WebUI时候需要用到。

步骤二:创建Notebook、导入数据、进行数据分析

示例Note下载:CASE2-Airline_Flight_Data.zpln

示例数据下载:airline_statistic_usa.csv

1. 读取OSS数据、打印schema,创建TempView

Load OSS data

%spark

val sparkDF = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("oss://databricks-demo-hangzhou/airline_statistic_usa.csv")

Print schema

%spark

sparkDF.printSchema()
sparkDF.show()

Create Temp View

%spark

sparkDF.createOrReplaceTempView("usa_flights")
airline读取数据导入等

2. 查询分析:Analysis,Top 10 Average Distance Traveled By Flight Carrier

%sql

SELECT OP_UNIQUE_CARRIER, CAST(AVG(DISTANCE) AS INT) AS AvgDistance
FROM usa_flights
GROUP BY OP_UNIQUE_CARRIER
ORDER BY AvgDistance
DESC
LIMIT 20
分析前10数据

3. 定义UDF:Define UDF

%pyspark

from pyspark.sql.types import IntegerType

def isDelayed(x):
    if x == None:
        return 0
    elif int(x) > 10:
        return 1
    else:
        return 0

spark.udf.register("isDelayed", isDelayed, IntegerType())
Case2-03

4. 使用UDF进行查询:Analysis,Top 10 Total Delayed Flights By Carrier

%sql

SELECT OP_UNIQUE_CARRIER, SUM(isDelayed(DEP_DELAY)) AS NumOfDelayed
FROM usa_flights
GROUP BY OP_UNIQUE_CARRIER
ORDER BY NumOfDelayed
DESC
LIMIT 10
Case3-04