美文网首页
Hive从入门到精通11:Hive实战1

Hive从入门到精通11:Hive实战1

作者: 金字塔下的小蜗牛 | 来源:发表于2020-04-03 08:29 被阅读0次

本节通过实战案例来介绍Hive的相关用法。

测试数据:ml-100k.zip 提取码:n5a9

1.数据格式说明

将上述测试数据的压缩文件解压后,可以看到目录下的数据文件,它们的格式分别为:

ml-data.tar.gz -- Compressed tar file.  To rebuild the u data files do this:
   gunzip ml-data.tar.gz
   tar xvf ml-data.tar
   mku.sh

u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1\. The data is randomly ordered.This is a tab separated list of
  user id | item id | rating | timestamp. 
The time stamps are unix seconds since 1/1/1970 UTC.

u.info -- The number of users, items, and ratings in the u data set.

u.item -- Information about the items (movies); this is a tab separated list of 
  movie id | movie title | release date | video release date | 
  IMDb URL | unknown | Action | Adventure | Animation | Children's | 
  Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | 
  Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | 
  Western |  
The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.

u.genre -- A list of the genres.

u.user -- Demographic information about the users; this is a tab separated list of
   user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base、u1.test、u2.base、u2.test、u3.base、u3.test、u4.base、u4.test、u5.base、u5.test
-- The data sets u1.base and u1.test through u5.base and u5.test  are 80%/20% splits of the u data into training and test data. Each of u1, ..., u5 have disjoint test sets; this if for 5 fold cross validation (where you repeat your experiment with each training and test set and average the results). These data sets can be generated from u.data by mku.sh.

ua.base、ua.test、ub.base、ua.test 
-- The data sets ua.base, ua.test, ub.base, and ub.test split the u data into a training set and a test set with exactly 10 ratings per user in the test set. The sets ub.test and ub.test are disjoint. These data sets can be generated from u.data by mku.sh.

allbut.pl -- The script that generates training and test sets where all but n of a users ratings are in the training data.

mku.sh -- A shell script to generate all the u data sets from u.data.

2.创建数据库:ml_100k

hive> create database ml_100k;
OK
Time taken: 0.019 seconds
hive> show databases;
OK
default
ml_100k
Time taken: 0.007 seconds, Fetched: 2 row(s)
hive> use ml_100k;
OK
Time taken: 0.01 seconds

3.创建内部表

# 创建u.data表:user id | item id | rating | timestamp
create table u_data ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u.info表:count | typename
create table u_info ( count string,
typename string ) row format delimited fields terminated by ' ';

# 创建u.item表:movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | hildren's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | 
create table u_item ( movie_id string,
movie_title string,
release_date string,
video_release_date string,
IMDb_URL string,
unknown string,
Action string,
Adventure string,
Animation string,
Childrens string,
Comedy string,
Crime string,
Documentary string,
Drama string,
Fantasy string,
Film_Noir string,
Horror string,
Musical string,
Mystery string,
Romance string,
Sci_Fi string,
Thriller string,
War string,
Western string
) row format delimited fields terminated by '|';

# 创建u.genre表:genre name | genre id
create table u_genre ( genre_name string,
genre_id string ) row format delimited fields terminated by '|';

# 创建u.user表:user id | age | gender | occupation | zip code
create table u_user ( user_id string,
age string,
gender string,
occupation string,
zip_code string ) row format delimited fields terminated by '|';

# 创建u.occupation表:occupation name
create table u_occupation ( occupation_name string ) 
row format delimited fields terminated by '\n';

# 创建u1.base表:user id | item id | rating | timestamp
create table u1_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u1.test表:user id | item id | rating | timestamp
create table u1_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u2.base表:user id | item id | rating | timestamp
create table u2_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u2.test表:user id | item id | rating | timestamp
create table u2_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u3.base表:user id | item id | rating | timestamp
create table u3_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u3.test表:user id | item id | rating | timestamp
create table u3_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u4.base表:user id | item id | rating | timestamp
create table u4_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u4.test表:user id | item id | rating | timestamp
create table u4_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u5.base表:user id | item id | rating | timestamp
create table u5_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建u5.test表:user id | item id | rating | timestamp
create table u5_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建ua.base表:user id | item id | rating | timestamp
create table ua_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建ua.test表:user id | item id | rating | timestamp
create table ua_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建ub.base表:user id | item id | rating | timestamp
create table ub_base ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

# 创建ub.test表:user id | item id | rating | timestamp
create table ub_test ( user_id string,
item_id string,
rating string,
dt string ) row format delimited fields terminated by '\t';

hive> show tables;
OK
u1_base
u1_test
u2_base
u2_test
u3_base
u3_test
u4_base
u4_test
u5_base
u5_test
u_data
u_genre
u_info
u_item
u_occupation
u_user
ua_base
ua_test
ub_base
ub_test
Time taken: 0.01 seconds, Fetched: 20 row(s)

3.从本地导入数据

# 导入u.data数据
load data local inpath '/root/input/data/ml-100k/u.data' overwrite into table u_data;

# 导入u.info数据
load data local inpath '/root/input/data/ml-100k/u.info' overwrite into table u_info;

# 导入u.item数据
load data local inpath '/root/input/data/ml-100k/u.item' overwrite into table u_item;

# 导入u.genre数据
load data local inpath '/root/input/data/ml-100k/u.genre' overwrite into table u_genre;

# 导入u.user数据
load data local inpath '/root/input/data/ml-100k/u.user' overwrite into table u_user;

# 导入u.occupation数据
load data local inpath '/root/input/data/ml-100k/u.occupation' overwrite into table u_occupation;

# 导入u1.base数据
load data local inpath '/root/input/data/ml-100k/u1.base' overwrite into table u1_base;

# 导入u1.test数据
load data local inpath '/root/input/data/ml-100k/u1.test' overwrite into table u1_test;

# 导入u2.base数据
load data local inpath '/root/input/data/ml-100k/u2.base' overwrite into table u2_base;

# 导入u2.test数据
load data local inpath '/root/input/data/ml-100k/u2.test' overwrite into table u2_test;

# 导入u3.base数据
load data local inpath '/root/input/data/ml-100k/u3.base' overwrite into table u3_base;

# 导入u3.test数据
load data local inpath '/root/input/data/ml-100k/u3.test' overwrite into table u3_test;

# 导入u4.base数据
load data local inpath '/root/input/data/ml-100k/u4.base' overwrite into table u4_base;

# 导入u4.test数据
load data local inpath '/root/input/data/ml-100k/u4.test' overwrite into table u4_test;

# 导入u5.base数据
load data local inpath '/root/input/data/ml-100k/u5.base' overwrite into table u5_base;

# 导入u5.test数据
load data local inpath '/root/input/data/ml-100k/u5.test' overwrite into table u5_test;

# 导入ua.base数据
load data local inpath '/root/input/data/ml-100k/ua.base' overwrite into table ua_base;

# 导入ua.test数据
load data local inpath '/root/input/data/ml-100k/ua.test' overwrite into table ua_test;

# 导入ub.base数据
load data local inpath '/root/input/data/ml-100k/ub.base' overwrite into table ub_base;

# 导入ub.test数据
load data local inpath '/root/input/data/ml-100k/ub.test' overwrite into table ub_test;

4.查询数据

## 设置打印表头
hive> set hive.cli.print.header=true;

## 查看表结构
hive> desc u_data;
OK
col_name        data_type       comment
user_id                 string
item_id                 string
rating                  string
dt                      string
Time taken: 0.022 seconds, Fetched: 4 row(s)

## 查看表的数据量
hive> select count(*) from u_user;
hive> select count(1) from u_user;
OK
_c0
943
Time taken: 70.972 seconds, Fetched: 1 row(s)

## 查看表的全部数据
hive> select * from u_info;
OK
u_info.count    u_info.typename
943     users
1682    items
100000  ratings
Time taken: 0.062 seconds, Fetched: 3 row(s)

## 查看表的部分数据(使用limit限制查询的数据量)
hive> select * from u_data limit 10;
OK
u_data.user_id  u_data.item_id  u_data.rating   u_data.dt
196     242     3       881250949
186     302     3       891717742
22      377     1       878887116
244     51      2       880606923
166     346     1       886397596
298     474     4       884182806
115     265     2       881171488
253     465     5       891628467
305     451     3       886324817
6       86      3       883603013
Time taken: 0.065 seconds, Fetched: 10 row(s)

## 使用distinct查看不重复数据
hive> select distinct gender from u_user;
OK
gender
F
M
Time taken: 73.046 seconds, Fetched: 2 row(s)

## 使用group by聚合数据
## 案例:查看男女用户的数量。
hive> select gender,count(1) as cnt from u_user group by gender;
OK
gender  cnt
F       273
M       670
Time taken: 73.219 seconds, Fetched: 2 row(s)

## 使用order by排序数据
## 案例:查看总得分最高的前10部电影。
hive> select item_id,sum(rating) as total_score from u_data group by item_id order by total_score desc limit 10;
OK
item_id total_score
50      2541.0
100     2111.0
181     2032.0
258     1936.0
174     1786.0
127     1769.0
286     1759.0
1       1753.0
98      1673.0
288     1645.0
Time taken: 155.293 seconds, Fetched: 10 row(s)

## 使用子查询
## 案例:查看男生总评分最高的前10部电影。
select item_id,sum(rating) as total_score from u_data 
where user_id in (select user_id from u_user where gender = 'M') group by item_id order by total_score desc limit 10;
item_id total_score
50      1900.0
100     1609.0
181     1535.0
258     1431.0
174     1398.0
127     1335.0
1       1302.0
56      1271.0
98      1241.0
7       1197.0
Time taken: 121.045 seconds, Fetched: 10 row(s)

## 使用join多表查询
## 案例:查看最多5分好评的电影名字。
select t2.movie_title from
( select item_id, count(1) as total_score from u_data where rating = '5' group by item_id order by total_score desc limit 1 ) t1
join u_item t2 on t1.item_id = t2.movie_id;
OK
t2.movie_title
Star Wars (1977)
Time taken: 169.247 seconds, Fetched: 1 row(s)

相关文章

网友评论

      本文标题:Hive从入门到精通11:Hive实战1

      本文链接:https://www.haomeiwen.com/subject/mhdkdhtx.html