sql系列——hive之array、map、struct、java函数（udf）、python函数、分隔符、json

sql系列——hive之array、map、struct、java函数（udf）、python函数、分隔符、json_tuple的处理

发布日期：2021-09-30 09:33:45 浏览次数：1 分类：技术文章

本文共 2852 字，大约阅读时间需要 9 分钟。

原始数据

1 huangbo guangzhou,xianggang,shenzhen a1:30,a2:20,a3:100 beijing,112233,13522334455,5002 xuzheng xianggang b2:50,b3:40 tianjin,223344,13644556677,6003 wangbaoqiang beijing,zhejinag c1:200 chongqinjg,334455,15622334455,20

表：

use class;create table cdt(id int, name string, work_location array
   
    , piaofang map
    
     , address struct
     
      ) row format delimited fields terminated by "\t" collection items terminated by "," map keys terminated by ":" lines terminated by "\n";

1、数据类型处理

1.1、array数组：选取全部，按索引选取

1.2、map：查询全部，查询某个键

1.3、struct：选取全部，选取某个字段

2、函数、udf、udaf、udtf

2.1、显示自定义函数的信息

show functions;desc function substr;desc function extended substr;

2.2、udf、udaf、udtf

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数。

UDF（user-defined function）作用于单个数据行，产生一个数据行作为输出。（数学函数，字符串函数）

UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）：接收多个输入数据行，并产生一个输出数据行。（count，max）

UDTF（表格生成函数 User-Defined Table Functions）：接收一行输入，输出多行（explode）

2.3、java中使用jar包方式使用udf

A、ToLowerCase.java：

import org.apache.hadoop.hive.ql.exec.UDF;public class ToLowerCase extends UDF{        // 必须是 public，并且 evaluate 方法可以重载    public String evaluate(String field) {    String result = field.toLowerCase();    return result;    }    }

B、打成 jar 包上传到服务器

C、将 jar 包添加到 hive 的 classpath

add JAR /home/hadoop/udf.jar;

D、查看jar包

list jar

E、创建临时函数与开发好的 class 关联起来

create temporary function tolowercase as 'com.study.hive.udf.ToLowerCase';

F、至此，便可以在 hql 在使用自定义的函数

2.4、使用transform+python方法创建函数

先编辑一个 python 脚本文件

########python######代码## vi weekday_mapper.py#!/bin/pythonimport sysimport datetimefor line in sys.stdin: line = line.strip() movie,rate,unixtime,userid = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([movie, rate, str(weekday),userid])

保存文件然后，将文件加入 hive 的 classpath，其中rate是\t分隔的表：

hive>add file /home/hadoop/weekday_mapper.py;hive> insert into table lastjsontable select transform(movie,rate,unixtime,userid)using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from rate;

创建最后的用来存储调用 python 脚本解析出来的数据的表：lastjsontable

create table lastjsontable(movie int, rate int, weekday int, userid int) row format delimitedfields terminated by '\t';

最后查询看数据是否正确

select distinct(weekday) from lastjsontable;

3、特殊符号（分隔符）的处理

Hive 对文件中字段的分隔符默认情况下只支持单字节分隔符，如果数据文件中的分隔符是多字符的，如下所示：

01||huangbo

02||xuzheng

03||wangbaoqiang

创建表

create table t_bi_reg(id string,name string)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'with serdeproperties('input.regex'='(.*)\\|\\|(.*)','output.format.string'='%1$s %2$s')stored as textfile;

导入数据并查询

0: jdbc:hive2://hadoop3:10000> load data local inpath '/home/hadoop/data.txt' into table t_bi_reg;No rows affected (0.747 seconds)0: jdbc:hive2://hadoop3:10000> select a.* from t_bi_reg a;