当前位置:文档之家› 整理和总结hive sql

整理和总结hive sql

进入hive shell#hive或者hive --service cliHive 的启动方式:hive 命令行模式,直接输入/hive/bin/hive的执行程序,或者输入hive –service clihive web界面的启动方式,hive –service hwihive 远程服务(端口号10000) 启动方式,hive --service hiveserverhive 远程后台启动(关闭终端hive服务不退出): nohup hive -–service hiveserver &显示所有函数:hive> show functions;查看函数用法:hive> describe function substr;查看hive为某个查询使用多少个MapReduce作业hive> Explain select a.id from tbname a;--------------------------------------------------------------------------表结构操作:托管表和外部表托管表会将数据移入Hive的warehouse目录;外部表则不会。

经验法则是,如果所有处理都由Hive完成,应该使用托管表;但如果要用Hive和其它工具来处理同一个数据集,则使用外部表。

创建表(通常stored as textfile):hive> create table tbName (id int,name string) stored as textfile;创建表并且按分割符分割行中的字段值(即导入数据的时候被导入数据是以该分割符划分的,否则导入后为null,缺省列为null);hive> create table tbName (id int,name string) row format delimited fields terminated by ','; 创建外部表:hive>create external table extbName(id int, name string);创建表并创建单分区字段ds(分区表指的是在创建表时指定的partition的分区空间。

): hive> create table tbName2 (id int, name string) partitioned by (ds string);创建表并创建双分区字段ds:hive> create table tbname3 (id int, content string) partitioned by (day string, hour string);表添加一列:hive> alter table tbName add columns (new_col int);添加一列并增加列字段注释:hive> alter table tbName add columns (new_col2 int comment 'a comment');更改表名:hive> alter table tbName rename to tbName3;删除表(删除表的元数据,如果是托管表还会删除表的数据):hive>drop table tbName;只删除内容(只删除表的内容,而保留元数据,则删除数据文件):hive>dfs –rmr ‘warehouse/my-table’;删除分区,分区的元数据和数据将被一并删除:hive>alter table tbname2 drop partition (dt='2008-08-08', hour='09');--------------------------------------------------------------------------元数据存储(从HDFS中将数据导入到表中都是瞬时的):将文件中的数据加载到表中(文件要有后缀名,缺省列默认为null):hive> load data local inpath 'myTest.txt' overwrite into table tbName;在已创立的表上添加单分区并指定数据:hive> alter table tbname2 add partition (ds='20120701') location '/user/hadoop/his_trans/record/20120701';在已创立的表上添加双分区并指定数据:hive> alter table tbname2 add partition (ds='2008-08-08', hour='08') location '/path/pv1.txt' partition (dt='2008-08-08', hour='09') location '/path/pv2.txt';加载本地数据,根据给定分区列信息:hive> alter table tbname2 add partition (ds='2013-12-12');hdfs数据加载进分区表中语法(当数据被加载至表中时,不会对数据进行任何转换。

Load操作只是将数据复制至Hive表对应的位置)[不建议使用]:hive> load data local inpath 'part.txt' overwrite into table tbName2 partition(ds='2013-12-12');hive> load data inpath '/user/hadoop/*' into table tbname3 partition(dt='2008-08-08', hour='08');--------------------------------------------------------------------------SQL 操作:查看表结构:hive> describe tbname;hive> desc tbname;显示所有表:hive> show tables;按正条件(正则表达式)显示表:hive> show tables '.*s';查询表数据不会做mapreduce操作:hive> select * from tbName;查询一列数据,会做mapreduce操作:hive> select a.id from tbname a ;基于分区的查询的语句:hive> select tbname2.* from tbname2 a where a.ds='2013-12-12' ;查看分区语句:hive> show partitions tbname2;函数avg/sum/count/group by/order by (desc)/limit:select logdate, count(logdate) as count from access_1 group by logdate order by count limit 5;内连接(inner join):hive> SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);外连接:hive> SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);hive> SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);hive> SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id =things.id);in查询:Hive不支持,但可以使用LEFT SEMI JOINhive> SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);相当于sql语句:SELECT * FROM things WHERE things.id IN (SELECT id from sales); Map连接:Hive可以把较小的表放入每个Mapper的内存来执行连接操作hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.* FROM sales JOIN things ON (sales.id = things.id);INSERT OVERWRITE TABLE ..SELECT:新表预先存在hive> FROM records2> INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year> INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(1) GROUP BY year> INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(1) WHERE temperature != 9999 AND(quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;CREATE TABLE ... AS SELECT:新表表预先不存在hive>CREATE TABLE target AS SELECT col1,col2 FROM source;创建视图:hive> CREATE VIEW valid_records AS SELECT * FROM records2 WHERE temperature !=9999;查看视图详细信息:hive> DESCRIBE EXTENDED valid_records;--------------------------------------------------------------------------将查询数据输出至目录hive> insert overwrite directory '/tmp/hdfs_out' select a.* from tbname2 a where a.ds='2013-12-12';将查询结果输出至本地目录hive> insert overwrite local directory '/tmp/local_out' select ds,count(1) from tbname group by ds;hive> insert overwrite table events select a.* from tbname a where a.id < 100;hive> insert overwrite local directory '/tmp/sum' select sum(a.pc) from tbpc a ;将一个表的统计结果插入另一个表中hive> from tbname a insert overwrite table events select a.bar,count(1) where a.foo > 0 group by a.bar;hive> insert overwrite table events select a.bar,count(1) from tbname a where a.foo > 0 group by a.bar;JOIN:hive> from tbname t1 join tbname2 t2 on (t1.id = t2.id) insert overwrite table events select t1.id,,t2,ds;将多表数据插入到同一表中FROM srcINSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;将文件流直接插入文件hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduceside (please see the Hive Tutorial or examples)--------------------------------------------------------------------------### 错误信息###问题:load数据全部为null原因:数据分隔符的问题,反序列化数据的时候出错了,定义表的时候需要定义数据分隔符。

相关主题