Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> how to handling complex log file(compressed, 200G)


Copy link to this message
-
Re: how to handling complex log file(compressed, 200G)
Hi Kiwon Lee

There isn't anything specific you need to do in hive DDL or DML to parse gz files. You need to ensure that 'org.apache.hadoop.io.compress.GzipCodec' is availabe in 'io.compression.codecs' property within core-site.xml.

To parse log files you can use RegexSerde. A sample DDL for loading Apache log files can be found at
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
You can create a partitioned table by using the  'PARTITIONED BY' clause while creating a table.  A sample DDL  below

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING, country STRING)
    ROW FORMAT DELIMITED
            FIELDS TERMINATED BY '1'
    STORED AS SEQUENCEFILE;

If your data is already partitioned in hdfs then you can create a partitioned table and add partitions to the table by specifying the dir corresponding to each partition using 'ALTER TABLE ADD PARTITION' statement.

If the data is not partitioned in hdfs but would like to be partitioned in hive then you can take a look at Dynamic Partition Insert.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Kiwon Lee <[EMAIL PROTECTED]>
Date: Sat, 18 Aug 2012 00:29:20
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: how to handling complex log file(compressed, 200G)

Hi,

I have complex log files (compressed ".gz", 200G) on HDFS.

+ log file format :
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"

I think DDL)),
CREATE TABLE log_tb (ip STRING, dt STRING, kv Map<STRING, STRING>)
ROW FORMAT SERDE "??"
STORED AS SEQUENCEFILE;

I want the results below.
SELECT kv['b']
FROM log_tb
LIMIT 10;
1) How do I parsing to Complex log file (compressed(".gz", 200G)

2) If I have to SerDe, what SerDe should I use?

3) Does existed SerDe(input/output) by user define class?

4) If I use to partition with log file, how use to DDL, DML?..plz. sample
sql (DDL, DML)
Thanks.

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB