Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> loading logfile into hive tables using certain format


Copy link to this message
-
Re: loading logfile into hive tables using certain format
I tried import apache2 log into hive a few weeks ago, and took a look
at SERDEPROPERTIES, but it was too complicated and pasting others'
demo wouldn't work.

Then I came up with another solution : apache2 log -> Apache Pig (for
ETL) -> Hive external table. But I ran into a problem of Pig ( which
was later solved with Cheolsoo Park's help), so I finally picked up
NodeJS.

Node is actually quite great, some advanced features of JavaScript
make ETL a lot easy and flexible, but it can't benefit from Hadoop.

So, to sum up, I suggest you do some ETL on logs before importing them
into hive.

1. ETL with Apache Pig:
--- Pig script
read = LOAD '/home/test/input/apacheLog'
USING PigStorage(' ')
--- space as separator
AS (
          ip:CHARARRAY
        , indentity:CHARARRAY
        , name:CHARARRAY
        , date:CHARARRAY
        , timezone:CHARARRAY
        , method:CHARARRAY
        , path:CHARARRAY
        , protocol:CHARARRAY
        , status:CHARARRAY
        , size:CHARARRAY
);
data = FOREACH read GENERATE
          ip
        , REPLACE(date,'\\[','')
--- Here be careful with [, it should be escape because it will cause
--- a regex warning where Pig will throw away the whole field.
--- (Not documented anywhere) REF: http://goo.gl/g1x1q
        , REPLACE(timezone,']','')
        , REPLACE(method,'"','')
        , path
        , REPLACE(protocol,'"','')
        , status
        , size;
STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' ');
2. Import into Hive as external tables or external partitions.

On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I would like to seek help on  loading logfiles to hive tables.
>
> I learnt from the "Getting Started" page that we could create hive
> tables as follow to import apachelog into it.
> ------------------------------------------------------------------
> CREATE TABLE apachelog (
>   host STRING,
>   identity STRING,
>   user STRING,
>   time STRING,
>   request STRING,
>   status STRING,
>   size STRING,
>   referer STRING,
>   agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> )
> STORED AS TEXTFILE;
> ------------------------------------------------------------------
>
> I was trying to do the same thing, but changing the value of my output.form.string,
> let's say i only need, host, user, request.
>
> CREATE TABLE apachelog (
>   host STRING,
>   user STRING,
>   request STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %3$s %5$s"
> )
> STORED AS TEXTFILE;
>
> My questions are :
> (1) I specified only %1, %3 %5 variables to be input into my table
> column, but looks like hive load the first 3 variables into it (%1 %2
> %3)
> Is there no way that hive could only load the columns i want?
>
> (2) How can i skip lines which does not fit input.regex pattern match?
>
> Thank you.
>
> lai
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB