Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Table creation for logfile data


+
Baahu 2013-11-23, 13:36
Copy link to this message
-
Re: Table creation for logfile data
Put the logile into a location on HDFS, and create an external table
pointing to that location. The External table should just have one column,
a string,
CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile'

I think that should work.

Then Create another table

CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
(DAY STRING) STORED AS ORC;

copy files into /etc/logfile

run this hive file:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.max.dynamic.partitions=10000;

SET hive.exec.max.dynamic.partitions.pernode=1000;

INSERT INTO logfile
select substring(message_line, 0, 17) as ts
regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
concat('20', substring(messageline, 0, 8)) as day
from logfile_etl

delete the the files /etl/logfile (or move them to an archival)

That will get you a day partitioned (I added the 20 in front of your date
so that string sorts well, although it probably would without it, it'
early, and I have not had coffee yet) ORC file table (with compression and
ORC good ness. The regexs are a little messy, by based on your one line of
data, should work. Also: If you have data from pre 2000 obviously, the
concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
date... why? It appears to be properly padded data, therefore a substring
is fast. This type of stuff has so many ways to skin a cat, so your way may
be totally different from my way, but this is how I'd approach it long
term. (if it's a one time thing, I may not create the managed the table,
but if so, having partitions and ORC files will make things faster).  If
there are syntax errors I apologize, see earlier disclaimer about lack of
proper bean sourced stimulants.

On Sat, Nov 23, 2013 at 7:36 AM, Baahu <[EMAIL PROTECTED]> wrote:

> Hi,
> I have a messy log file which I want to use to create a table, I am only
> interested to retrieve 3 columns (time,ADD,files),which are in bold.
> Sample entry from log file
> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip
>
> Can you please let me know how I should go about, regex seems to be way
> out,but I am struggling with that as well !!
>
> Thanks,
> Baahu
>
>
+
Baahu 2013-11-24, 18:13
+
John Omernik 2013-11-25, 00:51
+
郭士伟 2013-11-25, 05:34
+
郭士伟 2013-11-25, 05:36
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB