Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Table creation for logfile data


+
Baahu 2013-11-23, 13:36
+
John Omernik 2013-11-24, 11:55
+
Baahu 2013-11-24, 18:13
Copy link to this message
-
Re: Table creation for logfile data
I wouldn't worry about efficiency to much:

concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1],
'-', split(date_field, '\\/')[0]) as proper_date -- YYYY-MM-DD
On Sun, Nov 24, 2013 at 12:13 PM, Baahu <[EMAIL PROTECTED]> wrote:

> Hi John,
> Thanks for the reply,I have been given a new format of data and now the
> logs aren't as messy as they were earlier, but yes your mail gave me
> pointers which helped me is handling the new data.
>
> Now..I am stuck while handling a format of date,I am getting date in the
> form 22/11/13 which is dd/mm/yy, I have to rearrange this to yyyy/mm/dd,
> can you please shed some light on this. I think we need to use split() to
> get the tokens and then rearrange, but I am  not able to think of an
> efficient way to do this.
>
> Thanks.
>
>
> On Sun, Nov 24, 2013 at 5:25 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>
>> Put the logile into a location on HDFS, and create an external table
>> pointing to that location. The External table should just have one column,
>> a string,
>> CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
>> '/etl/logfile'
>>
>> I think that should work.
>>
>> Then Create another table
>>
>> CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
>> (DAY STRING) STORED AS ORC;
>>
>> copy files into /etc/logfile
>>
>> run this hive file:
>>
>> SET hive.exec.dynamic.partition=true;
>>
>> SET hive.exec.dynamic.partition.mode=nonstrict;
>>
>> SET hive.exec.max.dynamic.partitions=10000;
>>
>> SET hive.exec.max.dynamic.partitions.pernode=1000;
>>
>> INSERT INTO logfile
>> select substring(message_line, 0, 17) as ts
>> regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
>> regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
>> concat('20', substring(messageline, 0, 8)) as day
>> from logfile_etl
>>
>> delete the the files /etl/logfile (or move them to an archival)
>>
>> That will get you a day partitioned (I added the 20 in front of your date
>> so that string sorts well, although it probably would without it, it'
>> early, and I have not had coffee yet) ORC file table (with compression and
>> ORC good ness. The regexs are a little messy, by based on your one line of
>> data, should work. Also: If you have data from pre 2000 obviously, the
>> concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
>> date... why? It appears to be properly padded data, therefore a substring
>> is fast. This type of stuff has so many ways to skin a cat, so your way may
>> be totally different from my way, but this is how I'd approach it long
>> term. (if it's a one time thing, I may not create the managed the table,
>> but if so, having partitions and ORC files will make things faster).  If
>> there are syntax errors I apologize, see earlier disclaimer about lack of
>> proper bean sourced stimulants.
>>
>>
>>
>>
>>
>> On Sat, Nov 23, 2013 at 7:36 AM, Baahu <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>> I have a messy log file which I want to use to create a table, I am only
>>> interested to retrieve 3 columns (time,ADD,files),which are in bold.
>>> Sample entry from log file
>>> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
>>> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip
>>>
>>> Can you please let me know how I should go about, regex seems to be way
>>> out,but I am struggling with that as well !!
>>>
>>> Thanks,
>>> Baahu
>>>
>>>
>>
>
>
> --
> Twitter:http://twitter.com/Baahu
>
>
+
郭士伟 2013-11-25, 05:34
+
郭士伟 2013-11-25, 05:36
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB