Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Table creation for logfile data


+
Baahu 2013-11-23, 13:36
+
John Omernik 2013-11-24, 11:55
+
Baahu 2013-11-24, 18:13
Copy link to this message
-
Re: Table creation for logfile data
I wouldn't worry about efficiency to much:

concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1],
'-', split(date_field, '\\/')[0]) as proper_date -- YYYY-MM-DD
On Sun, Nov 24, 2013 at 12:13 PM, Baahu <[EMAIL PROTECTED]> wrote:

> Hi John,
> Thanks for the reply,I have been given a new format of data and now the
> logs aren't as messy as they were earlier, but yes your mail gave me
> pointers which helped me is handling the new data.
>
> Now..I am stuck while handling a format of date,I am getting date in the
> form 22/11/13 which is dd/mm/yy, I have to rearrange this to yyyy/mm/dd,
> can you please shed some light on this. I think we need to use split() to
> get the tokens and then rearrange, but I am  not able to think of an
> efficient way to do this.
>
> Thanks.
>
>
> On Sun, Nov 24, 2013 at 5:25 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>
>> Put the logile into a location on HDFS, and create an external table
>> pointing to that location. The External table should just have one column,
>> a string,
>> CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
>> '/etl/logfile'
>>
>> I think that should work.
>>
>> Then Create another table
>>
>> CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
>> (DAY STRING) STORED AS ORC;
>>
>> copy files into /etc/logfile
>>
>> run this hive file:
>>
>> SET hive.exec.dynamic.partition=true;
>>
>> SET hive.exec.dynamic.partition.mode=nonstrict;
>>
>> SET hive.exec.max.dynamic.partitions=10000;
>>
>> SET hive.exec.max.dynamic.partitions.pernode=1000;
>>
>> INSERT INTO logfile
>> select substring(message_line, 0, 17) as ts
>> regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
>> regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
>> concat('20', substring(messageline, 0, 8)) as day
>> from logfile_etl
>>
>> delete the the files /etl/logfile (or move them to an archival)
>>
>> That will get you a day partitioned (I added the 20 in front of your date
>> so that string sorts well, although it probably would without it, it'
>> early, and I have not had coffee yet) ORC file table (with compression and
>> ORC good ness. The regexs are a little messy, by based on your one line of
>> data, should work. Also: If you have data from pre 2000 obviously, the
>> concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
>> date... why? It appears to be properly padded data, therefore a substring
>> is fast. This type of stuff has so many ways to skin a cat, so your way may
>> be totally different from my way, but this is how I'd approach it long
>> term. (if it's a one time thing, I may not create the managed the table,
>> but if so, having partitions and ORC files will make things faster).  If
>> there are syntax errors I apologize, see earlier disclaimer about lack of
>> proper bean sourced stimulants.
>>
>>
>>
>>
>>
>> On Sat, Nov 23, 2013 at 7:36 AM, Baahu <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>> I have a messy log file which I want to use to create a table, I am only
>>> interested to retrieve 3 columns (time,ADD,files),which are in bold.
>>> Sample entry from log file
>>> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
>>> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip
>>>
>>> Can you please let me know how I should go about, regex seems to be way
>>> out,but I am struggling with that as well !!
>>>
>>> Thanks,
>>> Baahu
>>>
>>>
>>
>
>
> --
> Twitter:http://twitter.com/Baahu
>
>
+
郭士伟 2013-11-25, 05:34
+
郭士伟 2013-11-25, 05:36