-Re: Table creation for logfile data
John Omernik 2013-11-25, 00:51
I wouldn't worry about efficiency to much:
concat('20', split(date_field, '\\/'), '-', split(date_field, '\\/'),
'-', split(date_field, '\\/')) as proper_date -- YYYY-MM-DD
On Sun, Nov 24, 2013 at 12:13 PM, Baahu <[EMAIL PROTECTED]> wrote:
> Hi John,
> Thanks for the reply,I have been given a new format of data and now the
> logs aren't as messy as they were earlier, but yes your mail gave me
> pointers which helped me is handling the new data.
> Now..I am stuck while handling a format of date,I am getting date in the
> form 22/11/13 which is dd/mm/yy, I have to rearrange this to yyyy/mm/dd,
> can you please shed some light on this. I think we need to use split() to
> get the tokens and then rearrange, but I am not able to think of an
> efficient way to do this.
> On Sun, Nov 24, 2013 at 5:25 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>> Put the logile into a location on HDFS, and create an external table
>> pointing to that location. The External table should just have one column,
>> a string,
>> CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
>> I think that should work.
>> Then Create another table
>> CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
>> (DAY STRING) STORED AS ORC;
>> copy files into /etc/logfile
>> run this hive file:
>> SET hive.exec.dynamic.partition=true;
>> SET hive.exec.dynamic.partition.mode=nonstrict;
>> SET hive.exec.max.dynamic.partitions=10000;
>> SET hive.exec.max.dynamic.partitions.pernode=1000;
>> INSERT INTO logfile
>> select substring(message_line, 0, 17) as ts
>> regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
>> regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
>> concat('20', substring(messageline, 0, 8)) as day
>> from logfile_etl
>> delete the the files /etl/logfile (or move them to an archival)
>> That will get you a day partitioned (I added the 20 in front of your date
>> so that string sorts well, although it probably would without it, it'
>> early, and I have not had coffee yet) ORC file table (with compression and
>> ORC good ness. The regexs are a little messy, by based on your one line of
>> data, should work. Also: If you have data from pre 2000 obviously, the
>> concat('20' thing needs to be updated. Note, I didn't use a regex on the
>> date... why? It appears to be properly padded data, therefore a substring
>> is fast. This type of stuff has so many ways to skin a cat, so your way may
>> be totally different from my way, but this is how I'd approach it long
>> term. (if it's a one time thing, I may not create the managed the table,
>> but if so, having partitions and ORC files will make things faster). If
>> there are syntax errors I apologize, see earlier disclaimer about lack of
>> proper bean sourced stimulants.
>> On Sat, Nov 23, 2013 at 7:36 AM, Baahu <[EMAIL PROTECTED]> wrote:
>>> I have a messy log file which I want to use to create a table, I am only
>>> interested to retrieve 3 columns (time,ADD,files),which are in bold.
>>> Sample entry from log file
>>> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23
>>> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip
>>> Can you please let me know how I should go about, regex seems to be way
>>> out,but I am struggling with that as well !!