Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Concatenate adjacent lines with hadoop


Copy link to this message
-
Concatenate adjacent lines with hadoop
Hi

Please find below the issue I need to solve. Thank you in advance for your
help/ tips.

I have log files where sometimes log lines are splited (this happens when
the log line exceeds a specific length)

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAG<TAB>FIELD-0<TAB>....<TAB>FIELD-MAX
Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAG<TAB>FIELD-0<TAB>....<TAB>FIELD-N      <======= log line is being
splitted
Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
FIELD-N<TAB>FIELD-N+1 .....FIELD-MAX

Can I "reconcile"/ "concatenate" splited log lines with a hadoop map reduce
job?

On other words, using a map reduce job, can I concatenate the 2 following
adjacent lines (provided that I 'detect' them)

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAG<TAB>FIELD-0<TAB>....<TAB>FIELD-N      <======= log line is being
splitted
Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
FIELD-N<TAB>FIELD-N+1 .....FIELD-MAX

into

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAG<TAB>FIELD-0<TAB>....<TAB>FIELD-N<TAB>FIELD-N+1 .....FIELD-MAX

Thank you!
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB