Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> What if file format is dependent upon first few lines?


Copy link to this message
-
What if file format is dependent upon first few lines?
Below is a fake sample of Microsoft IIS log:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent
200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent
200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent
200 0 0 390
...

The first four lines describe the file format, which is a must to parse
each log line. It means log file could NOT be simply splitted, otherwise
the second split would lost the "file format" information.

How could each mapper get the first few lines in the file?