Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Looking for advice on file structure...


Copy link to this message
-
Looking for advice on file structure...

   All,

   I'm looking for a little bit of advice on how to format files.

   The problem I have is I have log files from a number of different sources.  The data elements between log files overlaps by about 80%, but there are unique data items in each of the log files that I want to keep and be able to access from my Map/Reduce jobs.  There also isn't a single obvious key to the log file entries.  A quick example would be two different log files.  Log file a has 3 columns of data types A,B,C and is tab delimited.  Log file 2 has data types A,B,C,D and is pipe delimited.  I'd like to pre-process them into files where in the map/reduce job I could consistently access data element A across both types of log files and also access element D if it exists.

    .I suspect the best answer would be to pre-process the files into a common file format that allows for variable data values within a log line.   What I'm wondering is, has anyone else solved this type of problem and did you find a solution you liked?

   Where I've been looking so far is to use SequenceFiles.  There isn't a logical key, so the key in the sequence file my thought was to just have a line number, similar to the default map file input format although that feels a little weird.  For the value, since I want somewhat arbitrary key/values for the SequenceFile value my thought was to just have the value as a serialized HashMap.

    Any thoughts as to if I'm trying to re-invent the wheel here or going off in a strange direction?

    Thanks

    Andy
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB