|
|
-
Looking for advice on file structure...Andy Sautins 2009-08-18, 20:47
All, I'm looking for a little bit of advice on how to format files. The problem I have is I have log files from a number of different sources. The data elements between log files overlaps by about 80%, but there are unique data items in each of the log files that I want to keep and be able to access from my Map/Reduce jobs. There also isn't a single obvious key to the log file entries. A quick example would be two different log files. Log file a has 3 columns of data types A,B,C and is tab delimited. Log file 2 has data types A,B,C,D and is pipe delimited. I'd like to pre-process them into files where in the map/reduce job I could consistently access data element A across both types of log files and also access element D if it exists. .I suspect the best answer would be to pre-process the files into a common file format that allows for variable data values within a log line. What I'm wondering is, has anyone else solved this type of problem and did you find a solution you liked? Where I've been looking so far is to use SequenceFiles. There isn't a logical key, so the key in the sequence file my thought was to just have a line number, similar to the default map file input format although that feels a little weird. For the value, since I want somewhat arbitrary key/values for the SequenceFile value my thought was to just have the value as a serialized HashMap. Any thoughts as to if I'm trying to re-invent the wheel here or going off in a strange direction? Thanks Andy |