Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Simple normalizing of data


Copy link to this message
-
Simple normalizing of data
Tim Robertson 2009-12-01, 09:05
Hi all,

I am processing a large tab file to format it suitable for loading
into a database with a predefined schema.
I have a tab file with a column that I need to normalize out to
another table and reference it with a foreign key from the original
file.  I would like to hear if my proposed processing is suitable:

Original file has:  ID, .... Institution_Code ....
I need to normalize out the distinct Institution Codes to a separate
table and have an Institution_Code_ID in the original.

Job 1: Generate a file of Institution_Code:<ID>
Job 2: Use output of Job1 and using the row number as the
Institution_Code_ID, generate ID:rowNumber
Job 3: do a reduce side join of the original input and the output of
job 2 to generate
   ID, ..... InstitutionCodeID ....

Job 1 output can then be used to populate the  institutionCodes table,
and Job 3 will have the appropriate FK so can populate the main table.

Is this approach sane?  Are there better approaches?

Many thanks,
Tim