Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Simple normalizing of data


Copy link to this message
-
Simple normalizing of data
Hi all,

I am processing a large tab file to format it suitable for loading
into a database with a predefined schema.
I have a tab file with a column that I need to normalize out to
another table and reference it with a foreign key from the original
file.  I would like to hear if my proposed processing is suitable:

Original file has:  ID, .... Institution_Code ....
I need to normalize out the distinct Institution Codes to a separate
table and have an Institution_Code_ID in the original.

Job 1: Generate a file of Institution_Code:<ID>
Job 2: Use output of Job1 and using the row number as the
Institution_Code_ID, generate ID:rowNumber
Job 3: do a reduce side join of the original input and the output of
job 2 to generate
   ID, ..... InstitutionCodeID ....

Job 1 output can then be used to populate the  institutionCodes table,
and Job 3 will have the appropriate FK so can populate the main table.

Is this approach sane?  Are there better approaches?

Many thanks,
Tim
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB