Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> How does one preprocess the data so that they can be exported using sqoop


Copy link to this message
-
Re: How does one preprocess the data so that they can be exported using sqoop
Hi Matthieu,
as Sqoop is import/exporting data as they are, using natural keys is not usually working as expected. You should prepare your data with proper surrogate keys and use Sqoop as a bulk load tool to get all your data from/to database.

Jarcec

On Mon, May 20, 2013 at 09:27:52AM -0700, Matthieu Labour wrote:
> Jarcec
>
> Thank you for your response
>
> May I ask you a tip on best practices with respect to keys. When preparing
> the data for sqoop do people end up using natural keys? Or do folks
> generate surrogate keys while preparing the data?
>
> As an example:
>
> instead of having
> user: naturalUserIdentifier, address ....
> product: productRef, productProperties...
> user_product: natural user Identifier, productRef
>
> I would like
> user: userSurrogateKey, userNaturaIdentifier, address ....
> product: productSurrogateKey, productRef, productProperties...
> user_product: userSurrogateKey, productSurrogateKey
>
> Thanks a lot
> -matt
>
>
> On Sun, May 19, 2013 at 11:37 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:
>
> > Hi Matthieu,
> > Sqoop is currently highly specialized EL tool (extract-load) and not a
> > generic ETL tool (extract-transform-load). Thus you need to execute custom
> > mapreduce/pig/hive job that will separate all three different logical
> > tables and prepare data into format that Sqoop can process.
> >
> > Jarcec
> >
> > On Fri, May 17, 2013 at 05:44:03PM -0400, Matthieu Labour wrote:
> > > Hi
> > >
> > > I would be grateful for any tips on how to "prepare" the data so they can
> > > be exported to a Postgesql Database using sqoop.
> > >
> > > As an example:
> > >
> > > Provided some files of events. (user events, product events,
> > > productActivity events)
> > >
> > > [file0001]
> > > event:user propertes:{name:"john" ...}
> > > event:product properties:{ref:123,color:"blue",...
> > > event:productActivity properties:{user:"john", product:"ref",
> > action:"buy"}
> > > .....
> > >
> > > How does one come up with the primary keys and the user_product join
> > table
> > > ready to be exported?
> > >
> > > On other words.
> > >
> > > function(Input:eventfile) => output:[productFile, userFile,
> > > user_productFile with auto generated primary keys ]
> > >
> > > what goes into function?
> > >
> > > I hope this makes sense!
> > >
> > > Thank you in advance for any help
> > >
> > > -matt
> >
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB