Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - How does one preprocess the data so that they can be exported using sqoop


Copy link to this message
-
Re: How does one preprocess the data so that they can be exported using sqoop
Matthieu Labour 2013-05-20, 16:27
Jarcec

Thank you for your response

May I ask you a tip on best practices with respect to keys. When preparing
the data for sqoop do people end up using natural keys? Or do folks
generate surrogate keys while preparing the data?

As an example:

instead of having
user: naturalUserIdentifier, address ....
product: productRef, productProperties...
user_product: natural user Identifier, productRef

I would like
user: userSurrogateKey, userNaturaIdentifier, address ....
product: productSurrogateKey, productRef, productProperties...
user_product: userSurrogateKey, productSurrogateKey

Thanks a lot
-matt
On Sun, May 19, 2013 at 11:37 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:

> Hi Matthieu,
> Sqoop is currently highly specialized EL tool (extract-load) and not a
> generic ETL tool (extract-transform-load). Thus you need to execute custom
> mapreduce/pig/hive job that will separate all three different logical
> tables and prepare data into format that Sqoop can process.
>
> Jarcec
>
> On Fri, May 17, 2013 at 05:44:03PM -0400, Matthieu Labour wrote:
> > Hi
> >
> > I would be grateful for any tips on how to "prepare" the data so they can
> > be exported to a Postgesql Database using sqoop.
> >
> > As an example:
> >
> > Provided some files of events. (user events, product events,
> > productActivity events)
> >
> > [file0001]
> > event:user propertes:{name:"john" ...}
> > event:product properties:{ref:123,color:"blue",...
> > event:productActivity properties:{user:"john", product:"ref",
> action:"buy"}
> > .....
> >
> > How does one come up with the primary keys and the user_product join
> table
> > ready to be exported?
> >
> > On other words.
> >
> > function(Input:eventfile) => output:[productFile, userFile,
> > user_productFile with auto generated primary keys ]
> >
> > what goes into function?
> >
> > I hope this makes sense!
> >
> > Thank you in advance for any help
> >
> > -matt
>