Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - How does one preprocess the data so that they can be exported using sqoop


Copy link to this message
-
Re: How does one preprocess the data so that they can be exported using sqoop
Jarek Jarcec Cecho 2013-05-20, 16:58
Hi Matthieu,
as Sqoop is import/exporting data as they are, using natural keys is not usually working as expected. You should prepare your data with proper surrogate keys and use Sqoop as a bulk load tool to get all your data from/to database.

Jarcec

On Mon, May 20, 2013 at 09:27:52AM -0700, Matthieu Labour wrote:
> Jarcec
>
> Thank you for your response
>
> May I ask you a tip on best practices with respect to keys. When preparing
> the data for sqoop do people end up using natural keys? Or do folks
> generate surrogate keys while preparing the data?
>
> As an example:
>
> instead of having
> user: naturalUserIdentifier, address ....
> product: productRef, productProperties...
> user_product: natural user Identifier, productRef
>
> I would like
> user: userSurrogateKey, userNaturaIdentifier, address ....
> product: productSurrogateKey, productRef, productProperties...
> user_product: userSurrogateKey, productSurrogateKey
>
> Thanks a lot
> -matt
>
>
> On Sun, May 19, 2013 at 11:37 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:
>
> > Hi Matthieu,
> > Sqoop is currently highly specialized EL tool (extract-load) and not a
> > generic ETL tool (extract-transform-load). Thus you need to execute custom
> > mapreduce/pig/hive job that will separate all three different logical
> > tables and prepare data into format that Sqoop can process.
> >
> > Jarcec
> >
> > On Fri, May 17, 2013 at 05:44:03PM -0400, Matthieu Labour wrote:
> > > Hi
> > >
> > > I would be grateful for any tips on how to "prepare" the data so they can
> > > be exported to a Postgesql Database using sqoop.
> > >
> > > As an example:
> > >
> > > Provided some files of events. (user events, product events,
> > > productActivity events)
> > >
> > > [file0001]
> > > event:user propertes:{name:"john" ...}
> > > event:product properties:{ref:123,color:"blue",...
> > > event:productActivity properties:{user:"john", product:"ref",
> > action:"buy"}
> > > .....
> > >
> > > How does one come up with the primary keys and the user_product join
> > table
> > > ready to be exported?
> > >
> > > On other words.
> > >
> > > function(Input:eventfile) => output:[productFile, userFile,
> > > user_productFile with auto generated primary keys ]
> > >
> > > what goes into function?
> > >
> > > I hope this makes sense!
> > >
> > > Thank you in advance for any help
> > >
> > > -matt
> >