Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Splitting by unique values in a relation


Copy link to this message
-
Re: Splitting by unique values in a relation
Ruslan Al-Fakikh 2013-09-16, 02:25
Sorry. I didn't know/understand that you had unknown values. Yes, in your
case MultiStorage is a good way to split the data according to the values
of a column. It worked for me in similar cases.

Thanks
On Mon, Sep 16, 2013 at 4:06 AM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> Okay, I might not be able to explain the right scenario. Apologize if I was
> not clear enough with my problem.
>
> My scenario -
>
> I have a relation A, that has unique number of (unknown) customer_ids. I
> want to create different (N) number of output files per customer_id. I was
> thinking of finding the unique customer_ids first and then I was confused
> on how to go ahead, which made me to post the question.
>
> Through some further googling, I found piggybank's MultiStorage UDF that
> does this kind of operation, which in my case would do the job.
> Anyways, I was just thinking, if I had to do some other operation, eg
> filtering by unique customer ids, how would you achieve that in pig.
>
> SPLIT would need some known criteria to split into relations. Please
> correct me if I am wrong there. When values are unknown, how can we achieve
> the same.
>
> Regards
> Praveenesh
>
>
> On Mon, Sep 16, 2013 at 12:44 AM, Shahab Yunus <[EMAIL PROTECTED]
> >wrote:
>
> > Correction in my earlier comment. The following statement that I wrote
> was
> > wrong:
> > 'Won't SPLIT always give you 2 relations?'
> >
> > It is basically what Praveenesh himself mentioned i.e. a
> pre-defined/known
> > number of relations/splits.
> >
> > Regards,
> > Shahab
> >
> >
> > On Sun, Sep 15, 2013 at 7:41 PM, praveenesh kumar <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I can use split only when I am aware of the values by which I need to
> > split
> > > by... Here customer_ids are unknown to me. I don't know how many of
> them
> > > exist in my data. Hence SPLIT is not the answer to my problem.
> > >
> > > Anyways I have found piggybank's MultiStorage method much closer to
> what
> > I
> > > am looking for. I was just wondering is there a better or different way
> > to
> > > do the same.
> > >
> > > Regards
> > > Praveenesh
> > >
> > >
> > > On Mon, Sep 16, 2013 at 12:36 AM, Ruslan Al-Fakikh <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Hi!
> > > >
> > > > Have you tried the SPLIT operator?
> > > > http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> > > > After splitting the relation into two separate relations you can
> STORE
> > > them
> > > > into different locations.
> > > >
> > > > Best Regards,
> > > > Ruslan Al-Fakikh
> > > > https://www.odesk.com/users/~015b7b5f617eb89923
> > > >
> > > >
> > > > On Sun, Sep 15, 2013 at 11:03 PM, praveenesh kumar <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a relation A with (customer_id, data).
> > > > > I want to get the unique customer_ids, and spilt them into new
> > > > > files/relations. What is the most efficient way to do that.
> > > > >
> > > > > I can get the distinct customer_ids in a relation. But not able to
> > > > > understand how can can I use it in splitting the data by
> customer_id.
> > > > >
> > > > > Regards
> > > > > Praveenesh
> > > > >
> > > >
> > >
> >
>