Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> PIG bin/labeling relation


+
drd_ 2009-11-20, 18:18
Copy link to this message
-
Re: PIG bin/labeling relation
Unless you actually need the ordinal numbers, you can do it all in one step:
B = ORDER A by x PARALLEL 100;
Store B into ......

This will create 100 ordered part files, with the first part file
containing the first 100th of the data, the second -- the next 100th,
and so on. The fragments are approximate in size, so some may be
slightly bigger than others, but for a big enough dataset, they should
be roughly equal.

-D

On Fri, Nov 20, 2009 at 1:18 PM, drd_ <[EMAIL PROTECTED]> wrote:
>
> I am using PIG and this is what I am trying to do:
>
> 1) Sort a relation A into B by a field x. The smallest value of x is first.
> Just use SORT.
>
> 2) Label each tuple in B with a number denoting its order in the sorted
> relation. So the first tuple would be labeled with a 1, the second tuple
> with a 2, the third with a 3 and so on. Not certain how to do this.
>
> 3) Derive a relation C where each row is a bag of tuples. The first row
> contains the first n1 tuples from relation B, the second row contains the
> tuples from B labeled (n1 + 1) to n2 from, the third row contains the tuples
> from B labeled (n2 + 1) to n3 and so on to n100. This step is simple (just
> use filter) once we've labeled each tuple in B with a number.
>
> The question: how do I do step 2).
>
> thanks
> --
> View this message in context: http://old.nabble.com/PIG-bin-labeling-relation-tp26443615p26443615.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB