|
|
+
drd_ 2009-11-20, 18:18
-
Re: PIG bin/labeling relationDmitriy Ryaboy 2009-11-21, 20:04
Unless you actually need the ordinal numbers, you can do it all in one step:
B = ORDER A by x PARALLEL 100; Store B into ...... This will create 100 ordered part files, with the first part file containing the first 100th of the data, the second -- the next 100th, and so on. The fragments are approximate in size, so some may be slightly bigger than others, but for a big enough dataset, they should be roughly equal. -D On Fri, Nov 20, 2009 at 1:18 PM, drd_ <[EMAIL PROTECTED]> wrote: > > I am using PIG and this is what I am trying to do: > > 1) Sort a relation A into B by a field x. The smallest value of x is first. > Just use SORT. > > 2) Label each tuple in B with a number denoting its order in the sorted > relation. So the first tuple would be labeled with a 1, the second tuple > with a 2, the third with a 3 and so on. Not certain how to do this. > > 3) Derive a relation C where each row is a bag of tuples. The first row > contains the first n1 tuples from relation B, the second row contains the > tuples from B labeled (n1 + 1) to n2 from, the third row contains the tuples > from B labeled (n2 + 1) to n3 and so on to n100. This step is simple (just > use filter) once we've labeled each tuple in B with a number. > > The question: how do I do step 2). > > thanks > -- > View this message in context: http://old.nabble.com/PIG-bin-labeling-relation-tp26443615p26443615.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > |