Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem while using merge join


Copy link to this message
-
Re: Problem while using merge join
As far as I know that is not possible, because I have to extend from the
LoadFunc Class and this class requires this method:

 public abstract Tuple getNext() throws IOException;

Or do you have another idea?

And yes, the columns are sorted. In my modified HbaseStorage Load function
is only one row loaded (In every case). So there are no conflicts with
other rows, because there are now other rows :)

btw. the batch(1) workaround works fine so far. It's not faster, but its
also not slower. So its okay for me. The merge join works now too, I had at
first exactly the same error like it's described here:
https://issues.apache.org/jira/browse/PIG-2495 ... but after adding the
lines from the patch, the merge join worked.

There is one issue left. If I try to join the the joined bag with another
bag I got this exception:

Caused by:
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
ERROR 1103: Merge join/Cogroup only supports Filter, Foreach, Ascending
Sort, or Load as its predecessors. Found :

Here is the pig programm: http://pastebin.com/BeziRrdD . After the first
merge join the bag is sorted, or am I not right? Or do I have to execute a
sort after this!? Normaly I would join the 3 bags with a multi join, but
merge joins doesn't work with the merge feature.

regards,
john
2013/9/13 Pradeep Gollakota <[EMAIL PROTECTED]>

> I think a better option is to completely bypass the HBaseStorage mechanism.
> Since you've already modified it, just put your 2nd UDF in there and have
> it return the data that you need right away.
>
> Another question I have is, are you absolutely positive that your data will
> continue to be sorted if you've projected away the row key? The columns are
> only sorted intra-row.
>
>
> On Fri, Sep 13, 2013 at 12:06 PM, John <[EMAIL PROTECTED]> wrote:
>
> > Sure, it is not so fast while loading, but on the other hand I can safe
> the
> > foreach operation after the load function. The best way would be to get
> all
> > Columns and return a bag, but I see there no way because the LoadFunc
> > return a Tuple and no Bag. I will try this way and see how fast it is. If
> > there are other ideas to make that faster I will try it.
> >
> > regards,
> > john
> >
> >
> > 2013/9/13 Shahab Yunus <[EMAIL PROTECTED]>
> >
> > > Wouldn't this slow down your data retrieval? Once column in each call
> > > instead of a batch?
> > >
> > > Regards,
> > > Shahab
> > >
> > >
> > > On Fri, Sep 13, 2013 at 2:34 PM, John <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > I think I might have found a way to transform it directly into a bag.
> > > > Inside the HBaseStorage() Load Function I have set the HBase scan
> batch
> > > to
> > > > 1, so I got for every scan.next() one column instead of all columns.
> > See
> > > >
> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
> > > >
> > > > setBatch(int batch)
> > > > Set the maximum number of values to return for each call to next()
> > > >
> > > > I think this will work. Any idea if this way have disadvantages?
> > > >
> > > > regards
> > > >
> > > >
> > > > 2013/9/13 John <[EMAIL PROTECTED]>
> > > >
> > > > > hi,
> > > > >
> > > > > the join key is in the bag, thats the problem. The Load Function
> > > returns
> > > > > only one element 0$ and that is the map. This map is transformed in
> > the
> > > > > next step with the UDF "MapToBagUDF" into a bag. for example the
> load
> > > > > functions returns this ([col1,col2,col3), then this map inside the
> > > tuple
> > > > is
> > > > > transformed to:
> > > > >
> > > > > (col1)
> > > > > (col2)
> > > > > (col3)
> > > > >
> > > > > Maybe there is is way to transform the map directly in the load
> > > function
> > > > > into a bag? The problem I see is that the next() Method in the
> > LoadFunc
> > > > has
> > > > > to be a Tuple and no Bag. :/
> > > > >
> > > > >
> > > > >
> > > > > 2013/9/13 Pradeep Gollakota <[EMAIL PROTECTED]>
> > > > >