Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem while using merge join


Copy link to this message
-
Re: Problem while using merge join
Sure, it is not so fast while loading, but on the other hand I can safe the
foreach operation after the load function. The best way would be to get all
Columns and return a bag, but I see there no way because the LoadFunc
return a Tuple and no Bag. I will try this way and see how fast it is. If
there are other ideas to make that faster I will try it.

regards,
john
2013/9/13 Shahab Yunus <[EMAIL PROTECTED]>

> Wouldn't this slow down your data retrieval? Once column in each call
> instead of a batch?
>
> Regards,
> Shahab
>
>
> On Fri, Sep 13, 2013 at 2:34 PM, John <[EMAIL PROTECTED]> wrote:
>
> > I think I might have found a way to transform it directly into a bag.
> > Inside the HBaseStorage() Load Function I have set the HBase scan batch
> to
> > 1, so I got for every scan.next() one column instead of all columns. See
> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
> >
> > setBatch(int batch)
> > Set the maximum number of values to return for each call to next()
> >
> > I think this will work. Any idea if this way have disadvantages?
> >
> > regards
> >
> >
> > 2013/9/13 John <[EMAIL PROTECTED]>
> >
> > > hi,
> > >
> > > the join key is in the bag, thats the problem. The Load Function
> returns
> > > only one element 0$ and that is the map. This map is transformed in the
> > > next step with the UDF "MapToBagUDF" into a bag. for example the load
> > > functions returns this ([col1,col2,col3), then this map inside the
> tuple
> > is
> > > transformed to:
> > >
> > > (col1)
> > > (col2)
> > > (col3)
> > >
> > > Maybe there is is way to transform the map directly in the load
> function
> > > into a bag? The problem I see is that the next() Method in the LoadFunc
> > has
> > > to be a Tuple and no Bag. :/
> > >
> > >
> > >
> > > 2013/9/13 Pradeep Gollakota <[EMAIL PROTECTED]>
> > >
> > >> Since your join key is not in the Bag, can you do your join first and
> > then
> > >> execute your UDF?
> > >>
> > >>
> > >> On Fri, Sep 13, 2013 at 10:04 AM, John <[EMAIL PROTECTED]>
> > >> wrote:
> > >>
> > >> > Okay, I think I have found the problem here:
> > >> > http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there
> is
> > >> > wirtten;
> > >> >
> > >> > There may be filter statements and foreach statements between the
> > sorted
> > >> > data source and the join statement. The foreach statement should
> meet
> > >> the
> > >> > following conditions:
> > >> >
> > >> >    - There should be no UDFs in the foreach statement.
> > >> >    - The foreach statement should not change the position of the
> join
> > >> keys.
> > >> >    - There should be no transformation on the join keys which will
> > >> change
> > >> >    the sort order.
> > >> >
> > >> >
> > >> > I have to use a UDF to transform the Map into a Bag ... any
> Workaround
> > >> > idea?
> > >> >
> > >> > thanks
> > >> >
> > >> >
> > >> > 2013/9/13 John <[EMAIL PROTECTED]>
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I try to use a merge join for 2 bags. Here is my pig code:
> > >> > > http://pastebin.com/Y9b2UtNk .
> > >> > >
> > >> > > But I got this error:
> > >> > >
> > >> > > Caused by:
> > >> > >
> > >> >
> > >>
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
> > >> > > ERROR 1103: Merge join/Cogroup only supports Filter, Foreach,
> > >> Ascending
> > >> > > Sort, or Load as its predecessors. Found
> > >> > >
> > >> > > I think the reason is that there is no sort function or something
> > like
> > >> > > this. But the bags are definitely sorted. How can I do the merge
> > join?
> > >> > >
> > >> > > thanks
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB