Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Problem while using merge join


+
John 2013-09-13, 16:37
+
John 2013-09-13, 17:04
+
Pradeep Gollakota 2013-09-13, 17:41
+
John 2013-09-13, 17:58
+
John 2013-09-13, 18:34
+
Shahab Yunus 2013-09-13, 19:00
+
John 2013-09-13, 19:06
Copy link to this message
-
Re: Problem while using merge join
I think a better option is to completely bypass the HBaseStorage mechanism.
Since you've already modified it, just put your 2nd UDF in there and have
it return the data that you need right away.

Another question I have is, are you absolutely positive that your data will
continue to be sorted if you've projected away the row key? The columns are
only sorted intra-row.
On Fri, Sep 13, 2013 at 12:06 PM, John <[EMAIL PROTECTED]> wrote:

> Sure, it is not so fast while loading, but on the other hand I can safe the
> foreach operation after the load function. The best way would be to get all
> Columns and return a bag, but I see there no way because the LoadFunc
> return a Tuple and no Bag. I will try this way and see how fast it is. If
> there are other ideas to make that faster I will try it.
>
> regards,
> john
>
>
> 2013/9/13 Shahab Yunus <[EMAIL PROTECTED]>
>
> > Wouldn't this slow down your data retrieval? Once column in each call
> > instead of a batch?
> >
> > Regards,
> > Shahab
> >
> >
> > On Fri, Sep 13, 2013 at 2:34 PM, John <[EMAIL PROTECTED]>
> wrote:
> >
> > > I think I might have found a way to transform it directly into a bag.
> > > Inside the HBaseStorage() Load Function I have set the HBase scan batch
> > to
> > > 1, so I got for every scan.next() one column instead of all columns.
> See
> > >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
> > >
> > > setBatch(int batch)
> > > Set the maximum number of values to return for each call to next()
> > >
> > > I think this will work. Any idea if this way have disadvantages?
> > >
> > > regards
> > >
> > >
> > > 2013/9/13 John <[EMAIL PROTECTED]>
> > >
> > > > hi,
> > > >
> > > > the join key is in the bag, thats the problem. The Load Function
> > returns
> > > > only one element 0$ and that is the map. This map is transformed in
> the
> > > > next step with the UDF "MapToBagUDF" into a bag. for example the load
> > > > functions returns this ([col1,col2,col3), then this map inside the
> > tuple
> > > is
> > > > transformed to:
> > > >
> > > > (col1)
> > > > (col2)
> > > > (col3)
> > > >
> > > > Maybe there is is way to transform the map directly in the load
> > function
> > > > into a bag? The problem I see is that the next() Method in the
> LoadFunc
> > > has
> > > > to be a Tuple and no Bag. :/
> > > >
> > > >
> > > >
> > > > 2013/9/13 Pradeep Gollakota <[EMAIL PROTECTED]>
> > > >
> > > >> Since your join key is not in the Bag, can you do your join first
> and
> > > then
> > > >> execute your UDF?
> > > >>
> > > >>
> > > >> On Fri, Sep 13, 2013 at 10:04 AM, John <[EMAIL PROTECTED]>
> > > >> wrote:
> > > >>
> > > >> > Okay, I think I have found the problem here:
> > > >> > http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ...
> there
> > is
> > > >> > wirtten;
> > > >> >
> > > >> > There may be filter statements and foreach statements between the
> > > sorted
> > > >> > data source and the join statement. The foreach statement should
> > meet
> > > >> the
> > > >> > following conditions:
> > > >> >
> > > >> >    - There should be no UDFs in the foreach statement.
> > > >> >    - The foreach statement should not change the position of the
> > join
> > > >> keys.
> > > >> >    - There should be no transformation on the join keys which will
> > > >> change
> > > >> >    the sort order.
> > > >> >
> > > >> >
> > > >> > I have to use a UDF to transform the Map into a Bag ... any
> > Workaround
> > > >> > idea?
> > > >> >
> > > >> > thanks
> > > >> >
> > > >> >
> > > >> > 2013/9/13 John <[EMAIL PROTECTED]>
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I try to use a merge join for 2 bags. Here is my pig code:
> > > >> > > http://pastebin.com/Y9b2UtNk .
> > > >> > >
> > > >> > > But I got this error:
> > > >> > >
> > > >> > > Caused by:
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
> > > >> > > ERROR 1103: Merge join/Cogroup only supports Filter, Foreach,
+
John 2013-09-13, 20:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB