|
|
Mat Kelcey 2012-08-29, 23:55
Hello!
Considering the following two relations...
grunt> querys = load 'query' as (id:int, token:chararray); grunt> dump querys (11,foo) (12,bar) (13,frog)
and
grunt> documents = load 'document' as (id:int, text:chararray); grunt> dump documents; (21,foo bar frog) (22,hello frog)
Is is possible to do a join where the query:token is not equal to but contained in documents:text ?
eg (11,foo,21,foo bar frog) (12,bar,21,foo bar frog) (13,frog,21,foo bar frog) (13,frog,22,hello frog)
I can certainly do this in Java map/reduce (as we all had to in the dark days days before pig) but is there a way to hack this together with a custom udf or some other weird join backdoor (customer partitioner for a group or something whacky) ???
It's been a long day, maybe I'm just missing some super obvious..
Cheers! Mat
+
Mat Kelcey 2012-08-29, 23:55
Mat Kelcey 2012-08-30, 00:08
For the sake of discussion I actually simplified things but perhaps in a critical way...
Query actually has 3 token fields and Document has 2 text fields and I really require token1 to be text1, token2 to also be in text1 and token3 to be in text2. (Damn bizarre NLP)
These additional complexities might change things... On Aug 29, 2012 4:55 PM, "Mat Kelcey" <[EMAIL PROTECTED]> wrote:
> Hello! > > Considering the following two relations... > > grunt> querys = load 'query' as (id:int, token:chararray); > grunt> dump querys > (11,foo) > (12,bar) > (13,frog) > > and > > grunt> documents = load 'document' as (id:int, text:chararray); > grunt> dump documents; > (21,foo bar frog) > (22,hello frog) > > Is is possible to do a join where the query:token is not equal to but > contained in documents:text ? > > eg > (11,foo,21,foo bar frog) > (12,bar,21,foo bar frog) > (13,frog,21,foo bar frog) > (13,frog,22,hello frog) > > I can certainly do this in Java map/reduce (as we all had to in the > dark days days before pig) but is there a way to hack this together > with a custom udf or some other weird join backdoor (customer > partitioner for a group or something whacky) ??? > > It's been a long day, maybe I'm just missing some super obvious.. > > Cheers! > Mat >
+
Mat Kelcey 2012-08-30, 00:08
Jonathan Coveney 2012-08-30, 00:06
You're not missing anything obvious... what you're trying to do, on face value, is not an easy thing to do. In M/R, joining is done based on partitioning to the same reducer...how can you do that if you have a case
foo bar
foo bar
and foo is sent to reducer 1, bar to reducer 2? There's no way to know where keys should be sent.
That said, there are options.
Option 1: a cross. Undesirable because of data explosion. Option 2: If one of the data sets is large enough to fit in memory, you can make a UDF that brings it in, and does the join for you. This is essentially option 1. Option 3: Less generically, exploit the join you're actually doing. In the dummy example, it looks like you're checking if a token is contained in another string. You could convert this into a join by tokenizing, flattening, doing the join, etc. I don't know how close your real use case is to what you posted.
Jon 2012/8/29 Mat Kelcey <[EMAIL PROTECTED]>
> Hello! > > Considering the following two relations... > > grunt> querys = load 'query' as (id:int, token:chararray); > grunt> dump querys > (11,foo) > (12,bar) > (13,frog) > > and > > grunt> documents = load 'document' as (id:int, text:chararray); > grunt> dump documents; > (21,foo bar frog) > (22,hello frog) > > Is is possible to do a join where the query:token is not equal to but > contained in documents:text ? > > eg > (11,foo,21,foo bar frog) > (12,bar,21,foo bar frog) > (13,frog,21,foo bar frog) > (13,frog,22,hello frog) > > I can certainly do this in Java map/reduce (as we all had to in the > dark days days before pig) but is there a way to hack this together > with a custom udf or some other weird join backdoor (customer > partitioner for a group or something whacky) ??? > > It's been a long day, maybe I'm just missing some super obvious.. > > Cheers! > Mat >
+
Jonathan Coveney 2012-08-30, 00:06
Mat Kelcey 2012-08-30, 00:14
Unfortunately neither side is small enough to either support a cross or a replicated join in memory approach.
But opt3 does make sense, I think I'm over thinking things. I can utilise a udf to do the equivalent of tokenisation and do, like you say, just a join.
In terms of the multiple joins I can just do all three, count the matches, and only allow the cases of all three matching
Thanks! Mat On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
> You're not missing anything obvious... what you're trying to do, on face > value, is not an easy thing to do. In M/R, joining is done based on > partitioning to the same reducer...how can you do that if you have a case > > foo > bar > > foo bar > > and foo is sent to reducer 1, bar to reducer 2? There's no way to know > where keys should be sent. > > That said, there are options. > > Option 1: a cross. Undesirable because of data explosion. > Option 2: If one of the data sets is large enough to fit in memory, you can > make a UDF that brings it in, and does the join for you. This is > essentially option 1. > Option 3: Less generically, exploit the join you're actually doing. In the > dummy example, it looks like you're checking if a token is contained in > another string. You could convert this into a join by tokenizing, > flattening, doing the join, etc. I don't know how close your real use case > is to what you posted. > > Jon > > > 2012/8/29 Mat Kelcey <[EMAIL PROTECTED]> > > > Hello! > > > > Considering the following two relations... > > > > grunt> querys = load 'query' as (id:int, token:chararray); > > grunt> dump querys > > (11,foo) > > (12,bar) > > (13,frog) > > > > and > > > > grunt> documents = load 'document' as (id:int, text:chararray); > > grunt> dump documents; > > (21,foo bar frog) > > (22,hello frog) > > > > Is is possible to do a join where the query:token is not equal to but > > contained in documents:text ? > > > > eg > > (11,foo,21,foo bar frog) > > (12,bar,21,foo bar frog) > > (13,frog,21,foo bar frog) > > (13,frog,22,hello frog) > > > > I can certainly do this in Java map/reduce (as we all had to in the > > dark days days before pig) but is there a way to hack this together > > with a custom udf or some other weird join backdoor (customer > > partitioner for a group or something whacky) ??? > > > > It's been a long day, maybe I'm just missing some super obvious.. > > > > Cheers! > > Mat > > >
+
Mat Kelcey 2012-08-30, 00:14
Mat Kelcey 2012-08-30, 00:29
Actually, given the nature of my Query data I might just pack a few bloom filters and stream Document through a udf, I've got plenty of data and can guard against mistakes downstream. It's wonderful what leaving the office and getting on the bus does for your thought process.... Mat On Aug 29, 2012 5:14 PM, "Mat Kelcey" <[EMAIL PROTECTED]> wrote:
> Unfortunately neither side is small enough to either support a cross or a > replicated join in memory approach. > > But opt3 does make sense, I think I'm over thinking things. I can utilise > a udf to do the equivalent of tokenisation and do, like you say, just a > join. > > In terms of the multiple joins I can just do all three, count the matches, > and only allow the cases of all three matching > > Thanks! > Mat > On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: > >> You're not missing anything obvious... what you're trying to do, on face >> value, is not an easy thing to do. In M/R, joining is done based on >> partitioning to the same reducer...how can you do that if you have a case >> >> foo >> bar >> >> foo bar >> >> and foo is sent to reducer 1, bar to reducer 2? There's no way to know >> where keys should be sent. >> >> That said, there are options. >> >> Option 1: a cross. Undesirable because of data explosion. >> Option 2: If one of the data sets is large enough to fit in memory, you >> can >> make a UDF that brings it in, and does the join for you. This is >> essentially option 1. >> Option 3: Less generically, exploit the join you're actually doing. In the >> dummy example, it looks like you're checking if a token is contained in >> another string. You could convert this into a join by tokenizing, >> flattening, doing the join, etc. I don't know how close your real use case >> is to what you posted. >> >> Jon >> >> >> 2012/8/29 Mat Kelcey <[EMAIL PROTECTED]> >> >> > Hello! >> > >> > Considering the following two relations... >> > >> > grunt> querys = load 'query' as (id:int, token:chararray); >> > grunt> dump querys >> > (11,foo) >> > (12,bar) >> > (13,frog) >> > >> > and >> > >> > grunt> documents = load 'document' as (id:int, text:chararray); >> > grunt> dump documents; >> > (21,foo bar frog) >> > (22,hello frog) >> > >> > Is is possible to do a join where the query:token is not equal to but >> > contained in documents:text ? >> > >> > eg >> > (11,foo,21,foo bar frog) >> > (12,bar,21,foo bar frog) >> > (13,frog,21,foo bar frog) >> > (13,frog,22,hello frog) >> > >> > I can certainly do this in Java map/reduce (as we all had to in the >> > dark days days before pig) but is there a way to hack this together >> > with a custom udf or some other weird join backdoor (customer >> > partitioner for a group or something whacky) ??? >> > >> > It's been a long day, maybe I'm just missing some super obvious.. >> > >> > Cheers! >> > Mat >> > >> >
+
Mat Kelcey 2012-08-30, 00:29
Mat Kelcey 2012-08-30, 00:48
and i just realised this last statement makes no sense in the context of my original contrived example (i originally asked about a join, not a filter) don't mind me! :)
On 29 August 2012 17:29, Mat Kelcey <[EMAIL PROTECTED]> wrote: > Actually, given the nature of my Query data I might just pack a few bloom > filters and stream Document through a udf, I've got plenty of data and can > guard against mistakes downstream. > It's wonderful what leaving the office and getting on the bus does for your > thought process.... > Mat > > On Aug 29, 2012 5:14 PM, "Mat Kelcey" <[EMAIL PROTECTED]> wrote: >> >> Unfortunately neither side is small enough to either support a cross or a >> replicated join in memory approach. >> >> But opt3 does make sense, I think I'm over thinking things. I can utilise >> a udf to do the equivalent of tokenisation and do, like you say, just a >> join. >> >> In terms of the multiple joins I can just do all three, count the matches, >> and only allow the cases of all three matching >> >> Thanks! >> Mat >> >> On Aug 29, 2012 5:06 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote: >>> >>> You're not missing anything obvious... what you're trying to do, on face >>> value, is not an easy thing to do. In M/R, joining is done based on >>> partitioning to the same reducer...how can you do that if you have a case >>> >>> foo >>> bar >>> >>> foo bar >>> >>> and foo is sent to reducer 1, bar to reducer 2? There's no way to know >>> where keys should be sent. >>> >>> That said, there are options. >>> >>> Option 1: a cross. Undesirable because of data explosion. >>> Option 2: If one of the data sets is large enough to fit in memory, you >>> can >>> make a UDF that brings it in, and does the join for you. This is >>> essentially option 1. >>> Option 3: Less generically, exploit the join you're actually doing. In >>> the >>> dummy example, it looks like you're checking if a token is contained in >>> another string. You could convert this into a join by tokenizing, >>> flattening, doing the join, etc. I don't know how close your real use >>> case >>> is to what you posted. >>> >>> Jon >>> >>> >>> 2012/8/29 Mat Kelcey <[EMAIL PROTECTED]> >>> >>> > Hello! >>> > >>> > Considering the following two relations... >>> > >>> > grunt> querys = load 'query' as (id:int, token:chararray); >>> > grunt> dump querys >>> > (11,foo) >>> > (12,bar) >>> > (13,frog) >>> > >>> > and >>> > >>> > grunt> documents = load 'document' as (id:int, text:chararray); >>> > grunt> dump documents; >>> > (21,foo bar frog) >>> > (22,hello frog) >>> > >>> > Is is possible to do a join where the query:token is not equal to but >>> > contained in documents:text ? >>> > >>> > eg >>> > (11,foo,21,foo bar frog) >>> > (12,bar,21,foo bar frog) >>> > (13,frog,21,foo bar frog) >>> > (13,frog,22,hello frog) >>> > >>> > I can certainly do this in Java map/reduce (as we all had to in the >>> > dark days days before pig) but is there a way to hack this together >>> > with a custom udf or some other weird join backdoor (customer >>> > partitioner for a group or something whacky) ??? >>> > >>> > It's been a long day, maybe I'm just missing some super obvious.. >>> > >>> > Cheers! >>> > Mat >>> >
+
Mat Kelcey 2012-08-30, 00:48
Russell Jurney 2012-08-30, 00:04
Join on a dummy key or CROSS, then plug the token in a udf.
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
On Aug 29, 2012, at 4:56 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote:
> Hello! > > Considering the following two relations... > > grunt> querys = load 'query' as (id:int, token:chararray); > grunt> dump querys > (11,foo) > (12,bar) > (13,frog) > > and > > grunt> documents = load 'document' as (id:int, text:chararray); > grunt> dump documents; > (21,foo bar frog) > (22,hello frog) > > Is is possible to do a join where the query:token is not equal to but > contained in documents:text ? > > eg > (11,foo,21,foo bar frog) > (12,bar,21,foo bar frog) > (13,frog,21,foo bar frog) > (13,frog,22,hello frog) > > I can certainly do this in Java map/reduce (as we all had to in the > dark days days before pig) but is there a way to hack this together > with a custom udf or some other weird join backdoor (customer > partitioner for a group or something whacky) ??? > > It's been a long day, maybe I'm just missing some super obvious.. > > Cheers! > Mat
+
Russell Jurney 2012-08-30, 00:04
|
|