|
唐亮
2011-12-13, 03:54
Prashant Kommireddi
2011-12-13, 04:12
Thejas Nair
2011-12-13, 18:56
Andrew Wells
2011-12-13, 22:30
Andrew Wells
2011-12-13, 22:32
jiang licht
2011-12-13, 23:13
Thejas Nair
2011-12-13, 23:32
唐亮
2011-12-14, 03:55
Dmitriy Ryaboy
2011-12-14, 04:39
Prashant Kommireddi
2011-12-14, 05:37
Jonathan Coveney
2011-12-14, 05:57
唐亮
2011-12-14, 06:41
Prashant Kommireddi
2011-12-14, 06:49
唐亮
2011-12-14, 06:54
Jonathan Coveney
2011-12-14, 06:56
Prashant Kommireddi
2011-12-14, 07:16
唐亮
2011-12-14, 09:59
唐亮
2011-12-14, 10:13
Dmitriy Ryaboy
2011-12-14, 18:28
jiang licht
2011-12-14, 19:18
Prashant Kommireddi
2011-12-14, 19:23
Prashant Kommireddi
2011-12-14, 19:27
唐亮
2011-12-15, 02:28
Prashant Kommireddi
2011-12-15, 02:33
唐亮
2011-12-15, 05:26
Prashant Kommireddi
2011-12-15, 08:05
唐亮
2011-12-16, 05:12
唐亮
2011-12-18, 08:24
Prashant Kommireddi
2011-12-18, 10:17
唐亮
2011-12-19, 02:49
Dmitriy Ryaboy
2011-12-19, 04:55
|
-
Implement Binary Search in PIG唐亮 2011-12-13, 03:54
Hi all,
How can I implement a binary search in pig? In one relation, there exists a bag whose items are sorted. And I want to check there exists a specific item in the bag. In UDF, I can't random access items in DataBag container. So I have to transfer the items in DataBag to an ArrayList, and this is time consuming. How can I implement the binary search efficiently in pig?
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-13, 04:12
How many elements do you have in the Bag? Can you hold the elements in a
Tuple instead of a Bag? -Prashant On Mon, Dec 12, 2011 at 7:54 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > Hi all, > How can I implement a binary search in pig? > > In one relation, there exists a bag whose items are sorted. > And I want to check there exists a specific item in the bag. > > In UDF, I can't random access items in DataBag container. > So I have to transfer the items in DataBag to an ArrayList, and this is > time consuming. > > How can I implement the binary search efficiently in pig? >
-
Re: Implement Binary Search in PIGThejas Nair 2011-12-13, 18:56
Bags can be very large might not fit into memory, and in such cases some
or all of the bag might have to be stored on disk. In such cases, it is not efficient to do random access on the bag. That is why the DataBag interface does not support it. As Prashant suggested, storing it in a tuple would be a good alternative, if you want to have random access to do binary search. -Thejas On 12/12/11 7:54 PM, 唐亮 wrote: > Hi all, > How can I implement a binary search in pig? > > In one relation, there exists a bag whose items are sorted. > And I want to check there exists a specific item in the bag. > > In UDF, I can't random access items in DataBag container. > So I have to transfer the items in DataBag to an ArrayList, and this is > time consuming. > > How can I implement the binary search efficiently in pig? >
-
Re: Implement Binary Search in PIGAndrew Wells 2011-12-13, 22:30
I don't think this could be done,
pig is just a hadoop job, and the idea behind hadoop is to read all the data in a file. so by the time you put all the data into an array, you would have been better off just checking each element for the one you were looking for. So what you would get is [n + lg (n)], which will just be [n] after putting that into an array. Second, hadoop is all about large data analysis, usually more than 100GB, so putting this into memory is out of the question. Third, hadoop is efficient because it processes this large amount of data by splitting it up into multiple processes. To do an efficient binary search, you would need do this in one mapper or one reducer. My opinion is just don't fight hadoop/pig. On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: > Bags can be very large might not fit into memory, and in such cases some > or all of the bag might have to be stored on disk. In such cases, it is not > efficient to do random access on the bag. That is why the DataBag interface > does not support it. > > As Prashant suggested, storing it in a tuple would be a good alternative, > if you want to have random access to do binary search. > > -Thejas > > > > On 12/12/11 7:54 PM, 唐亮 wrote: > >> Hi all, >> How can I implement a binary search in pig? >> >> In one relation, there exists a bag whose items are sorted. >> And I want to check there exists a specific item in the bag. >> >> In UDF, I can't random access items in DataBag container. >> So I have to transfer the items in DataBag to an ArrayList, and this is >> time consuming. >> >> How can I implement the binary search efficiently in pig? >> >> >
-
Re: Implement Binary Search in PIGAndrew Wells 2011-12-13, 22:32
Oh, I might as well make a suggestion for random access.
Try looking into HBase On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells <[EMAIL PROTECTED]> wrote: > I don't think this could be done, > > pig is just a hadoop job, and the idea behind hadoop is to read all the > data in a file. > > so by the time you put all the data into an array, you would have been > better off just checking each element for the one you were looking for. > > So what you would get is [n + lg (n)], which will just be [n] after > putting that into an array. > Second, hadoop is all about large data analysis, usually more than 100GB, > so putting this into memory is out of the question. > Third, hadoop is efficient because it processes this large amount of data > by splitting it up into multiple processes. To do an efficient binary > search, you would need do this in one mapper or one reducer. > > My opinion is just don't fight hadoop/pig. > > > > On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair <[EMAIL PROTECTED]>wrote: > >> Bags can be very large might not fit into memory, and in such cases some >> or all of the bag might have to be stored on disk. In such cases, it is not >> efficient to do random access on the bag. That is why the DataBag interface >> does not support it. >> >> As Prashant suggested, storing it in a tuple would be a good alternative, >> if you want to have random access to do binary search. >> >> -Thejas >> >> >> >> On 12/12/11 7:54 PM, 唐亮 wrote: >> >>> Hi all, >>> How can I implement a binary search in pig? >>> >>> In one relation, there exists a bag whose items are sorted. >>> And I want to check there exists a specific item in the bag. >>> >>> In UDF, I can't random access items in DataBag container. >>> So I have to transfer the items in DataBag to an ArrayList, and this is >>> time consuming. >>> >>> How can I implement the binary search efficiently in pig? >>> >>> >> >
-
Re: Implement Binary Search in PIGjiang licht 2011-12-13, 23:13
Generally speaking, fancy algorithms for single machine are often time not doable in a m/r manner, think about graph operations. So, go back to the original goal, what you want is to search for occurrence of sth in sth else. For the purpose of doing this in pig, I guess maybe one can do a left outer join, in the result, any tuple that you get null from the other participant in the join, it is a mismatch. Will this work? But I believe one will not try to do a binary search on a bag, unless it is small. Generally speaking, either a map-side or reduce-side search will do the job for you.
�� Best regards, Michael ________________________________ From: Andrew Wells <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, December 13, 2011 2:32 PM Subject: Re: Implement Binary Search in PIG Oh, I might as well make a suggestion for random access. Try looking into HBase On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells <[EMAIL PROTECTED]> wrote: > I don't think this could be done, > > pig is just a hadoop job, and the idea behind hadoop is to read all the > data in a file. > > so by the time you put all the data into an array, you would have been > better off just checking each element for the one you were looking for. > > So what you would get is [n + lg (n)], which will just be [n] after > putting that into an array. > Second, hadoop is all about large data analysis, usually more than 100GB, > so putting this into memory is out of the question. > Third, hadoop is efficient because it processes this large amount of data > by splitting it up into multiple processes. To do an efficient binary > search, you would need do this in one mapper or one reducer. > > My opinion is just don't fight hadoop/pig. > > > > On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair <[EMAIL PROTECTED]>wrote: > >> Bags can be very large might not fit into memory, and in such cases some >> or all of the bag might have to be stored on disk. In such cases, it is not >> efficient to do random access on the bag. That is why the DataBag interface >> does not support it. >> >> As Prashant suggested, storing it in a tuple would be a good alternative, >> if you want to have random access to do binary search. >> >> -Thejas >> >> >> >> On 12/12/11 7:54 PM, 唐亮 wrote: >> >>> Hi all, >>> How can I implement a binary search in pig? >>> >>> In one relation, there exists a bag whose items are sorted. >>> And I want to check there exists a specific item in the bag. >>> >>> In UDF, I can't random access items in DataBag container. >>> So I have to transfer the items in DataBag to an ArrayList, and this is >>> time consuming. >>> >>> How can I implement the binary search efficiently in pig? >>> >>> >> >
-
Re: Implement Binary Search in PIGThejas Nair 2011-12-13, 23:32
My assumption is that 唐亮 is trying to do binary search on bags within
the tuples in a relation (ie schema of the relation has a bag column). I don't think he is trying to treat the entire relation as one bag and do binary search on that. -Thejas On 12/13/11 2:30 PM, Andrew Wells wrote: > I don't think this could be done, > > pig is just a hadoop job, and the idea behind hadoop is to read all the > data in a file. > > so by the time you put all the data into an array, you would have been > better off just checking each element for the one you were looking for. > > So what you would get is [n + lg (n)], which will just be [n] after putting > that into an array. > Second, hadoop is all about large data analysis, usually more than 100GB, > so putting this into memory is out of the question. > Third, hadoop is efficient because it processes this large amount of data > by splitting it up into multiple processes. To do an efficient binary > search, you would need do this in one mapper or one reducer. > > My opinion is just don't fight hadoop/pig. > > > > On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[EMAIL PROTECTED]> wrote: > >> Bags can be very large might not fit into memory, and in such cases some >> or all of the bag might have to be stored on disk. In such cases, it is not >> efficient to do random access on the bag. That is why the DataBag interface >> does not support it. >> >> As Prashant suggested, storing it in a tuple would be a good alternative, >> if you want to have random access to do binary search. >> >> -Thejas >> >> >> >> On 12/12/11 7:54 PM, 唐亮 wrote: >> >>> Hi all, >>> How can I implement a binary search in pig? >>> >>> In one relation, there exists a bag whose items are sorted. >>> And I want to check there exists a specific item in the bag. >>> >>> In UDF, I can't random access items in DataBag container. >>> So I have to transfer the items in DataBag to an ArrayList, and this is >>> time consuming. >>> >>> How can I implement the binary search efficiently in pig? >>> >>> >> >
-
Re: Implement Binary Search in PIG唐亮 2011-12-14, 03:55
Thank you all!
The detail is: A bag contains many "IP Segments", whose schema is (ipStart:long, ipEnd:long, locName:chararray) and the number of tuples is about 30000, and I want to check wheather an IP is belong to one segment in the bag. I want to order the "IP Segments" by (ipStart, ipEnd) in MR, and then binary search wheather an IP is in the bag in UDF. If enumerate every IP, it will be more than 100000000 single IPs, I think it will also be time consuming by JOIN in PIG. Please help me how can I deal with it efficiently! 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > My assumption is that 唐亮 is trying to do binary search on bags within the > tuples in a relation (ie schema of the relation has a bag column). I don't > think he is trying to treat the entire relation as one bag and do binary > search on that. > > > -Thejas > > > > On 12/13/11 2:30 PM, Andrew Wells wrote: > >> I don't think this could be done, >> >> pig is just a hadoop job, and the idea behind hadoop is to read all the >> data in a file. >> >> so by the time you put all the data into an array, you would have been >> better off just checking each element for the one you were looking for. >> >> So what you would get is [n + lg (n)], which will just be [n] after >> putting >> that into an array. >> Second, hadoop is all about large data analysis, usually more than 100GB, >> so putting this into memory is out of the question. >> Third, hadoop is efficient because it processes this large amount of data >> by splitting it up into multiple processes. To do an efficient binary >> search, you would need do this in one mapper or one reducer. >> >> My opinion is just don't fight hadoop/pig. >> >> >> >> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[EMAIL PROTECTED]> >> wrote: >> >> Bags can be very large might not fit into memory, and in such cases some >>> or all of the bag might have to be stored on disk. In such cases, it is >>> not >>> efficient to do random access on the bag. That is why the DataBag >>> interface >>> does not support it. >>> >>> As Prashant suggested, storing it in a tuple would be a good alternative, >>> if you want to have random access to do binary search. >>> >>> -Thejas >>> >>> >>> >>> On 12/12/11 7:54 PM, 唐亮 wrote: >>> >>> Hi all, >>>> How can I implement a binary search in pig? >>>> >>>> In one relation, there exists a bag whose items are sorted. >>>> And I want to check there exists a specific item in the bag. >>>> >>>> In UDF, I can't random access items in DataBag container. >>>> So I have to transfer the items in DataBag to an ArrayList, and this is >>>> time consuming. >>>> >>>> How can I implement the binary search efficiently in pig? >>>> >>>> >>>> >>> >> >
-
Re: Implement Binary Search in PIGDmitriy Ryaboy 2011-12-14, 04:39
Do you have many such bags or just one? If one, and you want to look up many ups in it, might be more efficient to serialize this relation to hdfs, and write a lookup udf that specifies the serialized data set as a file to put in distributed cache. At init time, load up the file into memory, then for every ip do the binary search in exec()
On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > Thank you all! > > The detail is: > A bag contains many "IP Segments", whose schema is (ipStart:long, > ipEnd:long, locName:chararray) and the number of tuples is about 30000, > and I want to check wheather an IP is belong to one segment in the bag. > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > and then binary search wheather an IP is in the bag in UDF. > > If enumerate every IP, it will be more than 100000000 single IPs, > I think it will also be time consuming by JOIN in PIG. > > Please help me how can I deal with it efficiently! > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > >> My assumption is that 唐亮 is trying to do binary search on bags within the >> tuples in a relation (ie schema of the relation has a bag column). I don't >> think he is trying to treat the entire relation as one bag and do binary >> search on that. >> >> >> -Thejas >> >> >> >> On 12/13/11 2:30 PM, Andrew Wells wrote: >> >>> I don't think this could be done, >>> >>> pig is just a hadoop job, and the idea behind hadoop is to read all the >>> data in a file. >>> >>> so by the time you put all the data into an array, you would have been >>> better off just checking each element for the one you were looking for. >>> >>> So what you would get is [n + lg (n)], which will just be [n] after >>> putting >>> that into an array. >>> Second, hadoop is all about large data analysis, usually more than 100GB, >>> so putting this into memory is out of the question. >>> Third, hadoop is efficient because it processes this large amount of data >>> by splitting it up into multiple processes. To do an efficient binary >>> search, you would need do this in one mapper or one reducer. >>> >>> My opinion is just don't fight hadoop/pig. >>> >>> >>> >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[EMAIL PROTECTED]> >>> wrote: >>> >>> Bags can be very large might not fit into memory, and in such cases some >>>> or all of the bag might have to be stored on disk. In such cases, it is >>>> not >>>> efficient to do random access on the bag. That is why the DataBag >>>> interface >>>> does not support it. >>>> >>>> As Prashant suggested, storing it in a tuple would be a good alternative, >>>> if you want to have random access to do binary search. >>>> >>>> -Thejas >>>> >>>> >>>> >>>> On 12/12/11 7:54 PM, 唐亮 wrote: >>>> >>>> Hi all, >>>>> How can I implement a binary search in pig? >>>>> >>>>> In one relation, there exists a bag whose items are sorted. >>>>> And I want to check there exists a specific item in the bag. >>>>> >>>>> In UDF, I can't random access items in DataBag container. >>>>> So I have to transfer the items in DataBag to an ArrayList, and this is >>>>> time consuming. >>>>> >>>>> How can I implement the binary search efficiently in pig? >>>>> >>>>> >>>>> >>>> >>> >>
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-14, 05:37
I am lost when you say "If enumerate every IP, it will be more than
100000000 single IPs" If each bag is a collection of 30000 tuples it might not be too bad on the memory if you used Tuple to store segments instead? (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 Lets say we incur an additional overhead 4X times this, which is ~160 bytes per tuple. Total per Bag = 30000 X 160 = ~5 MB You could probably store the ipsegments as Tuple and test it on your servers. On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Do you have many such bags or just one? If one, and you want to look up > many ups in it, might be more efficient to serialize this relation to hdfs, > and write a lookup udf that specifies the serialized data set as a file to > put in distributed cache. At init time, load up the file into memory, then > for every ip do the binary search in exec() > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > > > Thank you all! > > > > The detail is: > > A bag contains many "IP Segments", whose schema is (ipStart:long, > > ipEnd:long, locName:chararray) and the number of tuples is about 30000, > > and I want to check wheather an IP is belong to one segment in the bag. > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > > and then binary search wheather an IP is in the bag in UDF. > > > > If enumerate every IP, it will be more than 100000000 single IPs, > > I think it will also be time consuming by JOIN in PIG. > > > > Please help me how can I deal with it efficiently! > > > > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > > > >> My assumption is that 唐亮 is trying to do binary search on bags within > the > >> tuples in a relation (ie schema of the relation has a bag column). I > don't > >> think he is trying to treat the entire relation as one bag and do binary > >> search on that. > >> > >> > >> -Thejas > >> > >> > >> > >> On 12/13/11 2:30 PM, Andrew Wells wrote: > >> > >>> I don't think this could be done, > >>> > >>> pig is just a hadoop job, and the idea behind hadoop is to read all the > >>> data in a file. > >>> > >>> so by the time you put all the data into an array, you would have been > >>> better off just checking each element for the one you were looking for. > >>> > >>> So what you would get is [n + lg (n)], which will just be [n] after > >>> putting > >>> that into an array. > >>> Second, hadoop is all about large data analysis, usually more than > 100GB, > >>> so putting this into memory is out of the question. > >>> Third, hadoop is efficient because it processes this large amount of > data > >>> by splitting it up into multiple processes. To do an efficient binary > >>> search, you would need do this in one mapper or one reducer. > >>> > >>> My opinion is just don't fight hadoop/pig. > >>> > >>> > >>> > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[EMAIL PROTECTED]> > >>> wrote: > >>> > >>> Bags can be very large might not fit into memory, and in such cases > some > >>>> or all of the bag might have to be stored on disk. In such cases, it > is > >>>> not > >>>> efficient to do random access on the bag. That is why the DataBag > >>>> interface > >>>> does not support it. > >>>> > >>>> As Prashant suggested, storing it in a tuple would be a good > alternative, > >>>> if you want to have random access to do binary search. > >>>> > >>>> -Thejas > >>>> > >>>> > >>>> > >>>> On 12/12/11 7:54 PM, 唐亮 wrote: > >>>> > >>>> Hi all, > >>>>> How can I implement a binary search in pig? > >>>>> > >>>>> In one relation, there exists a bag whose items are sorted. > >>>>> And I want to check there exists a specific item in the bag. > >>>>> > >>>>> In UDF, I can't random access items in DataBag container. > >>>>> So I have to transfer the items in DataBag to an ArrayList, and this > is > >>>>> time consuming. > >>>>> > >>>>> How can I implement the binary search efficiently in pig? > >>>>> > >>>>> > >>>>> > >>>> > >>> > >> >
-
Re: Implement Binary Search in PIGJonathan Coveney 2011-12-14, 05:57
It's funny, but if you look wayyyy in the past, I actually asked a bunch of
questions that circled around, literally, this exact problem. Dmitriy and Prahsant are correct: the best way is to make a UDF that can do the lookup really efficiently. This is what the maxmind API does, for example. 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > I am lost when you say "If enumerate every IP, it will be more than > 100000000 single IPs" > > If each bag is a collection of 30000 tuples it might not be too bad on the > memory if you used Tuple to store segments instead? > > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > Lets say we incur an additional overhead 4X times this, which is ~160 bytes > per tuple. > Total per Bag = 30000 X 160 = ~5 MB > > You could probably store the ipsegments as Tuple and test it on your > servers. > > > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > Do you have many such bags or just one? If one, and you want to look up > > many ups in it, might be more efficient to serialize this relation to > hdfs, > > and write a lookup udf that specifies the serialized data set as a file > to > > put in distributed cache. At init time, load up the file into memory, > then > > for every ip do the binary search in exec() > > > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > > > > > Thank you all! > > > > > > The detail is: > > > A bag contains many "IP Segments", whose schema is (ipStart:long, > > > ipEnd:long, locName:chararray) and the number of tuples is about 30000, > > > and I want to check wheather an IP is belong to one segment in the bag. > > > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > > > and then binary search wheather an IP is in the bag in UDF. > > > > > > If enumerate every IP, it will be more than 100000000 single IPs, > > > I think it will also be time consuming by JOIN in PIG. > > > > > > Please help me how can I deal with it efficiently! > > > > > > > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > > > > > >> My assumption is that 唐亮 is trying to do binary search on bags within > > the > > >> tuples in a relation (ie schema of the relation has a bag column). I > > don't > > >> think he is trying to treat the entire relation as one bag and do > binary > > >> search on that. > > >> > > >> > > >> -Thejas > > >> > > >> > > >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: > > >> > > >>> I don't think this could be done, > > >>> > > >>> pig is just a hadoop job, and the idea behind hadoop is to read all > the > > >>> data in a file. > > >>> > > >>> so by the time you put all the data into an array, you would have > been > > >>> better off just checking each element for the one you were looking > for. > > >>> > > >>> So what you would get is [n + lg (n)], which will just be [n] after > > >>> putting > > >>> that into an array. > > >>> Second, hadoop is all about large data analysis, usually more than > > 100GB, > > >>> so putting this into memory is out of the question. > > >>> Third, hadoop is efficient because it processes this large amount of > > data > > >>> by splitting it up into multiple processes. To do an efficient binary > > >>> search, you would need do this in one mapper or one reducer. > > >>> > > >>> My opinion is just don't fight hadoop/pig. > > >>> > > >>> > > >>> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[EMAIL PROTECTED]> > > >>> wrote: > > >>> > > >>> Bags can be very large might not fit into memory, and in such cases > > some > > >>>> or all of the bag might have to be stored on disk. In such cases, it > > is > > >>>> not > > >>>> efficient to do random access on the bag. That is why the DataBag > > >>>> interface > > >>>> does not support it. > > >>>> > > >>>> As Prashant suggested, storing it in a tuple would be a good > > alternative, > > >>>> if you want to have random access to do binary search. > > >>>> > > >>>> -Thejas > > >>>> > > >>>> > > >>>> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
-
Re: Implement Binary Search in PIG唐亮 2011-12-14, 06:41
Then how can I transfer all the items in Bag to a Tuple?
2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > It's funny, but if you look wayyyy in the past, I actually asked a bunch of > questions that circled around, literally, this exact problem. > > Dmitriy and Prahsant are correct: the best way is to make a UDF that can do > the lookup really efficiently. This is what the maxmind API does, for > example. > > 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > > > I am lost when you say "If enumerate every IP, it will be more than > > 100000000 single IPs" > > > > If each bag is a collection of 30000 tuples it might not be too bad on > the > > memory if you used Tuple to store segments instead? > > > > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > Lets say we incur an additional overhead 4X times this, which is ~160 > bytes > > per tuple. > > Total per Bag = 30000 X 160 = ~5 MB > > > > You could probably store the ipsegments as Tuple and test it on your > > servers. > > > > > > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > Do you have many such bags or just one? If one, and you want to look up > > > many ups in it, might be more efficient to serialize this relation to > > hdfs, > > > and write a lookup udf that specifies the serialized data set as a file > > to > > > put in distributed cache. At init time, load up the file into memory, > > then > > > for every ip do the binary search in exec() > > > > > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > > > > > > > Thank you all! > > > > > > > > The detail is: > > > > A bag contains many "IP Segments", whose schema is (ipStart:long, > > > > ipEnd:long, locName:chararray) and the number of tuples is about > 30000, > > > > and I want to check wheather an IP is belong to one segment in the > bag. > > > > > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > > > > and then binary search wheather an IP is in the bag in UDF. > > > > > > > > If enumerate every IP, it will be more than 100000000 single IPs, > > > > I think it will also be time consuming by JOIN in PIG. > > > > > > > > Please help me how can I deal with it efficiently! > > > > > > > > > > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > > > > > > > >> My assumption is that 唐亮 is trying to do binary search on bags > within > > > the > > > >> tuples in a relation (ie schema of the relation has a bag column). I > > > don't > > > >> think he is trying to treat the entire relation as one bag and do > > binary > > > >> search on that. > > > >> > > > >> > > > >> -Thejas > > > >> > > > >> > > > >> > > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: > > > >> > > > >>> I don't think this could be done, > > > >>> > > > >>> pig is just a hadoop job, and the idea behind hadoop is to read all > > the > > > >>> data in a file. > > > >>> > > > >>> so by the time you put all the data into an array, you would have > > been > > > >>> better off just checking each element for the one you were looking > > for. > > > >>> > > > >>> So what you would get is [n + lg (n)], which will just be [n] after > > > >>> putting > > > >>> that into an array. > > > >>> Second, hadoop is all about large data analysis, usually more than > > > 100GB, > > > >>> so putting this into memory is out of the question. > > > >>> Third, hadoop is efficient because it processes this large amount > of > > > data > > > >>> by splitting it up into multiple processes. To do an efficient > binary > > > >>> search, you would need do this in one mapper or one reducer. > > > >>> > > > >>> My opinion is just don't fight hadoop/pig. > > > >>> > > > >>> > > > >>> > > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< > [EMAIL PROTECTED]> > > > >>> wrote: > > > >>> > > > >>> Bags can be very large might not fit into memory, and in such cases > > > some > > > >>>> or all of the bag might have to be stored on disk. In such cases, > it > > > is > > > >>>> not > > > >>>> efficient to do random access on the bag. That is why the DataBag
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-14, 06:49
How are you storing segments in a Bag? Can you forward the script.
2011/12/13 唐亮 <[EMAIL PROTECTED]> > Then how can I transfer all the items in Bag to a Tuple? > > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > > > It's funny, but if you look wayyyy in the past, I actually asked a bunch > of > > questions that circled around, literally, this exact problem. > > > > Dmitriy and Prahsant are correct: the best way is to make a UDF that can > do > > the lookup really efficiently. This is what the maxmind API does, for > > example. > > > > 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > > > > > I am lost when you say "If enumerate every IP, it will be more than > > > 100000000 single IPs" > > > > > > If each bag is a collection of 30000 tuples it might not be too bad on > > the > > > memory if you used Tuple to store segments instead? > > > > > > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > > Lets say we incur an additional overhead 4X times this, which is ~160 > > bytes > > > per tuple. > > > Total per Bag = 30000 X 160 = ~5 MB > > > > > > You could probably store the ipsegments as Tuple and test it on your > > > servers. > > > > > > > > > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Do you have many such bags or just one? If one, and you want to look > up > > > > many ups in it, might be more efficient to serialize this relation to > > > hdfs, > > > > and write a lookup udf that specifies the serialized data set as a > file > > > to > > > > put in distributed cache. At init time, load up the file into memory, > > > then > > > > for every ip do the binary search in exec() > > > > > > > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > > > > > > > > > Thank you all! > > > > > > > > > > The detail is: > > > > > A bag contains many "IP Segments", whose schema is (ipStart:long, > > > > > ipEnd:long, locName:chararray) and the number of tuples is about > > 30000, > > > > > and I want to check wheather an IP is belong to one segment in the > > bag. > > > > > > > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > > > > > and then binary search wheather an IP is in the bag in UDF. > > > > > > > > > > If enumerate every IP, it will be more than 100000000 single IPs, > > > > > I think it will also be time consuming by JOIN in PIG. > > > > > > > > > > Please help me how can I deal with it efficiently! > > > > > > > > > > > > > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > > > > > > > > > >> My assumption is that 唐亮 is trying to do binary search on bags > > within > > > > the > > > > >> tuples in a relation (ie schema of the relation has a bag > column). I > > > > don't > > > > >> think he is trying to treat the entire relation as one bag and do > > > binary > > > > >> search on that. > > > > >> > > > > >> > > > > >> -Thejas > > > > >> > > > > >> > > > > >> > > > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: > > > > >> > > > > >>> I don't think this could be done, > > > > >>> > > > > >>> pig is just a hadoop job, and the idea behind hadoop is to read > all > > > the > > > > >>> data in a file. > > > > >>> > > > > >>> so by the time you put all the data into an array, you would have > > > been > > > > >>> better off just checking each element for the one you were > looking > > > for. > > > > >>> > > > > >>> So what you would get is [n + lg (n)], which will just be [n] > after > > > > >>> putting > > > > >>> that into an array. > > > > >>> Second, hadoop is all about large data analysis, usually more > than > > > > 100GB, > > > > >>> so putting this into memory is out of the question. > > > > >>> Third, hadoop is efficient because it processes this large amount > > of > > > > data > > > > >>> by splitting it up into multiple processes. To do an efficient > > binary > > > > >>> search, you would need do this in one mapper or one reducer. > > > > >>> > > > > >>> My opinion is just don't fight hadoop/pig. > > > > >>> > > >
-
Re: Implement Binary Search in PIG唐亮 2011-12-14, 06:54
The detailed PIG codes are as below:
raw_ip_segment = load ... ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; group_ip_segs = group ip_segs all; order_ip_segs = foreach group_ip_segs { order_seg = order ip_segs by ipstart, ipend; generate 't' as tag, order_seg; } describe order_ip_segs order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid: chararray}} Here, the order_ip_segs::order_seg is a BAG, how can I transer it to a TUPLE? And can I access the TUPLE randomly in UDF? 在 2011年12月14日 下午2:41,唐亮 <[EMAIL PROTECTED]>写道: > Then how can I transfer all the items in Bag to a Tuple? > > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > >> It's funny, but if you look wayyyy in the past, I actually asked a bunch >> of >> questions that circled around, literally, this exact problem. >> >> Dmitriy and Prahsant are correct: the best way is to make a UDF that can >> do >> the lookup really efficiently. This is what the maxmind API does, for >> example. >> >> 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> >> >> > I am lost when you say "If enumerate every IP, it will be more than >> > 100000000 single IPs" >> > >> > If each bag is a collection of 30000 tuples it might not be too bad on >> the >> > memory if you used Tuple to store segments instead? >> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 >> > Lets say we incur an additional overhead 4X times this, which is ~160 >> bytes >> > per tuple. >> > Total per Bag = 30000 X 160 = ~5 MB >> > >> > You could probably store the ipsegments as Tuple and test it on your >> > servers. >> > >> > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Do you have many such bags or just one? If one, and you want to look >> up >> > > many ups in it, might be more efficient to serialize this relation to >> > hdfs, >> > > and write a lookup udf that specifies the serialized data set as a >> file >> > to >> > > put in distributed cache. At init time, load up the file into memory, >> > then >> > > for every ip do the binary search in exec() >> > > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: >> > > >> > > > Thank you all! >> > > > >> > > > The detail is: >> > > > A bag contains many "IP Segments", whose schema is (ipStart:long, >> > > > ipEnd:long, locName:chararray) and the number of tuples is about >> 30000, >> > > > and I want to check wheather an IP is belong to one segment in the >> bag. >> > > > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, >> > > > and then binary search wheather an IP is in the bag in UDF. >> > > > >> > > > If enumerate every IP, it will be more than 100000000 single IPs, >> > > > I think it will also be time consuming by JOIN in PIG. >> > > > >> > > > Please help me how can I deal with it efficiently! >> > > > >> > > > >> > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> >> > > > >> > > >> My assumption is that 唐亮 is trying to do binary search on bags >> within >> > > the >> > > >> tuples in a relation (ie schema of the relation has a bag column). >> I >> > > don't >> > > >> think he is trying to treat the entire relation as one bag and do >> > binary >> > > >> search on that. >> > > >> >> > > >> >> > > >> -Thejas >> > > >> >> > > >> >> > > >> >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote: >> > > >> >> > > >>> I don't think this could be done, >> > > >>> >> > > >>> pig is just a hadoop job, and the idea behind hadoop is to read >> all >> > the >> > > >>> data in a file. >> > > >>> >> > > >>> so by the time you put all the data into an array, you would have >> > been >> > > >>> better off just checking each element for the one you were looking >> > for. >> > > >>> >> > > >>> So what you would get is [n + lg (n)], which will just be [n] >> after >> > > >>> putting >> > > >>> that into an array. >> > > >>> Second, hadoop is all about large data analysis, usually more than >> > > 100GB, >> > > >>> so putting this into memory is out of the question.
-
Re: Implement Binary Search in PIGJonathan Coveney 2011-12-14, 06:56
Here is a super naive UDF (pseudocode). It assumes that you have the data
in HDFS, per Dmitriy's suggestion. public MyUdf() { Get data from distributed cache load data into a TreeMap } public T exec(Tuple input) { TreeMap.get(input.get(0)); } and so on. You might want to lazily initialize the TreeMap, because hitting the distributed cache and making the TreeMap is costly, so you only want to do it on execution. There's another slightly crazier way, which may be what people above alluded to...if you know how many elements you have, you can make a binary tree with an array of fixed size (where your root is at 0, and then it's children are at 1 and 2, and their children at 3 through 7, and so on). So you could actually construct a binary search tree as an array. This is a bit more pain, but you'd be able to use it in various places, and Pig would handle serialization. 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > How are you storing segments in a Bag? Can you forward the script. > > 2011/12/13 唐亮 <[EMAIL PROTECTED]> > > > Then how can I transfer all the items in Bag to a Tuple? > > > > > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > It's funny, but if you look wayyyy in the past, I actually asked a > bunch > > of > > > questions that circled around, literally, this exact problem. > > > > > > Dmitriy and Prahsant are correct: the best way is to make a UDF that > can > > do > > > the lookup really efficiently. This is what the maxmind API does, for > > > example. > > > > > > 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > > > > > > > I am lost when you say "If enumerate every IP, it will be more than > > > > 100000000 single IPs" > > > > > > > > If each bag is a collection of 30000 tuples it might not be too bad > on > > > the > > > > memory if you used Tuple to store segments instead? > > > > > > > > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > > > Lets say we incur an additional overhead 4X times this, which is ~160 > > > bytes > > > > per tuple. > > > > Total per Bag = 30000 X 160 = ~5 MB > > > > > > > > You could probably store the ipsegments as Tuple and test it on your > > > > servers. > > > > > > > > > > > > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Do you have many such bags or just one? If one, and you want to > look > > up > > > > > many ups in it, might be more efficient to serialize this relation > to > > > > hdfs, > > > > > and write a lookup udf that specifies the serialized data set as a > > file > > > > to > > > > > put in distributed cache. At init time, load up the file into > memory, > > > > then > > > > > for every ip do the binary search in exec() > > > > > > > > > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > Thank you all! > > > > > > > > > > > > The detail is: > > > > > > A bag contains many "IP Segments", whose schema is (ipStart:long, > > > > > > ipEnd:long, locName:chararray) and the number of tuples is about > > > 30000, > > > > > > and I want to check wheather an IP is belong to one segment in > the > > > bag. > > > > > > > > > > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > > > > > > and then binary search wheather an IP is in the bag in UDF. > > > > > > > > > > > > If enumerate every IP, it will be more than 100000000 single IPs, > > > > > > I think it will also be time consuming by JOIN in PIG. > > > > > > > > > > > > Please help me how can I deal with it efficiently! > > > > > > > > > > > > > > > > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > > > > > > > > > > > >> My assumption is that 唐亮 is trying to do binary search on bags > > > within > > > > > the > > > > > >> tuples in a relation (ie schema of the relation has a bag > > column). I > > > > > don't > > > > > >> think he is trying to treat the entire relation as one bag and > do > > > > binary > > > > > >> search on that. > > > > > >> > > > > > >> > > > > > >> -Thejas
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-14, 07:16
Seems like at the end of this you have a Single bag with all the elements,
and somehow you would like to check whether an element exists in it based on ipstart/end. 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs GENERATE tag, FLATTEN(order_seq); ---- This is O(n) 2. Now write a UDF that can access the elements positionally for the BinarySearch 3. Dmitriy and Jonathan's ideas with DistributedCache could perform better than the above approach, so you could go down that route too. 2011/12/13 唐亮 <[EMAIL PROTECTED]> > The detailed PIG codes are as below: > > raw_ip_segment = load ... > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > group_ip_segs = group ip_segs all; > > order_ip_segs = foreach group_ip_segs { > order_seg = order ip_segs by ipstart, ipend; > generate 't' as tag, order_seg; > } > describe order_ip_segs > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid: > chararray}} > > Here, the order_ip_segs::order_seg is a BAG, > how can I transer it to a TUPLE? > > And can I access the TUPLE randomly in UDF? > > 在 2011年12月14日 下午2:41,唐亮 <[EMAIL PROTECTED]>写道: > > > Then how can I transfer all the items in Bag to a Tuple? > > > > > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > > > >> It's funny, but if you look wayyyy in the past, I actually asked a bunch > >> of > >> questions that circled around, literally, this exact problem. > >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF that can > >> do > >> the lookup really efficiently. This is what the maxmind API does, for > >> example. > >> > >> 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > >> > >> > I am lost when you say "If enumerate every IP, it will be more than > >> > 100000000 single IPs" > >> > > >> > If each bag is a collection of 30000 tuples it might not be too bad on > >> the > >> > memory if you used Tuple to store segments instead? > >> > > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > >> > Lets say we incur an additional overhead 4X times this, which is ~160 > >> bytes > >> > per tuple. > >> > Total per Bag = 30000 X 160 = ~5 MB > >> > > >> > You could probably store the ipsegments as Tuple and test it on your > >> > servers. > >> > > >> > > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> > wrote: > >> > > >> > > Do you have many such bags or just one? If one, and you want to look > >> up > >> > > many ups in it, might be more efficient to serialize this relation > to > >> > hdfs, > >> > > and write a lookup udf that specifies the serialized data set as a > >> file > >> > to > >> > > put in distributed cache. At init time, load up the file into > memory, > >> > then > >> > > for every ip do the binary search in exec() > >> > > > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[EMAIL PROTECTED]> wrote: > >> > > > >> > > > Thank you all! > >> > > > > >> > > > The detail is: > >> > > > A bag contains many "IP Segments", whose schema is (ipStart:long, > >> > > > ipEnd:long, locName:chararray) and the number of tuples is about > >> 30000, > >> > > > and I want to check wheather an IP is belong to one segment in the > >> bag. > >> > > > > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR, > >> > > > and then binary search wheather an IP is in the bag in UDF. > >> > > > > >> > > > If enumerate every IP, it will be more than 100000000 single IPs, > >> > > > I think it will also be time consuming by JOIN in PIG. > >> > > > > >> > > > Please help me how can I deal with it efficiently! > >> > > > > >> > > > > >> > > > 2011/12/14 Thejas Nair <[EMAIL PROTECTED]> > >> > > > > >> > > >> My assumption is that 唐亮 is trying to do binary search on bags > >> within > >> > > the > >> > > >> tuples in a relation (ie schema of the relation has a bag > column). > >> I > >> > > don't > >> > > >> think he is trying to treat the entire relation as one bag and do
-
Re: Implement Binary Search in PIG唐亮 2011-12-14, 09:59
Hi Prashant Kommireddi,
If I do 1. and 2. as you mentioned, the schema will be {tag, ipStart, ipEnd, locName}. BUT, how should I write the UDF, especially how should I set the type of the input parameter? Currently, the UDF codes are as below, whose input parameter is DataBag: public class GetProvinceNameFromIPNum extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return UnknownIP; if (input.size() != 2) { throw new IOException("Expected input's size is 2, but is: " + input.size()); } Object o1 = input.get(0); * // This should be the IP you want to look up* if (!(o1 instanceof Long)) { throw new IOException("Expected input 1 to be Long, but got " + o1.getClass().getName()); } Object o2 = input.get(1); *// This is the Bag of IP segs* if (!(o2 instanceof *DataBag*)) { //* Should I change it to "(o2 instanceof Tuple)"?* throw new IOException("Expected input 2 to be DataBag, but got " + o2.getClass().getName()); } ........... other codes ........... } } 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > Seems like at the end of this you have a Single bag with all the elements, > and somehow you would like to check whether an element exists in it based > on ipstart/end. > > > 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - > this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs > GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > 2. Now write a UDF that can access the elements positionally for the > BinarySearch > 3. Dmitriy and Jonathan's ideas with DistributedCache could perform > better than the above approach, so you could go down that route too. > > > 2011/12/13 唐亮 <[EMAIL PROTECTED]> > > > The detailed PIG codes are as below: > > > > raw_ip_segment = load ... > > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > > group_ip_segs = group ip_segs all; > > > > order_ip_segs = foreach group_ip_segs { > > order_seg = order ip_segs by ipstart, ipend; > > generate 't' as tag, order_seg; > > } > > describe order_ip_segs > > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > long,poid: > > chararray}} > > > > Here, the order_ip_segs::order_seg is a BAG, > > how can I transer it to a TUPLE? > > > > And can I access the TUPLE randomly in UDF? > > > > 在 2011年12月14日 下午2:41,唐亮 <[EMAIL PROTECTED]>写道: > > > > > Then how can I transfer all the items in Bag to a Tuple? > > > > > > > > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > > > > > >> It's funny, but if you look wayyyy in the past, I actually asked a > bunch > > >> of > > >> questions that circled around, literally, this exact problem. > > >> > > >> Dmitriy and Prahsant are correct: the best way is to make a UDF that > can > > >> do > > >> the lookup really efficiently. This is what the maxmind API does, for > > >> example. > > >> > > >> 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > > >> > > >> > I am lost when you say "If enumerate every IP, it will be more than > > >> > 100000000 single IPs" > > >> > > > >> > If each bag is a collection of 30000 tuples it might not be too bad > on > > >> the > > >> > memory if you used Tuple to store segments instead? > > >> > > > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > >> > Lets say we incur an additional overhead 4X times this, which is > ~160 > > >> bytes > > >> > per tuple. > > >> > Total per Bag = 30000 X 160 = ~5 MB > > >> > > > >> > You could probably store the ipsegments as Tuple and test it on your > > >> > servers. > > >> > > > >> > > > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[EMAIL PROTECTED] > > > > >> > wrote: > > >> > > > >> > > Do you have many such bags or just one? If one, and you want to > look > > >> up > > >> > > many ups in it, might be more efficient to serialize this relation
-
Re: Implement Binary Search in PIG唐亮 2011-12-14, 10:13
Now, I didn't use HBase,
so, maybe I can't use DistributedCache. And if FLATTEN DataBag, the results are Tuples, then in UDF I can process only one Tuple, which can't implement BinarySearch. So, please help and show me the detailed solution. Thanks! 在 2011年12月14日 下午5:59,唐亮 <[EMAIL PROTECTED]>写道: > Hi Prashant Kommireddi, > > If I do 1. and 2. as you mentioned, > the schema will be {tag, ipStart, ipEnd, locName}. > > BUT, how should I write the UDF, especially how should I set the type of > the input parameter? > > Currently, the UDF codes are as below, whose input parameter is DataBag: > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > public String exec(Tuple input) throws IOException { > if (input == null || input.size() == 0) > return UnknownIP; > if (input.size() != 2) { > throw new IOException("Expected input's size is 2, but is: " + > input.size()); > } > > Object o1 = input.get(0); * // This should be the IP you want to > look up* > if (!(o1 instanceof Long)) { > throw new IOException("Expected input 1 to be Long, but got " > + o1.getClass().getName()); > } > Object o2 = input.get(1); *// This is the Bag of IP segs* > if (!(o2 instanceof *DataBag*)) { //* Should I change it to "(o2 > instanceof Tuple)"?* > throw new IOException("Expected input 2 to be DataBag, but got > " > + o2.getClass().getName()); > } > > ........... other codes ........... > } > > } > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > Seems like at the end of this you have a Single bag with all the elements, >> and somehow you would like to check whether an element exists in it based >> on ipstart/end. >> >> >> 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - >> this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) >> 2. Now write a UDF that can access the elements positionally for the >> BinarySearch >> 3. Dmitriy and Jonathan's ideas with DistributedCache could perform >> better than the above approach, so you could go down that route too. >> >> >> 2011/12/13 唐亮 <[EMAIL PROTECTED]> >> >> > The detailed PIG codes are as below: >> > >> > raw_ip_segment = load ... >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; >> > group_ip_segs = group ip_segs all; >> > >> > order_ip_segs = foreach group_ip_segs { >> > order_seg = order ip_segs by ipstart, ipend; >> > generate 't' as tag, order_seg; >> > } >> > describe order_ip_segs >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: >> long,poid: >> > chararray}} >> > >> > Here, the order_ip_segs::order_seg is a BAG, >> > how can I transer it to a TUPLE? >> > >> > And can I access the TUPLE randomly in UDF? >> > >> > 在 2011年12月14日 下午2:41,唐亮 <[EMAIL PROTECTED]>写道: >> > >> > > Then how can I transfer all the items in Bag to a Tuple? >> > > >> > > >> > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> >> > > >> > >> It's funny, but if you look wayyyy in the past, I actually asked a >> bunch >> > >> of >> > >> questions that circled around, literally, this exact problem. >> > >> >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF that >> can >> > >> do >> > >> the lookup really efficiently. This is what the maxmind API does, for >> > >> example. >> > >> >> > >> 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> >> > >> >> > >> > I am lost when you say "If enumerate every IP, it will be more than >> > >> > 100000000 single IPs" >> > >> > >> > >> > If each bag is a collection of 30000 tuples it might not be too >> bad on >> > >> the >> > >> > memory if you used Tuple to store segments instead? >> > >> > >> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 >> > >> > Lets say we incur an additional overhead 4X times this, which is >> ~160 >> > >> bytes
-
Re: Implement Binary Search in PIGDmitriy Ryaboy 2011-12-14, 18:28
hbase has nothing to do with distributed cache.
2011/12/14 唐亮 <[EMAIL PROTECTED]> > Now, I didn't use HBase, > so, maybe I can't use DistributedCache. > > And if FLATTEN DataBag, the results are Tuples, > then in UDF I can process only one Tuple, which can't implement > BinarySearch. > > So, please help and show me the detailed solution. > Thanks! > > 在 2011年12月14日 下午5:59,唐亮 <[EMAIL PROTECTED]>写道: > > > Hi Prashant Kommireddi, > > > > If I do 1. and 2. as you mentioned, > > the schema will be {tag, ipStart, ipEnd, locName}. > > > > BUT, how should I write the UDF, especially how should I set the type of > > the input parameter? > > > > Currently, the UDF codes are as below, whose input parameter is DataBag: > > > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > > > public String exec(Tuple input) throws IOException { > > if (input == null || input.size() == 0) > > return UnknownIP; > > if (input.size() != 2) { > > throw new IOException("Expected input's size is 2, but is: " + > > input.size()); > > } > > > > Object o1 = input.get(0); * // This should be the IP you want to > > look up* > > if (!(o1 instanceof Long)) { > > throw new IOException("Expected input 1 to be Long, but got " > > + o1.getClass().getName()); > > } > > Object o2 = input.get(1); *// This is the Bag of IP segs* > > if (!(o2 instanceof *DataBag*)) { //* Should I change it to "(o2 > > instanceof Tuple)"?* > > throw new IOException("Expected input 2 to be DataBag, but > got > > " > > + o2.getClass().getName()); > > } > > > > ........... other codes ........... > > } > > > > } > > > > > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > > > Seems like at the end of this you have a Single bag with all the > elements, > >> and somehow you would like to check whether an element exists in it > based > >> on ipstart/end. > >> > >> > >> 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - > >> this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs > >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > >> 2. Now write a UDF that can access the elements positionally for the > >> BinarySearch > >> 3. Dmitriy and Jonathan's ideas with DistributedCache could perform > >> better than the above approach, so you could go down that route too. > >> > >> > >> 2011/12/13 唐亮 <[EMAIL PROTECTED]> > >> > >> > The detailed PIG codes are as below: > >> > > >> > raw_ip_segment = load ... > >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > >> > group_ip_segs = group ip_segs all; > >> > > >> > order_ip_segs = foreach group_ip_segs { > >> > order_seg = order ip_segs by ipstart, ipend; > >> > generate 't' as tag, order_seg; > >> > } > >> > describe order_ip_segs > >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > >> long,poid: > >> > chararray}} > >> > > >> > Here, the order_ip_segs::order_seg is a BAG, > >> > how can I transer it to a TUPLE? > >> > > >> > And can I access the TUPLE randomly in UDF? > >> > > >> > 在 2011年12月14日 下午2:41,唐亮 <[EMAIL PROTECTED]>写道: > >> > > >> > > Then how can I transfer all the items in Bag to a Tuple? > >> > > > >> > > > >> > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > >> > > > >> > >> It's funny, but if you look wayyyy in the past, I actually asked a > >> bunch > >> > >> of > >> > >> questions that circled around, literally, this exact problem. > >> > >> > >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF > that > >> can > >> > >> do > >> > >> the lookup really efficiently. This is what the maxmind API does, > for > >> > >> example. > >> > >> > >> > >> 2011/12/13 Prashant Kommireddi <[EMAIL PROTECTED]> > >> > >> > >> > >> > I am lost when you say "If enumerate every IP, it will be more > than > >> > >> > 100000000 single IPs" > >> > >> > > >> > >> > If each bag is a collection of 30000 tuples it might not be too
-
Re: Implement Binary Search in PIGjiang licht 2011-12-14, 19:18
If that list of ip pairs is pretty static most time and will be used frequently, maybe just copy it in hdfs with a high replication factor. Then use it as a look up table or some binary tree or treemap kind of thing by reading it from hdfs instead of using distributed cache if that sounds an easier thing to do.
Best regards, Michael ________________________________ From: Dmitriy Ryaboy <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, December 14, 2011 10:28 AM Subject: Re: Implement Binary Search in PIG hbase has nothing to do with distributed cache. 2011/12/14 唐亮 <[EMAIL PROTECTED]> > Now, I didn't use HBase, > so, maybe I can't use DistributedCache. > > And if FLATTEN DataBag, the results are Tuples, > then in UDF I can process only one Tuple, which can't implement > BinarySearch. > > So, please help and show me the detailed solution. > Thanks! > > �� 2011年12月14日 下午5:59,���亮 <[EMAIL PROTECTED]>写道: > > > Hi Prashant Kommireddi, > > > > If I do 1. and 2. as you mentioned, > > the schema will be {tag, ipStart, ipEnd, locName}. > > > > BUT, how should I write the UDF, especially how should I set the type of > > the input parameter? > > > > Currently, the UDF codes are as below, whose input parameter is DataBag: > > > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > > > public String exec(Tuple input) throws IOException { > > if (input == null || input.size() == 0) > > return UnknownIP; > > if (input.size() != 2) { > > throw new IOException("Expected input's size is 2, but is: " + > > input.size()); > > } > > > > Object o1 = input.get(0); * // This should be the IP you want to > > look up* > > if (!(o1 instanceof Long)) { > >�� throw new IOException("Expected input 1 to be Long, but got " > > + o1.getClass().getName()); > > } > > Object o2 = input.get(1); *// This is the Bag of IP segs* > > if (!(o2 instanceof *DataBag*)) { //* Should I change it to "(o2 > > instanceof Tuple)"?* > > throw new IOException("Expected input 2 to be DataBag, but > got > > " > > �� + o2.getClass().getName()); > > } > > > > ........... other codes ........... > > } > > > > } > > > > > > > > ��� 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > > > Seems like at the end of this you have a Single bag with all the > elements, > >> and somehow you would like to check whether an element exists in it > based > >> on ipstart/end. > >> > >> > >> 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - > >> this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs > >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > >> 2. Now write a UDF that can access the elements positionally for the > >> BinarySearch > >> 3. Dmitriy and Jonathan's ideas with DistributedCache could perform > >> better than the above approach, so you could go down that route too. > >> > >> > >> 2011/12/13 唐亮 <[EMAIL PROTECTED]> > >> > >> > The detailed PIG codes are as below: > >> > > >> > raw_ip_segment = load ... > >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > >> > group_ip_segs = group ip_segs all; > >> > > >> > order_ip_segs = foreach group_ip_segs { > >> > order_seg = order ip_segs by ipstart, ipend; > >> > generate 't' as tag, order_seg; > >> > } > >> > describe order_ip_segs > >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > >> long,poid: > >> > chararray}} > >> > > >> > Here, the order_ip_segs::order_seg is a BAG, > >> > how can I transer it to a TUPLE? > >> > > >> > And can I access the TUPLE randomly in UDF? > >> > > >> > 在 2011年12月14日 下午2:41,唐亮 <[EMAIL PROTECTED]>���道: > >> > > >> > > Then how can I transfer all the items in Bag to a Tuple? > >> > > > >> > > > >> > > 2011/12/14 Jonathan Coveney <[EMAIL PROTECTED]> > >> > > > >> > >> It's funny, but if you look wayyyy in the past, I actually asked a
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-14, 19:23
Try this
public class GetProvinceNameFromIPNum extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return UnknownIP; if (input.size() != 2) { throw new IOException("Expected input's size is 2, but is: " + input.size()); } Object o1 = input.get(0); * // This should be the IP you want to look up* if (!(o1 instanceof Long)) { throw new IOException("Expected input 1 to be Long, but got " + o1.getClass().getName()); } Object o2 = input.get(1); *// This is the Bag of IP segs* if (!(o2 instanceof Tuple)) { //* Should I change it to "(o2 instanceof Tuple)"?* throw new IOException("Expected input 2 to be Tuple, but got " + o2.getClass().getName()); } Long toSearch = (Long)o1; Tuple listOfTuples = (Tuple)o2; int numTuples = listOfTuples.size(); binarySearch(listOfTuples, toSearch, 0, numTuples - 1); } //I do not know what you would like your Binary search to return, so I have specified boolean/int/String. Change it as per your need. public boolean/int/String binarySearch(Tuple tuple, long toSearch, int low, int high) { //Your Binary search implementation here } } NOTE: You can check the input type at compile time by implementing outputSchema(Schema schema). Take a look at http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html 2011/12/14 唐亮 <[EMAIL PROTECTED]> > Hi Prashant Kommireddi, > > If I do 1. and 2. as you mentioned, > the schema will be {tag, ipStart, ipEnd, locName}. > > BUT, how should I write the UDF, especially how should I set the type of > the input parameter? > > Currently, the UDF codes are as below, whose input parameter is DataBag: > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > public String exec(Tuple input) throws IOException { > if (input == null || input.size() == 0) > return UnknownIP; > if (input.size() != 2) { > throw new IOException("Expected input's size is 2, but is: " + > input.size()); > } > > Object o1 = input.get(0); * // This should be the IP you want to > look up* > if (!(o1 instanceof Long)) { > throw new IOException("Expected input 1 to be Long, but got " > + o1.getClass().getName()); > } > Object o2 = input.get(1); *// This is the Bag of IP segs* > if (!(o2 instanceof *DataBag*)) { //* Should I change it to "(o2 > instanceof Tuple)"?* > throw new IOException("Expected input 2 to be DataBag, but got > " > + o2.getClass().getName()); > } > > ........... other codes ........... > } > > } > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > > Seems like at the end of this you have a Single bag with all the > elements, > > and somehow you would like to check whether an element exists in it based > > on ipstart/end. > > > > > > 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten - > > this will convert the Bag to Tuple: to_tuple = FOREACH order_ip_segs > > GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > > 2. Now write a UDF that can access the elements positionally for the > > BinarySearch > > 3. Dmitriy and Jonathan's ideas with DistributedCache could perform > > better than the above approach, so you could go down that route too. > > > > > > 2011/12/13 唐亮 <[EMAIL PROTECTED]> > > > > > The detailed PIG codes are as below: > > > > > > raw_ip_segment = load ... > > > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > > > group_ip_segs = group ip_segs all; > > > > > > order_ip_segs = foreach group_ip_segs { > > > order_seg = order ip_segs by ipstart, ipend; > > > generate 't' as tag, order_seg; > > > } > > > describe order_ip_segs > > > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > > long,poid: > > > chararray}} > > > > > > Here, the order_ip_segs::order_seg is a BAG,
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-14, 19:27
Michael,
This would have no benefit over using a DistributedCache. For a large cluster this would mean poor performance. If the file is static and needs to be looked-up across the cluster, DistributedCache would be a better approach. Thanks, Prashant On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[EMAIL PROTECTED]> wrote: > If that list of ip pairs is pretty static most time and will be used > frequently, maybe just copy it in hdfs with a high replication factor. Then > use it as a look up table or some binary tree or treemap kind of thing by > reading it from hdfs instead of using distributed cache if that sounds an > easier thing to do. > > > Best regards, > Michael > > > ________________________________ > From: Dmitriy Ryaboy <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, December 14, 2011 10:28 AM > Subject: Re: Implement Binary Search in PIG > > hbase has nothing to do with distributed cache. > > > 2011/12/14 唐亮 <[EMAIL PROTECTED]> > > > Now, I didn't use HBase, > > so, maybe I can't use DistributedCache. > > > > And if FLATTEN DataBag, the results are Tuples, > > then in UDF I can process only one Tuple, which can't implement > > BinarySearch. > > > > So, please help and show me the detailed solution. > > Thanks! > > > > 在 2011年12月14日 下午5:59���唐亮 <[EMAIL PROTECTED]>写道: > > > > > Hi Prashant Kommireddi, > > > > > > If I do 1. and 2. as you mentioned, > > > the schema will be {tag, ipStart, ipEnd, locName}. > > > > > > BUT, how should I write the UDF, especially how should I set the type > of > > > the input parameter? > > > > > > Currently, the UDF codes are as below, whose input parameter is > DataBag: > > > > > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > > > > > public String exec(Tuple input) throws IOException { > > > if (input == null || input.size() == 0) > > > return UnknownIP; > > > if (input.size() != 2) { > > > throw new IOException("Expected input's size is 2, but is: " + > > > input.size()); > > > } > > > > > > Object o1 = input.get(0); * // This should be the IP you want > to > > > look up* > > > if (!(o1 instanceof Long)) { > > > throw new IOException("Expected input 1 to be Long, but > got " > > > + o1.getClass().getName()); > > > } > > > Object o2 = input.get(1); *// This is the Bag of IP segs* > > > if (!(o2 instanceof *DataBag*)) { //* Should I change it to > "(o2 > > > instanceof Tuple)"?* > > > throw new IOException("Expected input 2 to be DataBag, but > > got > > > " > > > + o2.getClass().getName()); > > > } > > > > > > ........... other codes ........... > > > } > > > > > > } > > > > > > > > > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道�� > > > > > > Seems like at the end of this you have a Single bag with all the > > elements, > > >> and somehow you would like to check whether an element exists in it > > based > > >> on ipstart/end. > > >> > > >> > > >> 1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten- > > >> this will convert the Bag to Tuple: to_tuple = FOREACH > order_ip_segs > > >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > > >> 2. Now write a UDF that can access the elements positionally for the > > >> BinarySearch > > >> 3. Dmitriy and Jonathan's ideas with DistributedCache could perform > > >> better than the above approach, so you could go down that route too. > > >> > > >> > > >> 2011/12/13 唐亮 <[EMAIL PROTECTED]> > > >> > > >> > The detailed PIG codes are as below: > > >> > > > >> > raw_ip_segment = load ... > > >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > > >> > group_ip_segs = group ip_segs all; > > >> > > > >> > order_ip_segs = foreach group_ip_segs { > > >> > order_seg = order ip_segs by ipstart, ipend; > > >> > generate 't' as tag, order_seg; > > >> > } > > >> > describe order_ip_segs > > >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
-
Re: Implement Binary Search in PIG唐亮 2011-12-15, 02:28
Now the question is:
How should I put all the "IP Segments" in one TUPLE? Please help me! 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > Michael, > > This would have no benefit over using a DistributedCache. For a large > cluster this would mean poor performance. If the file is static and needs > to be looked-up across the cluster, DistributedCache would be a better > approach. > > Thanks, > Prashant > > On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[EMAIL PROTECTED]> > wrote: > > > If that list of ip pairs is pretty static most time and will be used > > frequently, maybe just copy it in hdfs with a high replication factor. > Then > > use it as a look up table or some binary tree or treemap kind of thing by > > reading it from hdfs instead of using distributed cache if that sounds an > > easier thing to do. > > > > > > Best regards, > > Michael > > > > > > ________________________________ > > From: Dmitriy Ryaboy <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Wednesday, December 14, 2011 10:28 AM > > Subject: Re: Implement Binary Search in PIG > > > > hbase has nothing to do with distributed cache. > > > > > > 2011/12/14 唐亮 <[EMAIL PROTECTED]> > > > > > Now, I didn't use HBase, > > > so, maybe I can't use DistributedCache. > > > > > > And if FLATTEN DataBag, the results are Tuples, > > > then in UDF I can process only one Tuple, which can't implement > > > BinarySearch. > > > > > > So, please help and show me the detailed solution. > > > Thanks! > > > > > > 在 2011年12月14日 下午5:59,唐亮 <[EMAIL PROTECTED]>写道: > > > > > > > Hi Prashant Kommireddi, > > > > > > > > If I do 1. and 2. as you mentioned, > > > > the schema will be {tag, ipStart, ipEnd, locName}. > > > > > > > > BUT, how should I write the UDF, especially how should I set the type > > of > > > > the input parameter? > > > > > > > > Currently, the UDF codes are as below, whose input parameter is > > DataBag: > > > > > > > > public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > > > > > > > public String exec(Tuple input) throws IOException { > > > > if (input == null || input.size() == 0) > > > > return UnknownIP; > > > > if (input.size() != 2) { > > > > throw new IOException("Expected input's size is 2, but is: " + > > > > input.size()); > > > > } > > > > > > > > Object o1 = input.get(0); * // This should be the IP you want > > to > > > > look up* > > > > if (!(o1 instanceof Long)) { > > > > throw new IOException("Expected input 1 to be Long, but > > got " > > > > + o1.getClass().getName()); > > > > } > > > > Object o2 = input.get(1); *// This is the Bag of IP segs* > > > > if (!(o2 instanceof *DataBag*)) { //* Should I change it to > > "(o2 > > > > instanceof Tuple)"?* > > > > throw new IOException("Expected input 2 to be DataBag, > but > > > got > > > > " > > > > + o2.getClass().getName()); > > > > } > > > > > > > > ........... other codes ........... > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > > > > > > > Seems like at the end of this you have a Single bag with all the > > > elements, > > > >> and somehow you would like to check whether an element exists in it > > > based > > > >> on ipstart/end. > > > >> > > > >> > > > >> 1. Use FLATTEN > http://pig.apache.org/docs/r0.9.1/basic.html#flatten- > > > >> this will convert the Bag to Tuple: to_tuple = FOREACH > > order_ip_segs > > > >> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > > > >> 2. Now write a UDF that can access the elements positionally for > the > > > >> BinarySearch > > > >> 3. Dmitriy and Jonathan's ideas with DistributedCache could > perform > > > >> better than the above approach, so you could go down that route > too. > > > >> > > > >> > > > >> 2011/12/13 唐亮 <[EMAIL PROTECTED]> > > > >> > > > >> > The detailed PIG codes are as below:
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-15, 02:33
When you flatten your BAG all your segments are within a single tuple.
Something like ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, ipstartN, ipendN, locN)) You can access the inner tuples positionally. Sent from my iPhone On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: > Now the question is: > How should I put all the "IP Segments" in one TUPLE? > > Please help me! > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > >> Michael, >> >> This would have no benefit over using a DistributedCache. For a large >> cluster this would mean poor performance. If the file is static and needs >> to be looked-up across the cluster, DistributedCache would be a better >> approach. >> >> Thanks, >> Prashant >> >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[EMAIL PROTECTED]> >> wrote: >> >>> If that list of ip pairs is pretty static most time and will be used >>> frequently, maybe just copy it in hdfs with a high replication factor. >> Then >>> use it as a look up table or some binary tree or treemap kind of thing by >>> reading it from hdfs instead of using distributed cache if that sounds an >>> easier thing to do. >>> >>> >>> Best regards, >>> Michael >>> >>> >>> ________________________________ >>> From: Dmitriy Ryaboy <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Sent: Wednesday, December 14, 2011 10:28 AM >>> Subject: Re: Implement Binary Search in PIG >>> >>> hbase has nothing to do with distributed cache. >>> >>> >>> 2011/12/14 唐亮 <[EMAIL PROTECTED]> >>> >>>> Now, I didn't use HBase, >>>> so, maybe I can't use DistributedCache. >>>> >>>> And if FLATTEN DataBag, the results are Tuples, >>>> then in UDF I can process only one Tuple, which can't implement >>>> BinarySearch. >>>> >>>> So, please help and show me the detailed solution. >>>> Thanks! >>>> >>>> 在 2011年12月14日 下午5:59,唐亮 <[EMAIL PROTECTED]>写道: >>>> >>>>> Hi Prashant Kommireddi, >>>>> >>>>> If I do 1. and 2. as you mentioned, >>>>> the schema will be {tag, ipStart, ipEnd, locName}. >>>>> >>>>> BUT, how should I write the UDF, especially how should I set the type >>> of >>>>> the input parameter? >>>>> >>>>> Currently, the UDF codes are as below, whose input parameter is >>> DataBag: >>>>> >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> { >>>>> >>>>> public String exec(Tuple input) throws IOException { >>>>> if (input == null || input.size() == 0) >>>>> return UnknownIP; >>>>> if (input.size() != 2) { >>>>> throw new IOException("Expected input's size is 2, but is: " + >>>>> input.size()); >>>>> } >>>>> >>>>> Object o1 = input.get(0); * // This should be the IP you want >>> to >>>>> look up* >>>>> if (!(o1 instanceof Long)) { >>>>> throw new IOException("Expected input 1 to be Long, but >>> got " >>>>> + o1.getClass().getName()); >>>>> } >>>>> Object o2 = input.get(1); *// This is the Bag of IP segs* >>>>> if (!(o2 instanceof *DataBag*)) { //* Should I change it to >>> "(o2 >>>>> instanceof Tuple)"?* >>>>> throw new IOException("Expected input 2 to be DataBag, >> but >>>> got >>>>> " >>>>> + o2.getClass().getName()); >>>>> } >>>>> >>>>> ........... other codes ........... >>>>> } >>>>> >>>>> } >>>>> >>>>> >>>>> >>>>> 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道�� >>>>> >>>>> Seems like at the end of this you have a Single bag with all the >>>> elements, >>>>>> and somehow you would like to check whether an element exists in it >>>> based >>>>>> on ipstart/end. >>>>>> >>>>>> >>>>>> 1. Use FLATTEN >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten- >>>>>> this will convert the Bag to Tuple: to_tuple = FOREACH >>> order_ip_segs >>>>>> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) >>>>>> 2. Now write a UDF that can access the elements positionally for >> the >>>>>> BinarySearch >>>>>> 3. Dmitriy and Jonathan's ideas with DistributedCache could >> perform
-
Re: Implement Binary Search in PIG唐亮 2011-12-15, 05:26
Hi Prashant Kommireddi,
If so, how should I write the UDF, especially the data types in UDF? 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > When you flatten your BAG all your segments are within a single tuple. > Something like > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > ipstartN, ipendN, locN)) > > You can access the inner tuples positionally. > > Sent from my iPhone > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: > > > Now the question is: > > How should I put all the "IP Segments" in one TUPLE? > > > > Please help me! > > > > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > > > >> Michael, > >> > >> This would have no benefit over using a DistributedCache. For a large > >> cluster this would mean poor performance. If the file is static and > needs > >> to be looked-up across the cluster, DistributedCache would be a better > >> approach. > >> > >> Thanks, > >> Prashant > >> > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[EMAIL PROTECTED]> > >> wrote: > >> > >>> If that list of ip pairs is pretty static most time and will be used > >>> frequently, maybe just copy it in hdfs with a high replication factor. > >> Then > >>> use it as a look up table or some binary tree or treemap kind of thing > by > >>> reading it from hdfs instead of using distributed cache if that sounds > an > >>> easier thing to do. > >>> > >>> > >>> Best regards, > >>> Michael > >>> > >>> > >>> ________________________________ > >>> From: Dmitriy Ryaboy <[EMAIL PROTECTED]> > >>> To: [EMAIL PROTECTED] > >>> Sent: Wednesday, December 14, 2011 10:28 AM > >>> Subject: Re: Implement Binary Search in PIG > >>> > >>> hbase has nothing to do with distributed cache. > >>> > >>> > >>> 2011/12/14 唐亮 <[EMAIL PROTECTED]> > >>> > >>>> Now, I didn't use HBase, > >>>> so, maybe I can't use DistributedCache. > >>>> > >>>> And if FLATTEN DataBag, the results are Tuples, > >>>> then in UDF I can process only one Tuple, which can't implement > >>>> BinarySearch. > >>>> > >>>> So, please help and show me the detailed solution. > >>>> Thanks! > >>>> > >>>> 在 2011年12月14日 下午5:59,唐亮 <[EMAIL PROTECTED]>写道: > >>>> > >>>>> Hi Prashant Kommireddi, > >>>>> > >>>>> If I do 1. and 2. as you mentioned, > >>>>> the schema will be {tag, ipStart, ipEnd, locName}. > >>>>> > >>>>> BUT, how should I write the UDF, especially how should I set the type > >>> of > >>>>> the input parameter? > >>>>> > >>>>> Currently, the UDF codes are as below, whose input parameter is > >>> DataBag: > >>>>> > >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> { > >>>>> > >>>>> public String exec(Tuple input) throws IOException { > >>>>> if (input == null || input.size() == 0) > >>>>> return UnknownIP; > >>>>> if (input.size() != 2) { > >>>>> throw new IOException("Expected input's size is 2, but is: " + > >>>>> input.size()); > >>>>> } > >>>>> > >>>>> Object o1 = input.get(0); * // This should be the IP you want > >>> to > >>>>> look up* > >>>>> if (!(o1 instanceof Long)) { > >>>>> throw new IOException("Expected input 1 to be Long, but > >>> got " > >>>>> + o1.getClass().getName()); > >>>>> } > >>>>> Object o2 = input.get(1); *// This is the Bag of IP segs* > >>>>> if (!(o2 instanceof *DataBag*)) { //* Should I change it to > >>> "(o2 > >>>>> instanceof Tuple)"?* > >>>>> throw new IOException("Expected input 2 to be DataBag, > >> but > >>>> got > >>>>> " > >>>>> + o2.getClass().getName()); > >>>>> } > >>>>> > >>>>> ........... other codes ........... > >>>>> } > >>>>> > >>>>> } > >>>>> > >>>>> > >>>>> > >>>>> 在 2011年12月14日 下午3:16,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > >>>>> > >>>>> Seems like at the end of this you have a Single bag with all the > >>>> elements, > >>>>>> and somehow you would like to check whether an element exists in it > >>>> based > >>>>>> on ipstart/end. > >>>>>> > >>>>>> > >>>>>> 1. Use FLATTEN �<[EMAIL PROTECTED]>写道:
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-15, 08:05
Not sure what you mean. Have you tried the code I forwarded? Are you facing
any issues there? If your question is regarding binarySearch implementation, here is pseudo-code'ish implementation. I have not tested this, please treat this as a general idea on how to go about accessing the elements within the Tuple. ALSO, I am assuming you have defined schema for (inner) Tuple contents. public String binarySearch(Tuple tuple, long toSearch, int low, int high) { if(low > high) return "NOT FOUND"; //Handle this the way you would like if(tuple == null) throw new IllegalArgumentException("Tuple is null"); //Handle this the way you would like int mid = (low + high)/2; Tuple midTuple = tuple.get(mid); String tag = midTuple.get(0).toString(); long ipstart = (Long)midTuple.get(1); long ipend = (Long)midTuple.get(2); String loc = midTuple.get(3).toString(); if(toSearch == ipstart) //Or ipend, I am not sure how you want to search { return loc; } else if(toSearch < ipstart) return binarySearch(tuple, low, mid - 1); else return binarySearch(tuple, mid+1, high); } 2011/12/14 唐亮 <[EMAIL PROTECTED]> > Hi Prashant Kommireddi, > > If so, how should I write the UDF, especially the data types in UDF? > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > > > When you flatten your BAG all your segments are within a single tuple. > > Something like > > > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > > ipstartN, ipendN, locN)) > > > > You can access the inner tuples positionally. > > > > Sent from my iPhone > > > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: > > > > > Now the question is: > > > How should I put all the "IP Segments" in one TUPLE? > > > > > > Please help me! > > > > > > > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > > > > > >> Michael, > > >> > > >> This would have no benefit over using a DistributedCache. For a large > > >> cluster this would mean poor performance. If the file is static and > > needs > > >> to be looked-up across the cluster, DistributedCache would be a better > > >> approach. > > >> > > >> Thanks, > > >> Prashant > > >> > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[EMAIL PROTECTED]> > > >> wrote: > > >> > > >>> If that list of ip pairs is pretty static most time and will be used > > >>> frequently, maybe just copy it in hdfs with a high replication > factor. > > >> Then > > >>> use it as a look up table or some binary tree or treemap kind of > thing > > by > > >>> reading it from hdfs instead of using distributed cache if that > sounds > > an > > >>> easier thing to do. > > >>> > > >>> > > >>> Best regards, > > >>> Michael > > >>> > > >>> > > >>> ________________________________ > > >>> From: Dmitriy Ryaboy <[EMAIL PROTECTED]> > > >>> To: [EMAIL PROTECTED] > > >>> Sent: Wednesday, December 14, 2011 10:28 AM > > >>> Subject: Re: Implement Binary Search in PIG > > >>> > > >>> hbase has nothing to do with distributed cache. > > >>> > > >>> > > >>> 2011/12/14 唐亮 <[EMAIL PROTECTED]> > > >>> > > >>>> Now, I didn't use HBase, > > >>>> so, maybe I can't use DistributedCache. > > >>>> > > >>>> And if FLATTEN DataBag, the results are Tuples, > > >>>> then in UDF I can process only one Tuple, which can't implement > > >>>> BinarySearch. > > >>>> > > >>>> So, please help and show me the detailed solution. > > >>>> Thanks! > > >>>> > > >>>> 在 2011年12月14日 下午5:59,唐� �<[EMAIL PROTECTED]>写道: > > >>>> > > >>>>> Hi Prashant Kommireddi, > > >>>>> > > >>>>> If I do 1. and 2. as you mentioned, > > >>>>> the schema will be {tag, ipStart, ipEnd, locName}. > > >>>>> > > >>>>> BUT, how should I write the UDF, especially how should I set the > type > > >>> of > > >>>>> the input parameter? > > >>>>> > > >>>>> Currently, the UDF codes are as below, whose input parameter is > > >>> DataBag: > > >>>>> > > >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > >>>>> > > >>>>> public String exec(Tuple input) throws IOException {
-
Re: Implement Binary Search in PIG唐亮 2011-12-16, 05:12
Thanks Prashant Kommireddi,
But my question is: How to call the UDF in PIG, especially the parameters to put into the UDF. 在 2011年12月15日 下午4:05,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > Not sure what you mean. Have you tried the code I forwarded? Are you facing > any issues there? > > If your question is regarding binarySearch implementation, here is > pseudo-code'ish implementation. I have not tested this, please treat this > as a general idea on how to go about accessing the elements within the > Tuple. > > ALSO, I am assuming you have defined schema for (inner) Tuple contents. > > public String binarySearch(Tuple tuple, long toSearch, int low, int high) { > if(low > high) > return "NOT FOUND"; //Handle this the way you would like > > if(tuple == null) > throw new IllegalArgumentException("Tuple is null"); //Handle > this the way you would like > > int mid = (low + high)/2; > Tuple midTuple = tuple.get(mid); > String tag = midTuple.get(0).toString(); > long ipstart = (Long)midTuple.get(1); > long ipend = (Long)midTuple.get(2); > String loc = midTuple.get(3).toString(); > > if(toSearch == ipstart) //Or ipend, I am not sure how you want to search > { > return loc; > } > else if(toSearch < ipstart) > return binarySearch(tuple, low, mid - 1); > > else > return binarySearch(tuple, mid+1, high); > > } > > > > > > > > 2011/12/14 唐亮 <[EMAIL PROTECTED]> > > > Hi Prashant Kommireddi, > > > > If so, how should I write the UDF, especially the data types in UDF? > > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > > > > > When you flatten your BAG all your segments are within a single tuple. > > > Something like > > > > > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > > > ipstartN, ipendN, locN)) > > > > > > You can access the inner tuples positionally. > > > > > > Sent from my iPhone > > > > > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: > > > > > > > Now the question is: > > > > How should I put all the "IP Segments" in one TUPLE? > > > > > > > > Please help me! > > > > > > > > > > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > > > > > > > >> Michael, > > > >> > > > >> This would have no benefit over using a DistributedCache. For a > large > > > >> cluster this would mean poor performance. If the file is static and > > > needs > > > >> to be looked-up across the cluster, DistributedCache would be a > better > > > >> approach. > > > >> > > > >> Thanks, > > > >> Prashant > > > >> > > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht < > [EMAIL PROTECTED]> > > > >> wrote: > > > >> > > > >>> If that list of ip pairs is pretty static most time and will be > used > > > >>> frequently, maybe just copy it in hdfs with a high replication > > factor. > > > >> Then > > > >>> use it as a look up table or some binary tree or treemap kind of > > thing > > > by > > > >>> reading it from hdfs instead of using distributed cache if that > > sounds > > > an > > > >>> easier thing to do. > > > >>> > > > >>> > > > >>> Best regards, > > > >>> Michael > > > >>> > > > >>> > > > >>> ________________________________ > > > >>> From: Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > >>> To: [EMAIL PROTECTED] > > > >>> Sent: Wednesday, December 14, 2011 10:28 AM > > > >>> Subject: Re: Implement Binary Search in PIG > > > >>> > > > >>> hbase has nothing to do with distributed cache. > > > >>> > > > >>> > > > >>> 2011/12/14 唐亮 <[EMAIL PROTECTED]> > > > >>> > > > >>>> Now, I didn't use HBase, > > > >>>> so, maybe I can't use DistributedCache. > > > >>>> > > > >>>> And if FLATTEN DataBag, the results are Tuples, > > > >>>> then in UDF I can process only one Tuple, which can't implement > > > >>>> BinarySearch. > > > >>>> > > > >>>> So, please help and show me the detailed solution. > > > >>>> Thanks! > > > >>>> > > > >>>> 在 2011年12月14日 下午5:59,唐亮 <[EMAIL PROTECTED]>写道: > > > >>>> > > > >>>>> Hi Prashant Kommireddi, > > > >>>>> > > > >>>>> If I do 1. and 2. as you mentioned, 屏�<[EMAIL PROTECTED]>写道:
-
Re: Implement Binary Search in PIG唐亮 2011-12-18, 08:24
Prashant Kommireddi,
How to call your UDF in PIG script? Thanks! 在 2011年12月16日 下午1:12,唐亮 <[EMAIL PROTECTED]>写道: > Thanks Prashant Kommireddi, > > But my question is: > How to call the UDF in PIG, especially the parameters to put into the UDF. > > 在 2011年12月15日 下午4:05,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > Not sure what you mean. Have you tried the code I forwarded? Are you facing >> any issues there? >> >> If your question is regarding binarySearch implementation, here is >> pseudo-code'ish implementation. I have not tested this, please treat this >> as a general idea on how to go about accessing the elements within the >> Tuple. >> >> ALSO, I am assuming you have defined schema for (inner) Tuple contents. >> >> public String binarySearch(Tuple tuple, long toSearch, int low, int high) >> { >> if(low > high) >> return "NOT FOUND"; //Handle this the way you would like >> >> if(tuple == null) >> throw new IllegalArgumentException("Tuple is null"); //Handle >> this the way you would like >> >> int mid = (low + high)/2; >> Tuple midTuple = tuple.get(mid); >> String tag = midTuple.get(0).toString(); >> long ipstart = (Long)midTuple.get(1); >> long ipend = (Long)midTuple.get(2); >> String loc = midTuple.get(3).toString(); >> >> if(toSearch == ipstart) //Or ipend, I am not sure how you want to search >> { >> return loc; >> } >> else if(toSearch < ipstart) >> return binarySearch(tuple, low, mid - 1); >> >> else >> return binarySearch(tuple, mid+1, high); >> >> } >> >> >> >> >> >> >> >> 2011/12/14 唐亮 <[EMAIL PROTECTED]> >> >> > Hi Prashant Kommireddi, >> > >> > If so, how should I write the UDF, especially the data types in UDF? >> > >> > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> >> > >> > > When you flatten your BAG all your segments are within a single tuple. >> > > Something like >> > > >> > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, >> > > ipstartN, ipendN, locN)) >> > > >> > > You can access the inner tuples positionally. >> > > >> > > Sent from my iPhone >> > > >> > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: >> > > >> > > > Now the question is: >> > > > How should I put all the "IP Segments" in one TUPLE? >> > > > >> > > > Please help me! >> > > > >> > > > >> > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> >> > > > >> > > >> Michael, >> > > >> >> > > >> This would have no benefit over using a DistributedCache. For a >> large >> > > >> cluster this would mean poor performance. If the file is static and >> > > needs >> > > >> to be looked-up across the cluster, DistributedCache would be a >> better >> > > >> approach. >> > > >> >> > > >> Thanks, >> > > >> Prashant >> > > >> >> > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht < >> [EMAIL PROTECTED]> >> > > >> wrote: >> > > >> >> > > >>> If that list of ip pairs is pretty static most time and will be >> used >> > > >>> frequently, maybe just copy it in hdfs with a high replication >> > factor. >> > > >> Then >> > > >>> use it as a look up table or some binary tree or treemap kind of >> > thing >> > > by >> > > >>> reading it from hdfs instead of using distributed cache if that >> > sounds >> > > an >> > > >>> easier thing to do. >> > > >>> >> > > >>> >> > > >>> Best regards, >> > > >>> Michael >> > > >>> >> > > >>> >> > > >>> ________________________________ >> > > >>> From: Dmitriy Ryaboy <[EMAIL PROTECTED]> >> > > >>> To: [EMAIL PROTECTED] >> > > >>> Sent: Wednesday, December 14, 2011 10:28 AM >> > > >>> Subject: Re: Implement Binary Search in PIG >> > > >>> >> > > >>> hbase has nothing to do with distributed cache. >> > > >>> >> > > >>> >> > > >>> 2011/12/14 唐亮 <[EMAIL PROTECTED]> >> > > >>> >> > > >>>> Now, I didn't use HBase, >> > > >>>> so, maybe I can't use DistributedCache. >> > > >>>> >> > > >>>> And if FLATTEN DataBag, the results are Tuples, >> > > >>>> then in UDF I can process only one Tuple, which can't implement >> > > >>>> BinarySearch. >> > > >>>> 屏�<[EMAIL PROTECTED]>写道:
-
Re: Implement Binary Search in PIGPrashant Kommireddi 2011-12-18, 10:17
to_tuple = FOREACH order_ip_segs GENERATE tag, FLATTEN(order_seq);
result = foreach totuple GetProvinceNameFromIPNum(toSearch, * ); 2011/12/18 唐亮 <[EMAIL PROTECTED]> > Prashant Kommireddi, > How to call your UDF in PIG script? > > Thanks! > > 在 2011年12月16日 下午1:12,唐亮 <[EMAIL PROTECTED]>写道: > > > Thanks Prashant Kommireddi, > > > > But my question is: > > How to call the UDF in PIG, especially the parameters to put into the > UDF. > > > > 在 2011年12月15日 下午4:05,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > > > Not sure what you mean. Have you tried the code I forwarded? Are you > facing > >> any issues there? > >> > >> If your question is regarding binarySearch implementation, here is > >> pseudo-code'ish implementation. I have not tested this, please treat > this > >> as a general idea on how to go about accessing the elements within the > >> Tuple. > >> > >> ALSO, I am assuming you have defined schema for (inner) Tuple contents. > >> > >> public String binarySearch(Tuple tuple, long toSearch, int low, int > high) > >> { > >> if(low > high) > >> return "NOT FOUND"; //Handle this the way you would like > >> > >> if(tuple == null) > >> throw new IllegalArgumentException("Tuple is null"); //Handle > >> this the way you would like > >> > >> int mid = (low + high)/2; > >> Tuple midTuple = tuple.get(mid); > >> String tag = midTuple.get(0).toString(); > >> long ipstart = (Long)midTuple.get(1); > >> long ipend = (Long)midTuple.get(2); > >> String loc = midTuple.get(3).toString(); > >> > >> if(toSearch == ipstart) //Or ipend, I am not sure how you want to > search > >> { > >> return loc; > >> } > >> else if(toSearch < ipstart) > >> return binarySearch(tuple, low, mid - 1); > >> > >> else > >> return binarySearch(tuple, mid+1, high); > >> > >> } > >> > >> > >> > >> > >> > >> > >> > >> 2011/12/14 唐亮 <[EMAIL PROTECTED]> > >> > >> > Hi Prashant Kommireddi, > >> > > >> > If so, how should I write the UDF, especially the data types in UDF? > >> > > >> > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > >> > > >> > > When you flatten your BAG all your segments are within a single > tuple. > >> > > Something like > >> > > > >> > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > >> > > ipstartN, ipendN, locN)) > >> > > > >> > > You can access the inner tuples positionally. > >> > > > >> > > Sent from my iPhone > >> > > > >> > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: > >> > > > >> > > > Now the question is: > >> > > > How should I put all the "IP Segments" in one TUPLE? > >> > > > > >> > > > Please help me! > >> > > > > >> > > > > >> > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > >> > > > > >> > > >> Michael, > >> > > >> > >> > > >> This would have no benefit over using a DistributedCache. For a > >> large > >> > > >> cluster this would mean poor performance. If the file is static > and > >> > > needs > >> > > >> to be looked-up across the cluster, DistributedCache would be a > >> better > >> > > >> approach. > >> > > >> > >> > > >> Thanks, > >> > > >> Prashant > >> > > >> > >> > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht < > >> [EMAIL PROTECTED]> > >> > > >> wrote: > >> > > >> > >> > > >>> If that list of ip pairs is pretty static most time and will be > >> used > >> > > >>> frequently, maybe just copy it in hdfs with a high replication > >> > factor. > >> > > >> Then > >> > > >>> use it as a look up table or some binary tree or treemap kind of > >> > thing > >> > > by > >> > > >>> reading it from hdfs instead of using distributed cache if that > >> > sounds > >> > > an > >> > > >>> easier thing to do. > >> > > >>> > >> > > >>> > >> > > >>> Best regards, > >> > > >>> Michael > >> > > >>> > >> > > >>> > >> > > >>> ________________________________ > >> > > >>> From: Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> > > >>> To: [EMAIL PROTECTED] > >> > > >>> Sent: Wednesday, December 14, 2011 10:28 AM > >> > > >>> Subject: Re: Implement Binary Search in PIG 屏�<[EMAIL PROTECTED]>写道:
-
Re: Implement Binary Search in PIG唐亮 2011-12-19, 02:49
Prashant Kommireddi,
Thank you very much! And your code seems cool, especially the usage of '*'. But, I'm still not very sure about the details. My PIG scripts are as below: *-- Load IP Segments* *raw_ip_segment = load ... * *ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;* *group_ip_segs = group ip_segs all;* * * *order_ip_segs = foreach group_ip_segs {* * order_seg = order ip_segs by ipstart, ipend;* * generate 't' as tag, order_seg;* *}* *describe order_ip_segs* *order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid: chararray}}* * * *-- Load IP from LOG* *ip_log = load ... * *ip_tag = foreach ip_log generate 't' as tag, ip;* * * *-- Join by tag* *join_ip_tag = join order_ip_segs by tag, ip_tag by tag;* * * *retain_ip_segs = foreach join_ip_tag generate ip_tag::ip as ip, order_ip_segs::order_seg as order_seg;* *-- ip: the ip I want to look up;* *-- order_seg: ordered ip segments used for BinarySearch* Can you show me the detailed followings? Such as the codes of UDF, and the PIG script to call the UDF. 在 2011年12月18日 下午6:17,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > to_tuple = FOREACH order_ip_segs GENERATE tag, FLATTEN(order_seq); > > result = foreach totuple GetProvinceNameFromIPNum(toSearch, * ); > > > 2011/12/18 唐亮 <[EMAIL PROTECTED]> > > > Prashant Kommireddi, > > How to call your UDF in PIG script? > > > > Thanks! > > > > 在 2011年12月16日 下午1:12,唐亮 <[EMAIL PROTECTED]>写道: > > > > > Thanks Prashant Kommireddi, > > > > > > But my question is: > > > How to call the UDF in PIG, especially the parameters to put into the > > UDF. > > > > > > 在 2011年12月15日 下午4:05,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > > > > > > Not sure what you mean. Have you tried the code I forwarded? Are you > > facing > > >> any issues there? > > >> > > >> If your question is regarding binarySearch implementation, here is > > >> pseudo-code'ish implementation. I have not tested this, please treat > > this > > >> as a general idea on how to go about accessing the elements within the > > >> Tuple. > > >> > > >> ALSO, I am assuming you have defined schema for (inner) Tuple > contents. > > >> > > >> public String binarySearch(Tuple tuple, long toSearch, int low, int > > high) > > >> { > > >> if(low > high) > > >> return "NOT FOUND"; //Handle this the way you would like > > >> > > >> if(tuple == null) > > >> throw new IllegalArgumentException("Tuple is null"); //Handle > > >> this the way you would like > > >> > > >> int mid = (low + high)/2; > > >> Tuple midTuple = tuple.get(mid); > > >> String tag = midTuple.get(0).toString(); > > >> long ipstart = (Long)midTuple.get(1); > > >> long ipend = (Long)midTuple.get(2); > > >> String loc = midTuple.get(3).toString(); > > >> > > >> if(toSearch == ipstart) //Or ipend, I am not sure how you want to > > search > > >> { > > >> return loc; > > >> } > > >> else if(toSearch < ipstart) > > >> return binarySearch(tuple, low, mid - 1); > > >> > > >> else > > >> return binarySearch(tuple, mid+1, high); > > >> > > >> } > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> 2011/12/14 唐亮 <[EMAIL PROTECTED]> > > >> > > >> > Hi Prashant Kommireddi, > > >> > > > >> > If so, how should I write the UDF, especially the data types in UDF? > > >> > > > >> > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> > > >> > > > >> > > When you flatten your BAG all your segments are within a single > > tuple. > > >> > > Something like > > >> > > > > >> > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > > >> > > ipstartN, ipendN, locN)) > > >> > > > > >> > > You can access the inner tuples positionally. > > >> > > > > >> > > Sent from my iPhone > > >> > > > > >> > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[EMAIL PROTECTED]> wrote: > > >> > > > > >> > > > Now the question is: > > >> > > > How should I put all the "IP Segments" in one TUPLE? > > >> > > > > > >> > > > Please help me! > > >> > > > > > >> > > > > > >> > > > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> 屏�<[EMAIL PROTECTED]>写道:
-
Re: Implement Binary Search in PIGDmitriy Ryaboy 2011-12-19, 04:55
There's a very detailed write-up about this here:
http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html 2011/12/18 唐亮 <[EMAIL PROTECTED]>: > Prashant Kommireddi, > Thank you very much! > And your code seems cool, especially the usage of '*'. > > But, I'm still not very sure about the details. > > My PIG scripts are as below: > *-- Load IP Segments* > *raw_ip_segment = load ... * > *ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;* > *group_ip_segs = group ip_segs all;* > * > * > *order_ip_segs = foreach group_ip_segs {* > * order_seg = order ip_segs by ipstart, ipend;* > * generate 't' as tag, order_seg;* > *}* > *describe order_ip_segs* > *order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid: > chararray}}* > * > * > *-- Load IP from LOG* > *ip_log = load ... * > *ip_tag = foreach ip_log generate 't' as tag, ip;* > * > * > *-- Join by tag* > *join_ip_tag = join order_ip_segs by tag, ip_tag by tag;* > * > * > *retain_ip_segs = foreach join_ip_tag generate ip_tag::ip as ip, > order_ip_segs::order_seg as order_seg;* > *-- ip: the ip I want to look up;* > *-- order_seg: ordered ip segments used for BinarySearch* > > > Can you show me the detailed followings? > Such as the codes of UDF, and the PIG script to call the UDF. > > > 在 2011年12月18日 下午6:17,Prashant Kommireddi <[EMAIL PROTECTED]>写道: > >> to_tuple = FOREACH order_ip_segs GENERATE tag, FLATTEN(order_seq); >> >> result = foreach totuple GetProvinceNameFromIPNum(toSearch, * ); >> >> >> 2011/12/18 唐亮 <[EMAIL PROTECTED]> >> >> > Prashant Kommireddi, >> > How to call your UDF in PIG script? >> > >> > Thanks! >> > >> > 在 2011年12月16日 下午1:12,唐亮 <[EMAIL PROTECTED]>写道: >> > >> > > Thanks Prashant Kommireddi, >> > > >> > > But my question is: >> > > How to call the UDF in PIG, especially the parameters to put into the >> > UDF. >> > > >> > > 在 2011年12月15日 下午4:05,Prashant Kommireddi <[EMAIL PROTECTED]>写道: >> > > >> > > Not sure what you mean. Have you tried the code I forwarded? Are you >> > facing >> > >> any issues there? >> > >> >> > >> If your question is regarding binarySearch implementation, here is >> > >> pseudo-code'ish implementation. I have not tested this, please treat >> > this >> > >> as a general idea on how to go about accessing the elements within the >> > >> Tuple. >> > >> >> > >> ALSO, I am assuming you have defined schema for (inner) Tuple >> contents. >> > >> >> > >> public String binarySearch(Tuple tuple, long toSearch, int low, int >> > high) >> > >> { >> > >> if(low > high) >> > >> return "NOT FOUND"; //Handle this the way you would like >> > >> >> > >> if(tuple == null) >> > >> throw new IllegalArgumentException("Tuple is null"); //Handle >> > >> this the way you would like >> > >> >> > >> int mid = (low + high)/2; >> > >> Tuple midTuple = tuple.get(mid); >> > >> String tag = midTuple.get(0).toString(); >> > >> long ipstart = (Long)midTuple.get(1); >> > >> long ipend = (Long)midTuple.get(2); >> > >> String loc = midTuple.get(3).toString(); >> > >> >> > >> if(toSearch == ipstart) //Or ipend, I am not sure how you want to >> > search >> > >> { >> > >> return loc; >> > >> } >> > >> else if(toSearch < ipstart) >> > >> return binarySearch(tuple, low, mid - 1); >> > >> >> > >> else >> > >> return binarySearch(tuple, mid+1, high); >> > >> >> > >> } >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> 2011/12/14 唐亮 <[EMAIL PROTECTED]> >> > >> >> > >> > Hi Prashant Kommireddi, >> > >> > >> > >> > If so, how should I write the UDF, especially the data types in UDF? >> > >> > >> > >> > 2011/12/15 Prashant Kommireddi <[EMAIL PROTECTED]> >> > >> > >> > >> > > When you flatten your BAG all your segments are within a single >> > tuple. >> > >> > > Something like >> > >> > > >> > >> > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, >> > >> > > ipstartN, ipendN, locN)) >> > >> > > >> > >> > > You can access the inner tuples positionally. >> > >> > > >> > >> > > Sent from my iPhone �屏�<[EMAIL PROTECTED]>写道: |