Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Implement Binary Search in PIG


Copy link to this message
-
Re: Implement Binary Search in PIG
Thejas Nair 2011-12-13, 23:32
My assumption is that 唐亮 is trying to do binary search on bags within
the tuples in a relation (ie schema of the relation has a bag column). I
don't think he is trying to treat the entire relation as one bag and do
binary search on that.
-Thejas
On 12/13/11 2:30 PM, Andrew Wells wrote:
> I don't think this could be done,
>
> pig is just a hadoop job, and the idea behind hadoop is to read all the
> data in a file.
>
> so by the time you put all the data into an array, you would have been
> better off just checking each element for the one you were looking for.
>
> So what you would get is [n + lg (n)], which will just be [n] after putting
> that into an array.
> Second, hadoop is all about large data analysis, usually more than 100GB,
> so putting this into memory is out of the question.
> Third, hadoop is efficient because it processes this large amount of data
> by splitting it up into multiple processes. To do an efficient binary
> search, you would need do this in one mapper or one reducer.
>
> My opinion is just don't fight hadoop/pig.
>
>
>
> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<[EMAIL PROTECTED]>  wrote:
>
>> Bags can be very large might not fit into memory, and in such cases some
>> or all of the bag might have to be stored on disk. In such cases, it is not
>> efficient to do random access on the bag. That is why the DataBag interface
>> does not support it.
>>
>> As Prashant suggested, storing it in a tuple would be a good alternative,
>> if you want to have random access to do binary search.
>>
>> -Thejas
>>
>>
>>
>> On 12/12/11 7:54 PM, 唐亮 wrote:
>>
>>> Hi all,
>>> How can I implement a binary search in pig?
>>>
>>> In one relation, there exists a bag whose items are sorted.
>>> And I want to check there exists a specific item in the bag.
>>>
>>> In UDF, I can't random access items in DataBag container.
>>> So I have to transfer the items in DataBag to an ArrayList, and this is
>>> time consuming.
>>>
>>> How can I implement the binary search efficiently in pig?
>>>
>>>
>>
>