Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Flatten a Bag on One Line?


Copy link to this message
-
Re: Flatten a Bag on One Line?
Eli Finkelshteyn 2012-02-13, 06:36
Hey Folks,
Sorry it took so long to get back on this. The function I wound up using
is really simple:

@outputSchema("t:tuple()")
def bagToTuple(bag):
   t = tuple([item[0] for item in bag])
   return t

You would use this in PIG to get what I wanted by just running that
function on a bag and then flattening the result, for example:

flattened_line = FOREACH line_with_bag GENERATE something,
something_else, flatten(myfuncs.bagToTuple(some_bag));

Thejas, I created a JIRA for this here
<https://issues.apache.org/jira/browse/PIG-2529>. This is the first one
I've ever made, so please excuse me if I messed anything up in the format.

Cheers,
Eli

On 2/10/12 7:07 PM, Thejas Nair wrote:
> Pig doesn't have a piggybank for python udfs, but it makes sense to
> create one.
> Please attach your udf to a a new jira, and we can figure where to put
> it .
>
> -Thejas
>
>
> On 2/10/12 1:14 PM, Eli Finkelshteyn wrote:
>> I was going to do this as a python udf, but haven't had a chance yet
>> since other stuff I was working on took priority. As soon as I do write
>> it, I'll be sure to upload it here. On a related note: is there a
>> piggybank for python udfs I could contribute it to for posterity?
>>
>> Eli
>>
>> On 2/10/12 11:09 AM, pablomar wrote:
>>> what about something like this?
>>> (typing on the phone, forgive any mistake)
>>>
>>> public class Flat extends EvalFunc<Tuple>
>>> {
>>> public Tuple exec(Tuple input) throws IOException
>>> {
>>> try
>>> {
>>> List<Object> list = new LinkedList<Object>();
>>> DataBag bag = (DataBag)input.get(0);
>>> Iterator it = bag.iterator();
>>> while(it.hasNext())
>>> {
>>> Tuple t = (Tuple)it.next();
>>> if(t != null&& t.size()>0)
>>> list.add(t.get(0));
>>> }
>>>
>>> TupleFactory fac = TupleFactory.getInstance();
>>> return fac.newTuple(list);
>>> }
>>> catch....
>>>
>>> On 2/10/12, Brendan Gill<[EMAIL PROTECTED]> wrote:
>>>> Eli,
>>>>
>>>> I'm trying to do exactly this, but am pretty new to Pig. Any chance
>>>> you
>>>> would share what the UDF would look like? Then I can tailor it to our
>>>> needs.
>>>>
>>>> Much appreciated if possible,
>>>>
>>>> Brendan
>>>>
>>>>
>>>>
>>>> On Thu, Feb 9, 2012 at 9:20 PM, Eli Finkelshteyn<[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>> Thanks. Was hoping/assuming there was a built-in, but I guess udf it
>>>>> is.
>>>>>
>>>>> Eli
>>>>>
>>>>>
>>>>> On 2/9/12 2:14 PM, Yulia Tolskaya wrote:
>>>>>
>>>>>> I actually can't think of an easy way to do this without it
>>>>>> becoming a
>>>>>> cross product. You could just right a really simple udf that takes
>>>>>> a bag
>>>>>> and spits out just the members.
>>>>>>
>>>>>> Yulia
>>>>>>
>>>>>> On 2/9/12 1:26 PM, "Eli
>>>>>> Finkelshteyn"<iefinkel@gmail.**com<[EMAIL PROTECTED]>>
>>>>>> wrote:
>>>>>>
>>>>>> This is probably easy, but my PigLatin is rusty, and I don't seem
>>>>>> to be
>>>>>>> able to find an answer on Google. If I have a record of the form:
>>>>>>>
>>>>>>> 98812 3 {(48567859),(15996334),(**15897772)}
>>>>>>>
>>>>>>> How can I flatten that bag to leave all members on a single row,
>>>>>>> ie:
>>>>>>>
>>>>>>> 98812 3 48567859 15996334 15897772
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eli
>>>>>>>
>>
>