Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Flatten a Bag on One Line?


Copy link to this message
-
Re: Flatten a Bag on One Line?
Hey Folks,
Sorry it took so long to get back on this. The function I wound up using
is really simple:

@outputSchema("t:tuple()")
def bagToTuple(bag):
   t = tuple([item[0] for item in bag])
   return t

You would use this in PIG to get what I wanted by just running that
function on a bag and then flattening the result, for example:

flattened_line = FOREACH line_with_bag GENERATE something,
something_else, flatten(myfuncs.bagToTuple(some_bag));

Thejas, I created a JIRA for this here
<https://issues.apache.org/jira/browse/PIG-2529>. This is the first one
I've ever made, so please excuse me if I messed anything up in the format.

Cheers,
Eli

On 2/10/12 7:07 PM, Thejas Nair wrote:
> Pig doesn't have a piggybank for python udfs, but it makes sense to
> create one.
> Please attach your udf to a a new jira, and we can figure where to put
> it .
>
> -Thejas
>
>
> On 2/10/12 1:14 PM, Eli Finkelshteyn wrote:
>> I was going to do this as a python udf, but haven't had a chance yet
>> since other stuff I was working on took priority. As soon as I do write
>> it, I'll be sure to upload it here. On a related note: is there a
>> piggybank for python udfs I could contribute it to for posterity?
>>
>> Eli
>>
>> On 2/10/12 11:09 AM, pablomar wrote:
>>> what about something like this?
>>> (typing on the phone, forgive any mistake)
>>>
>>> public class Flat extends EvalFunc<Tuple>
>>> {
>>> public Tuple exec(Tuple input) throws IOException
>>> {
>>> try
>>> {
>>> List<Object> list = new LinkedList<Object>();
>>> DataBag bag = (DataBag)input.get(0);
>>> Iterator it = bag.iterator();
>>> while(it.hasNext())
>>> {
>>> Tuple t = (Tuple)it.next();
>>> if(t != null&& t.size()>0)
>>> list.add(t.get(0));
>>> }
>>>
>>> TupleFactory fac = TupleFactory.getInstance();
>>> return fac.newTuple(list);
>>> }
>>> catch....
>>>
>>> On 2/10/12, Brendan Gill<[EMAIL PROTECTED]> wrote:
>>>> Eli,
>>>>
>>>> I'm trying to do exactly this, but am pretty new to Pig. Any chance
>>>> you
>>>> would share what the UDF would look like? Then I can tailor it to our
>>>> needs.
>>>>
>>>> Much appreciated if possible,
>>>>
>>>> Brendan
>>>>
>>>>
>>>>
>>>> On Thu, Feb 9, 2012 at 9:20 PM, Eli Finkelshteyn<[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>> Thanks. Was hoping/assuming there was a built-in, but I guess udf it
>>>>> is.
>>>>>
>>>>> Eli
>>>>>
>>>>>
>>>>> On 2/9/12 2:14 PM, Yulia Tolskaya wrote:
>>>>>
>>>>>> I actually can't think of an easy way to do this without it
>>>>>> becoming a
>>>>>> cross product. You could just right a really simple udf that takes
>>>>>> a bag
>>>>>> and spits out just the members.
>>>>>>
>>>>>> Yulia
>>>>>>
>>>>>> On 2/9/12 1:26 PM, "Eli
>>>>>> Finkelshteyn"<iefinkel@gmail.**com<[EMAIL PROTECTED]>>
>>>>>> wrote:
>>>>>>
>>>>>> This is probably easy, but my PigLatin is rusty, and I don't seem
>>>>>> to be
>>>>>>> able to find an answer on Google. If I have a record of the form:
>>>>>>>
>>>>>>> 98812 3 {(48567859),(15996334),(**15897772)}
>>>>>>>
>>>>>>> How can I flatten that bag to leave all members on a single row,
>>>>>>> ie:
>>>>>>>
>>>>>>> 98812 3 48567859 15996334 15897772
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eli
>>>>>>>
>>
>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB