Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - A GenericUDF Function to Extract a Field From an Array of Structs


Copy link to this message
-
RE: A GenericUDF Function to Extract a Field From an Array of Structs
Peter Chu 2013-03-29, 06:04
Sorry, the test should be following (changed extract_shas to extract_product_category):
import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;import org.apache.hadoop.hive.ql.udf.generic.GenericUDF.DeferredObject;import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import org.testng.annotations.Test;
import java.util.ArrayList;import java.util.List;
public class TestGenericUDFExtractProductCategory{    ArrayList<String> fieldNames = new ArrayList<String>();    ArrayList<ObjectInspector> fieldObjectInspectors = new ArrayList<ObjectInspector>();
    @Test    public void simpleTest()        throws Exception    {        ListObjectInspector firstInspector = new MyListObjectInspector();
        ArrayList test = new ArrayList();        test.add("test");
        ArrayList test2 = new ArrayList();        test2.add(test);
        StructObjectInspector soi = ObjectInspectorFactory.getStandardStructObjectInspector(test, test2);
        fieldNames.add("productCategory");        fieldObjectInspectors.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
        GenericUDF.DeferredObject firstDeferredObject = new MyDeferredObject(test2);
        GenericUDF extract_product_category = new GenericUDFExtractProductCategory();
        extract_product_category.initialize(new ObjectInspector[]{firstInspector});
        extract_product_category.evaluate(new DeferredObject[]{firstDeferredObject});    }
    public class MyDeferredObject implements DeferredObject    {        private Object value;
        public MyDeferredObject(Object value) {            this.value = value;        }
        @Override        public Object get() throws HiveException        {            return value;        }    }
    private class MyListObjectInspector implements ListObjectInspector    {        @Override        public ObjectInspector getListElementObjectInspector()        {            return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldObjectInspectors);        }
        @Override        public Object getListElement(Object data, int index)        {            List myList = (List) data;            if (myList == null || index > myList.size()) {                return null;            }            return myList.get(index);        }
        @Override        public int getListLength(Object data)        {            if (data == null) {                return -1;            }            return ((List) data).size();        }
        @Override        public List<?> getList(Object data)        {            return (List) data;        }
        @Override        public String getTypeName()        {            return null;  //To change body of implemented methods use File | Settings | File Templates.        }
        @Override        public Category getCategory()        {            return Category.LIST;        }    }}
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: A GenericUDF Function to Extract a Field From an Array of Structs
Date: Thu, 28 Mar 2013 14:16:33 -0700
I am trying to write a GenericUDF function to collect all of a specific struct field(s) within an array for each record, and return them in an array as well.
I wrote the UDF (as below), and it seems to work but:
1) It does not work when I am performing this on an external table, it works fine on a managed table, any idea?
2) I am having a tough time writing a test on this.  I have attached the test I have so far, and it does not work, always getting 'java.util.ArrayList cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector' or cannot cast String to LazyString', my question is how do I supply a list of structs for the evalue method?
Any help will be greatly appreciated.
Thanks,Peter
The table:
CREATE EXTERNAL TABLE FOO (    TS string,    customerId string,    products array< struct<productCategory:string> >  )  PARTITIONED BY (ds string)  ROW FORMAT SERDE 'some.serde'  WITH SERDEPROPERTIES ('error.ignore'='true')  LOCATION 'some_locations'  ;
A row of record holds:1340321132000, 'some_company', [{"productCategory":"footwear"},{"productCategory":"eyewear"}]
This is my code:
import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;import org.apache.hadoop.hive.serde2.lazy.LazyString;import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.StructField;import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;import org.apache.hadoop.io.Text;
import java.util.ArrayList;
@Description(name = "extract_product_category",        value = "_FUNC_( array< struct<productCategory:string> > ) - Collect all product category field values inside an array of struct(s), and return the results in an array<string>",        extended = "Example:\n SELECT _FUNC_(array_of_structs_with_product_catego