Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - Using Pig to load data imported with Sqoop


Copy link to this message
-
Re: Using Pig to load data imported with Sqoop
Jarek Jarcec Cecho 2013-11-07, 18:39
I believe that Pig's SequenceFileStorage is not compatible with custom writables at the moment. Per the docs the storage is only able to work with following ones:

  Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable, ByteWritable

Jarcec

On Mon, Nov 04, 2013 at 07:18:43PM +1100, Andre Araujo wrote:
> Hi, all,
>
> I've loaded some data with Sqoop from Oracle onto HDFS, storing it as
> SequenceFiles and I'm having problems loading it with Pig.
> I'm using Sqoop 1.4.3 and used the following steps (simplified example
> using the DUAL table).
>
> Any ideas of why it loads incorrectly? Am I missing any steps?
>
> Thanks,
> Andre
>
>
>
> *1. Imported data from the table onto HDFS (the DUAL table has only 1 row
> with 1 field containing the string "X") *
>
> sqoop import -D mapred.child.java.opts="$JDBC_JAVA_OPTS" --connect $CONNSTR
>  -m 1 --query "select DUMMY from dual where \$CONDITIONS" --target-dir test
> --as-sequencefile --class-name com.acme.Dual
>
> The Dual.java file is attached.
>
> *2. Generated the Dual.jar file:*
>
> javac -cp
> /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/sqoop/sqoop-1.4.3-cdh4.3.0.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.3.0.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/hadoop-common.jar:.
> com/acme/Dual.java
> jar cf /tmp/Dual.jar com/acme/Dual.class
>
> *3. Tried to load the data with Pig, however, the field value is read as 0
> (zero) instead of the string "X"):*
>
> REGISTER
> /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/pig/piggybank.jar;
> REGISTER
> /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/sqoop/sqoop-1.4.3-cdh4.3.0.jar
> REGISTER
> /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.3.0.jar
> REGISTER
> /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/hadoop-common.jar
> REGISTER /tmp/Dual.jar
> DEFINE SequenceFileLoader
> org.apache.pig.piggybank.storage.SequenceFileLoader();
> log = LOAD 'test' USING SequenceFileLoader AS (DUMMY:chararray);
> DUMP log;
>
>
> ...
> 2013-11-04 03:21:32,325 [main] INFO
>  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>  Features
> 2.0.0-cdh4.3.0  0.11.0-cdh4.3.0 araujo  2013-11-04 03:21:12     2013-11-04
> 03:21:32     UNKNOWN
>
> Success!
>
> Job Stats (time in seconds):
> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
>  MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
> MedianReducetime    Alias    Feature Outputs
> job_201310230912_0065   1       0       6       6       6       6       0
>     0       0       0       log     MAP_ONLY        hdfs://
> n1.hadoop.cto.pythian.com:8020/tmp/temp-805635901/tmp-702886222,
>
> Input(s):
> Successfully read 1 records (479 bytes) from: "hdfs://
> n1.hadoop.cto.pythian.com:8020/user/araujo/test"
>
> Output(s):
> Successfully stored 1 records (8 bytes) in: "hdfs://
> n1.hadoop.cto.pythian.com:8020/tmp/temp-805635901/tmp-702886222"
>
> Counters:
> Total records written : 1
> Total bytes written : 8
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201310230912_0065
>
>
> 2013-11-04 03:21:32,338 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
> 2013-11-04 03:21:32,342 [main] INFO  org.apache.pig.data.SchemaTupleBackend
> - Key [pig.schematuple] was not set... will not generate code.
> 2013-11-04 03:21:32,350 [main] INFO
>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
> 2013-11-04 03:21:32,350 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> *(0)  <--- THIS SHOULD SHOW "X"*
>
>
> --
> André Araújo
> Database Administrator / SDM