|
|
-
table from sequence file
Sagar Naik 2010-04-15, 08:06
Hi
My data is in the value field of a sequence file. The value field has subfields in it. I am trying to create table using these subfields. Example: <KEY> <VALUE> <KEY_FIELD1, KEYFIELD 2> forms the key <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. So i am trying to create a table from VALUE_FIELD*
CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE;
I am planing to a write a custom SerDe implementation and custom SequenceFileReader Pl let me knw if I am on the right track. -Sagar
-
Re: table from sequence file
Arvind Prabhakar 2010-04-15, 19:00
Hi Sagar,
Looks like your source file has custom writable types in it. If that is the case, implementing a SerDe that works with that type may not be that straight forward, although doable.
An alternative would be to implement a custom RecordReader that converts the value of your custom writable to Struct type which can then be queried directly.
Arvind
On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote:
> Hi > > My data is in the value field of a sequence file. > The value field has subfields in it. I am trying to create table using > these subfields. > Example: > <KEY> <VALUE> > <KEY_FIELD1, KEYFIELD 2> forms the key > <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. > So i am trying to create a table from VALUE_FIELD* > > CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as > string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; > > I am planing to a write a custom SerDe implementation and custom > SequenceFileReader > Pl let me knw if I am on the right track. > > > -Sagar
-
Re: table from sequence file
Edward Capriolo 2010-04-15, 20:23
On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar <[EMAIL PROTECTED]>wrote:
> Hi Sagar, > > Looks like your source file has custom writable types in it. If that is the > case, implementing a SerDe that works with that type may not be that > straight forward, although doable. > > An alternative would be to implement a custom RecordReader that converts > the value of your custom writable to Struct type which can then be queried > directly. > > Arvind > > > On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote: > >> Hi >> >> My data is in the value field of a sequence file. >> The value field has subfields in it. I am trying to create table using >> these subfields. >> Example: >> <KEY> <VALUE> >> <KEY_FIELD1, KEYFIELD 2> forms the key >> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >> So i am trying to create a table from VALUE_FIELD* >> >> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as >> string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >> >> I am planing to a write a custom SerDe implementation and custom >> SequenceFileReader >> Pl let me knw if I am on the right track. >> >> >> -Sagar > > > I am actually having lots of trouble with this. I have a sequence file that opens fine with /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text /home/edward/Downloads/seq/seq
create external table keyonly( ver string , theid int, thedate string ) row format delimited fields terminated by ',' STORED AS inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
location '/home/edward/Downloads/seq';
Also tried inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat' or stored as SEQUENCEFILE
I always get this...
2010-04-15 13:10:43,849 ERROR CliDriver (SessionState.java:printError(255)) - Failed with exception java.io.IOException:java.io.EOFException java.io.IOException: java.io.EOFException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) Caused by: java.io.EOFException at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68) at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92) at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101) at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169) at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44) at org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311) ... 21 more
Does anyone have a clue on what I am doing wrong??
-
Re: table from sequence file
Arvind Prabhakar 2010-04-15, 23:23
On Thu, Apr 15, 2010 at 1:23 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
> > > On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar <[EMAIL PROTECTED]>wrote: > >> Hi Sagar, >> >> Looks like your source file has custom writable types in it. If that is >> the case, implementing a SerDe that works with that type may not be that >> straight forward, although doable. >> >> An alternative would be to implement a custom RecordReader that converts >> the value of your custom writable to Struct type which can then be queried >> directly. >> >> Arvind >> >> >> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote: >> >>> Hi >>> >>> My data is in the value field of a sequence file. >>> The value field has subfields in it. I am trying to create table using >>> these subfields. >>> Example: >>> <KEY> <VALUE> >>> <KEY_FIELD1, KEYFIELD 2> forms the key >>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >>> So i am trying to create a table from VALUE_FIELD* >>> >>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as >>> string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >>> >>> I am planing to a write a custom SerDe implementation and custom >>> SequenceFileReader >>> Pl let me knw if I am on the right track. >>> >>> >>> -Sagar >> >> >> > I am actually having lots of trouble with this. > I have a sequence file that opens fine with > /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text > /home/edward/Downloads/seq/seq > > create external table keyonly( ver string , theid int, thedate string ) > row format delimited fields terminated by ',' > STORED AS > inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' > outputformat > 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' > > location '/home/edward/Downloads/seq'; > > > > Also tried > inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat' > or stored as SEQUENCEFILE > > I always get this... > > 2010-04-15 13:10:43,849 ERROR CliDriver (SessionState.java:printError(255)) > - Failed with exception java.io.IOException:java.io.EOFException > java.io.IOException: java.io.EOFException > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) > at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) > at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) > at > org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) > at > org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:154) > at junit.framework.TestCase.runBare(TestCase.java:127) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:118) > at junit.framework.TestSuite.runTest(TestSuite.java:208) > at junit.framework.TestSuite.run(TestSuite.java:203) > at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) > at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) > at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) > Caused by: java.io.EOFException > at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) > at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) The SequenceFileAsTextInputFormat converts the sequence record values to string using the toString() invocation. Assuming that your data has a custom writable that has multiple fields in it, I don't think it is possible for you to map the individual bits to different columns.
Can you try doing the following:
create external table dummy( fullvalue string) stored as inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' outputformat'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' location '/home/edward/Downloads/seq';
and then doing a select * from dummy.
Arvind
-
Re: table from sequence file
Sagar Naik 2010-04-16, 01:01
Hi Arvind,
U guessed it correct.
We have custom writables. I saw the TextRecordReader implementation to get an idea on RecordReader.
It looks like createRow creates an instance and next(...) populates this instance. The createRow returns an instance of Writable.
Is the Writable Instance same as "struct" from u r reply
How is this Writable instance mapped to column names ? Is there something in commandline syntax which binds the Writable instance to column names and values ? Or ObjectInspector will do it magically
-Sagar On Apr 15, 2010, at 12:00 PM, Arvind Prabhakar wrote:
> Hi Sagar, > > Looks like your source file has custom writable types in it. If that is the case, implementing a SerDe that works with that type may not be that straight forward, although doable. > > An alternative would be to implement a custom RecordReader that converts the value of your custom writable to Struct type which can then be queried directly. > > Arvind > > On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote: > Hi > > My data is in the value field of a sequence file. > The value field has subfields in it. I am trying to create table using these subfields. > Example: > <KEY> <VALUE> > <KEY_FIELD1, KEYFIELD 2> forms the key > <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. > So i am trying to create a table from VALUE_FIELD* > > CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; > > I am planing to a write a custom SerDe implementation and custom SequenceFileReader > Pl let me knw if I am on the right track. > > > -Sagar >
-
Re: table from sequence file
Edward Capriolo 2010-04-16, 02:00
On Thu, Apr 15, 2010 at 7:23 PM, Arvind Prabhakar <[EMAIL PROTECTED]>wrote:
> On Thu, Apr 15, 2010 at 1:23 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > >> >> >> On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar <[EMAIL PROTECTED]>wrote: >> >>> Hi Sagar, >>> >>> Looks like your source file has custom writable types in it. If that is >>> the case, implementing a SerDe that works with that type may not be that >>> straight forward, although doable. >>> >>> An alternative would be to implement a custom RecordReader that converts >>> the value of your custom writable to Struct type which can then be queried >>> directly. >>> >>> Arvind >>> >>> >>> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]>wrote: >>> >>>> Hi >>>> >>>> My data is in the value field of a sequence file. >>>> The value field has subfields in it. I am trying to create table using >>>> these subfields. >>>> Example: >>>> <KEY> <VALUE> >>>> <KEY_FIELD1, KEYFIELD 2> forms the key >>>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >>>> So i am trying to create a table from VALUE_FIELD* >>>> >>>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 >>>> as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >>>> >>>> I am planing to a write a custom SerDe implementation and custom >>>> SequenceFileReader >>>> Pl let me knw if I am on the right track. >>>> >>>> >>>> -Sagar >>> >>> >>> >> I am actually having lots of trouble with this. >> I have a sequence file that opens fine with >> /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text >> /home/edward/Downloads/seq/seq >> >> create external table keyonly( ver string , theid int, thedate string ) >> row format delimited fields terminated by ',' >> STORED AS >> inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' >> outputformat >> 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' >> >> location '/home/edward/Downloads/seq'; >> >> >> >> Also tried >> inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat' >> or stored as SEQUENCEFILE >> >> I always get this... >> >> 2010-04-15 13:10:43,849 ERROR CliDriver >> (SessionState.java:printError(255)) - Failed with exception >> java.io.IOException:java.io.EOFException >> java.io.IOException: java.io.EOFException >> at >> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) >> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) >> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) >> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) >> at >> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) >> at >> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) >> at >> org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at junit.framework.TestCase.runTest(TestCase.java:154) >> at junit.framework.TestCase.runBare(TestCase.java:127) >> at junit.framework.TestResult$1.protect(TestResult.java:106) >> at junit.framework.TestResult.runProtected(TestResult.java:124) >> at junit.framework.TestResult.run(TestResult.java:109) >> at junit.framework.TestCase.run(TestCase.java:118) >> at junit.framework.TestSuite.runTest(TestSuite.java:208) >> at junit.framework.TestSuite.run(TestSuite.java:203) >> at >> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) >> at >> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) >> at >> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) [edward@ec hive]$ head -1 /home/edward/Downloads/seq/seq | od -a 0000000 S E Q ack em o r g . a p a c h e . 0000020 h a d o o p . i o . T e x t em o 0000040 r g . a p a c h e . h a d o o p 0000060 . i o . T e x t soh soh ' o r g . a 0000100 p a c h e . h a d o o p . i o . 0000120 c o m p r e s s . G z i p C o d 0000140 e c nul nul nul nul = 4 ff Y F s V so 4 " 0000160 R + X enq dle T del del del del = 4 ff Y F s 0000200 V so 4 " R + X enq dle T soh etb us vt bs nul 2010-04-15 18:45:24,954 ERROR CliDriver (SessionState.java:printError(255)) - Failed with exception java.io.IOException:java.io.EOFException java.io.IOException: java.io.EOFException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected
-
Re: table from sequence file
Arvind Prabhakar 2010-04-16, 16:58
Sagar,
Unfortunately it is more complicated than that. The idea behind the record reader implementation is to actually convert the underlying writable into a type that is understood by the SerDe layer. At this time, the SerDe layer seems to understand ByteWritable and Text types. So - if you could take your custom type and emit a ByteWritable that represents a struct implementation of the same, it would work.
Another alternative which would be simple to implement would be to do the following:
1. Modify your custom writable so that it has a toString() method that generates a parsable representation of the fields. For example you could use the JSON representation in your toString() method.
2. Create the external table with inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' and outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat', mapping the entire value type to a single string column.
3. Use the UDFJson to extract the individual attributes from the JSON string that is emitted from the select query.
You can use this output to populate a new table that now has the parsed values separated out in the warehouse.
Arvind On Thu, Apr 15, 2010 at 6:01 PM, Sagar Naik <[EMAIL PROTECTED]> wrote:
> Hi Arvind, > > U guessed it correct. > > We have custom writables. > I saw the TextRecordReader implementation to get an idea on RecordReader. > > It looks like createRow creates an instance and next(...) populates this > instance. > The createRow returns an instance of Writable. > > Is the Writable Instance same as "struct" from u r reply > > How is this Writable instance mapped to column names ? > Is there something in commandline syntax which binds the Writable instance > to column names and values ? > Or ObjectInspector will do it magically > > -Sagar > On Apr 15, 2010, at 12:00 PM, Arvind Prabhakar wrote: > > Hi Sagar, > > Looks like your source file has custom writable types in it. If that is the > case, implementing a SerDe that works with that type may not be that > straight forward, although doable. > > An alternative would be to implement a custom RecordReader that converts > the value of your custom writable to Struct type which can then be queried > directly. > > Arvind > > On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote: > >> Hi >> >> My data is in the value field of a sequence file. >> The value field has subfields in it. I am trying to create table using >> these subfields. >> Example: >> <KEY> <VALUE> >> <KEY_FIELD1, KEYFIELD 2> forms the key >> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >> So i am trying to create a table from VALUE_FIELD* >> >> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as >> string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >> >> I am planing to a write a custom SerDe implementation and custom >> SequenceFileReader >> Pl let me knw if I am on the right track. >> >> >> -Sagar > > > >
-
Re: table from sequence file
Arvind Prabhakar 2010-04-16, 17:13
On Thu, Apr 15, 2010 at 7:00 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
> > > On Thu, Apr 15, 2010 at 7:23 PM, Arvind Prabhakar <[EMAIL PROTECTED]>wrote: > >> On Thu, Apr 15, 2010 at 1:23 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: >> >>> >>> >>> On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar <[EMAIL PROTECTED]>wrote: >>> >>>> Hi Sagar, >>>> >>>> Looks like your source file has custom writable types in it. If that is >>>> the case, implementing a SerDe that works with that type may not be that >>>> straight forward, although doable. >>>> >>>> An alternative would be to implement a custom RecordReader that converts >>>> the value of your custom writable to Struct type which can then be queried >>>> directly. >>>> >>>> Arvind >>>> >>>> >>>> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]>wrote: >>>> >>>>> Hi >>>>> >>>>> My data is in the value field of a sequence file. >>>>> The value field has subfields in it. I am trying to create table using >>>>> these subfields. >>>>> Example: >>>>> <KEY> <VALUE> >>>>> <KEY_FIELD1, KEYFIELD 2> forms the key >>>>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >>>>> So i am trying to create a table from VALUE_FIELD* >>>>> >>>>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 >>>>> as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >>>>> >>>>> I am planing to a write a custom SerDe implementation and custom >>>>> SequenceFileReader >>>>> Pl let me knw if I am on the right track. >>>>> >>>>> >>>>> -Sagar >>>> >>>> >>>> >>> I am actually having lots of trouble with this. >>> I have a sequence file that opens fine with >>> /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text >>> /home/edward/Downloads/seq/seq >>> >>> create external table keyonly( ver string , theid int, thedate string ) >>> row format delimited fields terminated by ',' >>> STORED AS >>> inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' >>> outputformat >>> 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' >>> >>> location '/home/edward/Downloads/seq'; >>> >>> >>> >>> Also tried >>> inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat' >>> or stored as SEQUENCEFILE >>> >>> I always get this... >>> >>> 2010-04-15 13:10:43,849 ERROR CliDriver >>> (SessionState.java:printError(255)) - Failed with exception >>> java.io.IOException:java.io.EOFException >>> java.io.IOException: java.io.EOFException >>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) >>> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) >>> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) >>> at >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) >>> at >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) >>> at >>> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) >>> at >>> org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at junit.framework.TestCase.runTest(TestCase.java:154) >>> at junit.framework.TestCase.runBare(TestCase.java:127) >>> at junit.framework.TestResult$1.protect(TestResult.java:106) >>> at junit.framework.TestResult.runProtected(TestResult.java:124) >>> at junit.framework.TestResult.run(TestResult.java:109) >>> at junit.framework.TestCase.run(TestCase.java:118) >>> at junit.framework.TestSuite.runTest(TestSuite.java:208) >>> at junit.framework.TestSuite.run(TestSuite.java:203) >>> at >>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) >>> at The compression being used here - gzip - is not suitable for splitting of the input files. That could be the reason why you are seeing this exception. Can you try using a different compression scheme such as bzip2, or perhaps by not compressing the files at all?
Arvind
-
Re: table from sequence file
Sagar Naik 2010-04-16, 18:04
Hi Arvind, Thanks for explanation.
I am newbie so I am not familiar with terms. Struct implementation is POJO or some thing else.
My guess is struct is a simple POJO . If so then simple POJO represented in BYTES will be passed to BytesWritable . And it should work ?
-Sagar
On Apr 16, 2010, at 9:58 AM, Arvind Prabhakar wrote:
> Sagar, > > Unfortunately it is more complicated than that. The idea behind the record reader implementation is to actually convert the underlying writable into a type that is understood by the SerDe layer. At this time, the SerDe layer seems to understand ByteWritable and Text types. So - if you could take your custom type and emit a ByteWritable that represents a struct implementation of the same, it would work. > > Another alternative which would be simple to implement would be to do the following: > > 1. Modify your custom writable so that it has a toString() method that generates a parsable representation of the fields. For example you could use the JSON representation in your toString() method. > > 2. Create the external table with inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' and outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat', mapping the entire value type to a single string column. > > 3. Use the UDFJson to extract the individual attributes from the JSON string that is emitted from the select query. > > You can use this output to populate a new table that now has the parsed values separated out in the warehouse. > > Arvind > > > On Thu, Apr 15, 2010 at 6:01 PM, Sagar Naik <[EMAIL PROTECTED]> wrote: > Hi Arvind, > > U guessed it correct. > > We have custom writables. > I saw the TextRecordReader implementation to get an idea on RecordReader. > > It looks like createRow creates an instance and next(...) populates this instance. > The createRow returns an instance of Writable. > > Is the Writable Instance same as "struct" from u r reply > > How is this Writable instance mapped to column names ? > Is there something in commandline syntax which binds the Writable instance to column names and values ? > Or ObjectInspector will do it magically > > -Sagar > On Apr 15, 2010, at 12:00 PM, Arvind Prabhakar wrote: > >> Hi Sagar, >> >> Looks like your source file has custom writable types in it. If that is the case, implementing a SerDe that works with that type may not be that straight forward, although doable. >> >> An alternative would be to implement a custom RecordReader that converts the value of your custom writable to Struct type which can then be queried directly. >> >> Arvind >> >> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote: >> Hi >> >> My data is in the value field of a sequence file. >> The value field has subfields in it. I am trying to create table using these subfields. >> Example: >> <KEY> <VALUE> >> <KEY_FIELD1, KEYFIELD 2> forms the key >> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >> So i am trying to create a table from VALUE_FIELD* >> >> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >> >> I am planing to a write a custom SerDe implementation and custom SequenceFileReader >> Pl let me knw if I am on the right track. >> >> >> -Sagar >> > >
-
Re: table from sequence file
Arvind Prabhakar 2010-04-16, 19:05
I think it will be better to take a look at LazySimpleSerDe to see how it serializes and deserializes Struct types. Your implementation should be such that it works with this SerDe seamlessly.
More specifically, creating a simple POJO may not work due to inherent marshaling/encoding semantics that must be observed to conform to the ByteWritable contracts.
Arvind
On Fri, Apr 16, 2010 at 11:04 AM, Sagar Naik <[EMAIL PROTECTED]> wrote:
> Hi Arvind, > Thanks for explanation. > > I am newbie so I am not familiar with terms. > Struct implementation is POJO or some thing else. > > My guess is struct is a simple POJO . If so then simple POJO represented in > BYTES will be passed to BytesWritable . > And it should work ? > > > > -Sagar > > On Apr 16, 2010, at 9:58 AM, Arvind Prabhakar wrote: > > Sagar, > > Unfortunately it is more complicated than that. The idea behind the record > reader implementation is to actually convert the underlying writable into a > type that is understood by the SerDe layer. At this time, the SerDe layer > seems to understand ByteWritable and Text types. So - if you could take your > custom type and emit a ByteWritable that represents a struct implementation > of the same, it would work. > > Another alternative which would be simple to implement would be to do the > following: > > 1. Modify your custom writable so that it has a toString() method that > generates a parsable representation of the fields. For example you could use > the JSON representation in your toString() method. > > 2. Create the external table with inputformat > 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' and outputformat > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat', mapping the > entire value type to a single string column. > > 3. Use the UDFJson to extract the individual attributes from the JSON > string that is emitted from the select query. > > You can use this output to populate a new table that now has the parsed > values separated out in the warehouse. > > Arvind > > > On Thu, Apr 15, 2010 at 6:01 PM, Sagar Naik <[EMAIL PROTECTED]> wrote: > >> Hi Arvind, >> >> U guessed it correct. >> >> We have custom writables. >> I saw the TextRecordReader implementation to get an idea on RecordReader. >> >> It looks like createRow creates an instance and next(...) populates this >> instance. >> The createRow returns an instance of Writable. >> >> Is the Writable Instance same as "struct" from u r reply >> >> How is this Writable instance mapped to column names ? >> Is there something in commandline syntax which binds the Writable instance >> to column names and values ? >> Or ObjectInspector will do it magically >> >> -Sagar >> On Apr 15, 2010, at 12:00 PM, Arvind Prabhakar wrote: >> >> Hi Sagar, >> >> Looks like your source file has custom writable types in it. If that is >> the case, implementing a SerDe that works with that type may not be that >> straight forward, although doable. >> >> An alternative would be to implement a custom RecordReader that converts >> the value of your custom writable to Struct type which can then be queried >> directly. >> >> Arvind >> >> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[EMAIL PROTECTED]> wrote: >> >>> Hi >>> >>> My data is in the value field of a sequence file. >>> The value field has subfields in it. I am trying to create table using >>> these subfields. >>> Example: >>> <KEY> <VALUE> >>> <KEY_FIELD1, KEYFIELD 2> forms the key >>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >>> So i am trying to create a table from VALUE_FIELD* >>> >>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 as >>> string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >>> >>> I am planing to a write a custom SerDe implementation and custom >>> SequenceFileReader >>> Pl let me knw if I am on the right track. >>> >>> >>> -Sagar >> >> >> >> > >
|
|