|
Ruben de Vries
2012-04-19, 10:07
madhu phatak
2012-04-19, 10:15
Ruben de Vries
2012-04-19, 11:13
madhu phatak
2012-04-19, 11:26
David Kulp
2012-04-19, 12:12
Ruben de Vries
2012-04-19, 12:21
David Kulp
2012-04-19, 12:52
Owen O'Malley
2012-04-19, 13:09
Dilip Joseph
2012-04-19, 15:46
Ruben de Vries
2012-04-19, 15:49
David Kulp
2012-04-19, 18:12
|
-
using the key from a SequenceFileRuben de Vries 2012-04-19, 10:07
I'm trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive,
Previously the data was stored in sequencefiles with the keys containing valueable data! However if I load the data into a table I loose that key data (or at least I can't access it with hive), I want to somehow use the key from the sequence file in hive. I know this has come up before since I can find some hints of people needing it but I can't seem to find a working solution and since I'm not very good with java I really can't get it done myself :(. Does anyone have a snippet of something like this working? I get errors like; ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol [javac] symbol : constructor SequenceFileRecordReader() [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader<K,V> [javac] public class CustomSeqRecordReader<K, V> extends SequenceFileRecordReader<K, V> implements RecordReader<K, V> { Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, Ruben de Vries
-
Re: using the key from a SequenceFilemadhu phatak 2012-04-19, 10:15
Serde will allow you to create custom data from your sequence File
https://cwiki.apache.org/confluence/display/Hive/SerDe On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]>wrote: > I’m trying to migrate a part of our current hadoop jobs from normal > mapreduce jobs to hive,**** > > Previously the data was stored in sequencefiles with the keys containing > valueable data!**** > > However if I load the data into a table I loose that key data (or at least > I can’t access it with hive), I want to somehow use the key from the > sequence file in hive.**** > > ** ** > > I know this has come up before since I can find some hints of people > needing it but I can’t seem to find a working solution and since I’m not > very good with java I really can’t get it done myself L.**** > > Does anyone have a snippet of something like this working? **** > > ** ** > > I get errors like; **** > > ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol**** > > [javac] symbol : constructor SequenceFileRecordReader()**** > > [javac] location: class > org.apache.hadoop.mapred.SequenceFileRecordReader<K,V>**** > > [javac] public class CustomSeqRecordReader<K, V> extends > SequenceFileRecordReader<K, V> implements RecordReader<K, V> {**** > > ** ** > > ** ** > > Hope some1 has a snippet or can help me out, would really love to be able > to switch part of our jobs to hive,**** > > ** ** > > ** ** > > Ruben de Vries**** > -- https://github.com/zinnia-phatak-dev/Nectar
-
RE: using the key from a SequenceFileRuben de Vries 2012-04-19, 11:13
Afaik SerDe only serialzes / deserializes the value part of the sequencefile :( ?
From: madhu phatak [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 19, 2012 12:16 PM To: [EMAIL PROTECTED] Subject: Re: using the key from a SequenceFile Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I'm trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, Previously the data was stored in sequencefiles with the keys containing valueable data! However if I load the data into a table I loose that key data (or at least I can't access it with hive), I want to somehow use the key from the sequence file in hive. I know this has come up before since I can find some hints of people needing it but I can't seem to find a working solution and since I'm not very good with java I really can't get it done myself :(. Does anyone have a snippet of something like this working? I get errors like; ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol [javac] symbol : constructor SequenceFileRecordReader() [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader<K,V> [javac] public class CustomSeqRecordReader<K, V> extends SequenceFileRecordReader<K, V> implements RecordReader<K, V> { Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, Ruben de Vries -- https://github.com/zinnia-phatak-dev/Nectar
-
Re: using the key from a SequenceFilemadhu phatak 2012-04-19, 11:26
http://grokbase.com/p/hive/user/111gqvs0g0/%E2%80%8Fsequence-file-custom-serdes-question
according this hive ignore key part . May be u have to write custom inputformat which combine both key and value. On Thu, Apr 19, 2012 at 4:43 PM, Ruben de Vries <[EMAIL PROTECTED]>wrote: > Afaik SerDe only serialzes / deserializes the value part of the > sequencefile L ?**** > > ** ** > > *From:* madhu phatak [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, April 19, 2012 12:16 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: using the key from a SequenceFile**** > > ** ** > > Serde will allow you to create custom data from your sequence File > https://cwiki.apache.org/confluence/display/Hive/SerDe **** > > On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]> > wrote:**** > > I’m trying to migrate a part of our current hadoop jobs from normal > mapreduce jobs to hive,**** > > Previously the data was stored in sequencefiles with the keys containing > valueable data!**** > > However if I load the data into a table I loose that key data (or at least > I can’t access it with hive), I want to somehow use the key from the > sequence file in hive.**** > > **** > > I know this has come up before since I can find some hints of people > needing it but I can’t seem to find a working solution and since I’m not > very good with java I really can’t get it done myself L.**** > > Does anyone have a snippet of something like this working? **** > > **** > > I get errors like; **** > > ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol**** > > [javac] symbol : constructor SequenceFileRecordReader()**** > > [javac] location: class > org.apache.hadoop.mapred.SequenceFileRecordReader<K,V>**** > > [javac] public class CustomSeqRecordReader<K, V> extends > SequenceFileRecordReader<K, V> implements RecordReader<K, V> {**** > > **** > > **** > > Hope some1 has a snippet or can help me out, would really love to be able > to switch part of our jobs to hive,**** > > **** > > **** > > Ruben de Vries**** > > > > **** > > ** ** > > -- > https://github.com/zinnia-phatak-dev/Nectar**** > -- https://github.com/zinnia-phatak-dev/Nectar
-
Re: using the key from a SequenceFileDavid Kulp 2012-04-19, 12:12
I'm trying to achieve something very similar. I want to write an MR program that writes results in a record-based sequencefile that would be directly readable from hive as though it were created using "STORED AS SEQUENCEFILE" with, say, BinarySortableSerDe.
From this discussion it seems that Hive does not / cannot take advantage of the key/values in a sequencefile, but rather it requires a value that is serialized using a SerDe. Is that right? If so, does that mean that the right approach is to using the BinarySortableSerDe to pass the collector a row's worth of data as the Writable value. And would Hive "just work" on such data? If SequencefileOutputFormat is used, will it automatically place sync markers in the file to allow for file splitting? Thanks! (ps. As an aside, Avro would be better. Wouldn't it be a huge win for MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that read such files? It seems like there's a natural correspondence between the richer data representations of an SQL schema and an Avro schema, and there's already code for working with Avro in MR as input.) On Apr 19, 2012, at 6:15 AM, madhu phatak wrote: > Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe > > On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: > I’m trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, > > Previously the data was stored in sequencefiles with the keys containing valueable data! > > However if I load the data into a table I loose that key data (or at least I can’t access it with hive), I want to somehow use the key from the sequence file in hive. > > > > I know this has come up before since I can find some hints of people needing it but I can’t seem to find a working solution and since I’m not very good with java I really can’t get it done myself L. > > Does anyone have a snippet of something like this working? > > > > I get errors like; > > ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol > > [javac] symbol : constructor SequenceFileRecordReader() > > [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader<K,V> > > [javac] public class CustomSeqRecordReader<K, V> extends SequenceFileRecordReader<K, V> implements RecordReader<K, V> { > > > > > > Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, > > > > > > Ruben de Vries > > > > > -- > https://github.com/zinnia-phatak-dev/Nectar >
-
RE: using the key from a SequenceFileRuben de Vries 2012-04-19, 12:21
Hive can handle a sequence file just like a text file, only it omits the key completely and only uses the value part of it, other than that you won't notice the difference between sequence or plain text file
From: David Kulp [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 19, 2012 2:13 PM To: [EMAIL PROTECTED] Subject: Re: using the key from a SequenceFile I'm trying to achieve something very similar. I want to write an MR program that writes results in a record-based sequencefile that would be directly readable from hive as though it were created using "STORED AS SEQUENCEFILE" with, say, BinarySortableSerDe. >From this discussion it seems that Hive does not / cannot take advantage of the key/values in a sequencefile, but rather it requires a value that is serialized using a SerDe. Is that right? If so, does that mean that the right approach is to using the BinarySortableSerDe to pass the collector a row's worth of data as the Writable value. And would Hive "just work" on such data? If SequencefileOutputFormat is used, will it automatically place sync markers in the file to allow for file splitting? Thanks! (ps. As an aside, Avro would be better. Wouldn't it be a huge win for MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that read such files? It seems like there's a natural correspondence between the richer data representations of an SQL schema and an Avro schema, and there's already code for working with Avro in MR as input.) On Apr 19, 2012, at 6:15 AM, madhu phatak wrote: Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I'm trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, Previously the data was stored in sequencefiles with the keys containing valueable data! However if I load the data into a table I loose that key data (or at least I can't access it with hive), I want to somehow use the key from the sequence file in hive. I know this has come up before since I can find some hints of people needing it but I can't seem to find a working solution and since I'm not very good with java I really can't get it done myself :(. Does anyone have a snippet of something like this working? I get errors like; ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol [javac] symbol : constructor SequenceFileRecordReader() [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader<K,V> [javac] public class CustomSeqRecordReader<K, V> extends SequenceFileRecordReader<K, V> implements RecordReader<K, V> { Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, Ruben de Vries -- https://github.com/zinnia-phatak-dev/Nectar
-
Re: using the key from a SequenceFileDavid Kulp 2012-04-19, 12:52
But I'm not clear on how to write a single row of multiple values in my MR program, since my only way to output data is to send values to the collector. Are you saying that there's no row delimiter and I simply make repeated calls to the collector, e.g.
output.collect(null, row1col1) output.collect(null, row1col2) ... output.collect(null, row2col1) output.collect(null, row2col2) If that's the case, then there's no explicit row boundary in the data, which also implies that there's no reliable way to split such a file later when hive does an MR. Or is it along the lines of the following? ArrayList<Object> row; row.add(row1col1); row.add(row1col2); output.collect(null, row); Thanks in advance! On Apr 19, 2012, at 8:21 AM, Ruben de Vries wrote: > Hive can handle a sequence file just like a text file, only it omits the key completely and only uses the value part of it, other than that you won’t notice the difference between sequence or plain text file > > From: David Kulp [mailto:[EMAIL PROTECTED]] > Sent: Thursday, April 19, 2012 2:13 PM > To: [EMAIL PROTECTED] > Subject: Re: using the key from a SequenceFile > > I'm trying to achieve something very similar. I want to write an MR program that writes results in a record-based sequencefile that would be directly readable from hive as though it were created using "STORED AS SEQUENCEFILE" with, say, BinarySortableSerDe. > > From this discussion it seems that Hive does not / cannot take advantage of the key/values in a sequencefile, but rather it requires a value that is serialized using a SerDe. Is that right? > > If so, does that mean that the right approach is to using the BinarySortableSerDe to pass the collector a row's worth of data as the Writable value. And would Hive "just work" on such data? > > If SequencefileOutputFormat is used, will it automatically place sync markers in the file to allow for file splitting? > > Thanks! > > > (ps. As an aside, Avro would be better. Wouldn't it be a huge win for MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that read such files? It seems like there's a natural correspondence between the richer data representations of an SQL schema and an Avro schema, and there's already code for working with Avro in MR as input.) > > > > On Apr 19, 2012, at 6:15 AM, madhu phatak wrote: > > > Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe > > On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: > I’m trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, > Previously the data was stored in sequencefiles with the keys containing valueable data! > However if I load the data into a table I loose that key data (or at least I can’t access it with hive), I want to somehow use the key from the sequence file in hive. > > I know this has come up before since I can find some hints of people needing it but I can’t seem to find a working solution and since I’m not very good with java I really can’t get it done myself L. > Does anyone have a snippet of something like this working? > > I get errors like; > ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol > [javac] symbol : constructor SequenceFileRecordReader() > [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader<K,V> > [javac] public class CustomSeqRecordReader<K, V> extends SequenceFileRecordReader<K, V> implements RecordReader<K, V> { > > > Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, > > > Ruben de Vries > > > > -- > https://github.com/zinnia-phatak-dev/Nectar >
-
Re: using the key from a SequenceFileOwen O'Malley 2012-04-19, 13:09
On Thu, Apr 19, 2012 at 3:07 AM, Ruben de Vries <[EMAIL PROTECTED]> wrote:
> I’m trying to migrate a part of our current hadoop jobs from normal > mapreduce jobs to hive, > > Previously the data was stored in sequencefiles with the keys containing > valueable data! I think you'll want to define your table using a custom InputFormat that creates a virtual row based on both the key and value and then use the 'STORED AS INPUTFORMAT ...' -- Owen
-
Re: using the key from a SequenceFileDilip Joseph 2012-04-19, 15:46
An example input format for using SequenceFile keys in hive is at
https://gist.github.com/2421795 . The code just reverses how the key and value are accessed in the standard SequenceFileRecordRecorder and SequenceFileInputFormat that comes with hadoop. You can use this custom input format by specifying the following when you create the table: STORED AS INPUTFORMAT 'com.mycompany.SequenceFileKeyInputFormat' Dilip On Thu, Apr 19, 2012 at 6:09 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > On Thu, Apr 19, 2012 at 3:07 AM, Ruben de Vries <[EMAIL PROTECTED]> > wrote: > > I’m trying to migrate a part of our current hadoop jobs from normal > > mapreduce jobs to hive, > > > > Previously the data was stored in sequencefiles with the keys containing > > valueable data! > > I think you'll want to define your table using a custom InputFormat > that creates a virtual row based on both the key and value and then > use the 'STORED AS INPUTFORMAT ...' > > -- Owen > -- _________________________________________ Dilip Antony Joseph http://csgrad.blogspot.com http://www.marydilip.info
-
RE: using the key from a SequenceFileRuben de Vries 2012-04-19, 15:49
You're a lifesaver!
From: Dilip Joseph [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 19, 2012 5:47 PM To: [EMAIL PROTECTED] Subject: Re: using the key from a SequenceFile An example input format for using SequenceFile keys in hive is at https://gist.github.com/2421795 . The code just reverses how the key and value are accessed in the standard SequenceFileRecordRecorder and SequenceFileInputFormat that comes with hadoop. You can use this custom input format by specifying the following when you create the table: STORED AS INPUTFORMAT 'com.mycompany.SequenceFileKeyInputFormat' Dilip On Thu, Apr 19, 2012 at 6:09 AM, Owen O'Malley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Thu, Apr 19, 2012 at 3:07 AM, Ruben de Vries <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > I'm trying to migrate a part of our current hadoop jobs from normal > mapreduce jobs to hive, > > Previously the data was stored in sequencefiles with the keys containing > valueable data! I think you'll want to define your table using a custom InputFormat that creates a virtual row based on both the key and value and then use the 'STORED AS INPUTFORMAT ...' -- Owen -- _________________________________________ Dilip Antony Joseph http://csgrad.blogspot.com http://www.marydilip.info
-
Re: using the key from a SequenceFileDavid Kulp 2012-04-19, 18:12
To answer my own question -- so that someone else may benefit some day -- I've found that there is nothing special about key or value formats in a SequenceFile. As has been noted, keys are ignored. Each new key/value pair is seen as a new row from Hive's perspective. There's no concept of using Writables, such as ArrayWritable, to create nested structures in a value field that are automatically parsed by Hive. There are no record delimiters known to SequenceFile. There's just an ignored key and a value that is just a byte stream.
Thus, the simplest approach is just to use the Lazy SerDe format to create a multi-column row in an MR program that will be read by Hive. For example, your MR program would set the output format to SequenceFile and Text values. conf.setOutputFormat(SequenceFileOutputFormat.class); conf.setOutputValueClass(Text.class); The reducer (or mapper if no reducer) would send values to the collector with Control-A delimiters between column values. There are no special formats for numbers, for example, in this approach. For example, output.collect(dummy, col1+ "\001" + col2) In Hive, create your table with "STORED AS SEQUENCEFILE" and you should be golden. You can presumably use one of the alternative serializers in your MR program, but I haven't tried it, yet. -d On Apr 19, 2012, at 8:52 AM, David Kulp wrote: > But I'm not clear on how to write a single row of multiple values in my MR program, since my only way to output data is to send values to the collector. Are you saying that there's no row delimiter and I simply make repeated calls to the collector, e.g. > > output.collect(null, row1col1) > output.collect(null, row1col2) > ... > output.collect(null, row2col1) > output.collect(null, row2col2) > > If that's the case, then there's no explicit row boundary in the data, which also implies that there's no reliable way to split such a file later when hive does an MR. > > Or is it along the lines of the following? > > ArrayList<Object> row; > row.add(row1col1); > row.add(row1col2); > output.collect(null, row); > > > Thanks in advance! > > > > On Apr 19, 2012, at 8:21 AM, Ruben de Vries wrote: > >> Hive can handle a sequence file just like a text file, only it omits the key completely and only uses the value part of it, other than that you won’t notice the difference between sequence or plain text file >> >> From: David Kulp [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, April 19, 2012 2:13 PM >> To: [EMAIL PROTECTED] >> Subject: Re: using the key from a SequenceFile >> >> I'm trying to achieve something very similar. I want to write an MR program that writes results in a record-based sequencefile that would be directly readable from hive as though it were created using "STORED AS SEQUENCEFILE" with, say, BinarySortableSerDe. >> >> From this discussion it seems that Hive does not / cannot take advantage of the key/values in a sequencefile, but rather it requires a value that is serialized using a SerDe. Is that right? >> >> If so, does that mean that the right approach is to using the BinarySortableSerDe to pass the collector a row's worth of data as the Writable value. And would Hive "just work" on such data? >> >> If SequencefileOutputFormat is used, will it automatically place sync markers in the file to allow for file splitting? >> >> Thanks! >> >> >> (ps. As an aside, Avro would be better. Wouldn't it be a huge win for MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that read such files? It seems like there's a natural correspondence between the richer data representations of an SQL schema and an Avro schema, and there's already code for working with Avro in MR as input.) >> >> >> >> On Apr 19, 2012, at 6:15 AM, madhu phatak wrote: >> >> >> Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe >> >> On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: |