I have two processes. One that writes sequence files directly to hdfs, the other that is a hive table that reads these files.
All works well with the exception that I am only flushing the files periodically. SequenceFile input format gets angry when it encounters 0-bytes seq files.
I was considering flush and sync on first record write. Also was thinking should just be able to hack sequence file input format to skip 0 byte files and not throw exception on readFully() which it sometimes does.
I looked at the source by curiosity, for the latest version (2.4), the header is flushed during the writer creation. Of course, key/value classes are provided. By 0-bytes, you really mean even without the header? Or 0 bytes of payload? On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
Sometimes even though they show as 0 bytes you can read data from them. Sometimes it blows up with a stack trace I have lost. On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072) at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274) ... 15 more On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <[EMAIL PROTECTED]> wrote:
Anyway, a solution (seen in Flume if I remember correctly) is having a good file name strategy. For exemple, all new files should end in ".open" and only when they are finished the suffix is removed. Then for processing, you only target the latter.
I am not sure this will help. The sequence file reader will still try to open it regardless of it's name.
For Hive, you might need to adapt the strategy a bit because Hive may not be able to target only files with a specific name (you are the expert). A simple move of the file from a temporary directory to the table directory would have the same effect (because from the point of view of HDFS, it's the same operation : metadata change only).
I would like to consider the file as soon as their is reasonable data in them. If I have to rename/move files I will not be able to see the data until it is moved in/renamed. (I am building files for N minutes before closing them). The problem only happens with 0 byte files- files being written currently work fine.
It seems like the split calculation could throw away 0 byte files before we ever get down to the record reader and parsing the header. An interesting thing is that even though dfs -ls shows the files as 0 bytes....Sometimes I can dfs -text theses 0 byte files and they actually have data! Sometimes when I dfs -text them I get the exception attached!
So it is interesting that the semantics here are not obvious. Can we map reduce a file being written? How does it work etc? It would be nice to understand the semantics here. On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
I don't know if anything is tracked on Hadoop/MapReduce side.
Bertrand Dechoux On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext