-Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC
Rahul Bhattacharjee 2013-09-30, 17:45
Sequence files are language neutral as Avro. Yes , but not sure about the
support of other language lib for processing seq files.
On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <[EMAIL PROTECTED]>wrote:
> It is not recommended to keep the data at rest in sequences format,
> because it is Java specific and you cannot share it with other none-java
> systems easily, it is ideal for running map/reduce jobs. On approach would
> be to bring all the data of different formats in HDFS as is and then
> convert them to a single format that works best for you depending on
> whether you will export this data out or not (in addition to many other
> considerations). But as already mentioned Hive can directly read any of
> these formats.
> On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <[EMAIL PROTECTED]>wrote:
>> for xml files processing hadoop comes with a class for this purpose
>> called StreamXmlRecordReader,You can use it by setting your input format
>> to StreamInputFormat and setting the
>> stream.recordreader.class property to
>> for Json files, an open-source project ElephantBird that contains some
>> useful utilities for working with LZO compression, has a
>> LzoJsonInputFormat, which can read JSON, but it requires that the input
>> file be LZOP compressed. We’ll use this code as a template for our own JSON
>> InputFormat, which doesn’t have the LZOP compression requirement.
>> if you are dealing with small files then sequence file format comes in
>> rescue, it stores sequences of binary key-value pairs. Sequence files
>> are well suited as a format for MapReduce data since they are
>> splittable,support compression.
>> Raj K Singh
>> Mobile Tel: +91 (0)9899821370
>> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
>> [EMAIL PROTECTED]> wrote:
>>> the file format topic is still confusing me and I would appreciate if you
>>> could share your thoughts and experience with me.
>>> From reading different books/articles/websites I understand that
>>> - Sequence files (used frequently but not only for binary data),
>>> - AVRO,
>>> - RC (was developed to work best with Hive -columnar storage) and
>>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>>> are all container file formats to solve the "small files problem" and all
>>> support compression and splitting.
>>> Additionally, each file format was developed with specific
>>> in mind.
>>> Imagine I have the following text source data
>>> - 1 TB of XML documents (some millions of small files)
>>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>>> - 1 TB of Apache log files (some thousands of bigger files)
>>> How should I store this data in HDFS to process it using Java MapReduce
>>> Pig and Hive?
>>> I want to use the best tool for my specific problem - with "best"
>>> performance of course - i.e. maybe one problem on the apache log data
>>> can be
>>> best solved using Java MapReduce, another one using Hive or Pig.
>>> Should I simply put the data into HDFS as the data comes from - i.e. as
>>> plain text files?
>>> Or should I convert all my data to a container file format like sequence
>>> files, AVRO, RC or ORC?
>>> Based on this example, I believe
>>> - the XML documents will be need to be converted to a container file
>>> to overcome the "small files problem".
>>> - the JSON documents could/should not be affected by the "small files
>>> - the Apache files should definitely not be affected by the "small files
>>> problem", so they could be stored as plain text files.
>>> So, some source data needs to be converted to a container file format,
>>> others not necessarily.
>>> But what is really advisable?