Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Can I write to an compressed file which is located in hdfs?


Copy link to this message
-
Re: Can I write to an compressed file which is located in hdfs?
Hi
    AFAIK I don't think it is possible to append into a compressed file.

If you have files in hdfs on a dir and you need to compress the same (like files for an hour) you can use MapReduce to do that by setting mapred.output.compress = true and mapred.output.compression.codec='theCodecYouPrefer'
You'd get the blocks compressed in the output dir.

You can use the API to read from standard input like
-get hadoop conf
-register the required compression codec
-write to CompressionOutputStream.

You should get a well detailed explanation on the same from the book 'Hadoop - The definitive guide' by Tom White.

Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Xiaobin She <[EMAIL PROTECTED]>
Date: Tue, 7 Feb 2012 14:24:01
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; David Sinclair<[EMAIL PROTECTED]>
Subject: Re: Can I write to an compressed file which is located in hdfs?

hi Bejoy and David,

thank you for you help.

So I can't directly write logs or append logs into an compressed file in
hdfs, right?

Can I compress an file which is already in hdfs and has not been compressed?

If I can , how can I do that?

Thanks!

2012/2/6 <[EMAIL PROTECTED]>

> Hi
>       I agree with David on the point, you can achieve step 1 of my
> previous response with flume. ie load real time inflow of data in
> compressed format into hdfs. You can specify a time interval or data size
> in flume collector that determines when to flush data on to hdfs.
>
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
>
> -----Original Message-----
> From: David Sinclair <[EMAIL PROTECTED]>
> Date: Mon, 6 Feb 2012 09:06:00
> To: <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Subject: Re: Can I write to an compressed file which is located in hdfs?
>
> Hi,
>
> You may want to have a look at the Flume project from Cloudera. I use it
> for writing data into HDFS.
>
> https://ccp.cloudera.com/display/SUPPORT/Downloads
>
> dave
>
> 2012/2/6 Xiaobin She <[EMAIL PROTECTED]>
>
> > hi Bejoy ,
> >
> > thank you for your reply.
> >
> > actually I have set up an test cluster which has one namenode/jobtracker
> > and two datanode/tasktracker, and I have make an test on this cluster.
> >
> > I fetch the log file of one of our modules from the log collector
> machines
> > by rsync, and then I use hive command line tool to load this log file
> into
> > the hive warehouse which  simply copy the file from the local filesystem
> to
> > hdfs.
> >
> > And I have run some analysis on these data with hive, all this run well.
> >
> > But now I want to avoid the fetch section which use rsync, and write the
> > logs into hdfs files directly from the servers which generate these logs.
> >
> > And it seems easy to do this job if the file locate in the hdfs is not
> > compressed.
> >
> > But how to write or append logs to an file that is compressed and located
> > in hdfs?
> >
> > Is this possible?
> >
> > Or is this an bad practice?
> >
> > Thanks!
> >
> >
> >
> > 2012/2/6 <[EMAIL PROTECTED]>
> >
> > > Hi
> > >     If you have log files enough to become at least one block size in
> an
> > > hour. You can go ahead as
> > > - run a scheduled job every hour that compresses the log files for that
> > > hour and stores them on to hdfs (can use LZO or even Snappy to
> compress)
> > > - if your hive does more frequent analysis on this data store it as
> > > PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
> > > directory - sub dir structure. Once data is in hdfs issue a Alter Table
> > Add
> > > Partition statement on corresponding hive table.
> > > -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
> > > Input Format already)
> > >
> > >
> > > Regards
> > > Bejoy K S
> > >
> > > From handheld, Please excuse typos.
> > >
> > > -----Original Message-----
> > > From: Xiaobin She <[EMAIL PROTECTED]>
> > > Date: Mon, 6 Feb 2012 16:41:50
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB