|
|
-
Re: Can I write to an compressed file which is located in hdfs?bejoy.hadoop@... 2012-02-07, 07:53
Hi
AFAIK I don't think it is possible to append into a compressed file. If you have files in hdfs on a dir and you need to compress the same (like files for an hour) you can use MapReduce to do that by setting mapred.output.compress = true and mapred.output.compression.codec='theCodecYouPrefer' You'd get the blocks compressed in the output dir. You can use the API to read from standard input like -get hadoop conf -register the required compression codec -write to CompressionOutputStream. You should get a well detailed explanation on the same from the book 'Hadoop - The definitive guide' by Tom White. Regards Bejoy K S From handheld, Please excuse typos. -----Original Message----- From: Xiaobin She <[EMAIL PROTECTED]> Date: Tue, 7 Feb 2012 14:24:01 To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; David Sinclair<[EMAIL PROTECTED]> Subject: Re: Can I write to an compressed file which is located in hdfs? hi Bejoy and David, thank you for you help. So I can't directly write logs or append logs into an compressed file in hdfs, right? Can I compress an file which is already in hdfs and has not been compressed? If I can , how can I do that? Thanks! 2012/2/6 <[EMAIL PROTECTED]> > Hi > I agree with David on the point, you can achieve step 1 of my > previous response with flume. ie load real time inflow of data in > compressed format into hdfs. You can specify a time interval or data size > in flume collector that determines when to flush data on to hdfs. > > Regards > Bejoy K S > > From handheld, Please excuse typos. > > -----Original Message----- > From: David Sinclair <[EMAIL PROTECTED]> > Date: Mon, 6 Feb 2012 09:06:00 > To: <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Subject: Re: Can I write to an compressed file which is located in hdfs? > > Hi, > > You may want to have a look at the Flume project from Cloudera. I use it > for writing data into HDFS. > > https://ccp.cloudera.com/display/SUPPORT/Downloads > > dave > > 2012/2/6 Xiaobin She <[EMAIL PROTECTED]> > > > hi Bejoy , > > > > thank you for your reply. > > > > actually I have set up an test cluster which has one namenode/jobtracker > > and two datanode/tasktracker, and I have make an test on this cluster. > > > > I fetch the log file of one of our modules from the log collector > machines > > by rsync, and then I use hive command line tool to load this log file > into > > the hive warehouse which simply copy the file from the local filesystem > to > > hdfs. > > > > And I have run some analysis on these data with hive, all this run well. > > > > But now I want to avoid the fetch section which use rsync, and write the > > logs into hdfs files directly from the servers which generate these logs. > > > > And it seems easy to do this job if the file locate in the hdfs is not > > compressed. > > > > But how to write or append logs to an file that is compressed and located > > in hdfs? > > > > Is this possible? > > > > Or is this an bad practice? > > > > Thanks! > > > > > > > > 2012/2/6 <[EMAIL PROTECTED]> > > > > > Hi > > > If you have log files enough to become at least one block size in > an > > > hour. You can go ahead as > > > - run a scheduled job every hour that compresses the log files for that > > > hour and stores them on to hdfs (can use LZO or even Snappy to > compress) > > > - if your hive does more frequent analysis on this data store it as > > > PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a > > > directory - sub dir structure. Once data is in hdfs issue a Alter Table > > Add > > > Partition statement on corresponding hive table. > > > -in Hive DDL use the appropriate Input format (Hive has some ApacheLog > > > Input Format already) > > > > > > > > > Regards > > > Bejoy K S > > > > > > From handheld, Please excuse typos. > > > > > > -----Original Message----- > > > From: Xiaobin She <[EMAIL PROTECTED]> > > > Date: Mon, 6 Feb 2012 16:41:50 |