Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig write to single file


Copy link to this message
-
Re: Pig write to single file
What I'm doing is at the end of each day I deduce and store all my log files in lzo format in an archive directory. I thought that since LZO is splittable and Hadoop likes larger files that this would be best. Is this not the case?

And to answer your question there seems to be 2 files around 800mb in size.

On May 1, 2013, at 10:17 AM, Mike Sukmanowsky <[EMAIL PROTECTED]> wrote:

> How many output files are you getting?  You can set SET DEFAULT_PARALLEL 1;
> so you don't have to specify parallelism on each reduce phase.
>
> In general though, I wouldn't recommend forcing your output into one file
> (parallelism is good).  Just write a shell/python/ruby/perl script that
> appends the files after the full job executes.
>
>
> On Wed, May 1, 2013 at 12:51 PM, Mark <[EMAIL PROTECTED]> wrote:
>
>> Thought I understood how to output to a single file but It doesn't seem to
>> be working. Anything I'm missing here?
>>
>>
>> -- Dedupe and store
>>
>> rows   = LOAD '$input';
>> unique = DISTINCT rows PARELLEL 1;
>>
>> STORE unique INTO '$output';
>>
>>
>>
>
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB