Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to use CombineFileInputFormat in Pig


Copy link to this message
-
Re: How to use CombineFileInputFormat in Pig
What load function are you using ? if it implements some of the
interfaces specified here, it turns off split combination -
http://pig.apache.org/docs/r0.9.1/perf.html#combine-files

-Thejas
On 1/11/12 11:07 PM, Marcel Holle wrote:
> In my pig.properties are only these parameters specified: log4jconf,
> fs.default.name, mapred.job.tracker. So it should use the
> CombineFileInputFormat by default. I have 100.000 files of around 16K.
>
> 2012/1/11 Prashant Kommireddi<[EMAIL PROTECTED]>
>
>> Hi Marcel,
>>
>> You might not find "pig.splitCombination" in your configuration if not
>> manually set. Pig internally defaults it to true.
>>
>> What is the value of  "pig.maxCombinedSplitSize", if you are not setting it
>> manually this should be equal to your block size. What is the individual
>> filesize of the small files?
>>
>> Thanks,
>> Prashant
>>
>>
>> On Wed, Jan 11, 2012 at 3:18 PM, Marcel Holle
>> <[EMAIL PROTECTED]>wrote:
>>
>>> If I got it right I should see an output like "Total input paths
>> (combined)
>>> to process : 7" when I run a pig script, but I'm missing the "(combined)"
>>> part, so CombineFileInputFormat is not used? Where could I find the pig
>>> configuration? I think I have to check the "pig.splitCombination" value.
>>>
>>> 2012/1/11 Daniel Dai<[EMAIL PROTECTED]>
>>>
>>>> Check PIG-1518.
>>>>
>>>> Daniel
>>>>
>>>> On Wed, Jan 11, 2012 at 11:01 AM, Marcel Holle
>>>> <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> How could I verify this information? Could you point me to a config
>> or
>>>> the
>>>>> source code?
>>>>>
>>>>> 2012/1/11 Daniel Dai<[EMAIL PROTECTED]>
>>>>>
>>>>>> It is default in 0.8 as well.
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> On Wed, Jan 11, 2012 at 10:43 AM, Marcel Holle
>>>>>> <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> Is there also a way to activate the CombineFileInputFormat in Pig
>>>>> 0.8.1?
>>>>>>>
>>>>>>> 2012/1/10 Alex Rovner<[EMAIL PROTECTED]>
>>>>>>>
>>>>>>>> In versions 9+ default is CombineFileInputFormat
>>>>>>>>
>>>>>>>> On Tue, Jan 10, 2012 at 8:10 PM, Marcel Holle
>>>>>>>> <[EMAIL PROTECTED]>wrote:
>>>>>>>>
>>>>>>>>> How could I use the CombineFileInputFormat in Pig? I have a
>>>>>> performance
>>>>>>>>> issue with lots of small files which I want to get rid of. I
>>>> think
>>>>> by
>>>>>>>>> default the FileInputFormat is used.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB