Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: Loader for small files


+
Something Something 2013-02-12, 19:32
+
Something Something 2013-02-11, 18:22
+
Something Something 2013-02-11, 18:24
Copy link to this message
-
Re: Loader for small files
You could store your data in smaller block sizes. Do something like
hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
You might only need one of those parameters. You can verify the block size with
hadoop fsck /small-block-input

In your pig script, you'll probably need to set
pig.maxCombinedSplitSize
to something around the block size

David

On Feb 11, 2013, at 1:24 PM, Something Something <[EMAIL PROTECTED]> wrote:

> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> HBase.  Adding 'hadoop' user group.
>
> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
>> Hello,
>>
>> We are running into performance issues with Pig/Hadoop because our input
>> files are small.  Everything goes to only 1 Mapper.  To get around this, we
>> are trying to use our own Loader like this:
>>
>> 1)  Extend PigStorage:
>>
>> public class SmallFileStorage extends PigStorage {
>>
>>    public SmallFileStorage(String delimiter) {
>>        super(delimiter);
>>    }
>>
>>    @Override
>>    public InputFormat getInputFormat() {
>>        return new NLineInputFormat();
>>    }
>> }
>>
>>
>>
>> 2)  Add command line argument to the Pig command as follows:
>>
>> -Dmapreduce.input.lineinputformat.linespermap=500000
>>
>>
>>
>> 3)  Use SmallFileStorage in the Pig script as follows:
>>
>> USING com.xxx.yyy.SmallFileStorage ('\t')
>>
>>
>> But this doesn't seem to work.  We still see that everything is going to
>> one mapper.  Before we spend any more time on this, I am wondering if this
>> is a good approach – OR – if there's a better approach?  Please let me
>> know.  Thanks.
>>
>>
>>
+
Something Something 2013-02-11, 19:10
+
David LaBarbera 2013-02-11, 20:38
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB