You can control map size by setting "pig.maxCombinedSplitSize",
"mapred.max.split.size", "mapred.min.split.size". The first one is pig
parameter and last two are hadoop parameters.
Daniel
On 03/24/2011 06:18 PM, Dexin Wang wrote:
> Thanks for your explanation Alex.
>
> In some cases, there isn't even a reduce phase. For example, we have some
> raw data, after our custom LOAD function and some filter function, it
> directly goes into DB. And since we don't have control on number of mappers,
> we end up with too many DB writers. That's why I had to add that artificial
> reduce phase I mentioned earlier so that we can throttle it down.
>
> We could also do what someone else suggested - add a post process step that
> writes output to HDFS and load DB from that. But there are other
> considerations that we'd like not to do that if we don't have to.
>
> On Thu, Mar 17, 2011 at 2:16 PM, Alex Rovner<[EMAIL PROTECTED]> wrote:
>
>> Dexin,
>>
>> You can control the amount of reducers by adding the following in your pig
>> script:
>>
>> SET default_parallel 29;
>>
>> Pig will run with 29 reducers with the above statement.
>>
>> As far as the bulk insert goes:
>>
>> We are using MS-SQL as our database, but MySQL would be able to handle the
>> bulk insert the same way.
>>
>> Essentially we are directing the output of the job into a temporary folder
>> in order to know the output of this particular run. If you set the amount of
>> reducers to 29, you will have 29 files in the temp folder after the job
>> completes. You can then run a bulk insert SQL command on each of the
>> resulting files with pointing to HDFS either through FUSE(The way we do it)
>> or you can copy the resulting files to a samba share or NFS and point the
>> SQL server to that location.
>>
>> In order to bulk insert you would have to either A. Do this in a post
>> processing script or write your own storage func that takes care of this.
>> Storage func is tricky since you will need to implement your own
>> outputcommiter (See
https://issues.apache.org/jira/browse/PIG-1891)>>
>> Let me know if you have further questions.
>>
>> Alex
>>
>>
>> On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang<[EMAIL PROTECTED]> wrote:
>>
>>> Can you describe a bit more about your bulk insert technique? And the way
>>> you control the number of reducers is also by adding artificial ORDER or
>>> GROUP step?
>>>
>>> Thanks!
>>>
>>>
>>> On Thu, Mar 17, 2011 at 1:33 PM, Alex Rovner<[EMAIL PROTECTED]>wrote:
>>>
>>>> We use bulk insert technique after the job completes. You can control the
>>>> amount of each bulk insert by controlling the amount of reducers.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Mar 17, 2011, at 2:03 PM, Dexin Wang<[EMAIL PROTECTED]> wrote:
>>>>
>>>>> We do some processing in hadoop then as the last step, we write the
>>>> result
>>>>> to database. Database is not good at handling hundreds of concurrent
>>>>> connections and fast writes. So we need to throttle down the number of
>>>> tasks
>>>>> that writes to DB. Since we have no control on the number of mappers,
>>>> we add
>>>>> an artificial reducer step to achieve that, either by doing GROUP or
>>>> ORDER,
>>>>> like this:
>>>>>
>>>>> sorted_data = ORDER data BY f1 PARALLEL 10;
>>>>> -- then write sorted_data to DB
>>>>>
>>>>> or
>>>>>
>>>>> grouped_data = GROUP data BY f1 PARALLEL 10;
>>>>> data_to_write = FOREACH grouped_data GENERATE $1;
>>>>>
>>>>> I feel neither is good approach. They just add unnecessary computing
>>>> time,
>>>>> especially the first one. And GROUP may result in too large of bags
>>>> issue.
>>>>> Any better suggestions?
>>>