Thanks for your explanation Alex.
In some cases, there isn't even a reduce phase. For example, we have some
raw data, after our custom LOAD function and some filter function, it
directly goes into DB. And since we don't have control on number of mappers,
we end up with too many DB writers. That's why I had to add that artificial
reduce phase I mentioned earlier so that we can throttle it down.
We could also do what someone else suggested - add a post process step that
writes output to HDFS and load DB from that. But there are other
considerations that we'd like not to do that if we don't have to.
On Thu, Mar 17, 2011 at 2:16 PM, Alex Rovner <[EMAIL PROTECTED]> wrote:
> You can control the amount of reducers by adding the following in your pig
> SET default_parallel 29;
> Pig will run with 29 reducers with the above statement.
> As far as the bulk insert goes:
> We are using MS-SQL as our database, but MySQL would be able to handle the
> bulk insert the same way.
> Essentially we are directing the output of the job into a temporary folder
> in order to know the output of this particular run. If you set the amount of
> reducers to 29, you will have 29 files in the temp folder after the job
> completes. You can then run a bulk insert SQL command on each of the
> resulting files with pointing to HDFS either through FUSE(The way we do it)
> or you can copy the resulting files to a samba share or NFS and point the
> SQL server to that location.
> In order to bulk insert you would have to either A. Do this in a post
> processing script or write your own storage func that takes care of this.
> Storage func is tricky since you will need to implement your own
> outputcommiter (See https://issues.apache.org/jira/browse/PIG-1891)
> Let me know if you have further questions.
> On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
>> Can you describe a bit more about your bulk insert technique? And the way
>> you control the number of reducers is also by adding artificial ORDER or
>> GROUP step?
>> On Thu, Mar 17, 2011 at 1:33 PM, Alex Rovner <[EMAIL PROTECTED]>wrote:
>>> We use bulk insert technique after the job completes. You can control the
>>> amount of each bulk insert by controlling the amount of reducers.
>>> Sent from my iPhone
>>> On Mar 17, 2011, at 2:03 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
>>> > We do some processing in hadoop then as the last step, we write the
>>> > to database. Database is not good at handling hundreds of concurrent
>>> > connections and fast writes. So we need to throttle down the number of
>>> > that writes to DB. Since we have no control on the number of mappers,
>>> we add
>>> > an artificial reducer step to achieve that, either by doing GROUP or
>>> > like this:
>>> > sorted_data = ORDER data BY f1 PARALLEL 10;
>>> > -- then write sorted_data to DB
>>> > or
>>> > grouped_data = GROUP data BY f1 PARALLEL 10;
>>> > data_to_write = FOREACH grouped_data GENERATE $1;
>>> > I feel neither is good approach. They just add unnecessary computing
>>> > especially the first one. And GROUP may result in too large of bags
>>> > Any better suggestions?