In my opinion, MultiStorage should work just fine if you have less number
of buckets (0-100+, not sure about the limit, but definitely not 512) even
if you have large number of records in one bucket.
But, I think this method is error-prone against the task failures. I think
more scalable way is to generate files with tagged names and then move
them into one directory.
If you take a bag of grouped tuples and change your partitioner to fork
more than one reducer spitting into one directory it should work too. But,
this is only useful if you have uniform distribution of your bucket size
(and again another limit on no of buckets).
On Thu, March 31, 2011 5:17 pm, Dmitriy Ryaboy wrote:
> I think the problem there is # of unique keys -- one winds up creating
> way too many filehandles all at the same time. I may be misunderstanding
> the nature of the bug. If I do understand it correctly, it's endemic to the
> whole concept of MultiStorage; creating 7K files * # reducers sounds like
> a really bad thing to do; if you are running into the problem, you
> probably shouldn't be using MultiStorage.
> Or am I misreading what's happening?
> On Thu, Mar 31, 2011 at 9:12 AM, Jonathan Holloway <
> [EMAIL PROTECTED]> wrote:
>> Hi all,
>> I'm working with some data at the moment, for which I needed to
>> generate multiple reports for a given grouped set of data by name. I
>> wasn't initially sure about how to do this, I came across MultiStorage
>> in Pig contrib, but a little worried about the 7k limit there at
>> the moment due to a bug:
>> Does anybody know what the issue here is - I can take a look at this if
>> necessary and someone can point me in the right way in terms of fixing
>> it? I've currently hacked MultiStorage to take a bag and the contained
>> tuples and spit out the tuples with a tab delimiter between them. Is
>> this the best way to go?
>> Just looking for some feedback.