Two other points -
if you have several input files make a custom input whose reader make
protected boolean isSplitable(JobContext context, Path file) return false
and you do not have problems starting in the middle -
If the input is not truly massive you can simply write a piece of code to
find the longest quotes string be reading the entire file - on a single box
you can handle tens of gigs per hour.
On Wed, Feb 22, 2012 at 3:22 PM, Keith Wiley <[EMAIL PROTECTED]> wrote:
> Thanks for responding. Unfortunately, the data already exists. I have no
> way of instituting limitations on the format, much less reformatting it to
> suit my needs. It is true that I can make some general assumptions about
> the data (unrealistically long strings are unlikely to occur), but I can't
> write a steadfastly robust reader under such assumptions.
> The problem is that even if I impose an assumption of limited length
> strings, that doesn't prescribe a method for handling the possibility of an
> error. If a string really is too long and the reader fails to detect it,
> I'm not sure how to insure that the reader or subsequent map task fails in
> a clean fashion.
> If I could at least impose an assumption of this sort...and then detect
> and fail cleanly on violations of the assumption, that would go a long way.
> I'll think about it.
> On Feb 22, 2012, at 14:59 , Steve Lewis wrote:
> > It sounds like you may need to give up a little to make things work -
> Suppose, for example, that you placed a limit on the length of a quoted
> > say 1024 characters - the reader can then either start at the beginning
> or read back by, say 1024 characters to see if the start is in a quote and
> proceed accordingly - it quoted strings can be of arbitrary length there
> may be no good solution
> Keith Wiley [EMAIL PROTECTED] keithwiley.com
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
> -- Galileo Galilei
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033