Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - CSV files as input


Copy link to this message
-
Re: CSV files as input
Steve Lewis 2012-02-23, 00:13
Two other points -
if you have several input files make a custom input whose reader make
 protected boolean isSplitable(JobContext context, Path file) return false
and you do not have problems starting in the middle -
If the input is not truly massive you can simply write a piece of code to
find the longest quotes string be reading the entire file - on a single box
you can handle tens of gigs per hour.

On Wed, Feb 22, 2012 at 3:22 PM, Keith Wiley <[EMAIL PROTECTED]> wrote:

> Thanks for responding.  Unfortunately, the data already exists.  I have no
> way of instituting limitations on the format, much less reformatting it to
> suit my needs.  It is true that I can make some general assumptions about
> the data (unrealistically long strings are unlikely to occur), but I can't
> write a steadfastly robust reader under such assumptions.
>
> The problem is that even if I impose an assumption of limited length
> strings, that doesn't prescribe a method for handling the possibility of an
> error.  If a string really is too long and the reader fails to detect it,
> I'm not sure how to insure that the reader or subsequent map task fails in
> a clean fashion.
>
> If I could at least impose an assumption of this sort...and then detect
> and fail cleanly on violations of the assumption, that would go a long way.
>
> I'll think about it.
>
> Thanks.
>
> On Feb 22, 2012, at 14:59 , Steve Lewis wrote:
>
> > It sounds like you may need to give up a little to make things work -
> Suppose, for example, that you placed a limit on the length of a quoted
> string,
> > say 1024 characters - the reader can then either start at the beginning
> or read back by, say 1024 characters to see if the start is in a quote and
> proceed accordingly - it quoted strings can be of arbitrary length there
> may be no good solution
>
>
> ________________________________________________________________________________
> Keith Wiley     [EMAIL PROTECTED]     keithwiley.com
> music.keithwiley.com
>
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>                                           --  Galileo Galilei
>
> ________________________________________________________________________________
>
>
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com