Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - CSV files as input


Copy link to this message
-
Re: CSV files as input
Steve Lewis 2012-02-22, 22:59
It sounds like you may need to give up a little to make things work -
Suppose, for example, that you placed a limit on the length of a quoted
string,
say 1024 characters - the reader can then either start at the beginning or
read back by, say 1024 characters to see if the start is in a quote and
proceed accordingly - it quoted strings can be of arbitrary length there
may be no good solution

On Wed, Feb 22, 2012 at 11:01 AM, Keith Wiley <[EMAIL PROTECTED]> wrote:

> It seems nearly impossible to use CSV files as Hadoop input.  I see that
> there is a CsvRecordInput class, but have found virtually no examples
> online of how to use it...and the one example I did find blatantly assumed
> that the CSV records were delimited by endlines...which is not CSV spec.
>  Based on my analysis below, I don't see how CSV input is possible, so I
> don't understand how CsvRecordInput can work (and I am having trouble
> understanding the completely undocumented CsvRecordInput.java; It isn't
> clear how that class is intended to be used).  If CsvRecordInput solves all
> my problems, then great, but how do I use it?
>
> I need to process CSV files which will almost certainly contain quoted
> endlines.  I have attempted to derive my own record reader for this task
> and conclude that it is virtually impossible without reading from the
> beginning of the file.  I explain below.
>
> Consider this: Assuming a split starts at some arbitrary point in the
> file, the standard record reader approach would be to initialize the record
> reader by reading to the end of the current mid-record and beginning the
> record reader at the start of the next full record...but there is no way to
> positively identify the end of CSV record if you start at an arbitrary
> location without potentially reading to the end of the file!
>
> For example, we must consider the possibility that the split begins in the
> middle of a quoted string (therefore, endlines do not delimit records
> because they may be within a string).  We must therefore scan for a
> possible end-quote to close the string, but if we *didn't* begin within a
> string there may *be no end-quote at all* (the entire CSV file might not
> contain a single quoted string).  The only way to identify that we did not
> begin within a quoted string is to scan to the end of the CSV file (not the
> end of the *split* mind you).
>
> So, initializing a CSV record reader with absolute error-free confidence
> potentially requires reading not only the entire split at the time of
> initialization (grossly inefficient in itself), but potentially requires
> reading the entire file, which may not even reside on the current node!
>
> I'm at a loss.  How can Hadoop take CSV files as input?  It must be
> possible.  CSV is a very plain and common way to arrange textual data,
> which is Hadoop's forte; I'm sure people are processing CSV data with
> Hadoop, it seems like a natural fit...but I can't imagine how to enable
> Hadoop to read it under the conditions of Hadoop file splits.
>
> Blech.  Help!
>
>
> ________________________________________________________________________________
> Keith Wiley     [EMAIL PROTECTED]     keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
>
> ________________________________________________________________________________
>
>
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com