-Re: CSV files as input
Steve Lewis 2012-02-22, 22:59
It sounds like you may need to give up a little to make things work -
Suppose, for example, that you placed a limit on the length of a quoted
say 1024 characters - the reader can then either start at the beginning or
read back by, say 1024 characters to see if the start is in a quote and
proceed accordingly - it quoted strings can be of arbitrary length there
may be no good solution
On Wed, Feb 22, 2012 at 11:01 AM, Keith Wiley <[EMAIL PROTECTED]> wrote:
> It seems nearly impossible to use CSV files as Hadoop input. I see that
> there is a CsvRecordInput class, but have found virtually no examples
> online of how to use it...and the one example I did find blatantly assumed
> that the CSV records were delimited by endlines...which is not CSV spec.
> Based on my analysis below, I don't see how CSV input is possible, so I
> don't understand how CsvRecordInput can work (and I am having trouble
> understanding the completely undocumented CsvRecordInput.java; It isn't
> clear how that class is intended to be used). If CsvRecordInput solves all
> my problems, then great, but how do I use it?
> I need to process CSV files which will almost certainly contain quoted
> endlines. I have attempted to derive my own record reader for this task
> and conclude that it is virtually impossible without reading from the
> beginning of the file. I explain below.
> Consider this: Assuming a split starts at some arbitrary point in the
> file, the standard record reader approach would be to initialize the record
> reader by reading to the end of the current mid-record and beginning the
> record reader at the start of the next full record...but there is no way to
> positively identify the end of CSV record if you start at an arbitrary
> location without potentially reading to the end of the file!
> For example, we must consider the possibility that the split begins in the
> middle of a quoted string (therefore, endlines do not delimit records
> because they may be within a string). We must therefore scan for a
> possible end-quote to close the string, but if we *didn't* begin within a
> string there may *be no end-quote at all* (the entire CSV file might not
> contain a single quoted string). The only way to identify that we did not
> begin within a quoted string is to scan to the end of the CSV file (not the
> end of the *split* mind you).
> So, initializing a CSV record reader with absolute error-free confidence
> potentially requires reading not only the entire split at the time of
> initialization (grossly inefficient in itself), but potentially requires
> reading the entire file, which may not even reside on the current node!
> I'm at a loss. How can Hadoop take CSV files as input? It must be
> possible. CSV is a very plain and common way to arrange textual data,
> which is Hadoop's forte; I'm sure people are processing CSV data with
> Hadoop, it seems like a natural fit...but I can't imagine how to enable
> Hadoop to read it under the conditions of Hadoop file splits.
> Blech. Help!
> Keith Wiley [EMAIL PROTECTED] keithwiley.com
> "Luminous beings are we, not this crude matter."
> -- Yoda
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033