Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> CSV files as input


Copy link to this message
-
Re: CSV files as input
It sounds like you may need to give up a little to make things work -
Suppose, for example, that you placed a limit on the length of a quoted
string,
say 1024 characters - the reader can then either start at the beginning or
read back by, say 1024 characters to see if the start is in a quote and
proceed accordingly - it quoted strings can be of arbitrary length there
may be no good solution

On Wed, Feb 22, 2012 at 11:01 AM, Keith Wiley <[EMAIL PROTECTED]> wrote:

> It seems nearly impossible to use CSV files as Hadoop input.  I see that
> there is a CsvRecordInput class, but have found virtually no examples
> online of how to use it...and the one example I did find blatantly assumed
> that the CSV records were delimited by endlines...which is not CSV spec.
>  Based on my analysis below, I don't see how CSV input is possible, so I
> don't understand how CsvRecordInput can work (and I am having trouble
> understanding the completely undocumented CsvRecordInput.java; It isn't
> clear how that class is intended to be used).  If CsvRecordInput solves all
> my problems, then great, but how do I use it?
>
> I need to process CSV files which will almost certainly contain quoted
> endlines.  I have attempted to derive my own record reader for this task
> and conclude that it is virtually impossible without reading from the
> beginning of the file.  I explain below.
>
> Consider this: Assuming a split starts at some arbitrary point in the
> file, the standard record reader approach would be to initialize the record
> reader by reading to the end of the current mid-record and beginning the
> record reader at the start of the next full record...but there is no way to
> positively identify the end of CSV record if you start at an arbitrary
> location without potentially reading to the end of the file!
>
> For example, we must consider the possibility that the split begins in the
> middle of a quoted string (therefore, endlines do not delimit records
> because they may be within a string).  We must therefore scan for a
> possible end-quote to close the string, but if we *didn't* begin within a
> string there may *be no end-quote at all* (the entire CSV file might not
> contain a single quoted string).  The only way to identify that we did not
> begin within a quoted string is to scan to the end of the CSV file (not the
> end of the *split* mind you).
>
> So, initializing a CSV record reader with absolute error-free confidence
> potentially requires reading not only the entire split at the time of
> initialization (grossly inefficient in itself), but potentially requires
> reading the entire file, which may not even reside on the current node!
>
> I'm at a loss.  How can Hadoop take CSV files as input?  It must be
> possible.  CSV is a very plain and common way to arrange textual data,
> which is Hadoop's forte; I'm sure people are processing CSV data with
> Hadoop, it seems like a natural fit...but I can't imagine how to enable
> Hadoop to read it under the conditions of Hadoop file splits.
>
> Blech.  Help!
>
>
> ________________________________________________________________________________
> Keith Wiley     [EMAIL PROTECTED]     keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
>
> ________________________________________________________________________________
>
>
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB