Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> CSV files as input


Copy link to this message
-
Re: CSV files as input
Two other points -
if you have several input files make a custom input whose reader make
 protected boolean isSplitable(JobContext context, Path file) return false
and you do not have problems starting in the middle -
If the input is not truly massive you can simply write a piece of code to
find the longest quotes string be reading the entire file - on a single box
you can handle tens of gigs per hour.

On Wed, Feb 22, 2012 at 3:22 PM, Keith Wiley <[EMAIL PROTECTED]> wrote:

> Thanks for responding.  Unfortunately, the data already exists.  I have no
> way of instituting limitations on the format, much less reformatting it to
> suit my needs.  It is true that I can make some general assumptions about
> the data (unrealistically long strings are unlikely to occur), but I can't
> write a steadfastly robust reader under such assumptions.
>
> The problem is that even if I impose an assumption of limited length
> strings, that doesn't prescribe a method for handling the possibility of an
> error.  If a string really is too long and the reader fails to detect it,
> I'm not sure how to insure that the reader or subsequent map task fails in
> a clean fashion.
>
> If I could at least impose an assumption of this sort...and then detect
> and fail cleanly on violations of the assumption, that would go a long way.
>
> I'll think about it.
>
> Thanks.
>
> On Feb 22, 2012, at 14:59 , Steve Lewis wrote:
>
> > It sounds like you may need to give up a little to make things work -
> Suppose, for example, that you placed a limit on the length of a quoted
> string,
> > say 1024 characters - the reader can then either start at the beginning
> or read back by, say 1024 characters to see if the start is in a quote and
> proceed accordingly - it quoted strings can be of arbitrary length there
> may be no good solution
>
>
> ________________________________________________________________________________
> Keith Wiley     [EMAIL PROTECTED]     keithwiley.com
> music.keithwiley.com
>
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>                                           --  Galileo Galilei
>
> ________________________________________________________________________________
>
>
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB