Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Best practices on spliltting an input line?


Copy link to this message
-
Best practices on spliltting an input line?
Andy Sautins 2009-02-10, 20:18


   I have question.  I've dabbled with different ways of tokenizing an
input file line for processing.  I've noticed in my somewhat limited
tests that there seem to be some pretty reasonable performance
differences between different tokenizing methods.  For example, roughly
it seems to split a line on tokens ( tab delimited in my case ) that
Scanner is the slowest, followed by String.spit and StringTokenizer
being the fastest.  StringTokenizer, for my application, has the
unfortunate characteristic of not returning blank tokens ( i.e., parsing
"a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
The WordCount example uses StringTokenizer which makes sense to me,
except I'm currently getting hung up on not returning blank tokens.  I
did run across the com.Ostermiller.util StringTokenizer replacement that
handles null/blank tokens
(http://ostermiller.org/utils/StringTokenizer.html ) which seems
possible to use, but it sure seems like someone else has solved this
problem already better than I have.

 

   So, my question is, is there a "best practice" for splitting an input
line especially when NULL tokens are expected ( i.e., two consecutive
delimiter characters )?

 

   Any thoughts would be appreciated

 

   Thanks

 

   Andy