|
|
-
Best practices on spliltting an input line?Andy Sautins 2009-02-10, 20:18
I have question. I've dabbled with different ways of tokenizing an input file line for processing. I've noticed in my somewhat limited tests that there seem to be some pretty reasonable performance differences between different tokenizing methods. For example, roughly it seems to split a line on tokens ( tab delimited in my case ) that Scanner is the slowest, followed by String.spit and StringTokenizer being the fastest. StringTokenizer, for my application, has the unfortunate characteristic of not returning blank tokens ( i.e., parsing "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d"). The WordCount example uses StringTokenizer which makes sense to me, except I'm currently getting hung up on not returning blank tokens. I did run across the com.Ostermiller.util StringTokenizer replacement that handles null/blank tokens (http://ostermiller.org/utils/StringTokenizer.html ) which seems possible to use, but it sure seems like someone else has solved this problem already better than I have. So, my question is, is there a "best practice" for splitting an input line especially when NULL tokens are expected ( i.e., two consecutive delimiter characters )? Any thoughts would be appreciated Thanks Andy |