|
|
-
string conversion problems
Nikolay Korovaiko 2010-07-16, 01:18
Hi everyone,
I hope this is the right place for my question. If not, please, feel free to ignore it ;) and I'm sorry for any inconvenience made :(
I'm writing a simple program for enumerating triangles in directed graphs for my project. First, for each input arc (e.g. a b, b c, c a, note: a tab symbol serves as a delimiter) I want my map function output the following pairs ([a, to_b], [b, from_a], [a_b, -1]):
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String [] tokens = line.split(" ");
output.collect(new Text(tokens[0]), new Text("to_"+tokens[1]));
output.collect(new Text(tokens[1]), new Text("from_"+tokens[0]));
output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1"));
}
Now my reduce function is supposed to cross join all pairs that have both to_'s and from_'s and to simply propogate any other pairs whose keys contain "_".
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String key_s = key.toString();
if (key_s.indexOf("_")>0)
output.collect(key, new Text("completed"));
else {
HashMap <String, ArrayList<String>> lists = new HashMap <String, ArrayList<String>> ();
while (values.hasNext()) {
String line = values.next().toString();
String[] tokens = line.split("_");
if (!lists.containsKey(tokens[0])) {
lists.put(tokens[0], new ArrayList<String>());
} lists.get(tokens[0]).add(tokens[1]);
}
for (String t : lists.get("to"))
for (String f : lists.get("from"))
output.collect(new Text(t+"_"+f), key); }
}
And this is where the most exciting stuff happens. tokens[1] yields an ArrayOutOfBounds exception. If you scroll up, you can see that by this point the iterator should give values like "to_a", "from_b", "to_b", etc... when I just output these values, everything looks ok and I have "to_a", "from_b". But split() don't work at all, moreover line.length() is always 1 and indexOf("*") returns -1! The very same indexOf WORKS PERFECTLY for keys... where we have pairs whose keys contain "_"* and look like "a_b", "b_c"
I'm really puzzled with all this. MapReduce is supposed to save lives making everything simple. Instead I spent several hours to just spot this...
I'd really appreciate your help, guys!!! Thanks in advance!
+
Nikolay Korovaiko 2010-07-16, 01:18
-
Re: string conversion problems
Jeff Bean 2010-07-16, 16:16
Is the tab the delimiter between records or between keys and values on the input?
in other words does the input file look like this:
a\tb b\tc c\ta
or does it look like this:
a b\tb c\tc a\t
?
Jeff
On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko <[EMAIL PROTECTED]>wrote:
> Hi everyone, > > I hope this is the right place for my question. If not, please, feel free > to > ignore it ;) and I'm sorry for any inconvenience made :( > > I'm writing a simple program for enumerating triangles in directed graphs > for my project. First, for each input arc (e.g. a b, b c, c a, note: a tab > symbol serves as a delimiter) I want my map function output the following > pairs ([a, to_b], [b, from_a], [a_b, -1]): > > public void map(LongWritable key, Text value, > > OutputCollector<Text, Text> output, > > Reporter reporter) throws IOException { > > String line = value.toString(); > > String [] tokens = line.split(" "); > > output.collect(new Text(tokens[0]), new Text("to_"+tokens[1])); > > output.collect(new Text(tokens[1]), new Text("from_"+tokens[0])); > > output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1")); > > } > > Now my reduce function is supposed to cross join all pairs that have both > to_'s and from_'s and to simply propogate any other pairs whose keys > contain > "_". > > public void reduce(Text key, Iterator<Text> values, > > OutputCollector<Text, Text> output, > > Reporter reporter) throws IOException { > > String key_s = key.toString(); > > if (key_s.indexOf("_")>0) > > output.collect(key, new Text("completed")); > > else { > > HashMap <String, ArrayList<String>> lists = new HashMap > <String, ArrayList<String>> (); > > while (values.hasNext()) { > > String line = values.next().toString(); > > String[] tokens = line.split("_"); > > if (!lists.containsKey(tokens[0])) { > > lists.put(tokens[0], new ArrayList<String>()); > > } > lists.get(tokens[0]).add(tokens[1]); > > } > > for (String t : lists.get("to")) > > for (String f : lists.get("from")) > > output.collect(new Text(t+"_"+f), key); > > > } > > } > > And this is where the most exciting stuff happens. tokens[1] yields an > ArrayOutOfBounds exception. If you scroll up, you can see that by this > point > the iterator should give values like "to_a", "from_b", "to_b", etc... when > I > just output these values, everything looks ok and I have "to_a", "from_b". > But split() don't work at all, moreover line.length() is always 1 and > indexOf("*") returns -1! The very same indexOf WORKS PERFECTLY for keys... > where we have pairs whose keys contain "_"* and look like "a_b", "b_c" > > I'm really puzzled with all this. MapReduce is supposed to save lives > making > everything simple. Instead I spent several hours to just spot this... > > I'd really appreciate your help, guys!!! Thanks in advance! >
+
Jeff Bean 2010-07-16, 16:16
-
Re: string conversion problems
Nikolay Korovaiko 2010-07-16, 17:16
First, thank you very much for the reply!
so, this is my input:
a\tb b\tc c\ta
In other words, a map function initially receives the whole string a\tb as its value. And it processes my input data correctly. I actually changed my reduce function to simply emit merged pairs from a map's input for checking this. However, when I tried to cross join cases where I have both to_'s and from_'s (for example, a reducer gets the following pair <a, to_b ; from_c> ) by splitting each value provided by a reducer's iterator with split("_"), it just didn't work. Even though without this additional logic reducer DOES output these values <a, to_b ; from_c>, so it GETS them. The same split thing works just fine for keys in a reduce function i.e. it discriminates cases with a composite key like "a_b" from a simple key like "a." My guess is that Hadoop should be sorting values for a reducer behind the scene and this somehow messes up an initial character encoding. I'm using a Text class as a serializable wrapper for my strings. I guess there is no other option for it?)))
I wanna try to get rid of composite keys first (the last output.collect in a map function) to make things a bit simpler and test it again then. On Fri, Jul 16, 2010 at 9:16 AM, Jeff Bean <[EMAIL PROTECTED]> wrote:
> Is the tab the delimiter between records or between keys and values on the > input? > > in other words does the input file look like this: > > a\tb > b\tc > c\ta > > or does it look like this: > > a b\tb c\tc a\t > > ? > > Jeff > > On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko <[EMAIL PROTECTED] > >wrote: > > > Hi everyone, > > > > I hope this is the right place for my question. If not, please, feel free > > to > > ignore it ;) and I'm sorry for any inconvenience made :( > > > > I'm writing a simple program for enumerating triangles in directed graphs > > for my project. First, for each input arc (e.g. a b, b c, c a, note: a > tab > > symbol serves as a delimiter) I want my map function output the following > > pairs ([a, to_b], [b, from_a], [a_b, -1]): > > > > public void map(LongWritable key, Text value, > > > > OutputCollector<Text, Text> output, > > > > Reporter reporter) throws IOException { > > > > String line = value.toString(); > > > > String [] tokens = line.split(" "); > > > > output.collect(new Text(tokens[0]), new Text("to_"+tokens[1])); > > > > output.collect(new Text(tokens[1]), new Text("from_"+tokens[0])); > > > > output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1")); > > > > } > > > > Now my reduce function is supposed to cross join all pairs that have both > > to_'s and from_'s and to simply propogate any other pairs whose keys > > contain > > "_". > > > > public void reduce(Text key, Iterator<Text> values, > > > > OutputCollector<Text, Text> output, > > > > Reporter reporter) throws IOException { > > > > String key_s = key.toString(); > > > > if (key_s.indexOf("_")>0) > > > > output.collect(key, new Text("completed")); > > > > else { > > > > HashMap <String, ArrayList<String>> lists = new HashMap > > <String, ArrayList<String>> (); > > > > while (values.hasNext()) { > > > > String line = values.next().toString(); > > > > String[] tokens = line.split("_"); > > > > if (!lists.containsKey(tokens[0])) { > > > > lists.put(tokens[0], new ArrayList<String>()); > > > > } > > lists.get(tokens[0]).add(tokens[1]); > > > > } > > > > for (String t : lists.get("to")) > > > > for (String f : lists.get("from")) > > > > output.collect(new Text(t+"_"+f), key); > > > > > > } > > > > } > > > > And this is where the most exciting stuff happens. tokens[1] yields an > > ArrayOutOfBounds exception. If you scroll up, you can see that by this > > point > > the iterator should give values like "to_a", "from_b", "to_b", etc...
+
Nikolay Korovaiko 2010-07-16, 17:16
-
Re: string conversion problems
Jeff Bean 2010-07-16, 20:33
Whitespace characters are funny. You showed me this code in the mapper:
String [] tokens = line.split(" ");
Which doesn't actually match for tab, which would be line.split("\t");
This would still execute, but you'd have keys and values that look right going into the reducer, but you might not catch that you have value substrings appended to the key because you didn't split correctly.
This is just from eyeballing the code. Let me know if I'm on the right track.
Jeff On Fri, Jul 16, 2010 at 10:16 AM, Nikolay Korovaiko <[EMAIL PROTECTED]>wrote:
> First, thank you very much for the reply! > > so, this is my input: > > a\tb > b\tc > c\ta > > In other words, a map function initially receives the whole string a\tb as > its value. > And it processes my input data correctly. I actually changed my reduce > function to simply emit merged pairs from a map's input for checking this. > However, when I tried to cross join cases where I have both to_'s and > from_'s (for example, a reducer gets the following pair <a, to_b ; from_c> > ) > by splitting each value provided by a reducer's iterator with split("_"), > it > just didn't work. Even though without this additional logic reducer DOES > output these values <a, to_b ; from_c>, so it GETS them. The same split > thing works just fine for keys in a reduce function i.e. it discriminates > cases with a composite key like "a_b" from a simple key like "a." My guess > is that Hadoop should be sorting values for a reducer behind the scene and > this somehow messes up an initial character encoding. I'm using a Text > class > as a serializable wrapper for my strings. I guess there is no other option > for it?))) > > I wanna try to get rid of composite keys first (the last output.collect in > a > map function) to make things a bit simpler and test it again then. > > > On Fri, Jul 16, 2010 at 9:16 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > > > Is the tab the delimiter between records or between keys and values on > the > > input? > > > > in other words does the input file look like this: > > > > a\tb > > b\tc > > c\ta > > > > or does it look like this: > > > > a b\tb c\tc a\t > > > > ? > > > > Jeff > > > > On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko <[EMAIL PROTECTED] > > >wrote: > > > > > Hi everyone, > > > > > > I hope this is the right place for my question. If not, please, feel > free > > > to > > > ignore it ;) and I'm sorry for any inconvenience made :( > > > > > > I'm writing a simple program for enumerating triangles in directed > graphs > > > for my project. First, for each input arc (e.g. a b, b c, c a, note: a > > tab > > > symbol serves as a delimiter) I want my map function output the > following > > > pairs ([a, to_b], [b, from_a], [a_b, -1]): > > > > > > public void map(LongWritable key, Text value, > > > > > > OutputCollector<Text, Text> output, > > > > > > Reporter reporter) throws IOException { > > > > > > String line = value.toString(); > > > > > > String [] tokens = line.split(" "); > > > > > > output.collect(new Text(tokens[0]), new Text("to_"+tokens[1])); > > > > > > output.collect(new Text(tokens[1]), new Text("from_"+tokens[0])); > > > > > > output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1")); > > > > > > } > > > > > > Now my reduce function is supposed to cross join all pairs that have > both > > > to_'s and from_'s and to simply propogate any other pairs whose keys > > > contain > > > "_". > > > > > > public void reduce(Text key, Iterator<Text> values, > > > > > > OutputCollector<Text, Text> output, > > > > > > Reporter reporter) throws IOException { > > > > > > String key_s = key.toString(); > > > > > > if (key_s.indexOf("_")>0) > > > > > > output.collect(key, new Text("completed")); > > > > > > else { > > > > > > HashMap <String, ArrayList<String>> lists = new HashMap > > > <String, ArrayList<String>> (); > > > > > > while (values.hasNext()) {
+
Jeff Bean 2010-07-16, 20:33
-
Re: string conversion problems
cvkkumar 2010-07-17, 05:57
Hi,
You could also try String [] tokens = line.split("\\s+"); Even this is by just eyeballing... Do let us know. Regards, CVK
On Jul 16, 2010, at 1:33 PM, Jeff Bean wrote:
Whitespace characters are funny. You showed me this code in the mapper:
String [] tokens = line.split(" ");
Which doesn't actually match for tab, which would be line.split("\t");
This would still execute, but you'd have keys and values that look right going into the reducer, but you might not catch that you have value substrings appended to the key because you didn't split correctly.
This is just from eyeballing the code. Let me know if I'm on the right track.
Jeff On Fri, Jul 16, 2010 at 10:16 AM, Nikolay Korovaiko <[EMAIL PROTECTED]>wrote:
> First, thank you very much for the reply! > > so, this is my input: > > a\tb > b\tc > c\ta > > In other words, a map function initially receives the whole string a\tb as > its value. > And it processes my input data correctly. I actually changed my reduce > function to simply emit merged pairs from a map's input for checking this. > However, when I tried to cross join cases where I have both to_'s and > from_'s (for example, a reducer gets the following pair <a, to_b ; from_c> > ) > by splitting each value provided by a reducer's iterator with split("_"), > it > just didn't work. Even though without this additional logic reducer DOES > output these values <a, to_b ; from_c>, so it GETS them. The same split > thing works just fine for keys in a reduce function i.e. it discriminates > cases with a composite key like "a_b" from a simple key like "a." My guess > is that Hadoop should be sorting values for a reducer behind the scene and > this somehow messes up an initial character encoding. I'm using a Text > class > as a serializable wrapper for my strings. I guess there is no other option > for it?))) > > I wanna try to get rid of composite keys first (the last output.collect in > a > map function) to make things a bit simpler and test it again then. > > > On Fri, Jul 16, 2010 at 9:16 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > >> Is the tab the delimiter between records or between keys and values on > the >> input? >> >> in other words does the input file look like this: >> >> a\tb >> b\tc >> c\ta >> >> or does it look like this: >> >> a b\tb c\tc a\t >> >> ? >> >> Jeff >> >> On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko <[EMAIL PROTECTED] >>> wrote: >> >>> Hi everyone, >>> >>> I hope this is the right place for my question. If not, please, feel > free >>> to >>> ignore it ;) and I'm sorry for any inconvenience made :( >>> >>> I'm writing a simple program for enumerating triangles in directed > graphs >>> for my project. First, for each input arc (e.g. a b, b c, c a, note: a >> tab >>> symbol serves as a delimiter) I want my map function output the > following >>> pairs ([a, to_b], [b, from_a], [a_b, -1]): >>> >>> public void map(LongWritable key, Text value, >>> >>> OutputCollector<Text, Text> output, >>> >>> Reporter reporter) throws IOException { >>> >>> String line = value.toString(); >>> >>> String [] tokens = line.split(" "); >>> >>> output.collect(new Text(tokens[0]), new Text("to_"+tokens[1])); >>> >>> output.collect(new Text(tokens[1]), new Text("from_"+tokens[0])); >>> >>> output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1")); >>> >>> } >>> >>> Now my reduce function is supposed to cross join all pairs that have > both >>> to_'s and from_'s and to simply propogate any other pairs whose keys >>> contain >>> "_". >>> >>> public void reduce(Text key, Iterator<Text> values, >>> >>> OutputCollector<Text, Text> output, >>> >>> Reporter reporter) throws IOException { >>> >>> String key_s = key.toString(); >>> >>> if (key_s.indexOf("_")>0) >>> >>> output.collect(key, new Text("completed")); >>> >>> else { >>> >>> HashMap <String, ArrayList<String>> lists = new HashMap
+
cvkkumar 2010-07-17, 05:57
-
Re: string conversion problems
Nikolay Korovaiko 2010-07-17, 06:26
Hi guys! Thank you very much for the help!
Ive actually tried the both: "\\t" and "\\s+", but neither of them has worked... Even though (" ") might not be working for some other cases, however this splits keys and values correctly for this particular one... I've also set my delimiter to a comma to just be on the safe side..
The only reason why I've used this (" ") is that I cannot get used to double slashed special symbols in java's regexprs. I'm very sorry for this confusion((( Regards, Nikolay On Fri, Jul 16, 2010 at 10:57 PM, cvkkumar <[EMAIL PROTECTED]> wrote:
> Hi, > > You could also try > String [] tokens = line.split("\\s+"); > Even this is by just eyeballing... Do let us know. > Regards, > CVK > > On Jul 16, 2010, at 1:33 PM, Jeff Bean wrote: > > Whitespace characters are funny. You showed me this code in the mapper: > > String [] tokens = line.split(" "); > > Which doesn't actually match for tab, which would be line.split("\t"); > > This would still execute, but you'd have keys and values that look right > going into the reducer, but you might not catch that you have value > substrings appended to the key because you didn't split correctly. > > This is just from eyeballing the code. Let me know if I'm on the right > track. > > Jeff > > > On Fri, Jul 16, 2010 at 10:16 AM, Nikolay Korovaiko <[EMAIL PROTECTED] > >wrote: > > > First, thank you very much for the reply! > > > > so, this is my input: > > > > a\tb > > b\tc > > c\ta > > > > In other words, a map function initially receives the whole string a\tb > as > > its value. > > And it processes my input data correctly. I actually changed my reduce > > function to simply emit merged pairs from a map's input for checking > this. > > However, when I tried to cross join cases where I have both to_'s and > > from_'s (for example, a reducer gets the following pair <a, to_b ; > from_c> > > ) > > by splitting each value provided by a reducer's iterator with split("_"), > > it > > just didn't work. Even though without this additional logic reducer DOES > > output these values <a, to_b ; from_c>, so it GETS them. The same split > > thing works just fine for keys in a reduce function i.e. it discriminates > > cases with a composite key like "a_b" from a simple key like "a." My > guess > > is that Hadoop should be sorting values for a reducer behind the scene > and > > this somehow messes up an initial character encoding. I'm using a Text > > class > > as a serializable wrapper for my strings. I guess there is no other > option > > for it?))) > > > > I wanna try to get rid of composite keys first (the last output.collect > in > > a > > map function) to make things a bit simpler and test it again then. > > > > > > On Fri, Jul 16, 2010 at 9:16 AM, Jeff Bean <[EMAIL PROTECTED]> wrote: > > > >> Is the tab the delimiter between records or between keys and values on > > the > >> input? > >> > >> in other words does the input file look like this: > >> > >> a\tb > >> b\tc > >> c\ta > >> > >> or does it look like this: > >> > >> a b\tb c\tc a\t > >> > >> ? > >> > >> Jeff > >> > >> On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko < > [EMAIL PROTECTED] > >>> wrote: > >> > >>> Hi everyone, > >>> > >>> I hope this is the right place for my question. If not, please, feel > > free > >>> to > >>> ignore it ;) and I'm sorry for any inconvenience made :( > >>> > >>> I'm writing a simple program for enumerating triangles in directed > > graphs > >>> for my project. First, for each input arc (e.g. a b, b c, c a, note: a > >> tab > >>> symbol serves as a delimiter) I want my map function output the > > following > >>> pairs ([a, to_b], [b, from_a], [a_b, -1]): > >>> > >>> public void map(LongWritable key, Text value, > >>> > >>> OutputCollector<Text, Text> output, > >>> > >>> Reporter reporter) throws IOException { > >>> > >>> String line = value.toString(); > >>> > >>> String [] tokens = line.split(" "); > >>> > >>> output.collect(new Text(tokens[0]), new Text("to_"+tokens[1]));
+
Nikolay Korovaiko 2010-07-17, 06:26
|
|