Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> MapReduce mapper not seeing all rows


Copy link to this message
-
Re: MapReduce mapper not seeing all rows
I think we found the culprit.  Nothing is wrong with the map/reduce logic
or the AccumuloRowInputFormat.

We were validating the output of the map reduce job from the accumulo
shell, and the output row key we were looking for happened to begin with a
quote which made it look like there weren't any rows (
https://issues.apache.org/jira/browse/ACCUMULO-1116)

On top of that, we were recursively grepping directories with log files
that had a special character in them that was making it look like the
mapper hadn't actually iterated over some rows, when indeed it had.

A confluence of (user) errors :)

Thanks for the help along the way!

Mike
On Tue, Feb 26, 2013 at 5:05 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:

> Thanks Billie,
>
> The TimestampFilter is configured with an end time:
>
>         IteratorSetting timestampIterator = new IteratorSetting(1,
> "tsBefore", TimestampFilter.class);
>         TimestampFilter.setEnd(timestampIterator, endTime, true);
>
> We have validated that all the records we're interested in have a
> timestamp that's less than the end time we're passing in.  E.g. the
> timestamp being passed to the timestamp filter is
> 1361907184183 and a sample timestamp on a record in the table is
> 1361849294237.
>
> The only difference between the two runs is whether we set the ranges or
> not:  AccumuloRowInputFormat.setRanges(job.getConfiguration(), ranges);
>
> Running a scan from the accumulo shell we see all the data is there, as
> well as running a scan via the Java API (not map-reduce, just a straight up
> scanner), but for some reason the Mapper just never hits those rows.
>
> Is there any other visibility type of issue I might be hitting?  I don't
> think there is, as the two map / reduce runs (one with a range, one
> without) are kicked off the same way, with the same username/password, and
> by the same unix user.
>
> Any other thoughts?  I'm sure we're missing something simple but I can't
> pinpoint it.
>
> Thanks,
>
> Mike
>
>
>
> On Tue, Feb 26, 2013 at 4:45 PM, Billie Rinaldi <[EMAIL PROTECTED]> wrote:
>
>> On Tue, Feb 26, 2013 at 12:31 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
>>
>>> Our row keys are a combination of two elements, like this:
>>>
>>> foo/bar
>>> foo/baz
>>> foo/bee
>>>
>>> eee/blah
>>> eee/boo
>>>
>>> When running without any ranges set, we're missing an entire prefix
>>> worth - e.g. we don't get any rows that start with "foo"
>>>
>>
>> That sounds like a clue, because Accumulo doesn't know about the format
>> of your row keys.  If it were dropping arbitrary rows, I would expect you
>> to see some foo-prefixed rows and not others.  Are there any other
>> differences in the two runs?  How is the TimestampFilter configured?
>>
>> Billie
>>
>>
>>
>>>
>>> When I tried running with the range set, I did a prefix range on "foo"
>>> and it then found the rows starting with "foo"
>>>
>>>
>>> On Tue, Feb 26, 2013 at 2:28 PM, Billie Rinaldi <[EMAIL PROTECTED]>wrote:
>>>
>>>> Have you noticed any pattern in the rows it seems to be missing?  E.g.
>>>> every other row, the last row in each tablet, etc.?  When you set a range,
>>>> what range did you set?
>>>>
>>>> Billie
>>>>
>>>>
>>>>
>>>> On Tue, Feb 26, 2013 at 12:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm running a map reduce job over a table using
>>>>> AccumuloRowInputFormat.  For debugging purposes I'm logging the
>>>>> key.getRow() so I can see what rows it's finding as it progresses.
>>>>>
>>>>> If I don't specify any ranges on the input format, it skips
>>>>> significant number of rows - that is, I don't see any logging indicating
>>>>> that it traversed them.
>>>>>
>>>>> To see if it was a visibility issue, I tried explicitly setting a
>>>>> range, like this:
>>>>>
>>>>>         AccumuloRowInputFormat.setRanges(job.getConfiguration(),
>>>>> ranges);
>>>>>
>>>>> When doing that it does process the rows that it otherwise skips.
>>>>>
>>>>> The same TimestampFilter is being applied in both scenarios, no other
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB