Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - MapReduce mapper not seeing all rows


Copy link to this message
-
Re: MapReduce mapper not seeing all rows
Mike Hugo 2013-02-26, 23:05
Thanks Billie,

The TimestampFilter is configured with an end time:

        IteratorSetting timestampIterator = new IteratorSetting(1,
"tsBefore", TimestampFilter.class);
        TimestampFilter.setEnd(timestampIterator, endTime, true);

We have validated that all the records we're interested in have a timestamp
that's less than the end time we're passing in.  E.g. the timestamp being
passed to the timestamp filter is
1361907184183 and a sample timestamp on a record in the table is
1361849294237.

The only difference between the two runs is whether we set the ranges or
not:  AccumuloRowInputFormat.setRanges(job.getConfiguration(), ranges);

Running a scan from the accumulo shell we see all the data is there, as
well as running a scan via the Java API (not map-reduce, just a straight up
scanner), but for some reason the Mapper just never hits those rows.

Is there any other visibility type of issue I might be hitting?  I don't
think there is, as the two map / reduce runs (one with a range, one
without) are kicked off the same way, with the same username/password, and
by the same unix user.

Any other thoughts?  I'm sure we're missing something simple but I can't
pinpoint it.

Thanks,

Mike

On Tue, Feb 26, 2013 at 4:45 PM, Billie Rinaldi <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 26, 2013 at 12:31 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
>
>> Our row keys are a combination of two elements, like this:
>>
>> foo/bar
>> foo/baz
>> foo/bee
>>
>> eee/blah
>> eee/boo
>>
>> When running without any ranges set, we're missing an entire prefix worth
>> - e.g. we don't get any rows that start with "foo"
>>
>
> That sounds like a clue, because Accumulo doesn't know about the format of
> your row keys.  If it were dropping arbitrary rows, I would expect you to
> see some foo-prefixed rows and not others.  Are there any other differences
> in the two runs?  How is the TimestampFilter configured?
>
> Billie
>
>
>
>>
>> When I tried running with the range set, I did a prefix range on "foo"
>> and it then found the rows starting with "foo"
>>
>>
>> On Tue, Feb 26, 2013 at 2:28 PM, Billie Rinaldi <[EMAIL PROTECTED]>wrote:
>>
>>> Have you noticed any pattern in the rows it seems to be missing?  E.g.
>>> every other row, the last row in each tablet, etc.?  When you set a range,
>>> what range did you set?
>>>
>>> Billie
>>>
>>>
>>>
>>> On Tue, Feb 26, 2013 at 12:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm running a map reduce job over a table using AccumuloRowInputFormat.
>>>>  For debugging purposes I'm logging the key.getRow() so I can see what rows
>>>> it's finding as it progresses.
>>>>
>>>> If I don't specify any ranges on the input format, it skips significant
>>>> number of rows - that is, I don't see any logging indicating that it
>>>> traversed them.
>>>>
>>>> To see if it was a visibility issue, I tried explicitly setting a
>>>> range, like this:
>>>>
>>>>         AccumuloRowInputFormat.setRanges(job.getConfiguration(),
>>>> ranges);
>>>>
>>>> When doing that it does process the rows that it otherwise skips.
>>>>
>>>> The same TimestampFilter is being applied in both scenarios, no other
>>>> filters / iterators are being used.
>>>>
>>>> Any thoughts on why, when run without the ranges specified, it isn't
>>>> seeing a significant portion of the data?
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>
>>>
>>
>