Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> MapReduce mapper not seeing all rows


Copy link to this message
-
Re: MapReduce mapper not seeing all rows
On Tue, Feb 26, 2013 at 12:31 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:

> Our row keys are a combination of two elements, like this:
>
> foo/bar
> foo/baz
> foo/bee
>
> eee/blah
> eee/boo
>
> When running without any ranges set, we're missing an entire prefix worth
> - e.g. we don't get any rows that start with "foo"
>

That sounds like a clue, because Accumulo doesn't know about the format of
your row keys.  If it were dropping arbitrary rows, I would expect you to
see some foo-prefixed rows and not others.  Are there any other differences
in the two runs?  How is the TimestampFilter configured?

Billie

>
> When I tried running with the range set, I did a prefix range on "foo" and
> it then found the rows starting with "foo"
>
>
> On Tue, Feb 26, 2013 at 2:28 PM, Billie Rinaldi <[EMAIL PROTECTED]> wrote:
>
>> Have you noticed any pattern in the rows it seems to be missing?  E.g.
>> every other row, the last row in each tablet, etc.?  When you set a range,
>> what range did you set?
>>
>> Billie
>>
>>
>>
>> On Tue, Feb 26, 2013 at 12:17 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
>>
>>> Hello,
>>>
>>> I'm running a map reduce job over a table using AccumuloRowInputFormat.
>>>  For debugging purposes I'm logging the key.getRow() so I can see what rows
>>> it's finding as it progresses.
>>>
>>> If I don't specify any ranges on the input format, it skips significant
>>> number of rows - that is, I don't see any logging indicating that it
>>> traversed them.
>>>
>>> To see if it was a visibility issue, I tried explicitly setting a range,
>>> like this:
>>>
>>>         AccumuloRowInputFormat.setRanges(job.getConfiguration(), ranges);
>>>
>>> When doing that it does process the rows that it otherwise skips.
>>>
>>> The same TimestampFilter is being applied in both scenarios, no other
>>> filters / iterators are being used.
>>>
>>> Any thoughts on why, when run without the ranges specified, it isn't
>>> seeing a significant portion of the data?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB