kiran chitturi 2013-02-10, 00:14
Ted Yu 2013-02-10, 00:43
kiran chitturi 2013-02-10, 00:49
lars hofhansl 2013-02-10, 02:17
kiran chitturi 2013-02-10, 02:51
-Re: Inconsistent row count between mapreduce and shell count
lars hofhansl 2013-02-10, 04:38
That looks all as it should.
Unless you somehow pointed the M/R job to another cluster I have no good explanation.
Would be interesting to see whether in the absence of writes you'd always get precisely the same numbers.
(Look like it might be the case, your 2nd run is not wildly different from the first).
This is a bit disconcerting. Is there anything "interesting" in the logs?
Aside: For performance reasons you'd probably want to enable scanner caching for the M/R: -Dhbase.client.scanner.caching=100 (or 1000)
And also turn off speculative execution (we should do that by default): -Dmapred.map.tasks.speculative.execution=false
It might be the speculative execution that throws the job off, I am just guessing now.
From: kiran chitturi <[EMAIL PROTECTED]>
To: user <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]>
Sent: Saturday, February 9, 2013 6:51 PM
Subject: Re: Inconsistent row count between mapreduce and shell count
On Sat, Feb 9, 2013 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
Hmm... Can you show us the exact commands you executed?
I am writing below the exact commands that i have used.
In the hbase shell, for the table documents i have used
The mapreduce command is
/opt/hadoop-1.0.4/bin/hadoop jar /opt/hbase-0.94.1/hbase-0.94.1.jar rowcounter -Dhbase.zookeeper.quorum="LucidN1,LucidN2,LucidN3" documents
And just to rule out the obvious:
>1. There were no writes while you did the row count?
Actually, we have a few automated programs which write tweets to the table over time. So there might be writes when the row count is there
Should i disable writes when doing the mapreduce ?
2. In the RowCount M/R case you specified neither a range nor any columns?
>Do you always get the exact same numbers in both cases? Or do they vary?
I just did another map reduce and this time the number is 1394234. The actual count from shell is 2157447
>----- Original Message -----
>From: kiran chitturi <[EMAIL PROTECTED]>
>To: user <[EMAIL PROTECTED]>
>Sent: Saturday, February 9, 2013 4:49 PM
>Subject: Re: Inconsistent row count between mapreduce and shell count
>Yes. I just counted the number of regions in '
>http://machine1:60010/table.jsp?name=documents'; and the count is 53 which
>is equal to the number of complete tasks in hadoop.
>On Sat, Feb 9, 2013 at 7:43 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>> Apart from the 5 killed tasks, was the number of successful tasks equal to
>> the number of regions in your table ?
>> On Sat, Feb 9, 2013 at 4:14 PM, kiran chitturi <[EMAIL PROTECTED]
>> > Hi!
>> > I am using Hbase 0.94.1 version over a distributed cluster of 20 nodes.
>> > When i execute hbase count over a table in a shell, i got the count of
>> > 2152416 rows.
>> > When i did the same thing using the rowcounter mapreduce, i got the value
>> > as below
>> > org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
>> > 13/02/10 00:05:06 INFO mapred.JobClient: ROWS=1389991
>> > Same thing happened when i used pig to count or do operations. There is
>> > inconsistency between both the results.
>> > During the mapreduce, i have noticed that there are 5 tasks that are
>> > killed. When i tried to trace back to the tasktracker logs of the node it
>> > shows similar to below log.
>> > 2013-02-09_23:58:58.40665 13/02/09 23:58:58 INFO mapred.TaskTracker: JVM
>> > with ID: jvm_201302090035_0015_m_1905604998 given task:
>> > attempt_201302090035_0015_m_000012_1
>> > 2013-02-09_23:59:03.57016 13/02/09 23:59:03 INFO mapred.TaskTracker:
>> > Received KillTaskAction for task: attempt_201302090035_0015_m_000012_1
>> > 2013-02-09_23:59:03.57034 13/02/09 23:59:03 INFO mapred.TaskTracker:
>> > to purge task: attempt_201302090035_0015_m_000012_1
kiran chitturi 2013-02-10, 05:46
Ted Yu 2013-02-10, 07:05