Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - Efficient Tablet Merging [SEC=UNOFFICIAL]


+
Dickson, Matt MR 2013-10-02, 03:58
+
Eric Newton 2013-10-02, 13:05
+
Dickson, Matt MR 2013-10-02, 22:35
+
Eric Newton 2013-10-03, 03:28
+
Adam Fuchs 2013-10-03, 12:07
+
Dickson, Matt MR 2013-10-03, 06:29
+
Eric Newton 2013-10-03, 13:51
+
Dickson, Matt MR 2013-10-04, 00:43
+
Eric Newton 2013-10-04, 01:20
+
Dickson, Matt MR 2013-10-04, 03:20
Copy link to this message
-
Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
Eric Newton 2013-10-04, 03:27
Great details... but I need to sleep.  I'll dig in more tomorrow.  Sorry!

On Thu, Oct 3, 2013 at 11:20 PM, Dickson, Matt MR
<[EMAIL PROTECTED]> wrote:
> UNOFFICIAL
>
> Hi Eric,
> Our answers are in blue. Just a note that we do have the write ahead log
> disabled for ingest performance.
> We have a public holiday on Monday, so we may be delayed in our response.
>
> Cheers
> Matt
>
> ________________________________
> From: Eric Newton [mailto:[EMAIL PROTECTED]]
> Sent: Friday, 4 October 2013 11:20
>
> To: [EMAIL PROTECTED]
> Subject: Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
>
> Any errors on those servers?  Each server should be checking periodically
> for compactions, some crazy errors might escape error handling, though that
> is rare these days.
> In the tserver debug log there is a repeating error of  "Internal error
> processing applyUpdates
> org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits are
> held"
>
> Also found in the tserver log:
> ERROR: Failed to find midpoint Filesystem closed
> WARN: Tablet .... has too many files, batch lookup cannont run
>
> Are you experiencing any table level errors?  Unable to read or write files?
> No table level errors or read errors
>
>
> How full is HDFS?
> 32%
>
> If you scan the !METADATA table, are you seeing any trend in the tablets
> that have problems?
> By getting the extent id of the tablets that are large and then finding the
> range of that tablet by using 'getsplits -v' I have scanned the !METADATA
> table and can see a massive number of *.rf files associated with the range.
> Is there anything particular I should look at.
>
> At this point, we're looking for logged anomalies, the earlier the better.
> Anything red or yellow on the monitor pages.
> I ran one of the scans that hang and then see the following:
>
> Several "WARN Exception sying java.lang.reflect.InvocationTargetException"
>
> Several "ERROR  Unexpected error writing to log, retrying attempt 1
>     InvocationTargetException
>     Caused by LeaseExpiredException: Lease mismatch on /accumulo/wal/...
> owned by DFSClient_NOMAPREDUCE_56390516_13 but is accessed by
> DFSClient_NOMAPREDUCE_1080760417_13"
>
> "ERROR TTransportException: javav.net.SocketTimeoutException: ... while
> waiting for channel to be ready for write. ...."
>
> Bunch of "WARN Tablet 234234234 has too many files..."
>
>
>
>
>
>
>
> On Thu, Oct 3, 2013 at 8:43 PM, Dickson, Matt MR
> <[EMAIL PROTECTED]> wrote:
>>
>> UNOFFICIAL
>>
>> We have restarted the tablet servers that contain tablets with high
>> volumes of files and did not see any majc's run.
>>
>> Some more details are:
>> On 3 of our nodes we have 10-15 times the number of entries that are on
>> the other nodes.  When I view the tablets for one of these nodes there are 2
>> tablets with almost 10 times the the number of entries as the others.
>>
>> When we query on the date rowid's the queries are now hanging and there
>> are several scans running on the 3 nodes that have higher entries and they
>> are not completing, can I cancel these?
>>
>> In the logs we are getting "tablet ..... has too many files, batch lookup
>> can not run"
>>
>> At this point I'm stuck for ideas, so any suggestions would be great.
>>
>> ________________________________
>> From: Eric Newton [mailto:[EMAIL PROTECTED]]
>> Sent: Thursday, 3 October 2013 23:52
>>
>> To: [EMAIL PROTECTED]
>> Subject: Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
>>
>> You should have a major compaction running if your tablet has too many
>> files.  If you don't, something is wrong. It does take some time to re-write
>> 10G of data.
>>
>> If many merges occurred on a single tablet server, you may have these
>> many-file tablets on the same server, and there are not enough major
>> compaction threads to re-write those files right away.  If that's true, you
>> may wish to restart the tablet server in order to get the tablets pushed to
>> other idle servers.
>>
>> Again, if you don't have major compactions running, you will want to start
+
Kristopher Kane 2013-10-04, 02:02
+
Dickson, Matt MR 2013-10-03, 03:45