Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Efficient Tablet Merging [SEC=UNOFFICIAL]

Copy link to this message
Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
Great details... but I need to sleep.  I'll dig in more tomorrow.  Sorry!

On Thu, Oct 3, 2013 at 11:20 PM, Dickson, Matt MR
> Hi Eric,
> Our answers are in blue. Just a note that we do have the write ahead log
> disabled for ingest performance.
> We have a public holiday on Monday, so we may be delayed in our response.
> Cheers
> Matt
> ________________________________
> From: Eric Newton [mailto:[EMAIL PROTECTED]]
> Sent: Friday, 4 October 2013 11:20
> Subject: Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
> Any errors on those servers?  Each server should be checking periodically
> for compactions, some crazy errors might escape error handling, though that
> is rare these days.
> In the tserver debug log there is a repeating error of  "Internal error
> processing applyUpdates
> org.apache.accumulo.server.tabletserver.HoldTimeoutException: Commits are
> held"
> Also found in the tserver log:
> ERROR: Failed to find midpoint Filesystem closed
> WARN: Tablet .... has too many files, batch lookup cannont run
> Are you experiencing any table level errors?  Unable to read or write files?
> No table level errors or read errors
> How full is HDFS?
> 32%
> If you scan the !METADATA table, are you seeing any trend in the tablets
> that have problems?
> By getting the extent id of the tablets that are large and then finding the
> range of that tablet by using 'getsplits -v' I have scanned the !METADATA
> table and can see a massive number of *.rf files associated with the range.
> Is there anything particular I should look at.
> At this point, we're looking for logged anomalies, the earlier the better.
> Anything red or yellow on the monitor pages.
> I ran one of the scans that hang and then see the following:
> Several "WARN Exception sying java.lang.reflect.InvocationTargetException"
> Several "ERROR  Unexpected error writing to log, retrying attempt 1
>     InvocationTargetException
>     Caused by LeaseExpiredException: Lease mismatch on /accumulo/wal/...
> owned by DFSClient_NOMAPREDUCE_56390516_13 but is accessed by
> DFSClient_NOMAPREDUCE_1080760417_13"
> "ERROR TTransportException: javav.net.SocketTimeoutException: ... while
> waiting for channel to be ready for write. ...."
> Bunch of "WARN Tablet 234234234 has too many files..."
> On Thu, Oct 3, 2013 at 8:43 PM, Dickson, Matt MR
> <[EMAIL PROTECTED]> wrote:
>> We have restarted the tablet servers that contain tablets with high
>> volumes of files and did not see any majc's run.
>> Some more details are:
>> On 3 of our nodes we have 10-15 times the number of entries that are on
>> the other nodes.  When I view the tablets for one of these nodes there are 2
>> tablets with almost 10 times the the number of entries as the others.
>> When we query on the date rowid's the queries are now hanging and there
>> are several scans running on the 3 nodes that have higher entries and they
>> are not completing, can I cancel these?
>> In the logs we are getting "tablet ..... has too many files, batch lookup
>> can not run"
>> At this point I'm stuck for ideas, so any suggestions would be great.
>> ________________________________
>> From: Eric Newton [mailto:[EMAIL PROTECTED]]
>> Sent: Thursday, 3 October 2013 23:52
>> Subject: Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
>> You should have a major compaction running if your tablet has too many
>> files.  If you don't, something is wrong. It does take some time to re-write
>> 10G of data.
>> If many merges occurred on a single tablet server, you may have these
>> many-file tablets on the same server, and there are not enough major
>> compaction threads to re-write those files right away.  If that's true, you
>> may wish to restart the tablet server in order to get the tablets pushed to
>> other idle servers.
>> Again, if you don't have major compactions running, you will want to start