Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> OrcFile writing failing with multiple threads


+
Andrew Psaltis 2013-05-24, 18:39
+
Owen OMalley 2013-05-24, 20:15
+
Andrew Psaltis 2013-05-24, 20:28
Copy link to this message
-
Re: OrcFile writing failing with multiple threads
Sorry for dropping this thread for so long. Are you using the orc.Writer
interface directly or going through the OrcOutputFormat? I forgot that I
had updated the Writer to synchronize things. It looks like all of the
methods in Writer are synchronized. I haven't had a chance to investigate
this further yet.

-- Owen
On Fri, May 24, 2013 at 1:28 PM, Andrew Psaltis <
[EMAIL PROTECTED]> wrote:

>  Here is a snippet from the file header comment the WriterImpl for ORC:
>
>  /**
>  …………
>  * This class is synchronized so that multi-threaded access is ok. In
>  * particular, because the MemoryManager is shared between writers, this
> class
>  * assumes that checkMemory may be called from a separate thread.
>  */
>
>  And then the addRow looks like this:
>
>   public void addRow(Object row) throws IOException {
>     synchronized (this) {
>       treeWriter.write(row);
>       rowsInStripe += 1;
>       if (buildIndex) {
>         rowsInIndex += 1;
>
>          if (rowsInIndex >= rowIndexStride) {
>           createRowIndexEntry();
>         }
>       }
>     }
>     memoryManager.addedRow();
>   }
>
>  Am I missing something here about the synchronized(this) ?  Perhaps I am
> looking in the wrong place.
>
>  Thanks,
> agp
>
>
>   From: Owen O'Malley <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Date: Friday, May 24, 2013 2:15 PM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Re: OrcFile writing failing with multiple threads
>
>   Currently, ORC writers, like the Java collections API don't lock
> themselves. You should synchronize on the writer before adding a row. I'm
> open to making the writers synchronized.
>
>  -- Owen
>
>
> On Fri, May 24, 2013 at 11:39 AM, Andrew Psaltis <
> [EMAIL PROTECTED]> wrote:
>
>>  All,
>> I have a test application that is attempting to add rows to an OrcFile
>> from multiple threads, however, every time I do I get exceptions with stack
>> traces like the following:
>>
>>  java.lang.IndexOutOfBoundsException: Index 4 is outside of 0..5
>> at
>> org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.get(DynamicIntArray.java:73)
>> at
>> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:55)
>> at
>> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:192)
>> at
>> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:199)
>> at
>> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.add(RedBlackTree.java:300)
>> at
>> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:45)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:723)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$MapTreeWriter.write(WriterImpl.java:1093)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:996)
>> at
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1450)
>> at OrcFileTester$BigRowWriter.run(OrcFileTester.java:129)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:722)
>>
>>
>>  Below is the source code for my sample app that is heavily based on the
>> TestOrcFile test case using BigRow. Is there something I am doing wrong
>> here, or is this a legitimate bug in the Orc writing?
>>
>>  Thanks in advance,
>> Andrew
>>
>>
>>  ------------------------- Java app code follows
>> ---------------------------------
>>  import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
>
+
Andrew Psaltis 2013-06-11, 17:10