Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # dev >> Re: Review Request 8559: Add Accumulo support to Sqoop


Copy link to this message
-
Re: Review Request 8559: Add Accumulo support to Sqoop

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8559/#review27458
-----------------------------------------------------------

src/docs/user/accumulo-args.txt
<https://reviews.apache.org/r/8559/#comment53357>

    Similar to above.
    
    "Import into a test Accumulo instance"
    

src/docs/user/accumulo.txt
<https://reviews.apache.org/r/8559/#comment53358>

    Could you change this to take an argument for a single visibility to apply to everything?
    
    Or would you prefer I make it a follow on issue?

src/docs/user/accumulo.txt
<https://reviews.apache.org/r/8559/#comment53359>

    a brief bit here about what information I need in order to connect things to my Accumulo instance would help.

src/docs/user/accumulo.txt
<https://reviews.apache.org/r/8559/#comment53355>

    Since these docs are geared towards end users, I'd recommend phrasing similar to:
    
    "The Accumulo import tool has a testing mode, activated with the --accumulo-test-mode flag. When run in testing mode, the import spins up a small test instance of Accumulo rather than writing to an existing cluster."
    
    If a paragraph is added above about ocnnecting to an Accumulo instance, then also add a sentence that the zookeeper and instance name aren't needed.

src/java/org/apache/sqoop/accumulo/AccumuloMutationProcessor.java
<https://reviews.apache.org/r/8559/#comment53374>

    The settings for batch size and max latency (and maybe threads) should be optional command line args, since they can have a large impact on performance.
    
    These settings (2GB and 5 seconds) don't seem like appropriate defaults.
    
    for the max latency a default of 0 or Long.MAX_VALUE would be preferable, presuming Sqoop is going for high throughput.
    
    for max memory buffer to hold before writing, a few MB or 10s of MB should be sufficient.
    
    Might also want to switch to the non-deprecated createBatchWriter method.

src/java/org/apache/sqoop/accumulo/AccumuloMutationProcessor.java
<https://reviews.apache.org/r/8559/#comment53372>

    Instead of wrapping in RuntimeException, consider IOException

src/java/org/apache/sqoop/accumulo/AccumuloMutationProcessor.java
<https://reviews.apache.org/r/8559/#comment53371>

    Pending mutations can get rejected for reasons other than the table not being found.
    
    Since this is a failure to write, probably better to wrap in an IOException instead of RuntimeException.

src/java/org/apache/sqoop/mapreduce/AccumuloImportJob.java
<https://reviews.apache.org/r/8559/#comment53362>

    This should use MiniAccumuloCluster instead of MockInstance.
    
    http://accumulo.apache.org/1.5/accumulo_user_manual.html#_mini_accumulo_cluster
    
    It gives a much more realistic test environment for how Accumulo will behave.

src/java/org/apache/sqoop/mapreduce/AccumuloImportJob.java
<https://reviews.apache.org/r/8559/#comment53363>

    Logging at a unimportant level is preferable to swallowing.
    
    LOG.debug or LOG.info?
    
    When doing ops, I definitely care if some other process created the table while I was telling this process to do it.

src/java/org/apache/sqoop/tool/BaseSqoopTool.java
<https://reviews.apache.org/r/8559/#comment53365>

    Please don't include format fixes unrelated to this patch.

src/java/org/apache/sqoop/tool/BaseSqoopTool.java
<https://reviews.apache.org/r/8559/#comment53366>

    Please don't include format fixes unrelated to this patch.

src/java/org/apache/sqoop/tool/BaseSqoopTool.java
<https://reviews.apache.org/r/8559/#comment53367>

    Please don't include format fixes unrelated to this patch.

src/java/org/apache/sqoop/tool/BaseSqoopTool.java
<https://reviews.apache.org/r/8559/#comment53368>

    Avro and Sequence file arguments don't work with Accumulo as well, I think?
    

src/java/org/apache/sqoop/tool/ImportTool.java
<https://reviews.apache.org/r/8559/#comment53369>

    nit: trailing whitespace

src/test/org/apache/sqoop/accumulo/AccumuloTestCase.java
<https://reviews.apache.org/r/8559/#comment53375>

    why this logic for only running on Hadoop 0.20.x?
    
    Presuming keeping these only on hadoop 0.20 is needed, does ant have something analogous to Maven profiles that could instead isolate these tests to only run given a specific hadoop version?
    
    As is, this logic will falsely inflate the number of passing tests when they're actually being ignored.

src/test/org/apache/sqoop/accumulo/TestAccumuloImport.java
<https://reviews.apache.org/r/8559/#comment53379>

    I'd like to see test cases for dealing with the failure path on writing and on close, but I can't think of a good way to do that with the current Accumulo offerings. :/

src/test/org/apache/sqoop/accumulo/TestAccumuloImport.java
<https://reviews.apache.org/r/8559/#comment53376>

    Could you rewrite this to use the @Test(expects=IOException.class) form?

src/test/org/apache/sqoop/accumulo/TestAccumuloImport.java
<https://reviews.apache.org/r/8559/#comment53377>

    Could you rewrite this to use the @Test(expects=IOException.class) form?

src/test/org/apache/sqoop/accumulo/TestAccumuloQueryImport.java
<https://reviews.apache.org/r/8559/#comment53378>

    Could you rewrite this to use the @Test(expects=IOException.class) form?
- Sean Busbey
On Oct. 23, 2013, 9:24 p.m., Philip Grim wrote: