Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)


Copy link to this message
-
Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
Cheolsoo Park 2013-12-30, 21:50

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review30936
-----------------------------------------------------------
This a great work. Thank you very much!

I have few minor comments below mostly about tests.
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
<https://reviews.apache.org/r/16507/#comment59191>

    Replace UDFContext.getUdFContext() with udfContext.

/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59252>

    Do you mind removing trailing white spaces in the patch?

/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59192>

    Can you move this to PigConfiguration?

/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
<https://reviews.apache.org/r/16507/#comment59194>

    Do you mind removing unused imports?
    - import java.util.LinkedList;
    - import org.apache.pig.impl.util.IdentityHashSet;
    - import org.apache.pig.pen.util.LineageTracer;

/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
<https://reviews.apache.org/r/16507/#comment59276>

    Do you mind adding a comment explaining about isFetchable in POStream?

/trunk/test/org/apache/pig/test/TestAssert.java
<https://reviews.apache.org/r/16507/#comment59195>

    Does this else block ever get executed given the we're running the test with opt.fetch on?
    
    I think you can do either-
    
    1) explicitly set opt.fetch to true or false in setup(),
    
    or
    
    2) change the test to run the query twice with opt.fetch on and off to ensure we're not breaking anything when opt.fetch is off.

/trunk/test/org/apache/pig/test/TestEvalPipeline2.java
<https://reviews.apache.org/r/16507/#comment59197>

    Can you change the test to not take into account the order of records instead of handling fetchEnabled separately? Since union doesn't guarantee any ordering, the test shouldn't assume it in the first place.
    

/trunk/test/org/apache/pig/test/TestEvalPipeline2.java
<https://reviews.apache.org/r/16507/#comment59248>

    Same as above. Given that opt.fetch is on in the test, the bincond will be always true.
    
    I think you can do either-
    
    1) explicitly set opt.fetch to true or false in setup(),
    
    or
    
    2) change the test to run the query twice with opt.fetch on and off.

/trunk/test/org/apache/pig/test/TestPigRunner.java
<https://reviews.apache.org/r/16507/#comment59196>

    Why is this changed? I think the default value for opt.multiquery is true.

/trunk/test/org/apache/pig/test/TestPigRunner.java
<https://reviews.apache.org/r/16507/#comment59250>

    Same here. The default value for opt.multiquery is true.
- Cheolsoo Park
On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
>
> (Updated Dec. 29, 2013, 11:19 p.m.)
>
>
> Review request for pig.
>
>
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
>
>
> Repository: pig
>
>
> Description
> -------
>
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
>
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader