Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)


Copy link to this message
-
Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)
Cheolsoo Park 2014-01-05, 00:49

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31213
-----------------------------------------------------------

Ship it!
Looks good to me. I will commit it after running unit tests and e2e tests.

I found a minor bug below. Let me fix it when I commit it.
/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
<https://reviews.apache.org/r/16507/#comment59615>

    I think "return" is omitted here. The explain still outputs the MR plan even if the plan is fetchable.
- Cheolsoo Park
On Jan. 3, 2014, 10:57 p.m., Lorand Bendig wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
>
> (Updated Jan. 3, 2014, 10:57 p.m.)
>
>
> Review request for pig.
>
>
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
>
>
> Repository: pig
>
>
> Description
> -------
>
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
>
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
>
> The feature is enabled by default and can be toggled with:
>
>     -N or -no_fetch
>     set opt.fetch true/false;
>
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
>
>
> Diffs
> -----
>
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1555255
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1555255
>   /trunk/src/org/apache/pig/Main.java 1555255
>   /trunk/src/org/apache/pig/PigConfiguration.java 1555255
>   /trunk/src/org/apache/pig/PigServer.java 1555255
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1555255
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1555255
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1555255
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1555255
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1555255
>   /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255
>   /trunk/src/org/apache/pig/impl/util/Utils.java 1555255
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1555255
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION
>   /trunk/test/org/apache/pig/test/TestAssert.java 1555255