Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)


+
Lorand Bendig 2013-12-29, 23:19
+
Cheolsoo Park 2013-12-30, 21:50
+
Lorand Bendig 2014-01-02, 14:04
+
Lorand Bendig 2014-01-02, 14:05
+
Lorand Bendig 2014-01-03, 22:57
+
Cheolsoo Park 2014-01-05, 00:49
+
Lorand Bendig 2014-01-05, 12:56
Copy link to this message
-
Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
-----------------------------------------------------------
I have one last comment below. Other than that, everything looks good.

Also, can you document this? It think it's worth to mention in the "Performance and Efficiency" section in the manual. You can post a doc patch in a separate jira if you'd like.
/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59452>

    This won't work if the temporary file storage is not InterStorage. It can be one of Inter, TFile, and SequenceFile storages.
    
    See here-
    https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347
    
- Cheolsoo Park
On Jan. 2, 2014, 2:05 p.m., Lorand Bendig wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
>
> (Updated Jan. 2, 2014, 2:05 p.m.)
>
>
> Review request for pig.
>
>
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
>
>
> Repository: pig
>
>
> Description
> -------
>
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
>
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
>
> The feature is enabled by default and can be toggled with:
>
>     -N or -no_fetch
>     set opt.fetch true/false;
>
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
>
>
> Diffs
> -----
>
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1554785
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1554785
>   /trunk/src/org/apache/pig/Main.java 1554785
>   /trunk/src/org/apache/pig/PigConfiguration.java 1554785
>   /trunk/src/org/apache/pig/PigServer.java 1554785
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1554785
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1554785
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1554785
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1554785
+
Lorand Bendig 2014-01-03, 23:00