Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Loading CSV Files & LOAD large files behavior in local mode


Copy link to this message
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Thejas M Nair 2010-08-20, 14:24
To clarify what Jeff said, intermediate data before the join in your case
will be stored to disk only if the operations before join require an
separate map-reduce job.
If the operations between the load and the join are non-blocking , such as a
filter or foreach, then the data will be streamed through them and won't
need to be stored on disk.
-Thejas

On 8/20/10 12:40 AM, "Jeff Zhang" <[EMAIL PROTECTED]> wrote:

> Actually, the intermediate won't been stored in memory.  they will be
> stored in a tmp directory o hdfs, and pig will help you clean up the
> intermediate data when the job is finished.
>
> Yes, BinStorage is a binary format for storing intermediate data and
> know how to deserialize it to tuples
>
> On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator
> <[EMAIL PROTECTED]> wrote:
>> Right, in cases where you have to load multiple large relations and then do
>> some processing on each relations (filtering, aggregation) before joining
>> them together.  One wouldn't want to have all of the relations and
>> intermediate state in memory before the join.
>>
>> So is BinStorage just storing the Tuples in an internal binary format that
>> is easily converted back to a Tuple when loaded (i.e. no csv parsing
>> necessary)?
>>
>> Thanks.
>>
>> On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>>
>>> What do you mean "multiple relations with many tuples" ? Do you mean
>>> join multiple data set ?
>>> And Pig user BinStorage for storing intermediate data.
>>>
>>>
>>> On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator
>>> <[EMAIL PROTECTED]> wrote:
>>>> Thanks, Jeff.
>>>>
>>>> A quick follow-up question relating to the loading/storing of data - what
>>> is
>>>> the best practice when dealing with multiple relations with many tuples,
>>> do
>>>> people typically STORE intermediate relations to minimize memory usage
>>> and
>>>> RELOAD the intermediate data for use later on in the same script?
>>>  Because I
>>>> noticed that when tuples are written out using the TupleFormat, which
>>>> outputs text with an additional parenthesis that would cause a subsequent
>>>> PigStorage LOAD to get extra parenthesis characters, right?
>>>>
>>>> On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> I am afraid you should write your own LoadFunc to interpret the text.
>>>>> From Pig 0.7, the local mode use the hadoop's standalone local mode,
>>>>> so it will won't store all the data in memory, the data will been read
>>>>> in stream mode, but this mode need more memory because each task is
>>>>> executed in another jvm.
>>>>>
>>>>>
>>>>> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>> What loader should I use on csv files with quoted strings that contain
>>>>>> embedded commas?  (i.e. Embedded commas should not be a separator.)
>>>>>>
>>>>>> And when LOADing large files in local mode, does Pig just store it all
>>>>>> in memory?  Or does it have memory management ala buffer managers in
>>>>>> DBMS's?
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>