|
|
-
Loading CSV Files & LOAD large files behavior in local mode
Defenestrator 2010-08-19, 07:48
What loader should I use on csv files with quoted strings that contain embedded commas? (i.e. Embedded commas should not be a separator.)
And when LOADing large files in local mode, does Pig just store it all in memory? Or does it have memory management ala buffer managers in DBMS's?
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Jeff Zhang 2010-08-19, 08:50
I am afraid you should write your own LoadFunc to interpret the text. >From Pig 0.7, the local mode use the hadoop's standalone local mode, so it will won't store all the data in memory, the data will been read in stream mode, but this mode need more memory because each task is executed in another jvm. On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator <[EMAIL PROTECTED]> wrote: > What loader should I use on csv files with quoted strings that contain > embedded commas? (i.e. Embedded commas should not be a separator.) > > And when LOADing large files in local mode, does Pig just store it all > in memory? Or does it have memory management ala buffer managers in > DBMS's? >
-- Best Regards
Jeff Zhang
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Defenestrator 2010-08-20, 06:42
Thanks, Jeff.
A quick follow-up question relating to the loading/storing of data - what is the best practice when dealing with multiple relations with many tuples, do people typically STORE intermediate relations to minimize memory usage and RELOAD the intermediate data for use later on in the same script? Because I noticed that when tuples are written out using the TupleFormat, which outputs text with an additional parenthesis that would cause a subsequent PigStorage LOAD to get extra parenthesis characters, right?
On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
> I am afraid you should write your own LoadFunc to interpret the text. > From Pig 0.7, the local mode use the hadoop's standalone local mode, > so it will won't store all the data in memory, the data will been read > in stream mode, but this mode need more memory because each task is > executed in another jvm. > > > On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator > <[EMAIL PROTECTED]> wrote: > > What loader should I use on csv files with quoted strings that contain > > embedded commas? (i.e. Embedded commas should not be a separator.) > > > > And when LOADing large files in local mode, does Pig just store it all > > in memory? Or does it have memory management ala buffer managers in > > DBMS's? > > > > > > -- > Best Regards > > Jeff Zhang >
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Jeff Zhang 2010-08-20, 07:06
What do you mean "multiple relations with many tuples" ? Do you mean join multiple data set ? And Pig user BinStorage for storing intermediate data. On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator <[EMAIL PROTECTED]> wrote: > Thanks, Jeff. > > A quick follow-up question relating to the loading/storing of data - what is > the best practice when dealing with multiple relations with many tuples, do > people typically STORE intermediate relations to minimize memory usage and > RELOAD the intermediate data for use later on in the same script? Because I > noticed that when tuples are written out using the TupleFormat, which > outputs text with an additional parenthesis that would cause a subsequent > PigStorage LOAD to get extra parenthesis characters, right? > > On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > >> I am afraid you should write your own LoadFunc to interpret the text. >> From Pig 0.7, the local mode use the hadoop's standalone local mode, >> so it will won't store all the data in memory, the data will been read >> in stream mode, but this mode need more memory because each task is >> executed in another jvm. >> >> >> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator >> <[EMAIL PROTECTED]> wrote: >> > What loader should I use on csv files with quoted strings that contain >> > embedded commas? (i.e. Embedded commas should not be a separator.) >> > >> > And when LOADing large files in local mode, does Pig just store it all >> > in memory? Or does it have memory management ala buffer managers in >> > DBMS's? >> > >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> >
-- Best Regards
Jeff Zhang
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Defenestrator 2010-08-20, 07:35
Right, in cases where you have to load multiple large relations and then do some processing on each relations (filtering, aggregation) before joining them together. One wouldn't want to have all of the relations and intermediate state in memory before the join.
So is BinStorage just storing the Tuples in an internal binary format that is easily converted back to a Tuple when loaded (i.e. no csv parsing necessary)?
Thanks.
On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
> What do you mean "multiple relations with many tuples" ? Do you mean > join multiple data set ? > And Pig user BinStorage for storing intermediate data. > > > On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator > <[EMAIL PROTECTED]> wrote: > > Thanks, Jeff. > > > > A quick follow-up question relating to the loading/storing of data - what > is > > the best practice when dealing with multiple relations with many tuples, > do > > people typically STORE intermediate relations to minimize memory usage > and > > RELOAD the intermediate data for use later on in the same script? > Because I > > noticed that when tuples are written out using the TupleFormat, which > > outputs text with an additional parenthesis that would cause a subsequent > > PigStorage LOAD to get extra parenthesis characters, right? > > > > On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > >> I am afraid you should write your own LoadFunc to interpret the text. > >> From Pig 0.7, the local mode use the hadoop's standalone local mode, > >> so it will won't store all the data in memory, the data will been read > >> in stream mode, but this mode need more memory because each task is > >> executed in another jvm. > >> > >> > >> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator > >> <[EMAIL PROTECTED]> wrote: > >> > What loader should I use on csv files with quoted strings that contain > >> > embedded commas? (i.e. Embedded commas should not be a separator.) > >> > > >> > And when LOADing large files in local mode, does Pig just store it all > >> > in memory? Or does it have memory management ala buffer managers in > >> > DBMS's? > >> > > >> > >> > >> > >> -- > >> Best Regards > >> > >> Jeff Zhang > >> > > > > > > -- > Best Regards > > Jeff Zhang >
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Jeff Zhang 2010-08-20, 07:40
Actually, the intermediate won't been stored in memory. they will be stored in a tmp directory o hdfs, and pig will help you clean up the intermediate data when the job is finished.
Yes, BinStorage is a binary format for storing intermediate data and know how to deserialize it to tuples
On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator <[EMAIL PROTECTED]> wrote: > Right, in cases where you have to load multiple large relations and then do > some processing on each relations (filtering, aggregation) before joining > them together. One wouldn't want to have all of the relations and > intermediate state in memory before the join. > > So is BinStorage just storing the Tuples in an internal binary format that > is easily converted back to a Tuple when loaded (i.e. no csv parsing > necessary)? > > Thanks. > > On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > >> What do you mean "multiple relations with many tuples" ? Do you mean >> join multiple data set ? >> And Pig user BinStorage for storing intermediate data. >> >> >> On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator >> <[EMAIL PROTECTED]> wrote: >> > Thanks, Jeff. >> > >> > A quick follow-up question relating to the loading/storing of data - what >> is >> > the best practice when dealing with multiple relations with many tuples, >> do >> > people typically STORE intermediate relations to minimize memory usage >> and >> > RELOAD the intermediate data for use later on in the same script? >> Because I >> > noticed that when tuples are written out using the TupleFormat, which >> > outputs text with an additional parenthesis that would cause a subsequent >> > PigStorage LOAD to get extra parenthesis characters, right? >> > >> > On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: >> > >> >> I am afraid you should write your own LoadFunc to interpret the text. >> >> From Pig 0.7, the local mode use the hadoop's standalone local mode, >> >> so it will won't store all the data in memory, the data will been read >> >> in stream mode, but this mode need more memory because each task is >> >> executed in another jvm. >> >> >> >> >> >> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator >> >> <[EMAIL PROTECTED]> wrote: >> >> > What loader should I use on csv files with quoted strings that contain >> >> > embedded commas? (i.e. Embedded commas should not be a separator.) >> >> > >> >> > And when LOADing large files in local mode, does Pig just store it all >> >> > in memory? Or does it have memory management ala buffer managers in >> >> > DBMS's? >> >> > >> >> >> >> >> >> >> >> -- >> >> Best Regards >> >> >> >> Jeff Zhang >> >> >> > >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> >
-- Best Regards
Jeff Zhang
-
Re: Loading CSV Files & LOAD large files behavior in local mode
Thejas M Nair 2010-08-20, 14:24
To clarify what Jeff said, intermediate data before the join in your case will be stored to disk only if the operations before join require an separate map-reduce job. If the operations between the load and the join are non-blocking , such as a filter or foreach, then the data will be streamed through them and won't need to be stored on disk. -Thejas
On 8/20/10 12:40 AM, "Jeff Zhang" <[EMAIL PROTECTED]> wrote:
> Actually, the intermediate won't been stored in memory. they will be > stored in a tmp directory o hdfs, and pig will help you clean up the > intermediate data when the job is finished. > > Yes, BinStorage is a binary format for storing intermediate data and > know how to deserialize it to tuples > > On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator > <[EMAIL PROTECTED]> wrote: >> Right, in cases where you have to load multiple large relations and then do >> some processing on each relations (filtering, aggregation) before joining >> them together. One wouldn't want to have all of the relations and >> intermediate state in memory before the join. >> >> So is BinStorage just storing the Tuples in an internal binary format that >> is easily converted back to a Tuple when loaded (i.e. no csv parsing >> necessary)? >> >> Thanks. >> >> On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: >> >>> What do you mean "multiple relations with many tuples" ? Do you mean >>> join multiple data set ? >>> And Pig user BinStorage for storing intermediate data. >>> >>> >>> On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator >>> <[EMAIL PROTECTED]> wrote: >>>> Thanks, Jeff. >>>> >>>> A quick follow-up question relating to the loading/storing of data - what >>> is >>>> the best practice when dealing with multiple relations with many tuples, >>> do >>>> people typically STORE intermediate relations to minimize memory usage >>> and >>>> RELOAD the intermediate data for use later on in the same script? >>> Because I >>>> noticed that when tuples are written out using the TupleFormat, which >>>> outputs text with an additional parenthesis that would cause a subsequent >>>> PigStorage LOAD to get extra parenthesis characters, right? >>>> >>>> On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: >>>> >>>>> I am afraid you should write your own LoadFunc to interpret the text. >>>>> From Pig 0.7, the local mode use the hadoop's standalone local mode, >>>>> so it will won't store all the data in memory, the data will been read >>>>> in stream mode, but this mode need more memory because each task is >>>>> executed in another jvm. >>>>> >>>>> >>>>> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator >>>>> <[EMAIL PROTECTED]> wrote: >>>>>> What loader should I use on csv files with quoted strings that contain >>>>>> embedded commas? (i.e. Embedded commas should not be a separator.) >>>>>> >>>>>> And when LOADing large files in local mode, does Pig just store it all >>>>>> in memory? Or does it have memory management ala buffer managers in >>>>>> DBMS's? >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> > > > > -- > Best Regards > > Jeff Zhang >
|
|