Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Mixed input formats in LOAD path


Copy link to this message
-
Re: Mixed input formats in LOAD path
Johannes Schwenk 2012-06-15, 16:05
Well I don't consider this strategy of an data format migration to be a
hack. The only thing that is somewhat "hacky" and definitely not elegant
is the creation of empty files for each known format by the logger!

Do you have any advice on how to design our pig scripts that they
account for migration situations like described in my earlier mail?

Thanks,
Johannes

Am 15.06.2012 15:55, schrieb Ruslan Al-Fakikh:
> Hey,
>
> You can keep a single empty file per format. That way pig won't fail.
> But basically I recommend to avoid such situations that need hacks or
> custom formats. According to my experience you'll soon get in trouble
> with that.
>
> Thanks
>
> On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk
> <[EMAIL PROTECTED]> wrote:
>> Thanks a lot Ruslan, that seems one possible direction!
>>
>> One things stands to be resolved: I don't know whether I will get an
>> Avro in the input or CSV, TSV or all... So how could I get pig not to
>> choke on missing input files?
>>
>> Johannes
>>
>> Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
>>> I guess you could use globbing for extracting the files by extensions,
>>> like this:
>>> $ ls
>>> input.avro  input.txt
>>> $ cat input.avro
>>> avro1
>>> avro2
>>> $ cat input.txt
>>> txt1
>>> txt2
>>>
>>> [cloudera@localhost workpig]$ pig -x local
>>> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
>>> error messages to: /home/cloudera/workpig/pig_1339766469585.log
>>> 2012-06-15 17:21:09,892 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> Connecting to hadoop file system at: file:///
>>> grunt> txt = LOAD '*.txt';
>>> grunt> avro = LOAD '*.avro';
>>> grunt> result = UNION txt, avro;
>>> grunt> DUMP result;
>>> (txt1)
>>> (txt2)
>>> (avro1)
>>> (avro2)
>>>
>>> Please note that the input.avro file is actually not Avro, so you'll
>>> need to use the Avro loader in the LOAD statement.
>>>
>>> Ruslan
>>>
>>> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
>>> <[EMAIL PROTECTED]> wrote:
>>>> Hi Ruslan,
>>>>
>>>> thanks for you answer!
>>>>
>>>> I have only the input path, but do not know which file format the
>>>> different files in that path possess. All files that are in the path
>>>> belong to one relation however, so i want to load them at once. Though a
>>>> union of separately loaded files would be ok too, if that is possible to
>>>> achieve. Important is, that the LOAD automatically takes care of the
>>>> different formats.
>>>>
>>>> To illustrate further consider the following scenario:
>>>>
>>>> 1. Our logging system writes log data to LOG_PATH.
>>>> 2. The current format is tab separated values.
>>>> 3. We LOAD '$LOG_PATH'
>>>> 4. We switch to Avro format and have to migrate.
>>>> 5. The migration can not happen instantly, so it might be that at some
>>>> point in time some files in  LOG_PATH still have the TSV format while
>>>> other are already switched to Avro.
>>>>
>>>> Thanks,
>>>> Johannes
>>>>
>>>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>>>> Hi Johannes,
>>>>>
>>>>> I guess you'd have to write a custom Loader for such a situation, but
>>>>> why do you need to load everything in one pass? You can load different
>>>>> types of files separately (having multiple LOAD statements) and make a
>>>>> join or a union afterwards.
>>>>>
>>>>> Ruslan
>>>>>
>>>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>>>> that contains several files in *different formats* - say serialized Avro
>>>>>> data and tab separated values and make pig read the data into one alias?
>>>>>> I guess I have to write an UDF for this? How should I start, can you
>>>>>> sketch out a rough plan on how to proceed?
>>>>>>
>>>>>>
>>>>>> Greetings,
>>>>>> Johannes Schwenk
>>>>>>
>>>>>> --
>>>>>> Softwareentwickler (Reporting)
>>>>>> ________________________________________________________

Johannes Schwenk

Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434