Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Mixed input formats in LOAD path


Copy link to this message
-
Re: Mixed input formats in LOAD path
Ruslan Al-Fakikh 2012-06-15, 13:55
Hey,

You can keep a single empty file per format. That way pig won't fail.
But basically I recommend to avoid such situations that need hacks or
custom formats. According to my experience you'll soon get in trouble
with that.

Thanks

On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk
<[EMAIL PROTECTED]> wrote:
> Thanks a lot Ruslan, that seems one possible direction!
>
> One things stands to be resolved: I don't know whether I will get an
> Avro in the input or CSV, TSV or all... So how could I get pig not to
> choke on missing input files?
>
> Johannes
>
> Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
>> I guess you could use globbing for extracting the files by extensions,
>> like this:
>> $ ls
>> input.avro  input.txt
>> $ cat input.avro
>> avro1
>> avro2
>> $ cat input.txt
>> txt1
>> txt2
>>
>> [cloudera@localhost workpig]$ pig -x local
>> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: /home/cloudera/workpig/pig_1339766469585.log
>> 2012-06-15 17:21:09,892 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: file:///
>> grunt> txt = LOAD '*.txt';
>> grunt> avro = LOAD '*.avro';
>> grunt> result = UNION txt, avro;
>> grunt> DUMP result;
>> (txt1)
>> (txt2)
>> (avro1)
>> (avro2)
>>
>> Please note that the input.avro file is actually not Avro, so you'll
>> need to use the Avro loader in the LOAD statement.
>>
>> Ruslan
>>
>> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
>> <[EMAIL PROTECTED]> wrote:
>>> Hi Ruslan,
>>>
>>> thanks for you answer!
>>>
>>> I have only the input path, but do not know which file format the
>>> different files in that path possess. All files that are in the path
>>> belong to one relation however, so i want to load them at once. Though a
>>> union of separately loaded files would be ok too, if that is possible to
>>> achieve. Important is, that the LOAD automatically takes care of the
>>> different formats.
>>>
>>> To illustrate further consider the following scenario:
>>>
>>> 1. Our logging system writes log data to LOG_PATH.
>>> 2. The current format is tab separated values.
>>> 3. We LOAD '$LOG_PATH'
>>> 4. We switch to Avro format and have to migrate.
>>> 5. The migration can not happen instantly, so it might be that at some
>>> point in time some files in  LOG_PATH still have the TSV format while
>>> other are already switched to Avro.
>>>
>>> Thanks,
>>> Johannes
>>>
>>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>>> Hi Johannes,
>>>>
>>>> I guess you'd have to write a custom Loader for such a situation, but
>>>> why do you need to load everything in one pass? You can load different
>>>> types of files separately (having multiple LOAD statements) and make a
>>>> join or a union afterwards.
>>>>
>>>> Ruslan
>>>>
>>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>>> <[EMAIL PROTECTED]> wrote:
>>>>> Hi all,
>>>>>
>>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>>> that contains several files in *different formats* - say serialized Avro
>>>>> data and tab separated values and make pig read the data into one alias?
>>>>> I guess I have to write an UDF for this? How should I start, can you
>>>>> sketch out a rough plan on how to proceed?
>>>>>
>>>>>
>>>>> Greetings,
>>>>> Johannes Schwenk
>>>>>
>>>>> --
>>>>> Softwareentwickler (Reporting)
>>>>> ________________________________________________________
>>>>>
>>>>> ADITION technologies AG
>>>>> Schwarzwaldstraße 78b
>>>>> 79117 Freiburg
>>>>>
>>>>> http://www.adition.com
>>>>>
>>>>> T +49 / (0)761 / 88147 - 30
>>>>> F +49 / (0)761 / 88147 - 77
>>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>>
>>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>>
>>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>>
Best Regards,
Ruslan Al-Fakikh