Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Mixed input formats in LOAD path


+
Johannes Schwenk 2012-06-15, 12:13
+
Ruslan Al-Fakikh 2012-06-15, 12:37
+
Johannes Schwenk 2012-06-15, 12:52
+
Ruslan Al-Fakikh 2012-06-15, 13:24
Copy link to this message
-
Re: Mixed input formats in LOAD path
Thanks a lot Ruslan, that seems one possible direction!

One things stands to be resolved: I don't know whether I will get an
Avro in the input or CSV, TSV or all... So how could I get pig not to
choke on missing input files?

Johannes

Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
> I guess you could use globbing for extracting the files by extensions,
> like this:
> $ ls
> input.avro  input.txt
> $ cat input.avro
> avro1
> avro2
> $ cat input.txt
> txt1
> txt2
>
> [cloudera@localhost workpig]$ pig -x local
> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
> error messages to: /home/cloudera/workpig/pig_1339766469585.log
> 2012-06-15 17:21:09,892 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: file:///
> grunt> txt = LOAD '*.txt';
> grunt> avro = LOAD '*.avro';
> grunt> result = UNION txt, avro;
> grunt> DUMP result;
> (txt1)
> (txt2)
> (avro1)
> (avro2)
>
> Please note that the input.avro file is actually not Avro, so you'll
> need to use the Avro loader in the LOAD statement.
>
> Ruslan
>
> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
> <[EMAIL PROTECTED]> wrote:
>> Hi Ruslan,
>>
>> thanks for you answer!
>>
>> I have only the input path, but do not know which file format the
>> different files in that path possess. All files that are in the path
>> belong to one relation however, so i want to load them at once. Though a
>> union of separately loaded files would be ok too, if that is possible to
>> achieve. Important is, that the LOAD automatically takes care of the
>> different formats.
>>
>> To illustrate further consider the following scenario:
>>
>> 1. Our logging system writes log data to LOG_PATH.
>> 2. The current format is tab separated values.
>> 3. We LOAD '$LOG_PATH'
>> 4. We switch to Avro format and have to migrate.
>> 5. The migration can not happen instantly, so it might be that at some
>> point in time some files in  LOG_PATH still have the TSV format while
>> other are already switched to Avro.
>>
>> Thanks,
>> Johannes
>>
>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>> Hi Johannes,
>>>
>>> I guess you'd have to write a custom Loader for such a situation, but
>>> why do you need to load everything in one pass? You can load different
>>> types of files separately (having multiple LOAD statements) and make a
>>> join or a union afterwards.
>>>
>>> Ruslan
>>>
>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>> <[EMAIL PROTECTED]> wrote:
>>>> Hi all,
>>>>
>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>> that contains several files in *different formats* - say serialized Avro
>>>> data and tab separated values and make pig read the data into one alias?
>>>> I guess I have to write an UDF for this? How should I start, can you
>>>> sketch out a rough plan on how to proceed?
>>>>
>>>>
>>>> Greetings,
>>>> Johannes Schwenk
>>>>
>>>> --
>>>> Softwareentwickler (Reporting)
>>>> ________________________________________________________
>>>>
>>>> ADITION technologies AG
>>>> Schwarzwaldstraße 78b
>>>> 79117 Freiburg
>>>>
>>>> http://www.adition.com
>>>>
>>>> T +49 / (0)761 / 88147 - 30
>>>> F +49 / (0)761 / 88147 - 77
>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>
>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>> UStIDNr.: DE 218 858 434
>>>>
>>>
>>>
>>>
>>
>>
>>
>> Johannes Schwenk
>>
>> --
>> Softwareentwickler (Reporting)
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Johannes Schwenk

Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434
+
Ruslan Al-Fakikh 2012-06-15, 13:55
+
Johannes Schwenk 2012-06-15, 16:05