Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Mixed input formats in LOAD path


Copy link to this message
-
Re: Mixed input formats in LOAD path
Ruslan Al-Fakikh 2012-06-15, 13:24
I guess you could use globbing for extracting the files by extensions,
like this:
$ ls
input.avro  input.txt
$ cat input.avro
avro1
avro2
$ cat input.txt
txt1
txt2

[cloudera@localhost workpig]$ pig -x local
2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
error messages to: /home/cloudera/workpig/pig_1339766469585.log
2012-06-15 17:21:09,892 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
grunt> txt = LOAD '*.txt';
grunt> avro = LOAD '*.avro';
grunt> result = UNION txt, avro;
grunt> DUMP result;
(txt1)
(txt2)
(avro1)
(avro2)

Please note that the input.avro file is actually not Avro, so you'll
need to use the Avro loader in the LOAD statement.

Ruslan

On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
<[EMAIL PROTECTED]> wrote:
> Hi Ruslan,
>
> thanks for you answer!
>
> I have only the input path, but do not know which file format the
> different files in that path possess. All files that are in the path
> belong to one relation however, so i want to load them at once. Though a
> union of separately loaded files would be ok too, if that is possible to
> achieve. Important is, that the LOAD automatically takes care of the
> different formats.
>
> To illustrate further consider the following scenario:
>
> 1. Our logging system writes log data to LOG_PATH.
> 2. The current format is tab separated values.
> 3. We LOAD '$LOG_PATH'
> 4. We switch to Avro format and have to migrate.
> 5. The migration can not happen instantly, so it might be that at some
> point in time some files in  LOG_PATH still have the TSV format while
> other are already switched to Avro.
>
> Thanks,
> Johannes
>
> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>> Hi Johannes,
>>
>> I guess you'd have to write a custom Loader for such a situation, but
>> why do you need to load everything in one pass? You can load different
>> types of files separately (having multiple LOAD statements) and make a
>> join or a union afterwards.
>>
>> Ruslan
>>
>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>> <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>>
>>> is it possible to have an input path (as parameter to a LOAD statement)
>>> that contains several files in *different formats* - say serialized Avro
>>> data and tab separated values and make pig read the data into one alias?
>>> I guess I have to write an UDF for this? How should I start, can you
>>> sketch out a rough plan on how to proceed?
>>>
>>>
>>> Greetings,
>>> Johannes Schwenk
>>>
>>> --
>>> Softwareentwickler (Reporting)
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49  / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>>>
>>
>>
>>
>
>
>
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>

--
Best Regards,
Ruslan Al-Fakikh