Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Mixed input formats in LOAD path


Copy link to this message
-
Re: Mixed input formats in LOAD path
I guess you could use globbing for extracting the files by extensions,
like this:
$ ls
input.avro  input.txt
$ cat input.avro
avro1
avro2
$ cat input.txt
txt1
txt2

[cloudera@localhost workpig]$ pig -x local
2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
error messages to: /home/cloudera/workpig/pig_1339766469585.log
2012-06-15 17:21:09,892 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
grunt> txt = LOAD '*.txt';
grunt> avro = LOAD '*.avro';
grunt> result = UNION txt, avro;
grunt> DUMP result;
(txt1)
(txt2)
(avro1)
(avro2)

Please note that the input.avro file is actually not Avro, so you'll
need to use the Avro loader in the LOAD statement.

Ruslan

On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
<[EMAIL PROTECTED]> wrote:
> Hi Ruslan,
>
> thanks for you answer!
>
> I have only the input path, but do not know which file format the
> different files in that path possess. All files that are in the path
> belong to one relation however, so i want to load them at once. Though a
> union of separately loaded files would be ok too, if that is possible to
> achieve. Important is, that the LOAD automatically takes care of the
> different formats.
>
> To illustrate further consider the following scenario:
>
> 1. Our logging system writes log data to LOG_PATH.
> 2. The current format is tab separated values.
> 3. We LOAD '$LOG_PATH'
> 4. We switch to Avro format and have to migrate.
> 5. The migration can not happen instantly, so it might be that at some
> point in time some files in  LOG_PATH still have the TSV format while
> other are already switched to Avro.
>
> Thanks,
> Johannes
>
> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>> Hi Johannes,
>>
>> I guess you'd have to write a custom Loader for such a situation, but
>> why do you need to load everything in one pass? You can load different
>> types of files separately (having multiple LOAD statements) and make a
>> join or a union afterwards.
>>
>> Ruslan
>>
>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>> <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>>
>>> is it possible to have an input path (as parameter to a LOAD statement)
>>> that contains several files in *different formats* - say serialized Avro
>>> data and tab separated values and make pig read the data into one alias?
>>> I guess I have to write an UDF for this? How should I start, can you
>>> sketch out a rough plan on how to proceed?
>>>
>>>
>>> Greetings,
>>> Johannes Schwenk
>>>
>>> --
>>> Softwareentwickler (Reporting)
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49  / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>>>
>>
>>
>>
>
>
>
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>

--
Best Regards,
Ruslan Al-Fakikh
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB