Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Mixed input formats in LOAD path


Copy link to this message
-
Re: Mixed input formats in LOAD path
Hey,

You can keep a single empty file per format. That way pig won't fail.
But basically I recommend to avoid such situations that need hacks or
custom formats. According to my experience you'll soon get in trouble
with that.

Thanks

On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk
<[EMAIL PROTECTED]> wrote:
> Thanks a lot Ruslan, that seems one possible direction!
>
> One things stands to be resolved: I don't know whether I will get an
> Avro in the input or CSV, TSV or all... So how could I get pig not to
> choke on missing input files?
>
> Johannes
>
> Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
>> I guess you could use globbing for extracting the files by extensions,
>> like this:
>> $ ls
>> input.avro  input.txt
>> $ cat input.avro
>> avro1
>> avro2
>> $ cat input.txt
>> txt1
>> txt2
>>
>> [cloudera@localhost workpig]$ pig -x local
>> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: /home/cloudera/workpig/pig_1339766469585.log
>> 2012-06-15 17:21:09,892 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: file:///
>> grunt> txt = LOAD '*.txt';
>> grunt> avro = LOAD '*.avro';
>> grunt> result = UNION txt, avro;
>> grunt> DUMP result;
>> (txt1)
>> (txt2)
>> (avro1)
>> (avro2)
>>
>> Please note that the input.avro file is actually not Avro, so you'll
>> need to use the Avro loader in the LOAD statement.
>>
>> Ruslan
>>
>> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
>> <[EMAIL PROTECTED]> wrote:
>>> Hi Ruslan,
>>>
>>> thanks for you answer!
>>>
>>> I have only the input path, but do not know which file format the
>>> different files in that path possess. All files that are in the path
>>> belong to one relation however, so i want to load them at once. Though a
>>> union of separately loaded files would be ok too, if that is possible to
>>> achieve. Important is, that the LOAD automatically takes care of the
>>> different formats.
>>>
>>> To illustrate further consider the following scenario:
>>>
>>> 1. Our logging system writes log data to LOG_PATH.
>>> 2. The current format is tab separated values.
>>> 3. We LOAD '$LOG_PATH'
>>> 4. We switch to Avro format and have to migrate.
>>> 5. The migration can not happen instantly, so it might be that at some
>>> point in time some files in  LOG_PATH still have the TSV format while
>>> other are already switched to Avro.
>>>
>>> Thanks,
>>> Johannes
>>>
>>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>>> Hi Johannes,
>>>>
>>>> I guess you'd have to write a custom Loader for such a situation, but
>>>> why do you need to load everything in one pass? You can load different
>>>> types of files separately (having multiple LOAD statements) and make a
>>>> join or a union afterwards.
>>>>
>>>> Ruslan
>>>>
>>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>>> <[EMAIL PROTECTED]> wrote:
>>>>> Hi all,
>>>>>
>>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>>> that contains several files in *different formats* - say serialized Avro
>>>>> data and tab separated values and make pig read the data into one alias?
>>>>> I guess I have to write an UDF for this? How should I start, can you
>>>>> sketch out a rough plan on how to proceed?
>>>>>
>>>>>
>>>>> Greetings,
>>>>> Johannes Schwenk
>>>>>
>>>>> --
>>>>> Softwareentwickler (Reporting)
>>>>> ________________________________________________________
>>>>>
>>>>> ADITION technologies AG
>>>>> Schwarzwaldstraße 78b
>>>>> 79117 Freiburg
>>>>>
>>>>> http://www.adition.com
>>>>>
>>>>> T +49 / (0)761 / 88147 - 30
>>>>> F +49 / (0)761 / 88147 - 77
>>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>>
>>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>>
>>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>>
Best Regards,
Ruslan Al-Fakikh
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB