Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Best Practice: store depending on data content


Copy link to this message
-
Re: Best Practice: store depending on data content
Hi Markus,

Currently I am doing almost the same task. But in Hive.
In Hive you can use the native Avro+Hive integration:
https://issues.apache.org/jira/browse/HIVE-895
Or haivvreo project if you are not using the latest version of Hive.
Also there is a Dynamic Partition feature in Hive that can separate
your data by a column value.

As for HCatalog - I refused to use it after some investigation, because:
1) It is still incubating
2) It is not supported by Cloudera (the distribution provider we are
currently using)

I think it would be perfect if MultiStorage would be generic in the
way you described, but I am not familiar with it.

Ruslan

On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
> I am not aware of any work on adding those features to MultiStorage.
>
> I think the best way to do this is to use Hcatalog. (It makes the hive
> metastore available for all of hadoop, so you get metadata for your data as
> well).
> You can associate a outputformat+serde for a table (instead of file name
> ending), and HCatStorage will automatically pick the right format.
>
> Thanks,
> Thejas
>
>
>
> On 6/28/12 2:17 AM, Markus Resch wrote:
>>
>> Thanks Thejas,
>>
>> This _really_ helped a lot :)
>> Some additional question on this:
>> As far as I see, the MultiStorage is currently just capable to write CSV
>> output, right? Is there any attempt ongoing currently to make this
>> storage more generic regarding the format of the output data? For our
>> needs we would require AVRO output as well as some special proprietary
>> binary encoding for which we already created our own storage. I'm
>> thinking about a storage that will select a certain writer method
>> depending to the file names ending.
>>
>> Do you know of such efforts?
>>
>> Thanks
>>
>> Markus
>>
>>
>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>
>>> You can use MultiStorage store func -
>>>
>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>
>>> Or if you want something more flexible, and have metadata as well, use
>>> hcatalog . Specify the keys on which you want to partition as your
>>> partition keys in the table. Then use HcatStorer() to store the data.
>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>
>>> Thanks,
>>> Thejas
>>>
>>>
>>>
>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>
>>>> Hey everyone,
>>>>
>>>> We're doing some aggregation. The result contains a key where we want to
>>>> have a single output file for each key. Is it possible to store files
>>>> like this? Especially adjusting the path by the key's value.
>>>>
>>>> Example:
>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>> [.... doing stuff....]
>>>> Output = GROUP AggregatesValues BY Key;
>>>> FOREACH Output Store * into
>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>
>>>> I know this example does not work. But is there anything similar
>>>> possible? And, as I assume, not: is there some framework in the hadoop
>>>> world that can do such stuff?
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Markus
>>>>
>>>>
>>
>>
>