Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Best Practice: store depending on data content


+
Markus Resch 2012-06-22, 11:54
+
Thejas Nair 2012-06-22, 18:23
+
Markus Resch 2012-06-28, 09:17
+
Thejas Nair 2012-06-28, 17:27
+
Ruslan Al-Fakikh 2012-06-28, 17:59
Copy link to this message
-
Re: Best Practice: store depending on data content
Alan Gates 2012-06-29, 17:13
On a different topic, I'm interested in why you refuse to use a project in the incubator.  Incubation is the Apache process by why a community is built around the code.  It says nothing about the maturity of the code.  

Alan.

On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:

> Hi Markus,
>
> Currently I am doing almost the same task. But in Hive.
> In Hive you can use the native Avro+Hive integration:
> https://issues.apache.org/jira/browse/HIVE-895
> Or haivvreo project if you are not using the latest version of Hive.
> Also there is a Dynamic Partition feature in Hive that can separate
> your data by a column value.
>
> As for HCatalog - I refused to use it after some investigation, because:
> 1) It is still incubating
> 2) It is not supported by Cloudera (the distribution provider we are
> currently using)
>
> I think it would be perfect if MultiStorage would be generic in the
> way you described, but I am not familiar with it.
>
> Ruslan
>
> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>> I am not aware of any work on adding those features to MultiStorage.
>>
>> I think the best way to do this is to use Hcatalog. (It makes the hive
>> metastore available for all of hadoop, so you get metadata for your data as
>> well).
>> You can associate a outputformat+serde for a table (instead of file name
>> ending), and HCatStorage will automatically pick the right format.
>>
>> Thanks,
>> Thejas
>>
>>
>>
>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>
>>> Thanks Thejas,
>>>
>>> This _really_ helped a lot :)
>>> Some additional question on this:
>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>> output, right? Is there any attempt ongoing currently to make this
>>> storage more generic regarding the format of the output data? For our
>>> needs we would require AVRO output as well as some special proprietary
>>> binary encoding for which we already created our own storage. I'm
>>> thinking about a storage that will select a certain writer method
>>> depending to the file names ending.
>>>
>>> Do you know of such efforts?
>>>
>>> Thanks
>>>
>>> Markus
>>>
>>>
>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>
>>>> You can use MultiStorage store func -
>>>>
>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>>
>>>> Or if you want something more flexible, and have metadata as well, use
>>>> hcatalog . Specify the keys on which you want to partition as your
>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>>
>>>> Thanks,
>>>> Thejas
>>>>
>>>>
>>>>
>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>>
>>>>> Hey everyone,
>>>>>
>>>>> We're doing some aggregation. The result contains a key where we want to
>>>>> have a single output file for each key. Is it possible to store files
>>>>> like this? Especially adjusting the path by the key's value.
>>>>>
>>>>> Example:
>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>>> [.... doing stuff....]
>>>>> Output = GROUP AggregatesValues BY Key;
>>>>> FOREACH Output Store * into
>>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>>
>>>>> I know this example does not work. But is there anything similar
>>>>> possible? And, as I assume, not: is there some framework in the hadoop
>>>>> world that can do such stuff?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>
>>>
>>
+
Ruslan Al-Fakikh 2012-07-02, 12:57
+
Alan Gates 2012-07-02, 18:43
+
Dmitriy Ryaboy 2012-07-02, 17:37
+
Ruslan Al-Fakikh 2012-07-03, 09:56
+
Dmitriy Ryaboy 2012-07-04, 00:37
+
Ruslan Al-Fakikh 2012-07-05, 15:01
+
Markus Resch 2012-07-03, 11:30