Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Best Practice: store depending on data content


+
Markus Resch 2012-06-22, 11:54
+
Thejas Nair 2012-06-22, 18:23
+
Markus Resch 2012-06-28, 09:17
+
Thejas Nair 2012-06-28, 17:27
+
Ruslan Al-Fakikh 2012-06-28, 17:59
Copy link to this message
-
Re: Best Practice: store depending on data content
On a different topic, I'm interested in why you refuse to use a project in the incubator.  Incubation is the Apache process by why a community is built around the code.  It says nothing about the maturity of the code.  

Alan.

On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:

> Hi Markus,
>
> Currently I am doing almost the same task. But in Hive.
> In Hive you can use the native Avro+Hive integration:
> https://issues.apache.org/jira/browse/HIVE-895
> Or haivvreo project if you are not using the latest version of Hive.
> Also there is a Dynamic Partition feature in Hive that can separate
> your data by a column value.
>
> As for HCatalog - I refused to use it after some investigation, because:
> 1) It is still incubating
> 2) It is not supported by Cloudera (the distribution provider we are
> currently using)
>
> I think it would be perfect if MultiStorage would be generic in the
> way you described, but I am not familiar with it.
>
> Ruslan
>
> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>> I am not aware of any work on adding those features to MultiStorage.
>>
>> I think the best way to do this is to use Hcatalog. (It makes the hive
>> metastore available for all of hadoop, so you get metadata for your data as
>> well).
>> You can associate a outputformat+serde for a table (instead of file name
>> ending), and HCatStorage will automatically pick the right format.
>>
>> Thanks,
>> Thejas
>>
>>
>>
>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>
>>> Thanks Thejas,
>>>
>>> This _really_ helped a lot :)
>>> Some additional question on this:
>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>> output, right? Is there any attempt ongoing currently to make this
>>> storage more generic regarding the format of the output data? For our
>>> needs we would require AVRO output as well as some special proprietary
>>> binary encoding for which we already created our own storage. I'm
>>> thinking about a storage that will select a certain writer method
>>> depending to the file names ending.
>>>
>>> Do you know of such efforts?
>>>
>>> Thanks
>>>
>>> Markus
>>>
>>>
>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>
>>>> You can use MultiStorage store func -
>>>>
>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>>
>>>> Or if you want something more flexible, and have metadata as well, use
>>>> hcatalog . Specify the keys on which you want to partition as your
>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>>
>>>> Thanks,
>>>> Thejas
>>>>
>>>>
>>>>
>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>>
>>>>> Hey everyone,
>>>>>
>>>>> We're doing some aggregation. The result contains a key where we want to
>>>>> have a single output file for each key. Is it possible to store files
>>>>> like this? Especially adjusting the path by the key's value.
>>>>>
>>>>> Example:
>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>>> [.... doing stuff....]
>>>>> Output = GROUP AggregatesValues BY Key;
>>>>> FOREACH Output Store * into
>>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>>
>>>>> I know this example does not work. But is there anything similar
>>>>> possible? And, as I assume, not: is there some framework in the hadoop
>>>>> world that can do such stuff?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>
>>>
>>
+
Ruslan Al-Fakikh 2012-07-02, 12:57
+
Alan Gates 2012-07-02, 18:43
+
Dmitriy Ryaboy 2012-07-02, 17:37
+
Ruslan Al-Fakikh 2012-07-03, 09:56
+
Dmitriy Ryaboy 2012-07-04, 00:37
+
Ruslan Al-Fakikh 2012-07-05, 15:01
+
Markus Resch 2012-07-03, 11:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB