Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Best Practice: store depending on data content


+
Markus Resch 2012-06-22, 11:54
+
Thejas Nair 2012-06-22, 18:23
+
Markus Resch 2012-06-28, 09:17
+
Thejas Nair 2012-06-28, 17:27
+
Ruslan Al-Fakikh 2012-06-28, 17:59
+
Alan Gates 2012-06-29, 17:13
+
Ruslan Al-Fakikh 2012-07-02, 12:57
+
Alan Gates 2012-07-02, 18:43
Copy link to this message
-
Re: Best Practice: store depending on data content
Dmitriy Ryaboy 2012-07-02, 17:37
"It would give me the list of datasets in one place accessible from all
tools,"

And that's exactly why you want it.

D

On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
> Hey Alan,
>
> I am not familiar with Apache processes, so I could be wrong in my
> point 1, I am sorry.
> Basically my impressions was that Cloudera is pushing Avro format for
> intercommunications between hadoop tools like pig, hive and mapreduce.
> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
> http://www.cloudera.com/blog/2011/07/avro-data-interop/
> And if I decide to use Avro then HCatalog becomes a little redundant.
> It would give me the list of datasets in one place accessible from all
> tools, but all the columns (names and types) would be stored in Avro
> schemas and Hive metastore becomes just a stub for those Avro schemas:
> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
> And having those avro schemas I could access data from pig and
> mapreduce without HCatalog. Though I haven't figured out how to deal
> without hive partitions yet.
>
> Best Regards,
> Ruslan
>
> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>> On a different topic, I'm interested in why you refuse to use a project in the incubator.  Incubation is the Apache process by why a community is built around the code.  It says nothing about the maturity of the code.
>>
>> Alan.
>>
>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>>
>>> Hi Markus,
>>>
>>> Currently I am doing almost the same task. But in Hive.
>>> In Hive you can use the native Avro+Hive integration:
>>> https://issues.apache.org/jira/browse/HIVE-895
>>> Or haivvreo project if you are not using the latest version of Hive.
>>> Also there is a Dynamic Partition feature in Hive that can separate
>>> your data by a column value.
>>>
>>> As for HCatalog - I refused to use it after some investigation, because:
>>> 1) It is still incubating
>>> 2) It is not supported by Cloudera (the distribution provider we are
>>> currently using)
>>>
>>> I think it would be perfect if MultiStorage would be generic in the
>>> way you described, but I am not familiar with it.
>>>
>>> Ruslan
>>>
>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>>>> I am not aware of any work on adding those features to MultiStorage.
>>>>
>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>> metastore available for all of hadoop, so you get metadata for your data as
>>>> well).
>>>> You can associate a outputformat+serde for a table (instead of file name
>>>> ending), and HCatStorage will automatically pick the right format.
>>>>
>>>> Thanks,
>>>> Thejas
>>>>
>>>>
>>>>
>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>>
>>>>> Thanks Thejas,
>>>>>
>>>>> This _really_ helped a lot :)
>>>>> Some additional question on this:
>>>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>>>> output, right? Is there any attempt ongoing currently to make this
>>>>> storage more generic regarding the format of the output data? For our
>>>>> needs we would require AVRO output as well as some special proprietary
>>>>> binary encoding for which we already created our own storage. I'm
>>>>> thinking about a storage that will select a certain writer method
>>>>> depending to the file names ending.
>>>>>
>>>>> Do you know of such efforts?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>>>
>>>>>> You can use MultiStorage store func -
>>>>>>
>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>>>>
>>>>>> Or if you want something more flexible, and have metadata as well, use
>>>>>> hcatalog . Specify the keys on which you want to partition as your
>>>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
+
Ruslan Al-Fakikh 2012-07-03, 09:56
+
Dmitriy Ryaboy 2012-07-04, 00:37
+
Ruslan Al-Fakikh 2012-07-05, 15:01
+
Markus Resch 2012-07-03, 11:30