Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Best Practice: store depending on data content


Copy link to this message
-
Re: Best Practice: store depending on data content
That is a very interesting offtopic:)
I think I will reinvestigate HCatalog some day and come up with
specific questions.

Thanks a lot for explaining

On Wed, Jul 4, 2012 at 4:37 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Imagine increasing the number of datasets by a couple orders of
> magnitude. "ls" stops being a good browsing too pretty quickly.
>
> Then, add the need to manage quotas and retention policies for
> different data producers, to find resources across multiple teams, to
> have a web ui for easy metadata search...
>
> (and now we are totally and thoroughly offtopic. Sorry.)
>
> D
>
> On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh
> <[EMAIL PROTECTED]> wrote:
>> Dmirtiy,
>>
>> In our organization we use file paths for this purpose like this:
>> /incoming/datasetA
>> /incoming/datasetB
>> /reports/datasetC
>> etc
>>
>> On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>>> "It would give me the list of datasets in one place accessible from all
>>> tools,"
>>>
>>> And that's exactly why you want it.
>>>
>>> D
>>>
>>> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>>>> Hey Alan,
>>>>
>>>> I am not familiar with Apache processes, so I could be wrong in my
>>>> point 1, I am sorry.
>>>> Basically my impressions was that Cloudera is pushing Avro format for
>>>> intercommunications between hadoop tools like pig, hive and mapreduce.
>>>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
>>>> http://www.cloudera.com/blog/2011/07/avro-data-interop/
>>>> And if I decide to use Avro then HCatalog becomes a little redundant.
>>>> It would give me the list of datasets in one place accessible from all
>>>> tools, but all the columns (names and types) would be stored in Avro
>>>> schemas and Hive metastore becomes just a stub for those Avro schemas:
>>>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
>>>> And having those avro schemas I could access data from pig and
>>>> mapreduce without HCatalog. Though I haven't figured out how to deal
>>>> without hive partitions yet.
>>>>
>>>> Best Regards,
>>>> Ruslan
>>>>
>>>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>>>>> On a different topic, I'm interested in why you refuse to use a project in the incubator.  Incubation is the Apache process by why a community is built around the code.  It says nothing about the maturity of the code.
>>>>>
>>>>> Alan.
>>>>>
>>>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>>>>>
>>>>>> Hi Markus,
>>>>>>
>>>>>> Currently I am doing almost the same task. But in Hive.
>>>>>> In Hive you can use the native Avro+Hive integration:
>>>>>> https://issues.apache.org/jira/browse/HIVE-895
>>>>>> Or haivvreo project if you are not using the latest version of Hive.
>>>>>> Also there is a Dynamic Partition feature in Hive that can separate
>>>>>> your data by a column value.
>>>>>>
>>>>>> As for HCatalog - I refused to use it after some investigation, because:
>>>>>> 1) It is still incubating
>>>>>> 2) It is not supported by Cloudera (the distribution provider we are
>>>>>> currently using)
>>>>>>
>>>>>> I think it would be perfect if MultiStorage would be generic in the
>>>>>> way you described, but I am not familiar with it.
>>>>>>
>>>>>> Ruslan
>>>>>>
>>>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>>>>>>> I am not aware of any work on adding those features to MultiStorage.
>>>>>>>
>>>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>>>>> metastore available for all of hadoop, so you get metadata for your data as
>>>>>>> well).
>>>>>>> You can associate a outputformat+serde for a table (instead of file name
>>>>>>> ending), and HCatStorage will automatically pick the right format.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thejas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>>>>>
>>>>>>>> Thanks Thejas,
>>>>>>>>
>>>>>>>> This _really_ helped a lot :)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB