Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Best Practice: store depending on data content


Copy link to this message
-
Re: Best Practice: store depending on data content
Ruslan Al-Fakikh 2012-07-03, 09:56
Dmirtiy,

In our organization we use file paths for this purpose like this:
/incoming/datasetA
/incoming/datasetB
/reports/datasetC
etc

On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> "It would give me the list of datasets in one place accessible from all
> tools,"
>
> And that's exactly why you want it.
>
> D
>
> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>> Hey Alan,
>>
>> I am not familiar with Apache processes, so I could be wrong in my
>> point 1, I am sorry.
>> Basically my impressions was that Cloudera is pushing Avro format for
>> intercommunications between hadoop tools like pig, hive and mapreduce.
>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
>> http://www.cloudera.com/blog/2011/07/avro-data-interop/
>> And if I decide to use Avro then HCatalog becomes a little redundant.
>> It would give me the list of datasets in one place accessible from all
>> tools, but all the columns (names and types) would be stored in Avro
>> schemas and Hive metastore becomes just a stub for those Avro schemas:
>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
>> And having those avro schemas I could access data from pig and
>> mapreduce without HCatalog. Though I haven't figured out how to deal
>> without hive partitions yet.
>>
>> Best Regards,
>> Ruslan
>>
>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>>> On a different topic, I'm interested in why you refuse to use a project in the incubator.  Incubation is the Apache process by why a community is built around the code.  It says nothing about the maturity of the code.
>>>
>>> Alan.
>>>
>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>>>
>>>> Hi Markus,
>>>>
>>>> Currently I am doing almost the same task. But in Hive.
>>>> In Hive you can use the native Avro+Hive integration:
>>>> https://issues.apache.org/jira/browse/HIVE-895
>>>> Or haivvreo project if you are not using the latest version of Hive.
>>>> Also there is a Dynamic Partition feature in Hive that can separate
>>>> your data by a column value.
>>>>
>>>> As for HCatalog - I refused to use it after some investigation, because:
>>>> 1) It is still incubating
>>>> 2) It is not supported by Cloudera (the distribution provider we are
>>>> currently using)
>>>>
>>>> I think it would be perfect if MultiStorage would be generic in the
>>>> way you described, but I am not familiar with it.
>>>>
>>>> Ruslan
>>>>
>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
>>>>> I am not aware of any work on adding those features to MultiStorage.
>>>>>
>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>>> metastore available for all of hadoop, so you get metadata for your data as
>>>>> well).
>>>>> You can associate a outputformat+serde for a table (instead of file name
>>>>> ending), and HCatStorage will automatically pick the right format.
>>>>>
>>>>> Thanks,
>>>>> Thejas
>>>>>
>>>>>
>>>>>
>>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>>>
>>>>>> Thanks Thejas,
>>>>>>
>>>>>> This _really_ helped a lot :)
>>>>>> Some additional question on this:
>>>>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>>>>> output, right? Is there any attempt ongoing currently to make this
>>>>>> storage more generic regarding the format of the output data? For our
>>>>>> needs we would require AVRO output as well as some special proprietary
>>>>>> binary encoding for which we already created our own storage. I'm
>>>>>> thinking about a storage that will select a certain writer method
>>>>>> depending to the file names ending.
>>>>>>
>>>>>> Do you know of such efforts?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Markus
>>>>>>
>>>>>>
>>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>>>>
>>>>>>> You can use MultiStorage store func -
>>>>>>>
>>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

Best Regards,
Ruslan Al-Fakikh