|
|
-
Best Practice: store depending on data content
Markus Resch 2012-06-22, 11:54
Hey everyone,
We're doing some aggregation. The result contains a key where we want to have a single output file for each key. Is it possible to store files like this? Especially adjusting the path by the key's value.
Example: Input = LOAD 'my/data.avro' USING AvroStorage; [.... doing stuff....] Output = GROUP AggregatesValues BY Key; FOREACH Output Store * into '/my/output/path/by/$Output.Key/Result.avro'
I know this example does not work. But is there anything similar possible? And, as I assume, not: is there some framework in the hadoop world that can do such stuff? Thanks
Markus
+
Markus Resch 2012-06-22, 11:54
-
Re: Best Practice: store depending on data content
Thejas Nair 2012-06-22, 18:23
You can use MultiStorage store func - http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.htmlOr if you want something more flexible, and have metadata as well, use hcatalog . Specify the keys on which you want to partition as your partition keys in the table. Then use HcatStorer() to store the data. See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.htmlThanks, Thejas On 6/22/12 4:54 AM, Markus Resch wrote: > Hey everyone, > > We're doing some aggregation. The result contains a key where we want to > have a single output file for each key. Is it possible to store files > like this? Especially adjusting the path by the key's value. > > Example: > Input = LOAD 'my/data.avro' USING AvroStorage; > [.... doing stuff....] > Output = GROUP AggregatesValues BY Key; > FOREACH Output Store * into > '/my/output/path/by/$Output.Key/Result.avro' > > I know this example does not work. But is there anything similar > possible? And, as I assume, not: is there some framework in the hadoop > world that can do such stuff? > > > Thanks > > Markus > >
+
Thejas Nair 2012-06-22, 18:23
-
Re: Best Practice: store depending on data content
Markus Resch 2012-06-28, 09:17
Thanks Thejas, This _really_ helped a lot :) Some additional question on this: As far as I see, the MultiStorage is currently just capable to write CSV output, right? Is there any attempt ongoing currently to make this storage more generic regarding the format of the output data? For our needs we would require AVRO output as well as some special proprietary binary encoding for which we already created our own storage. I'm thinking about a storage that will select a certain writer method depending to the file names ending. Do you know of such efforts? Thanks Markus Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: > You can use MultiStorage store func - > http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html> > Or if you want something more flexible, and have metadata as well, use > hcatalog . Specify the keys on which you want to partition as your > partition keys in the table. Then use HcatStorer() to store the data. > See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html> > Thanks, > Thejas > > > > On 6/22/12 4:54 AM, Markus Resch wrote: > > Hey everyone, > > > > We're doing some aggregation. The result contains a key where we want to > > have a single output file for each key. Is it possible to store files > > like this? Especially adjusting the path by the key's value. > > > > Example: > > Input = LOAD 'my/data.avro' USING AvroStorage; > > [.... doing stuff....] > > Output = GROUP AggregatesValues BY Key; > > FOREACH Output Store * into > > '/my/output/path/by/$Output.Key/Result.avro' > > > > I know this example does not work. But is there anything similar > > possible? And, as I assume, not: is there some framework in the hadoop > > world that can do such stuff? > > > > > > Thanks > > > > Markus > > > >
+
Markus Resch 2012-06-28, 09:17
-
Re: Best Practice: store depending on data content
Thejas Nair 2012-06-28, 17:27
I am not aware of any work on adding those features to MultiStorage. I think the best way to do this is to use Hcatalog. (It makes the hive metastore available for all of hadoop, so you get metadata for your data as well). You can associate a outputformat+serde for a table (instead of file name ending), and HCatStorage will automatically pick the right format. Thanks, Thejas On 6/28/12 2:17 AM, Markus Resch wrote: > Thanks Thejas, > > This _really_ helped a lot :) > Some additional question on this: > As far as I see, the MultiStorage is currently just capable to write CSV > output, right? Is there any attempt ongoing currently to make this > storage more generic regarding the format of the output data? For our > needs we would require AVRO output as well as some special proprietary > binary encoding for which we already created our own storage. I'm > thinking about a storage that will select a certain writer method > depending to the file names ending. > > Do you know of such efforts? > > Thanks > > Markus > > > Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >> You can use MultiStorage store func - >> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html>> >> Or if you want something more flexible, and have metadata as well, use >> hcatalog . Specify the keys on which you want to partition as your >> partition keys in the table. Then use HcatStorer() to store the data. >> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html>> >> Thanks, >> Thejas >> >> >> >> On 6/22/12 4:54 AM, Markus Resch wrote: >>> Hey everyone, >>> >>> We're doing some aggregation. The result contains a key where we want to >>> have a single output file for each key. Is it possible to store files >>> like this? Especially adjusting the path by the key's value. >>> >>> Example: >>> Input = LOAD 'my/data.avro' USING AvroStorage; >>> [.... doing stuff....] >>> Output = GROUP AggregatesValues BY Key; >>> FOREACH Output Store * into >>> '/my/output/path/by/$Output.Key/Result.avro' >>> >>> I know this example does not work. But is there anything similar >>> possible? And, as I assume, not: is there some framework in the hadoop >>> world that can do such stuff? >>> >>> >>> Thanks >>> >>> Markus >>> >>> > >
+
Thejas Nair 2012-06-28, 17:27
-
Re: Best Practice: store depending on data content
Ruslan Al-Fakikh 2012-06-28, 17:59
Hi Markus, Currently I am doing almost the same task. But in Hive. In Hive you can use the native Avro+Hive integration: https://issues.apache.org/jira/browse/HIVE-895Or haivvreo project if you are not using the latest version of Hive. Also there is a Dynamic Partition feature in Hive that can separate your data by a column value. As for HCatalog - I refused to use it after some investigation, because: 1) It is still incubating 2) It is not supported by Cloudera (the distribution provider we are currently using) I think it would be perfect if MultiStorage would be generic in the way you described, but I am not familiar with it. Ruslan On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: > I am not aware of any work on adding those features to MultiStorage. > > I think the best way to do this is to use Hcatalog. (It makes the hive > metastore available for all of hadoop, so you get metadata for your data as > well). > You can associate a outputformat+serde for a table (instead of file name > ending), and HCatStorage will automatically pick the right format. > > Thanks, > Thejas > > > > On 6/28/12 2:17 AM, Markus Resch wrote: >> >> Thanks Thejas, >> >> This _really_ helped a lot :) >> Some additional question on this: >> As far as I see, the MultiStorage is currently just capable to write CSV >> output, right? Is there any attempt ongoing currently to make this >> storage more generic regarding the format of the output data? For our >> needs we would require AVRO output as well as some special proprietary >> binary encoding for which we already created our own storage. I'm >> thinking about a storage that will select a certain writer method >> depending to the file names ending. >> >> Do you know of such efforts? >> >> Thanks >> >> Markus >> >> >> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>> >>> You can use MultiStorage store func - >>> >>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html>>> >>> Or if you want something more flexible, and have metadata as well, use >>> hcatalog . Specify the keys on which you want to partition as your >>> partition keys in the table. Then use HcatStorer() to store the data. >>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html>>> >>> Thanks, >>> Thejas >>> >>> >>> >>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>> >>>> Hey everyone, >>>> >>>> We're doing some aggregation. The result contains a key where we want to >>>> have a single output file for each key. Is it possible to store files >>>> like this? Especially adjusting the path by the key's value. >>>> >>>> Example: >>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>> [.... doing stuff....] >>>> Output = GROUP AggregatesValues BY Key; >>>> FOREACH Output Store * into >>>> '/my/output/path/by/$Output.Key/Result.avro' >>>> >>>> I know this example does not work. But is there anything similar >>>> possible? And, as I assume, not: is there some framework in the hadoop >>>> world that can do such stuff? >>>> >>>> >>>> Thanks >>>> >>>> Markus >>>> >>>> >> >> >
+
Ruslan Al-Fakikh 2012-06-28, 17:59
-
Re: Best Practice: store depending on data content
Alan Gates 2012-06-29, 17:13
On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. Alan. On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > Hi Markus, > > Currently I am doing almost the same task. But in Hive. > In Hive you can use the native Avro+Hive integration: > https://issues.apache.org/jira/browse/HIVE-895> Or haivvreo project if you are not using the latest version of Hive. > Also there is a Dynamic Partition feature in Hive that can separate > your data by a column value. > > As for HCatalog - I refused to use it after some investigation, because: > 1) It is still incubating > 2) It is not supported by Cloudera (the distribution provider we are > currently using) > > I think it would be perfect if MultiStorage would be generic in the > way you described, but I am not familiar with it. > > Ruslan > > On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >> I am not aware of any work on adding those features to MultiStorage. >> >> I think the best way to do this is to use Hcatalog. (It makes the hive >> metastore available for all of hadoop, so you get metadata for your data as >> well). >> You can associate a outputformat+serde for a table (instead of file name >> ending), and HCatStorage will automatically pick the right format. >> >> Thanks, >> Thejas >> >> >> >> On 6/28/12 2:17 AM, Markus Resch wrote: >>> >>> Thanks Thejas, >>> >>> This _really_ helped a lot :) >>> Some additional question on this: >>> As far as I see, the MultiStorage is currently just capable to write CSV >>> output, right? Is there any attempt ongoing currently to make this >>> storage more generic regarding the format of the output data? For our >>> needs we would require AVRO output as well as some special proprietary >>> binary encoding for which we already created our own storage. I'm >>> thinking about a storage that will select a certain writer method >>> depending to the file names ending. >>> >>> Do you know of such efforts? >>> >>> Thanks >>> >>> Markus >>> >>> >>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>> >>>> You can use MultiStorage store func - >>>> >>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html>>>> >>>> Or if you want something more flexible, and have metadata as well, use >>>> hcatalog . Specify the keys on which you want to partition as your >>>> partition keys in the table. Then use HcatStorer() to store the data. >>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html>>>> >>>> Thanks, >>>> Thejas >>>> >>>> >>>> >>>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>>> >>>>> Hey everyone, >>>>> >>>>> We're doing some aggregation. The result contains a key where we want to >>>>> have a single output file for each key. Is it possible to store files >>>>> like this? Especially adjusting the path by the key's value. >>>>> >>>>> Example: >>>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>>> [.... doing stuff....] >>>>> Output = GROUP AggregatesValues BY Key; >>>>> FOREACH Output Store * into >>>>> '/my/output/path/by/$Output.Key/Result.avro' >>>>> >>>>> I know this example does not work. But is there anything similar >>>>> possible? And, as I assume, not: is there some framework in the hadoop >>>>> world that can do such stuff? >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Markus >>>>> >>>>> >>> >>> >>
+
Alan Gates 2012-06-29, 17:13
-
Re: Best Practice: store depending on data content
Ruslan Al-Fakikh 2012-07-02, 12:57
Hey Alan, I am not familiar with Apache processes, so I could be wrong in my point 1, I am sorry. Basically my impressions was that Cloudera is pushing Avro format for intercommunications between hadoop tools like pig, hive and mapreduce. https://ccp.cloudera.com/display/CDHDOC/Avro+Usagehttp://www.cloudera.com/blog/2011/07/avro-data-interop/And if I decide to use Avro then HCatalog becomes a little redundant. It would give me the list of datasets in one place accessible from all tools, but all the columns (names and types) would be stored in Avro schemas and Hive metastore becomes just a stub for those Avro schemas: https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tablesAnd having those avro schemas I could access data from pig and mapreduce without HCatalog. Though I haven't figured out how to deal without hive partitions yet. Best Regards, Ruslan On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. > > Alan. > > On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > >> Hi Markus, >> >> Currently I am doing almost the same task. But in Hive. >> In Hive you can use the native Avro+Hive integration: >> https://issues.apache.org/jira/browse/HIVE-895>> Or haivvreo project if you are not using the latest version of Hive. >> Also there is a Dynamic Partition feature in Hive that can separate >> your data by a column value. >> >> As for HCatalog - I refused to use it after some investigation, because: >> 1) It is still incubating >> 2) It is not supported by Cloudera (the distribution provider we are >> currently using) >> >> I think it would be perfect if MultiStorage would be generic in the >> way you described, but I am not familiar with it. >> >> Ruslan >> >> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >>> I am not aware of any work on adding those features to MultiStorage. >>> >>> I think the best way to do this is to use Hcatalog. (It makes the hive >>> metastore available for all of hadoop, so you get metadata for your data as >>> well). >>> You can associate a outputformat+serde for a table (instead of file name >>> ending), and HCatStorage will automatically pick the right format. >>> >>> Thanks, >>> Thejas >>> >>> >>> >>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>> >>>> Thanks Thejas, >>>> >>>> This _really_ helped a lot :) >>>> Some additional question on this: >>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>> output, right? Is there any attempt ongoing currently to make this >>>> storage more generic regarding the format of the output data? For our >>>> needs we would require AVRO output as well as some special proprietary >>>> binary encoding for which we already created our own storage. I'm >>>> thinking about a storage that will select a certain writer method >>>> depending to the file names ending. >>>> >>>> Do you know of such efforts? >>>> >>>> Thanks >>>> >>>> Markus >>>> >>>> >>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>>> >>>>> You can use MultiStorage store func - >>>>> >>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html>>>>> >>>>> Or if you want something more flexible, and have metadata as well, use >>>>> hcatalog . Specify the keys on which you want to partition as your >>>>> partition keys in the table. Then use HcatStorer() to store the data. >>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html>>>>> >>>>> Thanks, >>>>> Thejas >>>>> >>>>> >>>>> >>>>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>>>> >>>>>> Hey everyone, >>>>>> >>>>>> We're doing some aggregation. The result contains a key where we want to >>>>>> have a single output file for each key. Is it possible to store files >>>>>> like this? Especially adjusting the path by the key's value.
+
Ruslan Al-Fakikh 2012-07-02, 12:57
-
Re: Best Practice: store depending on data content
Alan Gates 2012-07-02, 18:43
On Jul 2, 2012, at 5:57 AM, Ruslan Al-Fakikh wrote: > Hey Alan, > > I am not familiar with Apache processes, so I could be wrong in my > point 1, I am sorry. I wasn't trying to say you were right or wrong, just trying to understand your perspective. > Basically my impressions was that Cloudera is pushing Avro format for > intercommunications between hadoop tools like pig, hive and mapreduce. > https://ccp.cloudera.com/display/CDHDOC/Avro+Usage> http://www.cloudera.com/blog/2011/07/avro-data-interop/> And if I decide to use Avro then HCatalog becomes a little redundant. > It would give me the list of datasets in one place accessible from all > tools, but all the columns (names and types) would be stored in Avro > schemas and Hive metastore becomes just a stub for those Avro schemas: > https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables> And having those avro schemas I could access data from pig and > mapreduce without HCatalog. Though I haven't figured out how to deal > without hive partitions yet. It's true Avro can store schema data. HCatalog does much more than this and aspires to add to that set of features in the future. It will soon provide a REST API for external systems to interact with the metadata. It allows you to store data in HBase or other non-HDFS systems. In the future it will provide interfaces to data life cycle management tools like cleaning tools, replication tools, etc. And it does not bind you to one storage format. That said, if you don't need any of these things Avro may be a good solution for your situation. Definitely choose the tool that best fits your need. Alan. > > Best Regards, > Ruslan > > On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: >> On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. >> >> Alan. >> >> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >> >>> Hi Markus, >>> >>> Currently I am doing almost the same task. But in Hive. >>> In Hive you can use the native Avro+Hive integration: >>> https://issues.apache.org/jira/browse/HIVE-895>>> Or haivvreo project if you are not using the latest version of Hive. >>> Also there is a Dynamic Partition feature in Hive that can separate >>> your data by a column value. >>> >>> As for HCatalog - I refused to use it after some investigation, because: >>> 1) It is still incubating >>> 2) It is not supported by Cloudera (the distribution provider we are >>> currently using) >>> >>> I think it would be perfect if MultiStorage would be generic in the >>> way you described, but I am not familiar with it. >>> >>> Ruslan >>> >>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >>>> I am not aware of any work on adding those features to MultiStorage. >>>> >>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>> metastore available for all of hadoop, so you get metadata for your data as >>>> well). >>>> You can associate a outputformat+serde for a table (instead of file name >>>> ending), and HCatStorage will automatically pick the right format. >>>> >>>> Thanks, >>>> Thejas >>>> >>>> >>>> >>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>> >>>>> Thanks Thejas, >>>>> >>>>> This _really_ helped a lot :) >>>>> Some additional question on this: >>>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>>> output, right? Is there any attempt ongoing currently to make this >>>>> storage more generic regarding the format of the output data? For our >>>>> needs we would require AVRO output as well as some special proprietary >>>>> binary encoding for which we already created our own storage. I'm >>>>> thinking about a storage that will select a certain writer method >>>>> depending to the file names ending. >>>>> >>>>> Do you know of such efforts?
+
Alan Gates 2012-07-02, 18:43
-
Re: Best Practice: store depending on data content
Dmitriy Ryaboy 2012-07-02, 17:37
"It would give me the list of datasets in one place accessible from all tools," And that's exactly why you want it. D On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > Hey Alan, > > I am not familiar with Apache processes, so I could be wrong in my > point 1, I am sorry. > Basically my impressions was that Cloudera is pushing Avro format for > intercommunications between hadoop tools like pig, hive and mapreduce. > https://ccp.cloudera.com/display/CDHDOC/Avro+Usage> http://www.cloudera.com/blog/2011/07/avro-data-interop/> And if I decide to use Avro then HCatalog becomes a little redundant. > It would give me the list of datasets in one place accessible from all > tools, but all the columns (names and types) would be stored in Avro > schemas and Hive metastore becomes just a stub for those Avro schemas: > https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables> And having those avro schemas I could access data from pig and > mapreduce without HCatalog. Though I haven't figured out how to deal > without hive partitions yet. > > Best Regards, > Ruslan > > On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: >> On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. >> >> Alan. >> >> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >> >>> Hi Markus, >>> >>> Currently I am doing almost the same task. But in Hive. >>> In Hive you can use the native Avro+Hive integration: >>> https://issues.apache.org/jira/browse/HIVE-895>>> Or haivvreo project if you are not using the latest version of Hive. >>> Also there is a Dynamic Partition feature in Hive that can separate >>> your data by a column value. >>> >>> As for HCatalog - I refused to use it after some investigation, because: >>> 1) It is still incubating >>> 2) It is not supported by Cloudera (the distribution provider we are >>> currently using) >>> >>> I think it would be perfect if MultiStorage would be generic in the >>> way you described, but I am not familiar with it. >>> >>> Ruslan >>> >>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >>>> I am not aware of any work on adding those features to MultiStorage. >>>> >>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>> metastore available for all of hadoop, so you get metadata for your data as >>>> well). >>>> You can associate a outputformat+serde for a table (instead of file name >>>> ending), and HCatStorage will automatically pick the right format. >>>> >>>> Thanks, >>>> Thejas >>>> >>>> >>>> >>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>> >>>>> Thanks Thejas, >>>>> >>>>> This _really_ helped a lot :) >>>>> Some additional question on this: >>>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>>> output, right? Is there any attempt ongoing currently to make this >>>>> storage more generic regarding the format of the output data? For our >>>>> needs we would require AVRO output as well as some special proprietary >>>>> binary encoding for which we already created our own storage. I'm >>>>> thinking about a storage that will select a certain writer method >>>>> depending to the file names ending. >>>>> >>>>> Do you know of such efforts? >>>>> >>>>> Thanks >>>>> >>>>> Markus >>>>> >>>>> >>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>>>> >>>>>> You can use MultiStorage store func - >>>>>> >>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html>>>>>> >>>>>> Or if you want something more flexible, and have metadata as well, use >>>>>> hcatalog . Specify the keys on which you want to partition as your >>>>>> partition keys in the table. Then use HcatStorer() to store the data. >>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
+
Dmitriy Ryaboy 2012-07-02, 17:37
-
Re: Best Practice: store depending on data content
Ruslan Al-Fakikh 2012-07-03, 09:56
Dmirtiy, In our organization we use file paths for this purpose like this: /incoming/datasetA /incoming/datasetB /reports/datasetC etc On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > "It would give me the list of datasets in one place accessible from all > tools," > > And that's exactly why you want it. > > D > > On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: >> Hey Alan, >> >> I am not familiar with Apache processes, so I could be wrong in my >> point 1, I am sorry. >> Basically my impressions was that Cloudera is pushing Avro format for >> intercommunications between hadoop tools like pig, hive and mapreduce. >> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage>> http://www.cloudera.com/blog/2011/07/avro-data-interop/>> And if I decide to use Avro then HCatalog becomes a little redundant. >> It would give me the list of datasets in one place accessible from all >> tools, but all the columns (names and types) would be stored in Avro >> schemas and Hive metastore becomes just a stub for those Avro schemas: >> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables>> And having those avro schemas I could access data from pig and >> mapreduce without HCatalog. Though I haven't figured out how to deal >> without hive partitions yet. >> >> Best Regards, >> Ruslan >> >> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: >>> On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. >>> >>> Alan. >>> >>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >>> >>>> Hi Markus, >>>> >>>> Currently I am doing almost the same task. But in Hive. >>>> In Hive you can use the native Avro+Hive integration: >>>> https://issues.apache.org/jira/browse/HIVE-895>>>> Or haivvreo project if you are not using the latest version of Hive. >>>> Also there is a Dynamic Partition feature in Hive that can separate >>>> your data by a column value. >>>> >>>> As for HCatalog - I refused to use it after some investigation, because: >>>> 1) It is still incubating >>>> 2) It is not supported by Cloudera (the distribution provider we are >>>> currently using) >>>> >>>> I think it would be perfect if MultiStorage would be generic in the >>>> way you described, but I am not familiar with it. >>>> >>>> Ruslan >>>> >>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >>>>> I am not aware of any work on adding those features to MultiStorage. >>>>> >>>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>>> metastore available for all of hadoop, so you get metadata for your data as >>>>> well). >>>>> You can associate a outputformat+serde for a table (instead of file name >>>>> ending), and HCatStorage will automatically pick the right format. >>>>> >>>>> Thanks, >>>>> Thejas >>>>> >>>>> >>>>> >>>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>>> >>>>>> Thanks Thejas, >>>>>> >>>>>> This _really_ helped a lot :) >>>>>> Some additional question on this: >>>>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>>>> output, right? Is there any attempt ongoing currently to make this >>>>>> storage more generic regarding the format of the output data? For our >>>>>> needs we would require AVRO output as well as some special proprietary >>>>>> binary encoding for which we already created our own storage. I'm >>>>>> thinking about a storage that will select a certain writer method >>>>>> depending to the file names ending. >>>>>> >>>>>> Do you know of such efforts? >>>>>> >>>>>> Thanks >>>>>> >>>>>> Markus >>>>>> >>>>>> >>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>>>>> >>>>>>> You can use MultiStorage store func - >>>>>>> >>>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.htmlBest Regards, Ruslan Al-Fakikh
+
Ruslan Al-Fakikh 2012-07-03, 09:56
-
Re: Best Practice: store depending on data content
Dmitriy Ryaboy 2012-07-04, 00:37
Imagine increasing the number of datasets by a couple orders of magnitude. "ls" stops being a good browsing too pretty quickly. Then, add the need to manage quotas and retention policies for different data producers, to find resources across multiple teams, to have a web ui for easy metadata search... (and now we are totally and thoroughly offtopic. Sorry.) D On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > Dmirtiy, > > In our organization we use file paths for this purpose like this: > /incoming/datasetA > /incoming/datasetB > /reports/datasetC > etc > > On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> "It would give me the list of datasets in one place accessible from all >> tools," >> >> And that's exactly why you want it. >> >> D >> >> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: >>> Hey Alan, >>> >>> I am not familiar with Apache processes, so I could be wrong in my >>> point 1, I am sorry. >>> Basically my impressions was that Cloudera is pushing Avro format for >>> intercommunications between hadoop tools like pig, hive and mapreduce. >>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage>>> http://www.cloudera.com/blog/2011/07/avro-data-interop/>>> And if I decide to use Avro then HCatalog becomes a little redundant. >>> It would give me the list of datasets in one place accessible from all >>> tools, but all the columns (names and types) would be stored in Avro >>> schemas and Hive metastore becomes just a stub for those Avro schemas: >>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables>>> And having those avro schemas I could access data from pig and >>> mapreduce without HCatalog. Though I haven't figured out how to deal >>> without hive partitions yet. >>> >>> Best Regards, >>> Ruslan >>> >>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: >>>> On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. >>>> >>>> Alan. >>>> >>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >>>> >>>>> Hi Markus, >>>>> >>>>> Currently I am doing almost the same task. But in Hive. >>>>> In Hive you can use the native Avro+Hive integration: >>>>> https://issues.apache.org/jira/browse/HIVE-895>>>>> Or haivvreo project if you are not using the latest version of Hive. >>>>> Also there is a Dynamic Partition feature in Hive that can separate >>>>> your data by a column value. >>>>> >>>>> As for HCatalog - I refused to use it after some investigation, because: >>>>> 1) It is still incubating >>>>> 2) It is not supported by Cloudera (the distribution provider we are >>>>> currently using) >>>>> >>>>> I think it would be perfect if MultiStorage would be generic in the >>>>> way you described, but I am not familiar with it. >>>>> >>>>> Ruslan >>>>> >>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >>>>>> I am not aware of any work on adding those features to MultiStorage. >>>>>> >>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>>>> metastore available for all of hadoop, so you get metadata for your data as >>>>>> well). >>>>>> You can associate a outputformat+serde for a table (instead of file name >>>>>> ending), and HCatStorage will automatically pick the right format. >>>>>> >>>>>> Thanks, >>>>>> Thejas >>>>>> >>>>>> >>>>>> >>>>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>>>> >>>>>>> Thanks Thejas, >>>>>>> >>>>>>> This _really_ helped a lot :) >>>>>>> Some additional question on this: >>>>>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>>>>> output, right? Is there any attempt ongoing currently to make this >>>>>>> storage more generic regarding the format of the output data? For our >>>>>>> needs we would require AVRO output as well as some special proprietary
+
Dmitriy Ryaboy 2012-07-04, 00:37
-
Re: Best Practice: store depending on data content
Ruslan Al-Fakikh 2012-07-05, 15:01
That is a very interesting offtopic:) I think I will reinvestigate HCatalog some day and come up with specific questions. Thanks a lot for explaining On Wed, Jul 4, 2012 at 4:37 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Imagine increasing the number of datasets by a couple orders of > magnitude. "ls" stops being a good browsing too pretty quickly. > > Then, add the need to manage quotas and retention policies for > different data producers, to find resources across multiple teams, to > have a web ui for easy metadata search... > > (and now we are totally and thoroughly offtopic. Sorry.) > > D > > On Tue, Jul 3, 2012 at 2:56 AM, Ruslan Al-Fakikh > <[EMAIL PROTECTED]> wrote: >> Dmirtiy, >> >> In our organization we use file paths for this purpose like this: >> /incoming/datasetA >> /incoming/datasetB >> /reports/datasetC >> etc >> >> On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >>> "It would give me the list of datasets in one place accessible from all >>> tools," >>> >>> And that's exactly why you want it. >>> >>> D >>> >>> On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: >>>> Hey Alan, >>>> >>>> I am not familiar with Apache processes, so I could be wrong in my >>>> point 1, I am sorry. >>>> Basically my impressions was that Cloudera is pushing Avro format for >>>> intercommunications between hadoop tools like pig, hive and mapreduce. >>>> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage>>>> http://www.cloudera.com/blog/2011/07/avro-data-interop/>>>> And if I decide to use Avro then HCatalog becomes a little redundant. >>>> It would give me the list of datasets in one place accessible from all >>>> tools, but all the columns (names and types) would be stored in Avro >>>> schemas and Hive metastore becomes just a stub for those Avro schemas: >>>> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables>>>> And having those avro schemas I could access data from pig and >>>> mapreduce without HCatalog. Though I haven't figured out how to deal >>>> without hive partitions yet. >>>> >>>> Best Regards, >>>> Ruslan >>>> >>>> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: >>>>> On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. >>>>> >>>>> Alan. >>>>> >>>>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >>>>> >>>>>> Hi Markus, >>>>>> >>>>>> Currently I am doing almost the same task. But in Hive. >>>>>> In Hive you can use the native Avro+Hive integration: >>>>>> https://issues.apache.org/jira/browse/HIVE-895>>>>>> Or haivvreo project if you are not using the latest version of Hive. >>>>>> Also there is a Dynamic Partition feature in Hive that can separate >>>>>> your data by a column value. >>>>>> >>>>>> As for HCatalog - I refused to use it after some investigation, because: >>>>>> 1) It is still incubating >>>>>> 2) It is not supported by Cloudera (the distribution provider we are >>>>>> currently using) >>>>>> >>>>>> I think it would be perfect if MultiStorage would be generic in the >>>>>> way you described, but I am not familiar with it. >>>>>> >>>>>> Ruslan >>>>>> >>>>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: >>>>>>> I am not aware of any work on adding those features to MultiStorage. >>>>>>> >>>>>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>>>>> metastore available for all of hadoop, so you get metadata for your data as >>>>>>> well). >>>>>>> You can associate a outputformat+serde for a table (instead of file name >>>>>>> ending), and HCatStorage will automatically pick the right format. >>>>>>> >>>>>>> Thanks, >>>>>>> Thejas >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>>>>> >>>>>>>> Thanks Thejas, >>>>>>>> >>>>>>>> This _really_ helped a lot :)
+
Ruslan Al-Fakikh 2012-07-05, 15:01
-
Re: Best Practice: store depending on data content
Markus Resch 2012-07-03, 11:30
In our case we have /result/CustomerId1 /result/CustomerId2 /result/CustomerId3 /result/CustomerId4 [...] As we have a _lot_ of customers ;) we don't want to make an extra line of code to each script. I think the MultiStorage is perfect for our use case but we need to extend it for avro usage. Best Markus Am Dienstag, den 03.07.2012, 13:56 +0400 schrieb Ruslan Al-Fakikh: > Dmirtiy, > > In our organization we use file paths for this purpose like this: > /incoming/datasetA > /incoming/datasetB > /reports/datasetC > etc > > On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > "It would give me the list of datasets in one place accessible from all > > tools," > > > > And that's exactly why you want it. > > > > D > > > > On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > >> Hey Alan, > >> > >> I am not familiar with Apache processes, so I could be wrong in my > >> point 1, I am sorry. > >> Basically my impressions was that Cloudera is pushing Avro format for > >> intercommunications between hadoop tools like pig, hive and mapreduce. > >> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage> >> http://www.cloudera.com/blog/2011/07/avro-data-interop/> >> And if I decide to use Avro then HCatalog becomes a little redundant. > >> It would give me the list of datasets in one place accessible from all > >> tools, but all the columns (names and types) would be stored in Avro > >> schemas and Hive metastore becomes just a stub for those Avro schemas: > >> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables> >> And having those avro schemas I could access data from pig and > >> mapreduce without HCatalog. Though I haven't figured out how to deal > >> without hive partitions yet. > >> > >> Best Regards, > >> Ruslan > >> > >> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > >>> On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code. > >>> > >>> Alan. > >>> > >>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > >>> > >>>> Hi Markus, > >>>> > >>>> Currently I am doing almost the same task. But in Hive. > >>>> In Hive you can use the native Avro+Hive integration: > >>>> https://issues.apache.org/jira/browse/HIVE-895> >>>> Or haivvreo project if you are not using the latest version of Hive. > >>>> Also there is a Dynamic Partition feature in Hive that can separate > >>>> your data by a column value. > >>>> > >>>> As for HCatalog - I refused to use it after some investigation, because: > >>>> 1) It is still incubating > >>>> 2) It is not supported by Cloudera (the distribution provider we are > >>>> currently using) > >>>> > >>>> I think it would be perfect if MultiStorage would be generic in the > >>>> way you described, but I am not familiar with it. > >>>> > >>>> Ruslan > >>>> > >>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[EMAIL PROTECTED]> wrote: > >>>>> I am not aware of any work on adding those features to MultiStorage. > >>>>> > >>>>> I think the best way to do this is to use Hcatalog. (It makes the hive > >>>>> metastore available for all of hadoop, so you get metadata for your data as > >>>>> well). > >>>>> You can associate a outputformat+serde for a table (instead of file name > >>>>> ending), and HCatStorage will automatically pick the right format. > >>>>> > >>>>> Thanks, > >>>>> Thejas > >>>>> > >>>>> > >>>>> > >>>>> On 6/28/12 2:17 AM, Markus Resch wrote: > >>>>>> > >>>>>> Thanks Thejas, > >>>>>> > >>>>>> This _really_ helped a lot :) > >>>>>> Some additional question on this: > >>>>>> As far as I see, the MultiStorage is currently just capable to write CSV > >>>>>> output, right? Is there any attempt ongoing currently to make this > >>>>>> storage more generic regarding the format of the output data? For our > >>>>>> needs we would require AVRO output as well as some special proprietary
+
Markus Resch 2012-07-03, 11:30
|
|