|
|
Daniel Käfer 2012-10-25, 19:24
Hello all, I'm looking for a reference architecture for hadoop. The only result I found is Lambda architecture from Nathan Marz[0]. With architecture I mean answers to question like: - How should I store the data? CSV, Thirft, ProtoBuf - How should I model the data? ER-Model, Starschema, something new? - normalized or denormalized or both (master data normalized, then transformation to denormalized, like ETL) - How should i combine database and HDFS-Files? Are there any other documented architectures for hadoop? Regards Daniel Käfer [0] http://www.manning.com/marz/ just a preprint yet, not completed
+
Daniel Käfer 2012-10-25, 19:24
-
Re: reference architecture
Steve Loughran 2012-10-25, 21:10
On 25 October 2012 20:24, Daniel Käfer <[EMAIL PROTECTED]> wrote: > Hello all, > > I'm looking for a reference architecture for hadoop. The only result I > found is Lambda architecture from Nathan Marz[0]. > I quite like the new Hadoop in Practice for a lot of that, especially the answer to #2, "how to store the data", where he looks at all the options. Joining is the other big issue. http://steveloughran.blogspot.co.uk/2012/10/hadoop-in-practice-applied-hadoop.htmlRegarding storing DB data, HBase-on-HDFS is where people keep it; Pig and Hive can work with that as well as rawer data kept in HDFS directly > With architecture I mean answers to question like: > - How should I store the data? CSV, Thirft, ProtoBuf > - How should I model the data? ER-Model, Starschema, something new? > - normalized or denormalized or both (master data normalized, then > transformation to denormalized, like ETL) > - How should i combine database and HDFS-Files? > > Are there any other documented architectures for hadoop? > > Regards > Daniel Käfer > > > [0] http://www.manning.com/marz/ just a preprint yet, not completed > >
+
Steve Loughran 2012-10-25, 21:10
-
Re: reference architecture
Daniel Käfer 2012-10-25, 22:17
Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran: > I quite like the new Hadoop in Practice for a lot of that, especially > the answer to #2, "how to store the data", where he looks at all the > options The Part 3 Big Data Patterns looks very interesting. I am going to read the book. Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran: > Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig > and Hive can work with that as well as rawer data kept in HDFS > directly But is that the best idea? HBase is great for random read and small range scan. But the Hive (SQL) performance is 4-5x slower than plain HDFS. [0] I guess first data (raw data) in HDFS and last data in HBase is a good idea. But how to store the data between individual mapreduce jobs? [0] Todd Lipcon http://de.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introductionp.19 I don't benchmark the performance myself. >
+
Daniel Käfer 2012-10-25, 22:17
-
Re: reference architecture
Steve Loughran 2012-10-26, 17:25
On 25 October 2012 23:17, Daniel Käfer <[EMAIL PROTECTED]> wrote:
> Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran: > > Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig > > and Hive can work with that as well as rawer data kept in HDFS > > directly > > But is that the best idea? HBase is great for random read and small > range scan. But the Hive (SQL) performance is 4-5x slower than plain > HDFS. [0] > >
> I guess first data (raw data) in HDFS and last data in HBase is a good > idea. But how to store the data between individual mapreduce jobs? >
Depends on the amount of data and expected use. If it's transient food for the next MR jobs: HDFS
+
Steve Loughran 2012-10-26, 17:25
-
Re: reference architecture
Daniel Käfer 2012-10-27, 08:09
Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran: > Depends on the amount of data and expected use. If it's transient food > for the next MR jobs: HDFS
Thanks for your help
+
Daniel Käfer 2012-10-27, 08:09
-
Re: reference architecture
Russell Jurney 2012-10-27, 08:42
I define one of these in the book agile data, from O'Reilly. I express opinions on all matters you query us about. But you don't have to take my word for it... It's a reading rainbow! Jordi! Russell Jurney http://datasyndrome.comOn Oct 27, 2012, at 1:09 AM, "Daniel Käfer" <[EMAIL PROTECTED]> wrote: > Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran: >> Depends on the amount of data and expected use. If it's transient food >> for the next MR jobs: HDFS > > Thanks for your help >
+
Russell Jurney 2012-10-27, 08:42
-
Re: reference architecture
Russell Jurney 2012-10-27, 09:19
Russell Jurney http://datasyndrome.comOn Oct 25, 2012, at 12:24 PM, "Daniel Käfer" <[EMAIL PROTECTED]> wrote: > Hello all, > > I'm looking for a reference architecture for hadoop. The only result I > found is Lambda architecture from Nathan Marz[0]. > > With architecture I mean answers to question like: > - How should I store the data? CSV, Thirft, ProtoBuf You should use Avro. > - How should I model the data? ER-Model, Starschema, something new? You should use document format. > - normalized or denormalized or both (master data normalized, then > transformation to denormalized, like ETL) Demoralized fully, into document format. > - How should i combine database and HDFS-Files? Don't. Put everything on HDFS. > > Are there any other documented architectures for hadoop? I really did make an example in my book. It is just one example, but you wanted answers to questions that always 'depend.' You can check it out in slides: http://www.slideshare.net/mobile/hortonworks/agile-analytics-applications-on-hadoop> > Regards > Daniel Käfer > > > [0] http://www.manning.com/marz/ just a preprint yet, not completed >
+
Russell Jurney 2012-10-27, 09:19
-
Re: reference architecture
Mohammad Tariq 2012-10-27, 09:26
Thank you so much everybody, for the valuable comments. On Saturday, October 27, 2012, Russell Jurney <[EMAIL PROTECTED]> wrote: > Russell Jurney http://datasyndrome.com> > On Oct 25, 2012, at 12:24 PM, "Daniel Käfer" <[EMAIL PROTECTED]> wrote: > >> Hello all, >> >> I'm looking for a reference architecture for hadoop. The only result I >> found is Lambda architecture from Nathan Marz[0]. >> >> With architecture I mean answers to question like: >> - How should I store the data? CSV, Thirft, ProtoBuf > You should use Avro. >> - How should I model the data? ER-Model, Starschema, something new? > You should use document format. >> - normalized or denormalized or both (master data normalized, then >> transformation to denormalized, like ETL) > Demoralized fully, into document format. >> - How should i combine database and HDFS-Files? > Don't. Put everything on HDFS. >> >> Are there any other documented architectures for hadoop? > I really did make an example in my book. It is just one example, but > you wanted answers to questions that always 'depend.' You can check it > out in slides: http://www.slideshare.net/mobile/hortonworks/agile-analytics-applications-on-hadoop>> >> Regards >> Daniel Käfer >> >> >> [0] http://www.manning.com/marz/ just a preprint yet, not completed >> > -- Regards, Mohammad Tariq
+
Mohammad Tariq 2012-10-27, 09:26
-
Re: reference architecture
Daniel Käfer 2012-10-29, 21:16
Thank you, that book is exactly what i'm looking for. Regards Daniel Käfer Am Samstag, den 27.10.2012, 02:19 -0700 schrieb Russell Jurney: > Russell Jurney http://datasyndrome.com> > On Oct 25, 2012, at 12:24 PM, "Daniel Käfer" <[EMAIL PROTECTED]> wrote: > > > Hello all, > > > > I'm looking for a reference architecture for hadoop. The only result I > > found is Lambda architecture from Nathan Marz[0]. > > > > With architecture I mean answers to question like: > > - How should I store the data? CSV, Thirft, ProtoBuf > You should use Avro. > > - How should I model the data? ER-Model, Starschema, something new? > You should use document format. > > - normalized or denormalized or both (master data normalized, then > > transformation to denormalized, like ETL) > Demoralized fully, into document format. > > - How should i combine database and HDFS-Files? > Don't. Put everything on HDFS. > > > > Are there any other documented architectures for hadoop? > I really did make an example in my book. It is just one example, but > you wanted answers to questions that always 'depend.' You can check it > out in slides: http://www.slideshare.net/mobile/hortonworks/agile-analytics-applications-on-hadoop> > > > Regards > > Daniel Käfer > > > > > > [0] http://www.manning.com/marz/ just a preprint yet, not completed > >
+
Daniel Käfer 2012-10-29, 21:16
|
|