|
Otis Gospodnetic
2010-01-14, 04:06
Amandeep Khurana
2010-01-14, 04:08
Otis Gospodnetic
2010-01-14, 04:28
Ryan Rawson
2010-01-14, 04:35
Amandeep Khurana
2010-01-14, 04:36
Amandeep Khurana
2010-01-14, 04:37
Otis Gospodnetic
2010-01-14, 05:03
Andrew Purtell
2010-01-14, 10:29
Otis Gospodnetic
2010-01-14, 16:16
|
-
MR on HDFS data inserted via HBase?Otis Gospodnetic 2010-01-14, 04:06
Hello,
If I import data into HBase, can I still run a hand-written MapReduce job over that data in HDFS? That is, not using TableInputFormat to read the data back out via HBase. Similarly, can one run Hive or Pig scripts against that data, but again, without Hive or Pig reading the data via HBase, but rather getting to it directly via HDFS? I'm asking because I'm wondering whether storing data in HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
-
Re: MR on HDFS data inserted via HBase?Amandeep Khurana 2010-01-14, 04:08
HBase has its own file format. Reading data from it in your own job will not
be trivial to write, but not impossible. Why would you want to use the underlying data files in the MR jobs? Any limitation in using the HBase api? On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hello, > > If I import data into HBase, can I still run a hand-written MapReduce job > over that data in HDFS? > That is, not using TableInputFormat to read the data back out via HBase. > > Similarly, can one run Hive or Pig scripts against that data, but again, > without Hive or Pig reading the data via HBase, but rather getting to it > directly via HDFS? I'm asking because I'm wondering whether storing data in > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > >
-
Re: MR on HDFS data inserted via HBase?Otis Gospodnetic 2010-01-14, 04:28
Hello,
----- Original Message ---- > From: Amandeep Khurana <[EMAIL PROTECTED]> > HBase has its own file format. Reading data from it in your own job will not > be trivial to write, but not impossible. You are referring to HTable, HFile, etc.? > Why would you want to use the underlying data files in the MR jobs? Any > limitation in using the HBase api? Are you referring to writing a MR job that makes use of TableInputFormat and TableOutputFormat as mentioned on http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink ? I think that would work. But I'd also like to be able to run Hive/Pig scripts over the data, and I *think* neither support reading it from HBase. But they can obviously read it from files in HDFS, that's why I was asking. But it sounds like anything wanting to read HBase's data without going through the HBase's API and reading from behind its back would have to know how to read from HFile & friends? (and again, I think/assume Hive and Pig don't know how to do that) Thanks, Otis > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Hello, > > > > If I import data into HBase, can I still run a hand-written MapReduce job > > over that data in HDFS? > > That is, not using TableInputFormat to read the data back out via HBase. > > > > Similarly, can one run Hive or Pig scripts against that data, but again, > > without Hive or Pig reading the data via HBase, but rather getting to it > > directly via HDFS? I'm asking because I'm wondering whether storing data in > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > > > > Thanks, > > Otis > > -- > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > >
-
Re: MR on HDFS data inserted via HBase?Ryan Rawson 2010-01-14, 04:35
Hey,
It isnt just as simple as 'read HBase's files'. You will also need: - data that is only available in memory of the regionserver - merge multiple HFiles - do delete processing, etc, ie: reproduce the Regionserver read path Due to #1, I don't feel like this is a particularly fruitful avenue of approach. -ryan On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hello, > > > ----- Original Message ---- > >> From: Amandeep Khurana <[EMAIL PROTECTED]> > >> HBase has its own file format. Reading data from it in your own job will not >> be trivial to write, but not impossible. > > You are referring to HTable, HFile, etc.? > >> Why would you want to use the underlying data files in the MR jobs? Any >> limitation in using the HBase api? > > Are you referring to writing a MR job that makes use of TableInputFormat and TableOutputFormat as mentioned on http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink ? > > I think that would work. > > But I'd also like to be able to run Hive/Pig scripts over the data, and I *think* neither support reading it from HBase. But they can obviously read it from files in HDFS, that's why I was asking. But it sounds like anything wanting to read HBase's data without going through the HBase's API and reading from behind its back would have to know how to read from HFile & friends? > (and again, I think/assume Hive and Pig don't know how to do that) > > Thanks, > Otis > >> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < >> [EMAIL PROTECTED]> wrote: >> >> > Hello, >> > >> > If I import data into HBase, can I still run a hand-written MapReduce job >> > over that data in HDFS? >> > That is, not using TableInputFormat to read the data back out via HBase. >> > >> > Similarly, can one run Hive or Pig scripts against that data, but again, >> > without Hive or Pig reading the data via HBase, but rather getting to it >> > directly via HDFS? I'm asking because I'm wondering whether storing data in >> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. >> > >> > Thanks, >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch >> > >> > > >
-
Re: MR on HDFS data inserted via HBase?Amandeep Khurana 2010-01-14, 04:36
Yes, by api I mean TableInputFormat and TableOutputFormat.
Pig has a connector to HBase. Not sure if Hive has one yet. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hello, > > > ----- Original Message ---- > > > From: Amandeep Khurana <[EMAIL PROTECTED]> > > > HBase has its own file format. Reading data from it in your own job will > not > > be trivial to write, but not impossible. > > You are referring to HTable, HFile, etc.? > > > Why would you want to use the underlying data files in the MR jobs? Any > > limitation in using the HBase api? > > Are you referring to writing a MR job that makes use of TableInputFormat > and TableOutputFormat as mentioned on > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink? > > I think that would work. > > But I'd also like to be able to run Hive/Pig scripts over the data, and I > *think* neither support reading it from HBase. But they can obviously read > it from files in HDFS, that's why I was asking. But it sounds like anything > wanting to read HBase's data without going through the HBase's API and > reading from behind its back would have to know how to read from HFile & > friends? > (and again, I think/assume Hive and Pig don't know how to do that) > > Thanks, > Otis > > > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < > > [EMAIL PROTECTED]> wrote: > > > > > Hello, > > > > > > If I import data into HBase, can I still run a hand-written MapReduce > job > > > over that data in HDFS? > > > That is, not using TableInputFormat to read the data back out via > HBase. > > > > > > Similarly, can one run Hive or Pig scripts against that data, but > again, > > > without Hive or Pig reading the data via HBase, but rather getting to > it > > > directly via HDFS? I'm asking because I'm wondering whether storing > data in > > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > > > > > > Thanks, > > > Otis > > > -- > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > > >
-
Re: MR on HDFS data inserted via HBase?Amandeep Khurana 2010-01-14, 04:37
> - data that is only available in memory of the regionserver
> Precisely the reason why I said its non trivial
-
Re: MR on HDFS data inserted via HBase?Otis Gospodnetic 2010-01-14, 05:03
Thanks. I'm already turned off. :) Thanks for the quick advice, Amandeep & Ryan! (saw that 1M inserts/sec, impressive)
Otis ----- Original Message ---- > From: Ryan Rawson <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wed, January 13, 2010 11:35:12 PM > Subject: Re: MR on HDFS data inserted via HBase? > > Hey, > > It isnt just as simple as 'read HBase's files'. You will also need: > - data that is only available in memory of the regionserver > - merge multiple HFiles > - do delete processing, etc, ie: reproduce the Regionserver read path > > Due to #1, I don't feel like this is a particularly fruitful avenue of > approach. > > -ryan > > On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic > wrote: > > Hello, > > > > > > ----- Original Message ---- > > > >> From: Amandeep Khurana > > > >> HBase has its own file format. Reading data from it in your own job will not > >> be trivial to write, but not impossible. > > > > You are referring to HTable, HFile, etc.? > > > >> Why would you want to use the underlying data files in the MR jobs? Any > >> limitation in using the HBase api? > > > > Are you referring to writing a MR job that makes use of TableInputFormat and > TableOutputFormat as mentioned on > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink > ? > > > > I think that would work. > > > > But I'd also like to be able to run Hive/Pig scripts over the data, and I > *think* neither support reading it from HBase. But they can obviously read it > from files in HDFS, that's why I was asking. But it sounds like anything > wanting to read HBase's data without going through the HBase's API and reading > from behind its back would have to know how to read from HFile & friends? > > (and again, I think/assume Hive and Pig don't know how to do that) > > > > Thanks, > > Otis > > > >> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < > >> [EMAIL PROTECTED]> wrote: > >> > >> > Hello, > >> > > >> > If I import data into HBase, can I still run a hand-written MapReduce job > >> > over that data in HDFS? > >> > That is, not using TableInputFormat to read the data back out via HBase. > >> > > >> > Similarly, can one run Hive or Pig scripts against that data, but again, > >> > without Hive or Pig reading the data via HBase, but rather getting to it > >> > directly via HDFS? I'm asking because I'm wondering whether storing data > in > >> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > >> > > >> > Thanks, > >> > Otis > >> > -- > >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > >> > > >> > > > > >
-
Re: MR on HDFS data inserted via HBase?Andrew Purtell 2010-01-14, 10:29
There is some work on a SerDe for Hive for HBase ongoing:
https://issues.apache.org/jira/browse/HIVE-705 https://issues.apache.org/jira/browse/HIVE-806 - Andy ----- Original Message ---- > From: Amandeep Khurana <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wed, January 13, 2010 8:36:15 PM > Subject: Re: MR on HDFS data inserted via HBase? > > Yes, by api I mean TableInputFormat and TableOutputFormat. > > Pig has a connector to HBase. Not sure if Hive has one yet. > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Hello, > > > > > > ----- Original Message ---- > > > > > From: Amandeep Khurana > > > > > HBase has its own file format. Reading data from it in your own job will > > not > > > be trivial to write, but not impossible. > > > > You are referring to HTable, HFile, etc.? > > > > > Why would you want to use the underlying data files in the MR jobs? Any > > > limitation in using the HBase api? > > > > Are you referring to writing a MR job that makes use of TableInputFormat > > and TableOutputFormat as mentioned on > > > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink? > > > > I think that would work. > > > > But I'd also like to be able to run Hive/Pig scripts over the data, and I > > *think* neither support reading it from HBase. But they can obviously read > > it from files in HDFS, that's why I was asking. But it sounds like anything > > wanting to read HBase's data without going through the HBase's API and > > reading from behind its back would have to know how to read from HFile & > > friends? > > (and again, I think/assume Hive and Pig don't know how to do that) > > > > Thanks, > > Otis > > > > > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hello, > > > > > > > > If I import data into HBase, can I still run a hand-written MapReduce > > job > > > > over that data in HDFS? > > > > That is, not using TableInputFormat to read the data back out via > > HBase. > > > > > > > > Similarly, can one run Hive or Pig scripts against that data, but > > again, > > > > without Hive or Pig reading the data via HBase, but rather getting to > > it > > > > directly via HDFS? I'm asking because I'm wondering whether storing > > data in > > > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > > > > > > > > Thanks, > > > > Otis > > > > -- > > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > > > > > > >
-
Re: MR on HDFS data inserted via HBase?Otis Gospodnetic 2010-01-14, 16:16
Yeah, I'm JIRA Watch-ing them. Thanks.
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Andrew Purtell <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Thu, January 14, 2010 5:29:31 AM > Subject: Re: MR on HDFS data inserted via HBase? > > There is some work on a SerDe for Hive for HBase ongoing: > > https://issues.apache.org/jira/browse/HIVE-705 > > https://issues.apache.org/jira/browse/HIVE-806 > > - Andy > > > ----- Original Message ---- > > From: Amandeep Khurana > > To: [EMAIL PROTECTED] > > Sent: Wed, January 13, 2010 8:36:15 PM > > Subject: Re: MR on HDFS data inserted via HBase? > > > > Yes, by api I mean TableInputFormat and TableOutputFormat. > > > > Pig has a connector to HBase. Not sure if Hive has one yet. > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic < > > [EMAIL PROTECTED]> wrote: > > > > > Hello, > > > > > > > > > ----- Original Message ---- > > > > > > > From: Amandeep Khurana > > > > > > > HBase has its own file format. Reading data from it in your own job will > > > not > > > > be trivial to write, but not impossible. > > > > > > You are referring to HTable, HFile, etc.? > > > > > > > Why would you want to use the underlying data files in the MR jobs? Any > > > > limitation in using the HBase api? > > > > > > Are you referring to writing a MR job that makes use of TableInputFormat > > > and TableOutputFormat as mentioned on > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink? > > > > > > I think that would work. > > > > > > But I'd also like to be able to run Hive/Pig scripts over the data, and I > > > *think* neither support reading it from HBase. But they can obviously read > > > it from files in HDFS, that's why I was asking. But it sounds like anything > > > wanting to read HBase's data without going through the HBase's API and > > > reading from behind its back would have to know how to read from HFile & > > > friends? > > > (and again, I think/assume Hive and Pig don't know how to do that) > > > > > > Thanks, > > > Otis > > > > > > > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Hello, > > > > > > > > > > If I import data into HBase, can I still run a hand-written MapReduce > > > job > > > > > over that data in HDFS? > > > > > That is, not using TableInputFormat to read the data back out via > > > HBase. > > > > > > > > > > Similarly, can one run Hive or Pig scripts against that data, but > > > again, > > > > > without Hive or Pig reading the data via HBase, but rather getting to > > > it > > > > > directly via HDFS? I'm asking because I'm wondering whether storing > > > data in > > > > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > > > > > > > > > > Thanks, > > > > > Otis > > > > > -- > > > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > > > > > > > > > > > |