|
|
Sam Seigal 2011-04-14, 01:12
I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time. What is the best way to run analytical queries on data of such nature ?
One way is to maintain this data in a separate store, take a snapshot in point of time, and then import into the HDFS filesystem for analysis using Hadoop Map-Reduce. I do not see this approach scaling, since moving data is obviously expensive. If i was to directly maintain this data as Sequence Files in HDFS, how would updates work ?
I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I know that HBase works around this problem through multi version concurrency control techniques. Is that the only option ? Are there any alternatives ?
Also note that all aggregation and analysis I want to do is time based i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such use cases, is it advisable to use HDFS directly or use systems built on top of hadoop like Hive or Hbase ?
+
Sam Seigal 2011-04-14, 01:12
Ted Dunning 2011-04-14, 03:12
Hbase is very good at this kind of thing.
Depending on your aggregation needs OpenTSDB might be interesting since they store and query against large amounts of time ordered data similar to what you want to do.
It isn't clear to whether your data is primarily about current state or about time-embedded state transitions. You can easily store both in hbase, but the arrangements will be a bit different.
On Wed, Apr 13, 2011 at 6:12 PM, Sam Seigal <[EMAIL PROTECTED]> wrote:
> I have a requirement where I have large sets of incoming data into a > system I own. > > A single unit of data in this set has a set of immutable attributes + > state attached to it. The state is dynamic and can change at any time. > What is the best way to run analytical queries on data of such nature > ? > > One way is to maintain this data in a separate store, take a snapshot > in point of time, and then import into the HDFS filesystem for > analysis using Hadoop Map-Reduce. I do not see this approach scaling, > since moving data is obviously expensive. > If i was to directly maintain this data as Sequence Files in HDFS, how > would updates work ? > > I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I > know that HBase works around this problem through multi version > concurrency control techniques. Is that the only option ? Are there > any alternatives ? > > Also note that all aggregation and analysis I want to do is time based > i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such > use cases, is it advisable to use HDFS directly or use systems built > on top of hadoop like Hive or Hbase ? >
+
Ted Dunning 2011-04-14, 03:12
Michael Segel 2011-04-14, 17:06
James, If I understand you get a set of immutable attributes, then a state which can change.
If you wanted to use HBase... I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming that you're really interested in looking at the state change over time.
So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value.
HTH
-Mike ---------------------------------------- > Date: Wed, 13 Apr 2011 18:12:58 -0700 > Subject: Dynamic Data Sets > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > I have a requirement where I have large sets of incoming data into a > system I own. > > A single unit of data in this set has a set of immutable attributes + > state attached to it. The state is dynamic and can change at any time. > What is the best way to run analytical queries on data of such nature > ? > > One way is to maintain this data in a separate store, take a snapshot > in point of time, and then import into the HDFS filesystem for > analysis using Hadoop Map-Reduce. I do not see this approach scaling, > since moving data is obviously expensive. > If i was to directly maintain this data as Sequence Files in HDFS, how > would updates work ? > > I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I > know that HBase works around this problem through multi version > concurrency control techniques. Is that the only option ? Are there > any alternatives ? > > Also note that all aggregation and analysis I want to do is time based > i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such > use cases, is it advisable to use HDFS directly or use systems built > on top of hadoop like Hive or Hbase ?
+
Michael Segel 2011-04-14, 17:06
James Seigel Tynt 2011-04-14, 17:18
If all the seigel/seigal/segel gang don't chime in It'd be weird.
What size of data are we talking?
James
On 2011-04-14, at 11:06 AM, Michael Segel <[EMAIL PROTECTED]> wrote:
> > James, > > > If I understand you get a set of immutable attributes, then a state which can change. > > If you wanted to use HBase... > I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming > that you're really interested in looking at the state change over time. > > So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value. > > HTH > > -Mike > > > ---------------------------------------- >> Date: Wed, 13 Apr 2011 18:12:58 -0700 >> Subject: Dynamic Data Sets >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> >> I have a requirement where I have large sets of incoming data into a >> system I own. >> >> A single unit of data in this set has a set of immutable attributes + >> state attached to it. The state is dynamic and can change at any time. >> What is the best way to run analytical queries on data of such nature >> ? >> >> One way is to maintain this data in a separate store, take a snapshot >> in point of time, and then import into the HDFS filesystem for >> analysis using Hadoop Map-Reduce. I do not see this approach scaling, >> since moving data is obviously expensive. >> If i was to directly maintain this data as Sequence Files in HDFS, how >> would updates work ? >> >> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I >> know that HBase works around this problem through multi version >> concurrency control techniques. Is that the only option ? Are there >> any alternatives ? >> >> Also note that all aggregation and analysis I want to do is time based >> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such >> use cases, is it advisable to use HDFS directly or use systems built >> on top of hadoop like Hive or Hbase ? >
+
James Seigel Tynt 2011-04-14, 17:18
Michel Segel 2011-04-14, 19:19
Sorry, It appears to be a flock of us...
Ok bad pun...
I didn't see Ted's response but it looks like we're thinking along the same lines of thought. I was going to ask about that... But it's really a moot point. The size of the immutable data set doesn't really matter. The solution would be the same. Consider it some blob which is >= the size of a SHA-1 hash value. In fact that could be your unique key.
So you get your blob, timestamp and then state value. You hash the blob, store the blob in one table using the hash as the key value, and then store the state in a column where the timestamp as the column name and the hash value as the row key. Two separate tables because if you stored them as separate column families you may have some performance issues due to a size difference in column families.
This would be a pretty straight forward solution in hbase.
Sent from a remote device. Please excuse any typos...
Mike Segel
On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <[EMAIL PROTECTED]> wrote:
> If all the seigel/seigal/segel gang don't chime in It'd be weird. > > What size of data are we talking? > > James > > On 2011-04-14, at 11:06 AM, Michael Segel <[EMAIL PROTECTED]> wrote: > >> >> James, >> >> >> If I understand you get a set of immutable attributes, then a state which can change. >> >> If you wanted to use HBase... >> I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming >> that you're really interested in looking at the state change over time. >> >> So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value. >> >> HTH >> >> -Mike >> >> >> ---------------------------------------- >>> Date: Wed, 13 Apr 2011 18:12:58 -0700 >>> Subject: Dynamic Data Sets >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> >>> I have a requirement where I have large sets of incoming data into a >>> system I own. >>> >>> A single unit of data in this set has a set of immutable attributes + >>> state attached to it. The state is dynamic and can change at any time. >>> What is the best way to run analytical queries on data of such nature >>> ? >>> >>> One way is to maintain this data in a separate store, take a snapshot >>> in point of time, and then import into the HDFS filesystem for >>> analysis using Hadoop Map-Reduce. I do not see this approach scaling, >>> since moving data is obviously expensive. >>> If i was to directly maintain this data as Sequence Files in HDFS, how >>> would updates work ? >>> >>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I >>> know that HBase works around this problem through multi version >>> concurrency control techniques. Is that the only option ? Are there >>> any alternatives ? >>> >>> Also note that all aggregation and analysis I want to do is time based >>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such >>> use cases, is it advisable to use HDFS directly or use systems built >>> on top of hadoop like Hive or Hbase ? >> >
+
Michel Segel 2011-04-14, 19:19
Sam Seigal 2011-04-15, 00:55
How does HBase compare to Hive when it comes to dynamic data sets ? Does Hive support multi version concurrency control ? I am new to Hadoop, hence trying to get an idea of how to evaluate these different technologies and provide concrete justifications on why to choose one over the other.
Also, I am not interested in how a state changes over time. I am only interested in what the current state of a data unit is, and then aggregate with other data with the same state based on a time range (5000 records exist in state A on April 14th, 2000 records exist in state B on April 13th etc). The analysis will vary depending on how the state changes over time. On Thu, Apr 14, 2011 at 12:19 PM, Michel Segel <[EMAIL PROTECTED]> wrote: > Sorry, > It appears to be a flock of us... > > Ok bad pun... > > I didn't see Ted's response but it looks like we're thinking along the same lines of thought. > I was going to ask about that... But it's really a moot point. The size of the immutable data set doesn't really matter. The solution would be the same. Consider it some blob which is >= the size of a SHA-1 hash value. In fact that could be your unique key. > > So you get your blob, timestamp and then state value. You hash the blob, store the blob in one table using the hash as the key value, and then store the state in a column where the timestamp as the column name and the hash value as the row key. Two separate tables because if you stored them as separate column families you may have some performance issues due to a size difference in column families. > > This would be a pretty straight forward solution in hbase. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <[EMAIL PROTECTED]> wrote: > >> If all the seigel/seigal/segel gang don't chime in It'd be weird. >> >> What size of data are we talking? >> >> James >> >> On 2011-04-14, at 11:06 AM, Michael Segel <[EMAIL PROTECTED]> wrote: >> >>> >>> James, >>> >>> >>> If I understand you get a set of immutable attributes, then a state which can change. >>> >>> If you wanted to use HBase... >>> I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming >>> that you're really interested in looking at the state change over time. >>> >>> So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value. >>> >>> HTH >>> >>> -Mike >>> >>> >>> ---------------------------------------- >>>> Date: Wed, 13 Apr 2011 18:12:58 -0700 >>>> Subject: Dynamic Data Sets >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> >>>> I have a requirement where I have large sets of incoming data into a >>>> system I own. >>>> >>>> A single unit of data in this set has a set of immutable attributes + >>>> state attached to it. The state is dynamic and can change at any time. >>>> What is the best way to run analytical queries on data of such nature >>>> ? >>>> >>>> One way is to maintain this data in a separate store, take a snapshot >>>> in point of time, and then import into the HDFS filesystem for >>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling, >>>> since moving data is obviously expensive. >>>> If i was to directly maintain this data as Sequence Files in HDFS, how >>>> would updates work ? >>>> >>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I >>>> know that HBase works around this problem through multi version >>>> concurrency control techniques. Is that the only option ? Are there >>>> any alternatives ? >>>> >>>> Also note that all aggregation and analysis I want to do is time based >>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such >>>> use cases, is it advisable to use HDFS directly or use systems built >>>> on top of hadoop like Hive or Hbase ?
+
Sam Seigal 2011-04-15, 00:55
|
|