Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Dynamic Data Sets

Copy link to this message
RE: Dynamic Data Sets

If I understand you get a set of immutable attributes, then a state which can change.

If you wanted to use HBase...
I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming
that you're really interested in looking at the state change over time.

So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value.


> Date: Wed, 13 Apr 2011 18:12:58 -0700
> Subject: Dynamic Data Sets
> I have a requirement where I have large sets of incoming data into a
> system I own.
> A single unit of data in this set has a set of immutable attributes +
> state attached to it. The state is dynamic and can change at any time.
> What is the best way to run analytical queries on data of such nature
> ?
> One way is to maintain this data in a separate store, take a snapshot
> in point of time, and then import into the HDFS filesystem for
> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
> since moving data is obviously expensive.
> If i was to directly maintain this data as Sequence Files in HDFS, how
> would updates work ?
> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
> know that HBase works around this problem through multi version
> concurrency control techniques. Is that the only option ? Are there
> any alternatives ?
> Also note that all aggregation and analysis I want to do is time based
> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
> use cases, is it advisable to use HDFS directly or use systems built
> on top of hadoop like Hive or Hbase ?