Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> What's the right data storage/representation?

Copy link to this message
Re: What's the right data storage/representation?
Hive tables can sit on top of S3 storage so you dont really need a separate export process

On May 15, 2012, at 11:35 AM, Jon Palmer wrote:

> That seems like a very reasonable approach. However, if we use a technology like Amazon Elastic Map Reduce my Hive cluster is (potentially) going to be destroyed and recreated. As a result I'd really need to export the update history Hive table to some other store (like S3) so that it can be re-imported on the next spin up of the Hive cluster. Do I have that right?
> Jon
> -----Original Message-----
> From: shrikanth shankar [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 15, 2012 1:14 PM
> Subject: Re: What's the right data storage/representation?
> I would agree on keeping track of the history of updates in a separate table in Hive (you may not need to maintain it in the application tier). This pattern seems to be the "Slowly Changing Dimension" pattern used in other (more traditional) Data Warehouses...  I suspect the challenge here would be writing a ETL process to maintain the Hive table based on the current status of the application db table ..
> Shrikanth
> On May 15, 2012, at 9:41 AM, Owen O'Malley wrote:
>> On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <[EMAIL PROTECTED]> wrote:
>>> I can see a few potential solutions:
>>> 1.       Don't solve it. Accept that you have some artifacts in your
>>> reporting data that cannot be recovered from the source data.
>>> 2.       Create status and location history tables in the application db and
>>> use that during the analytics process.
>>> 3.       Log the status and location change 'events' to some other log file
>>> and use those logs in the Hive analysis.
>> I would probably create a Hive table that includes the status and
>> location updates. One of the advantages of Hive & Hadoop is that it is
>> easy to store the raw information in bulk and continue to process it.
>> Once you have the information, you will likely find new uses for it.
>> -- Owen
> This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.