Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Efficient way to read a large number of files in S3 and upload their content to HBase

Amandeep Khurana 2012-05-24, 19:21
Marcos Ortiz 2012-05-24, 19:52
Amandeep Khurana 2012-05-24, 20:47
Copy link to this message
Re: Efficient way to read a large number of files in S3 and upload their content to HBase

On 05/24/2012 04:47 PM, Amandeep Khurana wrote:
> Thanks for that description. I'm not entirely sure why you want to use
> HBase here. You've got logs coming that you want to process in batch
> to do calculations on. This can be done by running MR jobs on the flat
> files itself. You could use Java MR, Hive or Pig to accomplish this.
> Why do you want HBase here?
Tha main reason to use HBase is for the quantity of rows involved in the
process. It could provide a efficient and "quick" way to store all this.
Hive can be an option too.

I will discuss all this again with the dev team.
Thanks a lot for your answers.
> -ak
> On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:
>> On 05/24/2012 03:21 PM, Amandeep Khurana wrote:
>>> Marcos
>>> Can you elaborate on your use case a little bit? What is the nature of
>>> data in S3 and why you want to use HBase? Why do you want to combine
>>> HFiles and upload back to S3? It'll help us answer your questions
>>> better.
>>> Amandeep
>> Ok, let me explain more.
>> We are working on a ads optimization platform on top of Hadoop and HBase.
>> Another team of my organization create a type of log file per click
>> by user
>> and store this file in S3. I discussed with them that a better approach
>> is to storage this
>> "workflow" log in HBase, instead S3, because in this way, we can quit
>> the another step
>> to read from S3 the content of the file, build the HFile and upload it
>> to HBase.
>> The content of the file in S3 is the basic information for the operation:
>> - Source URL
>> - User Id
>> - User agent of the user
>> - Campaign id
>> and more fields.
>> So, we want this to then create MapReduce jobs on top of HBase to some
>> calculations and reports
>> for this data.
>> We are valuating HBase because our current solution is on top of
>> PostgreSQL, but the main issue is when you
>> launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL
>> in a short time, could rise from 1 to
>> 100 clicks per second. We did some preliminary tests and in two days,
>> the table where we store the "workflow"
>> log grow exponentially to 350, 000 tuples, so, it could be a problem.
>> For that reason, we want to migrate this to HBase.
>> But I think that the approach to generate a file in S3 and then upload
>> to HBase is not the best way to do this; because, you can always
>> create the workflow log for every user, build a Put for it and upload it
>> to HBase, and to avoid the locks, I´m valuating to use the asynchronous
>> API released
>> by StumbleUpon. [1]
>> What do you think about this?
>> [1] https://github.com/stumbleupon/asynchbase
>>> On May 24, 2012, at 12:19 PM, Marcos Ortiz<[EMAIL PROTECTED]
>>> <mailto:[EMAIL PROTECTED]>> wrote:
>>>> Thanks a lot for your answer, Amandeep.
>>>> On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
>>>>> Marcos,
>>>>> You could to a distcp from S3 to HDFS and then do a bulk import
>>>>> into HBase.
>>>> The quantity of files are very large, so, we want to combine some
>>>> files,
>>>> and then construct
>>>> the HFile to upload to HBase.
>>>> Any example of a custom FileMerger for it?
>>>>> Are you running HBase on EC2 or on your own hardware?
>>>> We have created a small HBase in our own hardware, but we want to build
>>>> another cluster on top of Amazon EC2. This
>>>> could be very good for the integration between S3 and the HBase
>>>> cluster.
>>>> Regards
>>>>> -Amandeep
>>>>> On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:
>>>>>> Regards to all the list.
>>>>>> We are using Amazon S3 to store millions of files with certain
>>>>>> format,
>>>>>> and we want to read the content of these files and then upload its
>>>>>> content to
>>>>>> a HBase cluster.
>>>>>> Anyone has done this?
>>>>>> Can you recommend me a efficient way to do this?
>>>>>> Best wishes.
>>>>>> --
>>>>>> Marcos Luis Ortíz Valmaseda
>>>>>> Data Engineer&& Sr. System Administrator at UCI

Marcos Luis Ortíz Valmaseda
  Data Engineer&&  Sr. System Administrator at UCI
  Twitter: @marcosluis2186


Ian Varley 2012-05-24, 21:12
Marcos Ortiz 2012-05-24, 22:33
Marcos Ortiz Valmaseda 2012-05-30, 15:56
Marcos Ortiz 2012-05-24, 18:52
Amandeep Khurana 2012-05-24, 18:55
Marcos Ortiz 2012-05-24, 19:18