-Re: HBase tasks
Pavel Hančar 2013-04-09, 19:21
thanks for the answer. Yes I meant in-memory column family. But please,
does it matter if I have two column families in separate tables or not? Or
is it somehow stupid to have a table with only one CF?
I have one column family with pictures and the other with two columns. In
the first there are vectors (small text files) extracted from those
pictures and the second column is filled by the pictures, but dimineshed.
I have a program measuring similarity distances of the vectors. I want to
have a real-time web application calculating the distances and displaying
the diminished pictures sorted by them. My question is, if I should use
MapReduce or if there is an alternative. MapReduce seems to me quite
I use CDH3 (HBase 0.90.6). Now I'm developing everything on my laptop with
small amount of data, but we expect to have about 30 nodes cluster with
hundreds of GB. On the laptop I have 3430 pictures and the response with
MapReduce is 26 sec. I thought, I could speed up the processing if the
second CF was in-memory. But the response is the same. I mean the MapReduce
does so many writes/reads on the disk, that it hardly can be quicker. Or is
there any possibility to make "in-memory" all the processing? Especially I
feel stupid, when my only reducer writes it's output on the disk and then I
read it immediately with a java web application. Can I somehow get an
Iterator instead of the output file from the reducer?
2013/4/8 Anoop Sam John <[EMAIL PROTECTED]>
> >But what to do, if I have an HBase in-memory table,
> Why you say in memory table? All the data in memory? Can u explain a bit
> abt this?
> Yes there is MR job to scan the HBase table data. (Full or part)
> When you say you want to retrieve data fast, what is the ammount of data?
> How many regions? Any testing u have done with scan APIs?
> Which version of HBase?
> From: Pavel Hančar [[EMAIL PROTECTED]]
> Sent: Saturday, April 06, 2013 10:15 PM
> To: [EMAIL PROTECTED]
> Subject: HBase tasks
> maybe I don't understand one basic thing. MapReduce jobs are there for long
> jobs, that process some big data. But what to do, if I have an HBase
> in-memory table, where I would like to process all (or selected) records
> with minimal time response. Also MapReduce?
> If so, are there any features to speed up the processing? Is possible to
> avoid some disk writes/reads?
> I try to compare some vectors extracted from pictures and sort the output
> with a single empty reducer. Then I take the output by a web application.
> Especially the last write of the output of the single reducer and then the
> reading it by the web application seems strange to me. Is it possible to
> get an iterator from the reducer instead of the output file?
> Pavel Hančar