im a newbie but maybe this will also add some value...
it is my understanding that mapreduce is like a distributed "group by"
when you run a statement like this against your petabyes of dataset it can
take a long time.. first and foremost because the first thing you have to
do before you apply the group by logic is to read the data off disk.
if your disk reads at 100/MBs then you can do the math.
The time frame that this query will run take at least this long to complete.
If you need this info really fast like in the next hour to support i dunno
personalization features on a ecommerce site or month end report that needs
to be complete in 2 hours.
Then it would be nice to put equal parts of your data on 100s of disks and
run the same algorithm in parralel
but thats just if your bottleneck is disk.
what if your dataset is relatively small but calculations done on each
element coming in is large
so therefore your bottleneck there is CPU power
there are alot of bottlenecks you could run into.
number of threads
latency of remote apis or remote database you hit as you analyze the data
Theres a book called programming collective intelligence from oreilly that
should help you out too
On Tue, May 21, 2013 at 11:02 PM, Sai Sai <[EMAIL PROTECTED]> wrote:
> Excellent Sanjay, really excellent input. Many Thanks for this input.
> I have been always thinking about some ideas but never knowing what to
> proceed with.
> Thanks again.
> *From:* Sanjay Subramanian <[EMAIL PROTECTED]>
> *To:* "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> *Sent:* Tuesday, 21 May 2013 11:51 PM
> *Subject:* Re: Project ideas
> My $0.02 is look look around and see problems u can solve…Its better to
> get a list of problems and see if u can model a solution using map-reduce
> An example is as follows
> Build a Cars Pricing Model based on advertisements on Craigs list
> Recommend a price to the Craigslist car seller when the user gives info
> about make,model,color,miles
> DATA required
> Collect RSS feeds daily from Craigs List (don't pound their website , else
> they will lock u down)
> DESIGN COMPONENTS
> - Daily RSS Collector - pulls data and puts into HDFS
> - Data Loader - Structures the columns u need to analyze and puts into HDFS
> - Hive Aggregator and analyzer - studies and queries data and brings out
> recommendation models for car pricing
> - REST Web service to return query results in XML/JSON
> - iPhone App that talks to web service and gets info
> There u go…this should keep a couple of students busy for 3 months
> I find this kind of problem statement and solutions simpler to
> understand because its all there in the real world !
> An example of my way of thinking led to me founding this non profit
> called www.medicalsidefx.org that gives users valuable metrics regarding
> medical side fx.
> It uses Hadoop to aggregate , Lucene to search….This year I am redesigning
> the core to use Hive :-)
> Good luck
> From: Michael Segel <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Date: Tuesday, May 21, 2013 6:46 AM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Re: Project ideas
> Drink heavily?
> Let me rephrase.
> Part of the exercise is for you, the student to come up with the idea.
> Not solicit someone else for a suggestion. This is how you learn.
> The exercise is to get you to think about the following:
> 1) What is Hadoop
> 2) How does it work
> 3) Why would you want to use it
> You need to understand #1 and #2 to be able to #3.
> But at the same time... you need to also incorporate your own view of
> the world.
> What are your hobbies? What do you like to do?
> What scares you the most? What excites you the most?
> Why are you here?