Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # dev - Suggestion of Research topic in Hadoop for PhD research

Copy link to this message
Re: Suggestion of Research topic in Hadoop for PhD research
Steve Loughran 2012-06-19, 08:25
On 18 June 2012 18:17, Suresh S <[EMAIL PROTECTED]> wrote:

> Dear Sir/Madam,
>                  I joined as a Research scholar(PhD) recently.
> I am interested to do research in cloud computing. Last month i was attend
> one workshop.
> From that, i know about Hadoop. I am very much intrested to do research in
> hadoop.
> Please give some topics and problems to work. Thanks in advance.
> *Regards*

Given you are doing PhD, presumably you are expected to start with the
reading of state of the art before diving into the depths of your own work

For that reason, I'm attaching a .bib file containing papers you may want
to read,

This list is incomplete and biased towards work I was doing last year on
data integrity within Hadoop -it omits all of Lamport's work on Distribute
Computing, and all the classic RDBMs papers, the latter list including:
[Chamberlin81] D Chamberlin et al., A History and Evaluation of System R,
[Codd71] E. F Codd, *A Database Sublanguage Founded on the Relational
Calculus*, 1971
[Date84]: C.J. Date, *A Critique of the SQL Database Language* 1984
 -everything from Google, Yahoo! Amazon and Microsoft Research groups,
Facebook, etc.
 -the work done in the 1980s and early 1990s on "massively parallel"
computers. They tried out a lot of designs there, some of which could have
relevance again.

Regarding working inside Hadoop itself, be aware that

   - The code is big, complicated and needs testing on large clusters.
   - It's in use in production, which makes people reluctant to accept
   large changes to the core

There are some tactics to address that, especially if you are looking at
the classic CS-hard problems of scheduling, data placement, etc

   - Work in your own scheduler
   - Use the block placement plugin
   - Find other plugin points, or help design one for the specific area you
   want to play in.
   - YARN lets you run completely different applications in a Hadoop

Another thing to be aware of is that because of the R&D money being
invested in the platform, sometimes it does change dramatically -and it is
hard to compete with the efforts of a team of full time developers. For
example, I've long complained that Hadoop wasn't that good in a virtual
world. and last week VMWare published a patch that contains many tens of
thousands of lines of code to address it. Anyone doing a PhD on the same
problem would now be in trouble.

This is why working on a related-but-higher-level stack such as Asterix or
Stratosphere may be a good approach; another is to pick a specific
application problem and look at implementing it within the Hadoop platform.
.bib file in no particular order:

@Article{ Chen94:raid,
    author = "Peter M. Chen and Edward K. Lee and Garth A. Gibson and Randy
H. Katz and David A. Patterson",
    title = "RAID: High-Performance, Reliable Secondary Storage",
    journal = "ACM Computing Surveys",
    year = "1994",
    volume = "26",
    pages = "145--185"

@Misc{ Ghemawat03:gfs,
    author = "Sanjay Ghemawat and Howard Gobioff and Shun-Tak Leung",
    title = "The Google File System",
    year = "2003"

@TechReport{ Gray05:diskFailureRates,
    title = "Empirical Measurements of Disk Failure Rates and Error Rates",
    author = "Jim Gray and Catharine van Ingen",
    institution = "Microsoft",
    number = "MSR-TR-2005-166",
    month = dec,
    year = "2005",
    url = "http://research.microsoft.com/apps/pubs/default.aspx?id=64599"

@PhDThesis{ fielding:rest,
    author = "Roy Thomas Fielding",
    title = "Architectural Styles and the Design of Network-based Software
    year = 2000,
    school = "University of California",
    type = "{Ph.D.} dissertation",
    note = "http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm"

@InProceedings{ indiana:rmiperf,
    title = "{Requirements for and Evaluation of {RMI} Protocols for
Scientific Computing}",
    author = "Madhusudhan Govindaraju and others",
    institution = "Department of Computer Science Indiana University",
    url = "http://www.extreme.indiana.edu/xgws/papers/sc00_paper/index.html
    year = "2000",
    booktitle = "Proceedings Supercomputing 2000",

@InProceedings{ indiana:soap-limits,
    title = "Investigating the Limits of {SOAP} Performance for Scientific
    author = "Kenneth Chiu and Madhusudhan Govindaraju and Randall Bramley",
    booktitle = "Proceedings of HPDC 2002",
    year = 2002,
    note = "

@TechReport{ paper:RMI,
    institution = "Sun Microsystems",
    title = "{Java Remote Method Invocation - Distributed Computing for
    year = 1997,
    author = "{Sun Microsystems}",
    note = "

@TechReport{ spec:DOM,
    institution = "W3C",
    author = "Vidur Apparao and others",
    year = "1998",
    title = "{Document Object Model (DOM)}",
    note = "http://www.w3.org/DOM/"
@TechReport{ ietf:rfc2616,
    title = "{RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1}",
    author = "R. Fielding and J. Gettys and J. Mogul and H. Frysyk and L.
Masinter and P. Leach and T. Berners-Lee",
    institution = "IETF",
    note = "http://ietf.org/rfc/rfc2616.txt",
    year = "1999"

@Misc{ harold:xom,
    title = "{What's Wrong with XML APIs (and how to fix them)}",
    year = "2002",
    author = "Elliotte Rusty Harold",
    note = "http://www.cafeconleche.org/XOM/whatswrong/"

@Article{ parnas:interfaces,
    author = "David L. Parnas",
    title = "{Use of Abstract Interfaces in the Development of Software for
Embedded Computer Systems}",
    year = "1974"
@Book{ vinoski:CORBA,
    title = "{Advanced CORBA(R) Programming with C++}",
    year = "1999",
    author = "Michi Henning and Steve Vinoski",
    publisher = "Addison-Wesley"

@Book{ neward:EEJ,
    title = "{Effective Enterprise Java}",
    year = "2004",