Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - structured data split


+
臧冬松 2011-11-11, 07:43
+
Denny Ye 2011-11-11, 09:50
+
臧冬松 2011-11-11, 10:11
+
Bejoy KS 2011-11-11, 11:01
+
臧冬松 2011-11-11, 12:46
+
bejoy.hadoop@... 2011-11-11, 13:25
+
Harsh J 2011-11-11, 13:54
+
Bejoy KS 2011-11-11, 14:38
+
Harsh J 2011-11-11, 16:06
+
Bejoy KS 2011-11-11, 16:27
+
臧冬松 2011-11-11, 14:12
+
Will Maier 2011-11-11, 14:26
+
Charles Earl 2011-11-11, 14:42
Copy link to this message
-
Re: structured data split
Bejoy KS 2011-11-11, 15:10
Hi Donal
         I don't have much of an expose to the domain which you are
pointing on to, but from a plain map reduce developer terms there would be
my way of looking into processing such data format with map reduce
- If the data is kind of flowing in continuously then I'd use flume to
collect the binary data and write the same into sequence files and load
into hdfs
- If it is already existing large data, I'd use a sequence file writer to
write the binary data as sequence files into hdfs. Where hdfs would take
care of the splits.
- I'd use SequenceFileInputFormat for my map reduce
- If my application code is in other compatible language than java then I'd
be using Streaming API to trigger my map reduce job.

If there is any specific constraints with reading your data, as Will
metioned you may need to go in with your custom Input Formats for
processing the same.
Hope it helps!...
On Fri, Nov 11, 2011 at 8:12 PM, Charles Earl <[EMAIL PROTECTED]> wrote:

> Hi,
> Please also feel free to contact me. I'm working with STAR project at
> Brookhaven Lab, and we are trying to build a MR workflow for analysis of
> particle data. I've done some preliminary experiments running Root and
> other nuclear physics analysis software in MR and have been looking at
> various file layouts.
> Charles
> On Nov 11, 2011, at 9:26 AM, Will Maier wrote:
>
> > Hi Donal-
> >
> > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote:
> >> My scenario is that I have lots of files from High Energy Physics
> experiment.
> >> These files are in binary format,about 2G each, but basically they are
> >> composed by lots of "Event", each Event is independent with others. The
> >> physicists use a C++ program called ROOT to analysis these files,and
> write the
> >> output to a result file(use open(),read(),write()).  I'm considering
> how to
> >> store the files in HDFS, and use the Map-reduce to analize them.
> >
> > May I ask which experiment you're working on? We run a HDFS cluster at
> one of
> > the analysis centers for the CMS detector at the LHC. I'm not aware of
> anyone
> > using Hadoop's MR for analysis, though about 10 PB of LHC data is now
> stored in
> > HDFS. For your/our use case, I think that you would have to implement a
> > domain-specific InputFormat yielding Events. ROOT files would be stored
> as-is in
> > HDFS.
> >
> > In CMS, we mostly run traditional HEP simulation and analysis workflows
> using
> > plain batch jobs managed by common schedulers like Condor or PBS. These
> of
> > course lack some of the features of the MR schedulers (like location
> awareness),
> > but have some advantages. For example, we run Condor schedulers that
> > transparently manage workflows of tens of thousands of jobs on dozens of
> > heterogeneous clusters across North America.
> >
> > Feel free to contact me off-list if have more HEP-specific questions
> about HDFS.
> >
> > Thanks!
> >
> > --
> >
> > Will Maier - UW High Energy Physics
> > cel: 608.438.6162
> > tel: 608.263.9692
> > web: http://www.hep.wisc.edu/~wcmaier/
>
>
+
臧冬松 2011-11-11, 15:57
+
臧冬松 2011-11-14, 08:32