You could run the flume collectors on other machines and write a source which connects to the sockets on the data generators.
On Dec 15, 2011, at 21:27, "Periya.Data" <[EMAIL PROTECTED]> wrote:
> Sorry...misworded my statement. What I meant was that the sources are meant to be untouched and admins do not want to mess with it and add more tools in there. All I've got is source addresses, port numbers. Once I know what technique(s) I will be using, accordingly, I will be given access via firewalls and other access credentials.
> On Thu, Dec 15, 2011 at 5:05 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:
> Just curious - what is the situation you're in where no collectors are
> possible? Sounds interesting.
> Russell Jurney
> [EMAIL PROTECTED]
> On Dec 15, 2011, at 5:01 PM, "Periya.Data" <[EMAIL PROTECTED]> wrote:
> > Hi all,
> > I would like to know what options I have to ingest terabytes of data
> > that are being generated very fast from a small set of sources. I have
> > thought about :
> > 1. Flume
> > 2. Have an intermediate staging server(s) where you can offload data and
> > from there use dfs -put to load into HDFS.
> > 3. Anything else??
> > Suppose I am unable to use Flume (since the sources do not support their
> > installation) and suppose that I do not have the luxury of having an
> > intermediate staging place, what options do I have? In this case, I might
> > have to directly (preferably in parallel) ingest data into HDFS.
> > I have read about a technique to use Map-Reduce where the map would read
> > data and use JAVA API to store in HDFS. We could have multiple threads of
> > maps to get parallel ingestion. It would be nice to know about ways to
> > ingest data "directly" into HDFS considering my assumptions.
> > Suggestions are appreciated,
> > /PD.