Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Data Lineage


Copy link to this message
-
Re: Data Lineage
Thank you, Connor.

>From what I understand I can use a serializer to write the data in my own
format.
The language in the documantation is a bit vauge, so if you could Connor
help me with the following question:
                    For a scenario where I know my logs files are delimited
by \t, I would like to add a column at the start of every event row which
indicates the Timestamp and FileName. can this be done by a Serializer?

If it's possible I'll send it to our Java devs :)
On Mon, Feb 4, 2013 at 8:51 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:

> You will want to look at the Serializer
> <http://flume.apache.org/FlumeUserGuide.html#event-serializers>component.
> The default serializer is TEXT, which will only write out the body of your
> event discarding all headers. You can switch to one of the other
> serializers, or if none of them suit your purpose you are able to create
> your own that, for instance, could write the event in JSON format thus
> preserving the headers.
>
> (Only two serializers are currently documented. You can see here<https://github.com/apache/flume/tree/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization>all of the ones currently in Flume (it looks like there's only one
> additional one there, and it might be exactly what you're looking for)).
>
> If you want more detail on creating a custom serializer, or how to use one
> of the existing ones, please ask.
>
> - Connor
>
>
> On Mon, Feb 4, 2013 at 7:38 AM, Tzur Turkenitz <[EMAIL PROTECTED]> wrote:
>
>> Hello All,****
>>
>> ** **
>>
>> In my company we are worried about data lineage. Big files can be split
>> into smaller files (block size) inside HDFS, and smaller files can be
>> aggregated into larger files. We want to have some kind of control
>> regarding data lineage and the ability to map source files to files in
>> HDFS. Using interceptors we can add various keys like timestamp, static,
>> file header etc.****
>>
>> ** **
>>
>> After a file has been processed and inserted into HDFS, do those keys
>> still exist and viewable if I choose to cat the file in HADOOP? (I did cat
>> the files and didn’t see any of the keys) Or the keys only exist during the
>> process and are not saved into the file.****
>>
>> ** **
>>
>> Alternatively is it possible to append those keys into the file using
>> Flume's built in component?****
>>
>> ** **
>>
>> I appreciate the help,****
>>
>> Tzur****
>>
>> ** **
>>
>
>
--
Regards,
Tzur Turkenitz
Vision.BI
http://www.vision.bi/

"*Facts are stubborn things, but statistics are more pliable*"
-Mark Twain