Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Hadoop import


Copy link to this message
-
Re: Hadoop import
Ok got it. How are partitions determined? Is this something that
producer is responsible for or can it be automatically handled by the
broker?

On 11/6/11 11:13 AM, Neha Narkhede wrote:
>>>> Ok so the partitioning is done on the hadoop side during importing and has
>> nothing to do with Kafka partitions.
> That's right.
>
> Kafka partitions help scale consumption by allowing multiple consumer
> processes pull data for a topic in parallel. The parallelism factor is
> limited by the total number of Kafka partitions. For example, if a
> topic has 2 partitions, 2 Hadoop mappers can pull data for the entire
> topic in parallel. If another topic has 8 partitions, the parallelism
> factor increased by 4x. Now 8 mappers can pull all the data for this
> topic at the same time.
>
> Thanks,
> Neha
>
> On Sun, Nov 6, 2011 at 11:00 AM, Mark<[EMAIL PROTECTED]>  wrote:
>> Ok so the partitioning is done on the hadoop side during importing and has
>> nothing to do with Kafka partitions. Would you mind explaining what Kafka
>> partitions are used for and when one should use them?
>>
>>
>>
>> On 11/6/11 10:52 AM, Neha Narkhede wrote:
>>> We use Avro serialization for the message data and use Avro schemas to
>>> convert event objects into Kafka message payload on the producers. On
>>> the Hadoop side, we use Avro schemas to deserialize Kafka message
>>> payload back into an event object. Each such event object has a
>>> timestamp field that the Hadoop job uses to put the message into its
>>> hourly and daily partition. So if the Hadoop job runs every 15 mins,
>>> it will run 4 times to collect data into the current hour's partition.
>>>
>>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>>>
>>> Thanks,
>>> Neha
>>>
>>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<[EMAIL PROTECTED]>    wrote:
>>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer
>>>> to
>>>> load the data for
>>>> topics in daily and hourly partitions."
>>>>
>>>> Sorry for my ignorance but what exactly do you mean by loading the data
>>>> in
>>>> daily and hourly partitions?
>>>>
>>>>
>>>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>>>> There should be no changes to the way you create topics to achieve
>>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>>>> mins or so. So the maximum delay of data being available from
>>>>> producer->Hadoop is around 10 mins.
>>>>>
>>>>> Thanks,
>>>>> Neha
>>>>>
>>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<[EMAIL PROTECTED]>
>>>>>   wrote:
>>>>>> This is more of a general design question but what is the preferred way
>>>>>> of
>>>>>> importing logs from Kafka to HDFS when you want your data segmented by
>>>>>> hour
>>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>>>>> does
>>>>>> one need to create their topics around the way they would like to
>>>>>> import
>>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>>>>> any
>>>>>> documentation/best practices on topic/key design?
>>>>>>
>>>>>> Thanks
>>>>>>