Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Desicion Tree Implementation in Hadoop MapReduce


Copy link to this message
-
Re: Desicion Tree Implementation in Hadoop MapReduce
To make it more general, it's better to separate them. Since there might be
multiple batches of training (or to-be-label), and you only need to train
the model once (if your data is stable).

Ok , I will go for the second one.
So if we are going for separate.They will not have any connection with
both. So we should tell what test data belongs to which train data.
And load the corresponding playtennnis_tree.txt (so the result file should
be named in a manner that the training result name can be noticed by its
file name) for the train data and predict the test data.
On Mon, Dec 2, 2013 at 10:29 AM, Yexi Jiang <[EMAIL PROTECTED]> wrote:

> Actually the training and testing (or prediction) are not necessary to be
> done in one shot. If you need to do them consecutively in your particular
> scenario, you can do it as what you said.
>
> To make it more general, it's better to separate them. Since there might
> be multiple batches of training (or to-be-label), and you only need to
> train the model once (if your data is stable).
>
>
> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>
>> 1. I jst thought of building a model using a project named say DT and wen
>> a huge input comes do another mr job test.java with in DT.
>> If not chaining jobs we need to create seperate project right DT_build
>> and DT_test projects
>> NO need for seperate project file?
>>
>> 2. M1_train - dataset for training.
>>
>> M1_test - test data or prediction.
>> 1. Will it be one data as input for prediction or  set of data given
>> as input at-once.
>> 2.we also need to ensure in our pgm that M1_test belongs to M1_train
>> only. we shld check that also ...right? if M1_test is given into
>> M2_train it should show error. is nt 'it?.
>>
>> Any thing wrong in my inference...
>> Are u able to guess wt i am trying to accomplish.
>> I am confused if i need to create only 1 project that includes train and
>> test.or 2 projects
>>
>>
>> On Mon, Dec 2, 2013 at 9:54 AM, Yexi Jiang <[EMAIL PROTECTED]> wrote:
>>
>>> What is your motivation of using chaining jobs?
>>>
>>>
>>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>>
>>>> Thanks Yexi...A very nice explanation...Thanks a lot..
>>>> Explained in a very simple way which is really understandable for
>>>> beginners..Thanks a lot.
>>>> I can go for chaining jobs right?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Dec 1, 2013 at 8:55 PM, Yexi Jiang <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> In my opinion.
>>>>>
>>>>> 1. Build the decision tree model with the training data.
>>>>> 2. Store it somewhere.
>>>>> 3. When the unlabeled data is available:
>>>>>    3.1 if the unlabeled data is huge, write another mrjob to process
>>>>> them, load the model at the setup stage, use the model to label the data
>>>>> one by one in map stage. There is no necessary to have a reducer.
>>>>>   3.2 if the unlabeled data is small, it is trivial.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>>>>
>>>>>> Thanks Yexi ,
>>>>>>
>>>>>> But how  it can be accomplished.
>>>>>> The input to Desicion Tree MR will be a set of data. But while
>>>>>> predicting a data it will be a one line data without classlabel right?
>>>>>> So what changes will be there in mrjob.Should we design like this.
>>>>>> 1. When a set of data is coming draw Desicion tree
>>>>>> 2. else if a one line data is coming.check the output of decision
>>>>>> tree(Decision tree generated from mr) and predict the class label.
>>>>>>
>>>>>> -------
>>>>>>
>>>>>> M1_train - dataset for training.
>>>>>> M1_test - test data or prediction.
>>>>>> 1. Will it be one data as input for prediction or  set of data given
>>>>>> as input at-once.
>>>>>> 2.we also need to ensure in our pgm that M1_test belongs to M1_train
>>>>>> only. we shld check that also ...right? if M1_test is given into
>>>>>> M2_train it should show error. is nt 'it?.
>>>>>>
>>>>>> Pls suggest if my thoughts are wrong.
>>>>>>
>>>>>> On 11/30/13, Yexi Jiang <[EMAIL PROTECTED]> wrote

*Thanks & Regards*

Unmesha Sreeveni U.B

*Junior Developer*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB