Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Desicion Tree Implementation in Hadoop MapReduce


Copy link to this message
-
Re: Desicion Tree Implementation in Hadoop MapReduce
Yexi Jiang 2013-12-02, 14:52
Yes, the user is responsible for using the correct model for a given piece
of testing (or unlabeled) data.
2013/12/2 unmesha sreeveni <[EMAIL PROTECTED]>

> To make it more general, it's better to separate them. Since there might
> be multiple batches of training (or to-be-label), and you only need to
> train the model once (if your data is stable).
>
> Ok , I will go for the second one.
> So if we are going for separate.They will not have any connection with
> both. So we should tell what test data belongs to which train data.
> And load the corresponding playtennnis_tree.txt (so the result file should
> be named in a manner that the training result name can be noticed by its
> file name) for the train data and predict the test data.
>
>
> On Mon, Dec 2, 2013 at 10:29 AM, Yexi Jiang <[EMAIL PROTECTED]> wrote:
>
>> Actually the training and testing (or prediction) are not necessary to be
>> done in one shot. If you need to do them consecutively in your particular
>> scenario, you can do it as what you said.
>>
>> To make it more general, it's better to separate them. Since there might
>> be multiple batches of training (or to-be-label), and you only need to
>> train the model once (if your data is stable).
>>
>>
>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>
>>> 1. I jst thought of building a model using a project named say DT and
>>> wen a huge input comes do another mr job test.java with in DT.
>>> If not chaining jobs we need to create seperate project right DT_build
>>> and DT_test projects
>>> NO need for seperate project file?
>>>
>>> 2. M1_train - dataset for training.
>>>
>>> M1_test - test data or prediction.
>>> 1. Will it be one data as input for prediction or  set of data given
>>> as input at-once.
>>> 2.we also need to ensure in our pgm that M1_test belongs to M1_train
>>> only. we shld check that also ...right? if M1_test is given into
>>> M2_train it should show error. is nt 'it?.
>>>
>>> Any thing wrong in my inference...
>>> Are u able to guess wt i am trying to accomplish.
>>> I am confused if i need to create only 1 project that includes train and
>>> test.or 2 projects
>>>
>>>
>>> On Mon, Dec 2, 2013 at 9:54 AM, Yexi Jiang <[EMAIL PROTECTED]> wrote:
>>>
>>>> What is your motivation of using chaining jobs?
>>>>
>>>>
>>>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>>>
>>>>> Thanks Yexi...A very nice explanation...Thanks a lot..
>>>>> Explained in a very simple way which is really understandable for
>>>>> beginners..Thanks a lot.
>>>>> I can go for chaining jobs right?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 1, 2013 at 8:55 PM, Yexi Jiang <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> In my opinion.
>>>>>>
>>>>>> 1. Build the decision tree model with the training data.
>>>>>> 2. Store it somewhere.
>>>>>> 3. When the unlabeled data is available:
>>>>>>    3.1 if the unlabeled data is huge, write another mrjob to process
>>>>>> them, load the model at the setup stage, use the model to label the data
>>>>>> one by one in map stage. There is no necessary to have a reducer.
>>>>>>   3.2 if the unlabeled data is small, it is trivial.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>>>>>
>>>>>>> Thanks Yexi ,
>>>>>>>
>>>>>>> But how  it can be accomplished.
>>>>>>> The input to Desicion Tree MR will be a set of data. But while
>>>>>>> predicting a data it will be a one line data without classlabel
>>>>>>> right?
>>>>>>> So what changes will be there in mrjob.Should we design like this.
>>>>>>> 1. When a set of data is coming draw Desicion tree
>>>>>>> 2. else if a one line data is coming.check the output of decision
>>>>>>> tree(Decision tree generated from mr) and predict the class label.
>>>>>>>
>>>>>>> -------
>>>>>>>
>>>>>>> M1_train - dataset for training.
>>>>>>> M1_test - test data or prediction.
>>>>>>> 1. Will it be one data as input for prediction or  set of data given
>>>>>>> as input at-once.
>>>>>>> 2.we also need to ensure in our pgm that M1_test belongs to M1_train
Yexi Jiang,
ECS 251,  [EMAIL PROTECTED]
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/