Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Desicion Tree Implementation in Hadoop MapReduce


Copy link to this message
-
Re: Desicion Tree Implementation in Hadoop MapReduce
Yes, the user is responsible for using the correct model for a given piece
of testing (or unlabeled) data.
2013/12/2 unmesha sreeveni <[EMAIL PROTECTED]>

> To make it more general, it's better to separate them. Since there might
> be multiple batches of training (or to-be-label), and you only need to
> train the model once (if your data is stable).
>
> Ok , I will go for the second one.
> So if we are going for separate.They will not have any connection with
> both. So we should tell what test data belongs to which train data.
> And load the corresponding playtennnis_tree.txt (so the result file should
> be named in a manner that the training result name can be noticed by its
> file name) for the train data and predict the test data.
>
>
> On Mon, Dec 2, 2013 at 10:29 AM, Yexi Jiang <[EMAIL PROTECTED]> wrote:
>
>> Actually the training and testing (or prediction) are not necessary to be
>> done in one shot. If you need to do them consecutively in your particular
>> scenario, you can do it as what you said.
>>
>> To make it more general, it's better to separate them. Since there might
>> be multiple batches of training (or to-be-label), and you only need to
>> train the model once (if your data is stable).
>>
>>
>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>
>>> 1. I jst thought of building a model using a project named say DT and
>>> wen a huge input comes do another mr job test.java with in DT.
>>> If not chaining jobs we need to create seperate project right DT_build
>>> and DT_test projects
>>> NO need for seperate project file?
>>>
>>> 2. M1_train - dataset for training.
>>>
>>> M1_test - test data or prediction.
>>> 1. Will it be one data as input for prediction or  set of data given
>>> as input at-once.
>>> 2.we also need to ensure in our pgm that M1_test belongs to M1_train
>>> only. we shld check that also ...right? if M1_test is given into
>>> M2_train it should show error. is nt 'it?.
>>>
>>> Any thing wrong in my inference...
>>> Are u able to guess wt i am trying to accomplish.
>>> I am confused if i need to create only 1 project that includes train and
>>> test.or 2 projects
>>>
>>>
>>> On Mon, Dec 2, 2013 at 9:54 AM, Yexi Jiang <[EMAIL PROTECTED]> wrote:
>>>
>>>> What is your motivation of using chaining jobs?
>>>>
>>>>
>>>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>>>
>>>>> Thanks Yexi...A very nice explanation...Thanks a lot..
>>>>> Explained in a very simple way which is really understandable for
>>>>> beginners..Thanks a lot.
>>>>> I can go for chaining jobs right?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 1, 2013 at 8:55 PM, Yexi Jiang <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> In my opinion.
>>>>>>
>>>>>> 1. Build the decision tree model with the training data.
>>>>>> 2. Store it somewhere.
>>>>>> 3. When the unlabeled data is available:
>>>>>>    3.1 if the unlabeled data is huge, write another mrjob to process
>>>>>> them, load the model at the setup stage, use the model to label the data
>>>>>> one by one in map stage. There is no necessary to have a reducer.
>>>>>>   3.2 if the unlabeled data is small, it is trivial.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/12/1 unmesha sreeveni <[EMAIL PROTECTED]>
>>>>>>
>>>>>>> Thanks Yexi ,
>>>>>>>
>>>>>>> But how  it can be accomplished.
>>>>>>> The input to Desicion Tree MR will be a set of data. But while
>>>>>>> predicting a data it will be a one line data without classlabel
>>>>>>> right?
>>>>>>> So what changes will be there in mrjob.Should we design like this.
>>>>>>> 1. When a set of data is coming draw Desicion tree
>>>>>>> 2. else if a one line data is coming.check the output of decision
>>>>>>> tree(Decision tree generated from mr) and predict the class label.
>>>>>>>
>>>>>>> -------
>>>>>>>
>>>>>>> M1_train - dataset for training.
>>>>>>> M1_test - test data or prediction.
>>>>>>> 1. Will it be one data as input for prediction or  set of data given
>>>>>>> as input at-once.
>>>>>>> 2.we also need to ensure in our pgm that M1_test belongs to M1_train
Yexi Jiang,
ECS 251,  [EMAIL PROTECTED]
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB