|
|
-
Logistic regression package on Hadoop
Rajesh Nikam 2012-10-12, 13:06
Hi,
Could you please suggest Logistic regression package that could be used on Hadoop ? I have large data and looking for LR package with kernel supports.
Thanks Rajesh
-
Re: Logistic regression package on Hadoop
Harsh J 2012-10-12, 15:36
Hi Rajesh, Please head over to the Apache Mahout project. See https://cwiki.apache.org/MAHOUT/logistic-regression.htmlApache Mahout is homed at http://mahout.apache.org and works well with Hadoop MR, etc.. On Fri, Oct 12, 2012 at 6:36 PM, Rajesh Nikam <[EMAIL PROTECTED]> wrote: > Hi, > > Could you please suggest Logistic regression package that could be used on > Hadoop ? > I have large data and looking for LR package with kernel supports. > > Thanks > Rajesh > > -- Harsh J
-
Re: Logistic regression package on Hadoop
Ted Dunning 2012-10-12, 17:21
Harsh, THanks for the plug. Rajesh has been talking to us. On Fri, Oct 12, 2012 at 8:36 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Hi Rajesh, > > Please head over to the Apache Mahout project. See > https://cwiki.apache.org/MAHOUT/logistic-regression.html> > Apache Mahout is homed at http://mahout.apache.org and works well with > Hadoop MR, etc.. > > On Fri, Oct 12, 2012 at 6:36 PM, Rajesh Nikam <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > Could you please suggest Logistic regression package that could be used > on > > Hadoop ? > > I have large data and looking for LR package with kernel supports. > > > > Thanks > > Rajesh > > > > > > > > -- > Harsh J >
-
Re: Logistic regression package on Hadoop
Rajesh Nikam 2012-10-15, 12:34
Hi Harsh,
Thanks for giving link for sgd from mahout.
I have asked question on issue with using sgd. Below is description of it. Ted Dunning has mentioned their may be some issue with data encoding.
However I am not able to point issue. Could you please let me know what is issue its format or usage.
Attached uses input files
I am using Iris Plants Database from Michael Marshall. PFA iris.arff. Converted this to csv file just by updating header: iris-3-classes.csv
mahout org.apache.mahout.classifier. sgd.TrainLogistic --input /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output /usr/local/mahout/trunk/ *iris-3-classes.model* --target class *--categories 3* --predictors sepallength sepalwidth petallength petalwidth --types n
>> it gave following error. Exception in thread "main" java.lang.IllegalArgumentException: Can only call classifyScalar with two categories
Now created csv with only 2 classes. PFA iris-2-classes.csv
>> trained iris-2-classes.csv with sgd
mahout org.apache.mahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories 2* --predictors sepallength sepalwidth petallength petalwidth --types n mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
AUC = 0.14 confusion: [[50.0, 50.0], [0.0, 0.0]] entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>> AUC seems to poor. Now changed --predictors
mahout org.apache.mahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories 2* --predictors sepalwidth petallength --types n n
mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion --scores
AUC = 0.80 confusion: [[50.0, 50.0], [0.0, 0.0]] entropy: [[-0.7, -0.3], [-0.7, -0.4]]
AUC is improved, however from confusion matrix seems everything is classified as class a.
Below is the output.
"target","model-output","log-likelihood" 0,0.492,-0.677017 0,0.493,-0.679192 0,0.493,-0.678355 0,0.493,-0.678724 0,0.492,-0.676583 0,0.491,-0.675182 0,0.492,-0.677452 0,0.492,-0.677419 0,0.493,-0.679628 0,0.493,-0.678724 0,0.491,-0.676116 0,0.492,-0.677386 0,0.493,-0.679192 0,0.493,-0.679291 0,0.491,-0.674912 0,0.490,-0.673081 0,0.491,-0.675313 0,0.492,-0.677017 0,0.491,-0.675616 0,0.491,-0.675682 0,0.492,-0.677353 0,0.491,-0.676116 0,0.492,-0.676714 0,0.492,-0.677788 0,0.492,-0.677287 0,0.493,-0.679126 0,0.492,-0.677386 0,0.492,-0.676984 0,0.492,-0.677452 0,0.492,-0.678256 0,0.493,-0.678691 0,0.492,-0.677419 0,0.491,-0.674381 0,0.490,-0.673980 0,0.493,-0.678724 0,0.493,-0.678387 0,0.492,-0.677050 0,0.493,-0.678724 0,0.493,-0.679225 0,0.492,-0.677419 0,0.492,-0.677050 0,0.495,-0.682279 0,0.493,-0.678355 0,0.492,-0.676951 0,0.491,-0.675550 0,0.493,-0.679192 0,0.491,-0.675649 0,0.493,-0.678322 0,0.491,-0.676116 0,0.492,-0.677887 1,0.492,-0.709316 1,0.492,-0.709248 1,0.492,-0.708935 1,0.494,-0.705048 1,0.493,-0.707488 1,0.493,-0.707454 1,0.492,-0.709765 1,0.494,-0.705258 1,0.493,-0.707936 1,0.493,-0.706803 1,0.495,-0.703539 1,0.493,-0.708249 1,0.494,-0.704601 1,0.493,-0.707970 1,0.493,-0.707597 1,0.492,-0.708765 1,0.492,-0.708351 1,0.493,-0.706871 1,0.494,-0.704770 1,0.494,-0.705908 1,0.492,-0.709350 1,0.493,-0.707285 1,0.493,-0.706247 1,0.493,-0.707522 1,0.493,-0.707835 1,0.492,-0.708317 1,0.493,-0.707556 1,0.492,-0.708520 1,0.493,-0.707902 1,0.494,-0.706220 1,0.494,-0.705427 1,0.494,-0.705393 1,0.493,-0.706803 1,0.493,-0.707210 1,0.492,-0.708351 1,0.492,-0.710146 1,0.492,-0.708867 1,0.494,-0.705183 1,0.493,-0.708215 1,0.494,-0.705942 1,0.493,-0.706525 1,0.492,-0.708385 1,0.493,-0.706389 1,0.494,-0.704811 1,0.493,-0.706905 1,0.493,-0.708249 1,0.493,-0.707801 1,0.493,-0.707835 1,0.494,-0.705604 1,0.493,-0.707319
AUC = 0.80 confusion: [[50.0, 50.0], [0.0, 0.0]] entropy: [[-0.7, -0.3], [-0.7, -0.4]] On Fri, Oct 12, 2012 at 10:51 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
-
Re: Logistic regression package on Hadoop
Bertrand Dechoux 2012-10-15, 12:53
Hi Rajesh, You may want to use the mahout mailing list for mahout related question. http://mahout.apache.org/mailinglists.htmlRegards Bertrand On Mon, Oct 15, 2012 at 2:34 PM, Rajesh Nikam <[EMAIL PROTECTED]> wrote: > Hi Harsh, > > Thanks for giving link for sgd from mahout. > > I have asked question on issue with using sgd. Below is description of it. > Ted Dunning has mentioned their may be some issue with data encoding. > > However I am not able to point issue. Could you please let me know what is > issue its format or usage. > > Attached uses input files > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff. > Converted this to csv file just by updating header: iris-3-classes.csv > > mahout org.apache.mahout.classifier. > sgd.TrainLogistic --input /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output /usr/local/mahout/trunk/ > *iris-3-classes.model* --target class *--categories 3* --predictors > sepallength sepalwidth petallength petalwidth --types n > > >> it gave following error. > Exception in thread "main" java.lang.IllegalArgumentException: Can only > call classifyScalar with two categories > > Now created csv with only 2 classes. PFA iris-2-classes.csv > > >> trained iris-2-classes.csv with sgd > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories > 2* --predictors sepallength sepalwidth petallength petalwidth --types n > > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion > > AUC = 0.14 > confusion: [[50.0, 50.0], [0.0, 0.0]] > entropy: [[-0.6, -0.3], [-0.8, -0.4]] > > >> AUC seems to poor. Now changed --predictors > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories > 2* --predictors sepalwidth petallength --types n n > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion > --scores > > AUC = 0.80 > confusion: [[50.0, 50.0], [0.0, 0.0]] > entropy: [[-0.7, -0.3], [-0.7, -0.4]] > > AUC is improved, however from confusion matrix seems everything is > classified as class a. > > Below is the output. > > "target","model-output","log-likelihood" > 0,0.492,-0.677017 > 0,0.493,-0.679192 > 0,0.493,-0.678355 > 0,0.493,-0.678724 > 0,0.492,-0.676583 > 0,0.491,-0.675182 > 0,0.492,-0.677452 > 0,0.492,-0.677419 > 0,0.493,-0.679628 > 0,0.493,-0.678724 > 0,0.491,-0.676116 > 0,0.492,-0.677386 > 0,0.493,-0.679192 > 0,0.493,-0.679291 > 0,0.491,-0.674912 > 0,0.490,-0.673081 > 0,0.491,-0.675313 > 0,0.492,-0.677017 > 0,0.491,-0.675616 > 0,0.491,-0.675682 > 0,0.492,-0.677353 > 0,0.491,-0.676116 > 0,0.492,-0.676714 > 0,0.492,-0.677788 > 0,0.492,-0.677287 > 0,0.493,-0.679126 > 0,0.492,-0.677386 > 0,0.492,-0.676984 > 0,0.492,-0.677452 > 0,0.492,-0.678256 > 0,0.493,-0.678691 > 0,0.492,-0.677419 > 0,0.491,-0.674381 > 0,0.490,-0.673980 > 0,0.493,-0.678724 > 0,0.493,-0.678387 > 0,0.492,-0.677050 > 0,0.493,-0.678724 > 0,0.493,-0.679225 > 0,0.492,-0.677419 > 0,0.492,-0.677050 > 0,0.495,-0.682279 > 0,0.493,-0.678355 > 0,0.492,-0.676951 > 0,0.491,-0.675550 > 0,0.493,-0.679192 > 0,0.491,-0.675649 > 0,0.493,-0.678322 > 0,0.491,-0.676116 > 0,0.492,-0.677887 > 1,0.492,-0.709316 > 1,0.492,-0.709248 > 1,0.492,-0.708935 > 1,0.494,-0.705048 > 1,0.493,-0.707488 > 1,0.493,-0.707454 > 1,0.492,-0.709765 > 1,0.494,-0.705258 > 1,0.493,-0.707936 > 1,0.493,-0.706803 > 1,0.495,-0.703539 > 1,0.493,-0.708249 > 1,0.494,-0.704601 > 1,0.493,-0.707970 > 1,0.493,-0.707597 > 1,0.492,-0.708765 > 1,0.492,-0.708351 > 1,0.493,-0.706871 > 1,0.494,-0.704770 > 1,0.494,-0.705908 > 1,0.492,-0.709350 > 1,0.493,-0.707285 > 1,0.493,-0.706247 > 1,0.493,-0.707522 Bertrand Dechoux
|
|