|
|
-
Bulk loading a CSV file into HBase
anil gupta 2012-03-05, 19:48
Hi All,
I am getting a "Bad line at offset" error in Stderr log of tasks while testing bulk loading a CSV file into HBase. I am using cdh3u2. Import of a TSV works fine.
Here is the command i ran: sudo -u hdfs hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city testload /temp/csv -Dimporttsv.skip.bad.lines=true '-Dimporttsv.separator=,'
Job Stdout logs: [root@ihub-namenode1 ihub]# sudo -u hdfs hadoop jar /usr/lib/hbase/hbase-0.90.4-cdh3u2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city testload /temp/csv -Dimporttsv.skip.bad.lines=true '-Dimporttsv.separator=,' 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:host.name =ihub-namenode1 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_20 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_20/jre 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/usr/lib/hadoop-0.20/conf:/usr/java/jdk1.6.0_20/jre//lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u2.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/guava-r06.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u2.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/zookeeper.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/lib/hadoop/lib:/usr/lib/hbase/lib:/usr/lib/sqoop/lib:/etc/hbase/conf 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_20/jre/lib/amd64/server:/usr/java/jdk1.6.0_20/jre/lib/amd64:/usr/java/jdk1.6.0_20/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA> 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-71.el6.x86_64 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:user.name =hdfs 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:user.home=/usr/lib/hadoop-0.20 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/ihub 12/03/05 11:42:42 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ihub-jobtracker1:2181 sessionTimeout=180000 watcher=hconnection 12/03/05 11:42:42 INFO zookeeper.ClientCnxn: Opening socket connection to server ihub-jobtracker1/192.168.1.98:2181 12/03/05 11:42:42 INFO zookeeper.ClientCnxn: Socket connection established to ihub-jobtracker1/192.168.1.98:2181, initiating session 12/03/05 11:42:42 INFO zookeeper.ClientCnxn: Session establishment complete on server ihub-jobtracker1/192.168.1.98:2181, sessionid 0x135d53c669a007a, negotiated timeout = 40000 12/03/05 11:42:42 INFO mapreduce.TableOutputFormat: Created table instance for testload 12/03/05 11:42:42 INFO input.FileInputFormat: Total input paths to process 12/03/05 11:42:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/03/05 11:42:42 WARN snappy.LoadSnappy: Snappy native library not loaded 12/03/05 11:42:42 INFO mapred.JobClient: Running job: job_201203021306_0017 12/03/05 11:42:43 INFO mapred.JobClient: map 0% reduce 0% 12/03/05 11:42:48 INFO mapred.JobClient: map 100% reduce 0% 12/03/05 11:42:48 INFO mapred.JobClient: Job complete: job_201203021306_0017 12/03/05 11:42:48 INFO mapred.JobClient: Counters: 13 12/03/05 11:42:48 INFO mapred.JobClient: Job Counters 12/03/05 11:42:48 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5063 12/03/05 11:42:48 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/05 11:42:48 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/05 11:42:48 INFO mapred.JobClient: Launched map tasks=1 12/03/05 11:42:48 INFO mapred.JobClient: Data-local map tasks=1 12/03/05 11:42:48 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/03/05 1
-
Re: Bulk loading a CSV file into HBase
Stack 2012-03-05, 22:58
On Mon, Mar 5, 2012 at 11:48 AM, anil gupta <[EMAIL PROTECTED]> wrote: > I am getting a "Bad line at offset" error in Stderr log of tasks while > testing bulk loading a CSV file into HBase. I am using cdh3u2. Import of a > TSV works fine. >
Its your encoding of the tsv and csv or its a problem w/ the parsing code in importtsv tool. Can you figure which it is? Can you add a bit of debug for the next time you run the job?
Thanks, St.Ack
-
Re: Bulk loading a CSV file into HBase
anil gupta 2012-03-06, 00:54
Hi St.Ack,
Thanks for the response. Both the tsv and csv are UTF-8 file. Could you please let me know how to run bulk loading in Debug mode? I dont know of any hadoop option which can run a job in Debug mode.
Thanks, Anil
On Mon, Mar 5, 2012 at 2:58 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Mon, Mar 5, 2012 at 11:48 AM, anil gupta <[EMAIL PROTECTED]> wrote: > > I am getting a "Bad line at offset" error in Stderr log of tasks while > > testing bulk loading a CSV file into HBase. I am using cdh3u2. Import of > a > > TSV works fine. > > > > Its your encoding of the tsv and csv or its a problem w/ the parsing > code in importtsv tool. Can you figure which it is? Can you add a > bit of debug for the next time you run the job? > > Thanks, > St.Ack >
-- Thanks & Regards, Anil Gupta
-
Re: Bulk loading a CSV file into HBase
Shrijeet Paliwal 2012-03-06, 01:06
Anil, Stack meant adding debug statements yourself in tool.
-Shrijeet
On Mon, Mar 5, 2012 at 4:54 PM, anil gupta <[EMAIL PROTECTED]> wrote:
> Hi St.Ack, > > Thanks for the response. Both the tsv and csv are UTF-8 file. Could you > please let me know how to run bulk loading in Debug mode? I dont know of > any hadoop option which can run a job in Debug mode. > > Thanks, > Anil > > On Mon, Mar 5, 2012 at 2:58 PM, Stack <[EMAIL PROTECTED]> wrote: > > > On Mon, Mar 5, 2012 at 11:48 AM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > I am getting a "Bad line at offset" error in Stderr log of tasks while > > > testing bulk loading a CSV file into HBase. I am using cdh3u2. Import > of > > a > > > TSV works fine. > > > > > > > Its your encoding of the tsv and csv or its a problem w/ the parsing > > code in importtsv tool. Can you figure which it is? Can you add a > > bit of debug for the next time you run the job? > > > > Thanks, > > St.Ack > > > > > > -- > Thanks & Regards, > Anil Gupta >
-
Re: Bulk loading a CSV file into HBase
anil gupta 2012-03-08, 07:59
Hi Stack, I decompiled the ImportTsv class and added some sysout statements in main() to figure out the problem. Please find the modified class here: http://pastebin.com/sKQcMXe4 With help of Keshav, i got to know that csv import works fine when i provide "-Dimporttsv.separator=," as first commandline parameter after specifying the classname. Here is the command and console log of the successful import of csv file: sudo -u hdfs hadoop jar /usr/lib/hadoop/importdata.jar com.intuit.ihub.hbase.poc.ImportData -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city testload /temp/csv -Dimporttsv.skip.bad.lines=true Command line Arguments::-Dimporttsv.separator=, Command line Arguments::-Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:city Command line Arguments::testload Command line Arguments::/temp/csv Command line Arguments::-Dimporttsv.skip.bad.lines=true OtherArguments==>testload OtherArguments==>/temp/csv OtherArguments==>-D OtherArguments==>importtsv.skip.bad.lines=true SEPARATOR as per jobconf:, 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:host.name =ihub-namenode1 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_20 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_20/jre 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/usr/lib/hadoop-0.20/conf:/usr/java/jdk1.6.0_20/jre//lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u2.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/guava-r06.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u2.jar:/usr/lib/hadoop-0.20/lib/hbase.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/zookeeper.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/lib/hadoop/lib:/usr/lib/hbase/lib:/usr/lib/sqoop/lib:/etc/hbase/conf 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_20/jre/lib/amd64/server:/usr/java/jdk1.6.0_20/jre/lib/amd64:/usr/java/jdk1.6.0_20/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA> 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-71.el6.x86_64 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:user.name =hdfs 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:user.home=/usr/lib/hadoop-0.20 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Client environment:user.dir=/root 12/03/07 10:01:33 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ihub-jobtracker1:2181 sessionTimeout=180000 watcher=hconnection 12/03/07 10:01:33 INFO zookeeper.ClientCnxn: Opening socket connection to server ihub-jobtracker1/192.168.1.98:2181 12/03/07 10:01:33 INFO zookeeper.ClientCnxn: Socket connection established to ihub-jobtracker1/192.168.1.98:2181, initiating session 12/03/07 10:01:33 INFO zookeeper.ClientCnxn: Session establishment complete on server ihub-jobtracker1/192.168.1.98:2181, sessionid 0x135d53c669a00ab, negotiated timeout = 40000 12/03/07 10:01:33 INFO mapreduce.TableOutputFormat: Created table instance for testload 12/03/07 10:01:33 INFO input.FileInputFormat: Total input paths to process 12/03/07 10:01:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/03/07 10:01:33 WARN snappy.LoadSnappy: Snappy native library not loaded 12/03/07 10:01:34 INFO mapred.JobClient: Running job: job_201203021306_0028 12/03/07 10:01:35 INFO mapred.JobClient: map 0% reduce 0% 12/03/07 10:01:40 INFO mapred.JobClient: map 100% reduce 0% 12/03/07 10:01:41 INFO mapred.JobClient: Job complete: job_201203021306_0028 12/03/07 10:01:41 INFO mapred.JobClient: Counters: 13 12/03/07 10:01:41 INFO mapred.JobClient: Job Counters 12/03/07 10:01:41 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5177 12/03/07 10:01:41 INF
-
Re: Bulk loading a CSV file into HBase
Stack 2012-03-08, 17:12
On Wed, Mar 7, 2012 at 11:59 PM, anil gupta <[EMAIL PROTECTED]> wrote: > I tried to analyze the problem and as per my analysis there is a problem > with "String[] otherArgs = new GenericOptionsParser(conf, > args).getRemainingArgs();" on line#102. Let me know you views. >
So, its just where you put the option on the command line? If its on the end, my guess is its presumed the arg is for the program. If its before the program name, then its for GenericOptionsParser to digest. Thats sort of how it is expected to work I'd say. Its confusing though? Can we do anything in the usage for the importtsv tool to make it so others don't have this issue?
Thanks, St.Ack
-
Re: Bulk loading a CSV file into HBase
anil gupta 2012-03-08, 19:14
Hi Stack,
Yes, the separator argument is sensitive to position in the command. Currently, it needs to be specified just after program name. The same is not mentioned in the docs.
I have got two suggestion for fixing this so that other don't run into same problem:
1. Update the HBase bulk load documentation and specify that separator argument should be next to program name. 2. Fix the problem in the code itself by handling the separator argument explicitly. (Still, i am wondering why only separator value is not being set in jobconf automatically if it is not provided next to program name??)
What's your take?
Thanks, Anil On Thu, Mar 8, 2012 at 9:12 AM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Mar 7, 2012 at 11:59 PM, anil gupta <[EMAIL PROTECTED]> wrote: > > I tried to analyze the problem and as per my analysis there is a problem > > with "String[] otherArgs = new GenericOptionsParser(conf, > > args).getRemainingArgs();" on line#102. Let me know you views. > > > > So, its just where you put the option on the command line? If its on > the end, my guess is its presumed the arg is for the program. If its > before the program name, then its for GenericOptionsParser to digest. > Thats sort of how it is expected to work I'd say. Its confusing > though? Can we do anything in the usage for the importtsv tool to > make it so others don't have this issue? > > Thanks, > St.Ack >
-- Thanks & Regards, Anil Gupta
-
Re: Bulk loading a CSV file into HBase
Stack 2012-03-08, 19:27
On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <[EMAIL PROTECTED]> wrote: > 1. Update the HBase bulk load documentation and specify that separator > argument should be next to program name.
This would help.
> 2. Fix the problem in the code itself by handling the separator argument > explicitly. (Still, i am wondering why only separator value is not being > set in jobconf automatically if it is not provided next to program name??) >
This is probably too late IIRC. I haven't looked at code but GenericOptionsParser has probably already been run by the time the application starts to process args. Duplicating what GOP in the application is probably not the way to go either?
St.Ack
-
Re: Bulk loading a CSV file into HBase
Shrijeet Paliwal 2012-03-08, 20:06
GenericOptionsParser stops parsing the arguments as soon as first non option is specified (refer : http://commons.apache.org/cli/api-1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Options, java.lang.String[], boolean)) So in this cases as soon parses sees the table name arg , it ignore all other properties specified with -D opt. Note it not only ignores separator it is also ignoring importtsv.skip.bad.lines option in your run which failed. On Thu, Mar 8, 2012 at 11:27 AM, Stack <[EMAIL PROTECTED]> wrote: > On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <[EMAIL PROTECTED]> wrote: > > 1. Update the HBase bulk load documentation and specify that separator > > argument should be next to program name. > > This would help. > > > 2. Fix the problem in the code itself by handling the separator argument > > explicitly. (Still, i am wondering why only separator value is not being > > set in jobconf automatically if it is not provided next to program > name??) > > > > This is probably too late IIRC. I haven't looked at code but > GenericOptionsParser has probably already been run by the time the > application starts to process args. Duplicating what GOP in the > application is probably not the way to go either? > > St.Ack >
-
Re: Bulk loading a CSV file into HBase
anil gupta 2012-03-08, 21:42
Yeah after digging further into the code: Line#374 in GenericOptionsParser.java "commandLine = parser.parse(opts, args, true);" is the culprit. Nice find, Shrijeet. That answers my question. :) Stack: Could you please tell me the meaning of "IIRC"? Updating the document is good but as per the behavior of parse() other -D option will also be ignored if tablename is followed by any -D option . Duplicating the GOP functionality does not seems to be a good idea . Maybe instead of invoking "parser.parse(opts, args, true);" if somehow we can invoke "parser.parse(opts, args, false);" then all will be good. I haven't looked at the api to know about the possibility of same. This is just food for thought. Thanks, Anil On Thu, Mar 8, 2012 at 12:06 PM, Shrijeet Paliwal <[EMAIL PROTECTED]>wrote: > GenericOptionsParser stops parsing the arguments as soon as first non > option is specified (refer : > > http://commons.apache.org/cli/api-1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Options> , > java.lang.String[], boolean)) > > So in this cases as soon parses sees the table name arg , it ignore all > other properties specified with -D opt. Note it not only ignores separator > it is also ignoring importtsv.skip.bad.lines option in your run which > failed. > > > > On Thu, Mar 8, 2012 at 11:27 AM, Stack <[EMAIL PROTECTED]> wrote: > > > On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <[EMAIL PROTECTED]> > wrote: > > > 1. Update the HBase bulk load documentation and specify that separator > > > argument should be next to program name. > > > > This would help. > > > > > 2. Fix the problem in the code itself by handling the separator > argument > > > explicitly. (Still, i am wondering why only separator value is not > being > > > set in jobconf automatically if it is not provided next to program > > name??) > > > > > > > This is probably too late IIRC. I haven't looked at code but > > GenericOptionsParser has probably already been run by the time the > > application starts to process args. Duplicating what GOP in the > > application is probably not the way to go either? > > > > St.Ack > > > -- Thanks & Regards, Anil Gupta
-
RE: Bulk loading a CSV file into HBase
Laxman 2012-03-09, 08:20
Hi Anil, > instead of invoking "parser.parse(opts, args, true);" if somehow we can > invoke "parser.parse(opts, args, false);" then all will be good. I > haven't > looked at the api to know about the possibility of same. Changing to parser.parse(opts, args, false) solves this problem. I think, we need to consider the following before going for this change. This involves behavior change in legacy hadoop code. Directly changing from true to false may cause behavioral compatibility issue. Also, Setting it to false may not be correct all the times. Case #1 java "java -Dprop1=val1 <Class> arg1 arg2" is different from "java <Class> arg1 arg2 -Dprop1=val1 In this case it looks like parser.parse(opts, args, true) is correct Case #2 linux "ls -l /home" is same as "ls /home -l" In this case it looks like parser.parse(opts, args, false) is correct >> This is probably too late IIRC Hope, Stack also meant the same point here. > Could you please tell me the meaning of "IIRC"? IIRC - If I Recall/Remember Correctly -- Regards, Laxman > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > anil gupta > Sent: Friday, March 09, 2012 3:12 AM > To: [EMAIL PROTECTED] > Subject: Re: Bulk loading a CSV file into HBase > > Yeah after digging further into the code: Line#374 in > GenericOptionsParser.java "commandLine = parser.parse(opts, args, > true);" > is the culprit. Nice find, Shrijeet. That answers my question. :) > > Stack: > Could you please tell me the meaning of "IIRC"? Updating the document > is > good but as per the behavior of parse() other -D option will also be > ignored if tablename is followed by any -D option . > Duplicating the GOP functionality does not seems to be a good idea . > Maybe > instead of invoking "parser.parse(opts, args, true);" if somehow we can > invoke "parser.parse(opts, args, false);" then all will be good. I > haven't > looked at the api to know about the possibility of same. This is just > food > for thought. > > Thanks, > Anil > > > > On Thu, Mar 8, 2012 at 12:06 PM, Shrijeet Paliwal > <[EMAIL PROTECTED]>wrote: > > > GenericOptionsParser stops parsing the arguments as soon as first non > > option is specified (refer : > > > > http://commons.apache.org/cli/api-> 1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Opt > ions > > , > > java.lang.String[], boolean)) > > > > So in this cases as soon parses sees the table name arg , it ignore > all > > other properties specified with -D opt. Note it not only ignores > separator > > it is also ignoring importtsv.skip.bad.lines option in your run which > > failed. > > > > > > > > On Thu, Mar 8, 2012 at 11:27 AM, Stack <[EMAIL PROTECTED]> wrote: > > > > > On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <[EMAIL PROTECTED]> > > wrote: > > > > 1. Update the HBase bulk load documentation and specify that > separator > > > > argument should be next to program name. > > > > > > This would help. > > > > > > > 2. Fix the problem in the code itself by handling the separator > > argument > > > > explicitly. (Still, i am wondering why only separator value is > not > > being > > > > set in jobconf automatically if it is not provided next to > program > > > name??) > > > > > > > > > > This is probably too late IIRC. I haven't looked at code but > > > GenericOptionsParser has probably already been run by the time the > > > application starts to process args. Duplicating what GOP in the > > > application is probably not the way to go either? > > > > > > St.Ack > > > > > > > > > -- > Thanks & Regards, > Anil Gupta
-
Re: Bulk loading a CSV file into HBase
anil gupta 2012-03-09, 16:29
Hi Lakshman, As per your last email, it seems that updating the doc seems to be an easy and right approach. Thanks, Anil Gupta On Fri, Mar 9, 2012 at 12:20 AM, Laxman <[EMAIL PROTECTED]> wrote: > Hi Anil, > > > instead of invoking "parser.parse(opts, args, true);" if somehow we can > > invoke "parser.parse(opts, args, false);" then all will be good. I > > haven't > > looked at the api to know about the possibility of same. > > Changing to parser.parse(opts, args, false) solves this problem. > I think, we need to consider the following before going for this change. > > This involves behavior change in legacy hadoop code. > Directly changing from true to false may cause behavioral compatibility > issue. > > Also, Setting it to false may not be correct all the times. > > Case #1 java > "java -Dprop1=val1 <Class> arg1 arg2" is different from "java <Class> arg1 > arg2 -Dprop1=val1 > > In this case it looks like parser.parse(opts, args, true) is correct > > > Case #2 linux > "ls -l /home" is same as "ls /home -l" > > In this case it looks like parser.parse(opts, args, false) is correct > > >> This is probably too late IIRC > Hope, Stack also meant the same point here. > > > Could you please tell me the meaning of "IIRC"? > IIRC - If I Recall/Remember Correctly > > -- > Regards, > Laxman > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > > anil gupta > > Sent: Friday, March 09, 2012 3:12 AM > > To: [EMAIL PROTECTED] > > Subject: Re: Bulk loading a CSV file into HBase > > > > Yeah after digging further into the code: Line#374 in > > GenericOptionsParser.java "commandLine = parser.parse(opts, args, > > true);" > > is the culprit. Nice find, Shrijeet. That answers my question. :) > > > > Stack: > > Could you please tell me the meaning of "IIRC"? Updating the document > > is > > good but as per the behavior of parse() other -D option will also be > > ignored if tablename is followed by any -D option . > > Duplicating the GOP functionality does not seems to be a good idea . > > Maybe > > instead of invoking "parser.parse(opts, args, true);" if somehow we can > > invoke "parser.parse(opts, args, false);" then all will be good. I > > haven't > > looked at the api to know about the possibility of same. This is just > > food > > for thought. > > > > Thanks, > > Anil > > > > > > > > On Thu, Mar 8, 2012 at 12:06 PM, Shrijeet Paliwal > > <[EMAIL PROTECTED]>wrote: > > > > > GenericOptionsParser stops parsing the arguments as soon as first non > > > option is specified (refer : > > > > > > http://commons.apache.org/cli/api-> > 1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Opt > > ions > > > , > > > java.lang.String[], boolean)) > > > > > > So in this cases as soon parses sees the table name arg , it ignore > > all > > > other properties specified with -D opt. Note it not only ignores > > separator > > > it is also ignoring importtsv.skip.bad.lines option in your run which > > > failed. > > > > > > > > > > > > On Thu, Mar 8, 2012 at 11:27 AM, Stack <[EMAIL PROTECTED]> wrote: > > > > > > > On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <[EMAIL PROTECTED]> > > > wrote: > > > > > 1. Update the HBase bulk load documentation and specify that > > separator > > > > > argument should be next to program name. > > > > > > > > This would help. > > > > > > > > > 2. Fix the problem in the code itself by handling the separator > > > argument > > > > > explicitly. (Still, i am wondering why only separator value is > > not > > > being > > > > > set in jobconf automatically if it is not provided next to > > program > > > > name??) > > > > > > > > > > > > > This is probably too late IIRC. I haven't looked at code but > > > > GenericOptionsParser has probably already been run by the time the > > > > application starts to process args. Duplicating what GOP in the > > > > application is probably not the way to go either? > > > > > > > > St.Ack > > > > > > > > > > > > > Thanks & Regards, Anil Gupta
-
Re: Bulk loading a CSV file into HBase
Harsh J 2012-05-28, 12:36
Anil, Sorry for the late bump but just for your reference, this is cause of: https://issues.apache.org/jira/browse/HADOOP-7995On Fri, Mar 9, 2012 at 9:59 PM, anil gupta <[EMAIL PROTECTED]> wrote: > Hi Lakshman, > > As per your last email, it seems that updating the doc seems to be an easy > and right approach. > > Thanks, > Anil Gupta > > On Fri, Mar 9, 2012 at 12:20 AM, Laxman <[EMAIL PROTECTED]> wrote: > >> Hi Anil, >> >> > instead of invoking "parser.parse(opts, args, true);" if somehow we can >> > invoke "parser.parse(opts, args, false);" then all will be good. I >> > haven't >> > looked at the api to know about the possibility of same. >> >> Changing to parser.parse(opts, args, false) solves this problem. >> I think, we need to consider the following before going for this change. >> >> This involves behavior change in legacy hadoop code. >> Directly changing from true to false may cause behavioral compatibility >> issue. >> >> Also, Setting it to false may not be correct all the times. >> >> Case #1 java >> "java -Dprop1=val1 <Class> arg1 arg2" is different from "java <Class> arg1 >> arg2 -Dprop1=val1 >> >> In this case it looks like parser.parse(opts, args, true) is correct >> >> >> Case #2 linux >> "ls -l /home" is same as "ls /home -l" >> >> In this case it looks like parser.parse(opts, args, false) is correct >> >> >> This is probably too late IIRC >> Hope, Stack also meant the same point here. >> >> > Could you please tell me the meaning of "IIRC"? >> IIRC - If I Recall/Remember Correctly >> >> -- >> Regards, >> Laxman >> >> > -----Original Message----- >> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of >> > anil gupta >> > Sent: Friday, March 09, 2012 3:12 AM >> > To: [EMAIL PROTECTED] >> > Subject: Re: Bulk loading a CSV file into HBase >> > >> > Yeah after digging further into the code: Line#374 in >> > GenericOptionsParser.java "commandLine = parser.parse(opts, args, >> > true);" >> > is the culprit. Nice find, Shrijeet. That answers my question. :) >> > >> > Stack: >> > Could you please tell me the meaning of "IIRC"? Updating the document >> > is >> > good but as per the behavior of parse() other -D option will also be >> > ignored if tablename is followed by any -D option . >> > Duplicating the GOP functionality does not seems to be a good idea . >> > Maybe >> > instead of invoking "parser.parse(opts, args, true);" if somehow we can >> > invoke "parser.parse(opts, args, false);" then all will be good. I >> > haven't >> > looked at the api to know about the possibility of same. This is just >> > food >> > for thought. >> > >> > Thanks, >> > Anil >> > >> > >> > >> > On Thu, Mar 8, 2012 at 12:06 PM, Shrijeet Paliwal >> > <[EMAIL PROTECTED]>wrote: >> > >> > > GenericOptionsParser stops parsing the arguments as soon as first non >> > > option is specified (refer : >> > > >> > > http://commons.apache.org/cli/api->> > 1.2/org/apache/commons/cli/Parser.html#parse(org.apache.commons.cli.Opt >> > ions >> > > , >> > > java.lang.String[], boolean)) >> > > >> > > So in this cases as soon parses sees the table name arg , it ignore >> > all >> > > other properties specified with -D opt. Note it not only ignores >> > separator >> > > it is also ignoring importtsv.skip.bad.lines option in your run which >> > > failed. >> > > >> > > >> > > >> > > On Thu, Mar 8, 2012 at 11:27 AM, Stack <[EMAIL PROTECTED]> wrote: >> > > >> > > > On Thu, Mar 8, 2012 at 11:14 AM, anil gupta <[EMAIL PROTECTED]> >> > > wrote: >> > > > > 1. Update the HBase bulk load documentation and specify that >> > separator >> > > > > argument should be next to program name. >> > > > >> > > > This would help. >> > > > >> > > > > 2. Fix the problem in the code itself by handling the separator >> > > argument >> > > > > explicitly. (Still, i am wondering why only separator value is >> > not >> > > being >> > > > > set in jobconf automatically if it is not provided next to >> > program >> > > > name??) >> > > > > >> Harsh J
|
|