Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Review Request: PIG-3015 Rewrite of AvroStorage

Copy link to this message
Re: Review Request: PIG-3015 Rewrite of AvroStorage

This is an automatically generated e-mail. To reply, visit:
Overall looks great! I haven't gone through the test cases yet, but here are my comments so far.
1) I noticed that I cannot load .avro files that are not record types. For example, I tried to load a .avro file whose schema is "int" as follows:

[cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema foo2/test_int.avro

[cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson foo2/test_int.avro

in = LOAD 'foo2/test_int.avro' USING AvroStorage('int');
DUMP in;

This gives me the following error:

Caused by: java.io.IOException: avroSchemaToResourceSchema only processes records

Can only Avro record type be loaded? Or am I doing something wrong?
2) TestAvroStorage needs to be more automated. To run it, I had to run the following commands:

ant clean compile-test
cd ./test/org/apache/pig/builtin/avro
python createests.py
cd -
ant clean test -Dtestcase=TestAvroStorage

Ideally, I should be able to run a single command: ant clean -Dtestcase=TestAvroStorage. Please let me know if you need help for this.
3) python createests.py fails with the following errors. I suppose that some files are missing:

creating data/avro/uncompressed/testDirectoryCounts.avro
Exception in thread "main" java.io.FileNotFoundException: data/json/testDirectoryCounts.json (No such file or directory)
creating evenFileNameTestDirectoryCounts.avro
Exception in thread "main" java.io.FileNotFoundException: data/json/evenFileNameTestDirectoryCounts.json (No such file or directory)
4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I suppose that this is due to the missing files:

Testcase: testLoadDirectory took 0.005 sec
Testcase: testLoadGlob took 0.004 sec
Testcase: testPartialLoadGlob took 0.005 sec
5) Typo in the name of createests.py. It should be createtests.py.
6) Is createTests.bash needed at all? If not, can you remove it?
I have more comments inline:

    Can you update .eclipse.templates/.classpath too?


    Can you remove this comment?


    Can you change Exception to SchemaParseException?


    Wouldn't it be better to make this static?


    FileSystem.get(new Configuration()).open() will look for a file on the default FS that is specified by Hadoop conf. But it won't work in the following case:
    1) The default FS is set by installed Hadoop conf as follows:
    2) Run Pig in *local* mode and specify a schema file that is on local FS. This gives me the following error:
    java.io.FileNotFoundException: File does not exist: /user/cheolsoo/test_record.avsc
    3) Even if I specify the uri schema as file:/, this still gives me the following error:
    java.lang.IllegalArgumentException: Wrong FS: file:/home/cheolsoo/pig-svn/test_record.avsc, expected: hdfs://cheolsoo-mr1-0.ent.cloudera.com:8020
    Can you instead do the following?
    Path p = new Path(...);
    (FileSystem.get(p.toUri(), new Configuration()).open(p);
    Then, local FS will be used if file:/ is specified in path, and the default FS is if no uri scheme is specified.


    Same problem as above.


    Can you split this into two catch blocks: ParseException and IOException?
    It seems incorrect that the help message is printed when FileSystem throws an IOExcetion.


    Can you filter out files that start with "." as well?


    Same as above.


    This won't work in the following case. Let's say p matches two dirs, and one dir is empty.
    p = foo*
    I would expect the schema of bar.avro is returned, but I get an IOException instead.


    Typo: not => No.


    Can you rename this class without Fast?


    Is this needed?
    In the constructor, schema is supposed to be set. If not, there must be an error. Shouldn't we throw an exception instead of re-trying to set schema?
    Please correct me if I am wrong.


    Can you replace ".avro" with AvroOutputFormat.EXT?


    Can you move this comment to the checkSchema() method?


    Can you move this comment to the prepareToWrite() method?