Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Review Request: PIG-3015 Rewrite of AvroStorage


Copy link to this message
-
Re: Review Request: PIG-3015 Rewrite of AvroStorage


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > Overall looks great! I haven't gone through the test cases yet, but here are my comments so far.
> >
> >
> > 1) I noticed that I cannot load .avro files that are not record types. For example, I tried to load a .avro file whose schema is "int" as follows:
> >
> > [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema foo2/test_int.avro
> > "int"
> >
> > [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson foo2/test_int.avro
> > 1
> >
> > in = LOAD 'foo2/test_int.avro' USING AvroStorage('int');
> > DUMP in;
> >
> > This gives me the following error:
> >
> > Caused by: java.io.IOException: avroSchemaToResourceSchema only processes records
> >
> > Can only Avro record type be loaded? Or am I doing something wrong?
> >
> >
> > 2) TestAvroStorage needs to be more automated. To run it, I had to run the following commands:
> >
> > ant clean compile-test
> > cd ./test/org/apache/pig/builtin/avro
> > python createests.py
> > cd -
> > ant clean test -Dtestcase=TestAvroStorage
> >
> > Ideally, I should be able to run a single command: ant clean -Dtestcase=TestAvroStorage. Please let me know if you need help for this.
> >
> >
> > 3) python createests.py fails with the following errors. I suppose that some files are missing:
> >
> > creating data/avro/uncompressed/testDirectoryCounts.avro
> > Exception in thread "main" java.io.FileNotFoundException: data/json/testDirectoryCounts.json (No such file or directory)
> > ...
> > creating evenFileNameTestDirectoryCounts.avro
> > Exception in thread "main" java.io.FileNotFoundException: data/json/evenFileNameTestDirectoryCounts.json (No such file or directory)
> > ...
> >
> >
> > 4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I suppose that this is due to the missing files:
> >
> > Testcase: testLoadDirectory took 0.005 sec
> >     FAILED
> > Testcase: testLoadGlob took 0.004 sec
> >     FAILED
> > Testcase: testPartialLoadGlob took 0.005 sec
> >     FAILED
> >
> >
> > 5) Typo in the name of createests.py. It should be createtests.py.
> >
> >
> > 6) Is createTests.bash needed at all? If not, can you remove it?
> >
> >
> > I have more comments inline:

Sounds like the python script isn't working completely correctly. I'll debug that script and make sure it generates all the required files.

Can I take you up on your offer to help automate that build process? I'm not exactly sure what to modify to automatically run the python script to create the test files.
> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/builtin/AvroStorage.java, lines 296-305
> > <https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line296>
> >
> >     This won't work in the following case. Let's say p matches two dirs, and one dir is empty.
> >    
> >     p = foo*
> >    
> >     foo1
> >     foo2/bar.avro
> >    
> >     I would expect the schema of bar.avro is returned, but I get an IOException instead.

Added proper depth first search to find the first file. (I decided to sort by modification date, most recent first.)
- Joseph
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
-----------------------------------------------------------
On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8104/
> -----------------------------------------------------------
>
> (Updated Nov. 17, 2012, 5:28 a.m.)
>
>
> Review request for pig and Cheolsoo Park.
>
>
> Description
> -------
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)