Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> data missing in writing an AVRO file.


Copy link to this message
-
Re: data missing in writing an AVRO file.
Hi,

This is most likely related to this issue https://issues.apache.org/jira/browse/AVRO-1364. It is fixed in Avro 1.7.6, so first try updating your Avro-C lib.

-Mika

On Jan 27, 2014, at 6:35 PM, Amrith Kumar <[EMAIL PROTECTED]> wrote:

> Here is some additional debugging information …
>  
> I created this simple CSV file that looks thus.
>  
> ubuntu@petest1:/mnt/avrotest$ head maketest.csv
> "data1", "data2",
> 0, 1804289383,
> 1, 846930886,
> 2, 1681692777,
> 3, 1714636915,
> 4, 1957747793,
> 5, 424238335,
> 6, 719885386,
> 7, 1649760492,
> 8, 596516649,
> ubuntu@petest1:/mnt/avrotest$ tail maketest.csv
> 499990, 1910331393,
> 499991, 1091319779,
> 499992, 805782879,
> 499993, 1636478990,
> 499994, 1827956658,
> 499995, 1695362021,
> 499996, 1235853180,
> 499997, 208721086,
> 499998, 1836333752,
> 499999, 699496062,
>  
> Nothing fancy, just 500,000 rows of data with the row number in the first column and some random integer in the second.
>  
> Here is the avro conversion.
>  
> ubuntu@petest1:/mnt/avrotest$ csvtoavro -i maketest.csv -o maketest.avro
> 2014-01-27 11:28:40  csvtoavro: Processed maketest.csv with 500001 rows of data
>  
> Since there is a header row which gets counted it says 500,001.
>  
> Now, here is the output from avrocat
>  
> ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | head -n 10
> {"data1": "0", "data2": " 1804289383"}
> {"data1": "1", "data2": " 846930886"}
> {"data1": "2", "data2": " 1681692777"}
> {"data1": "3", "data2": " 1714636915"}
> {"data1": "4", "data2": " 1957747793"}
> {"data1": "5", "data2": " 424238335"}
> {"data1": "6", "data2": " 719885386"}
> {"data1": "7", "data2": " 1649760492"}
> {"data1": "8", "data2": " 596516649"}
> {"data1": "9", "data2": " 1189641421"}
> ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | tail -n 10
> {"data1": "499944", "data2": " 929606694"}
> {"data1": "499945", "data2": " 973636875"}
> {"data1": "499946", "data2": " 1942285618"}
> {"data1": "499947", "data2": " 2089133167"}
> {"data1": "499948", "data2": " 213614747"}
> {"data1": "499949", "data2": " 599060422"}
> {"data1": "499950", "data2": " 1885053377"}
> {"data1": "499951", "data2": " 2100042242"}
> {"data1": "499952", "data2": " 1491280709"}
> {"data1": "499953", "data2": " 1103081139"}
> ubuntu@petest1:/mnt/avrotest$./maketest.avro
> ./maketest.avro 499954
>  
> For completeness, here is some data from the CSV file showing values near around where the AVRO file appears to end.
>  
> 499940, 1054581755,
> 499941, 600032353,
> 499942, 1997078786,
> 499943, 1508121989,
> 499944, 929606694,
> 499945, 973636875,
> 499946, 1942285618,
> 499947, 2089133167,
> 499948, 213614747,
> 499949, 599060422,
> 499950, 1885053377,
> 499951, 2100042242,
> 499952, 1491280709,
> 499953, 1103081139,
> 499954, 521709408,
> 499955, 494574550,
> 499956, 756884387,
> 499957, 2035729858,
> 499958, 1560742697,
> 499959, 923330093,
>  
> In other words, the last 46 rows of data appear to be missing.
>  
> -amrith
>  
> From: Amrith Kumar [mailto:[EMAIL PROTECTED]]
> Sent: Monday, January 27, 2014 11:23 AM
> To: [EMAIL PROTECTED]
> Subject: data missing in writing an AVRO file.
>  
> Greetings,
>  
> I’m attempting to convert some very large CSV files into AVRO format. To this end, I wrote a csvtoavro converter using C API v1.7.5.
>  
> The essence of the program is this:
>  
> // initialize line counter
> lineno = 0;
>  
> // make a schema first
> avro_schema_from_json_length (...);
>  
> // make a generic class from schema
> iface = avro_generic_class_from_schema( schema );
>  
> // get the record size and verify that it is 109
> avro_schema_record_size (schema);
>  
> // get a generic value
> avro_generic_value_new (iface, &tuple);
>  
> // make me an output file
> fp = fopen ( outputfile, "wb" );
>  
> // make me a filewriter
> avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);
>  
> // now for the code to emit the data
>  
> while (...)
> {
>     avro_value_reset (&tuple);
>  
>     // get the CSV record into the tuple