Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> data missing in writing an AVRO file.


Copy link to this message
-
Re: data missing in writing an AVRO file.
Hi,

This is most likely related to this issue https://issues.apache.org/jira/browse/AVRO-1364. It is fixed in Avro 1.7.6, so first try updating your Avro-C lib.

-Mika

On Jan 27, 2014, at 6:35 PM, Amrith Kumar <[EMAIL PROTECTED]> wrote:

> Here is some additional debugging information …
>  
> I created this simple CSV file that looks thus.
>  
> ubuntu@petest1:/mnt/avrotest$ head maketest.csv
> "data1", "data2",
> 0, 1804289383,
> 1, 846930886,
> 2, 1681692777,
> 3, 1714636915,
> 4, 1957747793,
> 5, 424238335,
> 6, 719885386,
> 7, 1649760492,
> 8, 596516649,
> ubuntu@petest1:/mnt/avrotest$ tail maketest.csv
> 499990, 1910331393,
> 499991, 1091319779,
> 499992, 805782879,
> 499993, 1636478990,
> 499994, 1827956658,
> 499995, 1695362021,
> 499996, 1235853180,
> 499997, 208721086,
> 499998, 1836333752,
> 499999, 699496062,
>  
> Nothing fancy, just 500,000 rows of data with the row number in the first column and some random integer in the second.
>  
> Here is the avro conversion.
>  
> ubuntu@petest1:/mnt/avrotest$ csvtoavro -i maketest.csv -o maketest.avro
> 2014-01-27 11:28:40  csvtoavro: Processed maketest.csv with 500001 rows of data
>  
> Since there is a header row which gets counted it says 500,001.
>  
> Now, here is the output from avrocat
>  
> ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | head -n 10
> {"data1": "0", "data2": " 1804289383"}
> {"data1": "1", "data2": " 846930886"}
> {"data1": "2", "data2": " 1681692777"}
> {"data1": "3", "data2": " 1714636915"}
> {"data1": "4", "data2": " 1957747793"}
> {"data1": "5", "data2": " 424238335"}
> {"data1": "6", "data2": " 719885386"}
> {"data1": "7", "data2": " 1649760492"}
> {"data1": "8", "data2": " 596516649"}
> {"data1": "9", "data2": " 1189641421"}
> ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | tail -n 10
> {"data1": "499944", "data2": " 929606694"}
> {"data1": "499945", "data2": " 973636875"}
> {"data1": "499946", "data2": " 1942285618"}
> {"data1": "499947", "data2": " 2089133167"}
> {"data1": "499948", "data2": " 213614747"}
> {"data1": "499949", "data2": " 599060422"}
> {"data1": "499950", "data2": " 1885053377"}
> {"data1": "499951", "data2": " 2100042242"}
> {"data1": "499952", "data2": " 1491280709"}
> {"data1": "499953", "data2": " 1103081139"}
> ubuntu@petest1:/mnt/avrotest$./maketest.avro
> ./maketest.avro 499954
>  
> For completeness, here is some data from the CSV file showing values near around where the AVRO file appears to end.
>  
> 499940, 1054581755,
> 499941, 600032353,
> 499942, 1997078786,
> 499943, 1508121989,
> 499944, 929606694,
> 499945, 973636875,
> 499946, 1942285618,
> 499947, 2089133167,
> 499948, 213614747,
> 499949, 599060422,
> 499950, 1885053377,
> 499951, 2100042242,
> 499952, 1491280709,
> 499953, 1103081139,
> 499954, 521709408,
> 499955, 494574550,
> 499956, 756884387,
> 499957, 2035729858,
> 499958, 1560742697,
> 499959, 923330093,
>  
> In other words, the last 46 rows of data appear to be missing.
>  
> -amrith
>  
> From: Amrith Kumar [mailto:[EMAIL PROTECTED]]
> Sent: Monday, January 27, 2014 11:23 AM
> To: [EMAIL PROTECTED]
> Subject: data missing in writing an AVRO file.
>  
> Greetings,
>  
> I’m attempting to convert some very large CSV files into AVRO format. To this end, I wrote a csvtoavro converter using C API v1.7.5.
>  
> The essence of the program is this:
>  
> // initialize line counter
> lineno = 0;
>  
> // make a schema first
> avro_schema_from_json_length (...);
>  
> // make a generic class from schema
> iface = avro_generic_class_from_schema( schema );
>  
> // get the record size and verify that it is 109
> avro_schema_record_size (schema);
>  
> // get a generic value
> avro_generic_value_new (iface, &tuple);
>  
> // make me an output file
> fp = fopen ( outputfile, "wb" );
>  
> // make me a filewriter
> avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);
>  
> // now for the code to emit the data
>  
> while (...)
> {
>     avro_value_reset (&tuple);
>  
>     // get the CSV record into the tuple
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB