Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Getting started with Avro + Reading from an Avro formatted file


+
selvi k 2012-01-24, 15:31
+
Douglas Creager 2012-01-24, 15:54
+
Harsh J 2012-01-24, 16:01
+
selvi k 2012-01-24, 19:37
+
selvi k 2012-01-24, 20:20
Copy link to this message
-
Re: Getting started with Avro + Reading from an Avro formatted file
Harsh J 2012-01-24, 20:44
If you want to try out the Python API for Avro datafiles, I had
written a short blog post on reading/writing that at
http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/
which still holds good I think. Hope this helps.

On Wed, Jan 25, 2012 at 1:50 AM, selvi k <[EMAIL PROTECTED]> wrote:
> I found out what the issue was:
> I first needed to install snappy downloaded from here:
> http://code.google.com/p/snappy/
>
> After a simple ./configure, make and make install, 'easy_install avro'
> completed successfully.
>
> I will try out both the CSV conversion options and update this thread in a
> bit.
>
> -Selvi
>
>
>
> On Tue, Jan 24, 2012 at 2:37 PM, selvi k <[EMAIL PROTECTED]> wrote:
>>
>> Douglas and Harsh - Thanks a lot for the immediate and detailed replies!
>> Looks like both of these would work well for me.
>>
>>
>> In order to start trying these, I have tried a few things to get started
>> with Avro, but this is where I am stuck:
>>
>>
>> 1. I first downloaded the stable version in the form of
>> "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine).
>>
>> I don't find a readme file and am not familar with installing a python
>> package, so I am not sure if what I am doing is correct. After some basic
>> googling, I did:
>>
>> avro-1.6.1$ ./setup.py build
>>
>> This appears to complete successfully. Then when I do this:
>>
>> ...avro-1.6.1$ sudo ./setup.py install
>>
>> I get an error message. (pasted at the end of this mail [1])
>>
>>
>> 2. I tried the technique suggested by Harsh, but it ends with a similar
>> error as pasted below in [2]
>>
>> /avro$ sudo easy_install avro
>>
>> Then I tried to install snappy by itself:
>>
>> /avro$ sudo easy_install python-snappy
>>
>> I get the same error.
>>
>> Also I read that that this might help with this type of error, so I tried:
>>
>> avro$ sudo apt-get install python2.6-dev
>>
>> I ensured I have gcc and installed g++ too (because I wasn't sure what was
>> needed).
>>
>> I did see a similar error message reported here for Avro and OS X:
>> https://issues.apache.org/jira/browse/AVRO-981
>>
>> Before installing g++ and python-dev, the error message I was seeing from
>> easy_install python_snappy was different and shorter (attached below) [3].
>>
>>
>>
>>
>> Sorry if I should just be reading up on general Python development or
>> packages or installs (and/or other things), before I should even be
>> attempting to do this.  I'll be doing that now to move further.  But in case
>> anyone might have suggestions for the errors I am seeing, that would be
>> great.
>>
>>
>> I did find this Quick Start Guide from the main Avro wiki page, but when I
>> look through the Python example it is once again focussed client/server and
>> RPC communication between them:
>>
>> https://github.com/phunt/avro-rpc-quickstart
>>
>>
>> Also my understanding is that I must 'install' or deploy Avro before I can
>> try out the C bindings suggested by Douglas. I am stating this since I am
>> not exactly clear by what this meant: -  "especially since the C bindings
>> don't have any library dependencies to install". I am assuming it means, I
>> don't need anything beyond a basic install of Avro.
>>
>>
>>
>> 3. With regards to the two suggested ways, would either of these
>> techniques allow me to filter my data records using some sort of a condition
>> on a field?(or a few fields)  If not it seems like I would have to resort to
>> first grepping the log file with the condition I want, and then using either
>> of these two techniques to convert to CSV file. This would still be much
>> better than what I am doing now, which is through not-so-pretty awk
>> invocations to retrieve the fields I need (after the initial grep). But if
>> the existing API, allows me to scan through the log file and specify
>> conditions for fields, it might be much more efficient. I can imagine that I
>> might have to use the low-level API and write a program to do this, but I am

Harsh J
Customer Ops. Engineer, Cloudera
+
selvi k 2012-01-25, 02:46
+
Douglas Creager 2012-01-24, 21:00
+
selvi k 2012-01-25, 02:50
+
Harsh J 2012-01-24, 21:06
+
selvi k 2012-01-25, 02:56