|
selvi k
2012-01-24, 15:31
Douglas Creager
2012-01-24, 15:54
Harsh J
2012-01-24, 16:01
selvi k
2012-01-24, 19:37
selvi k
2012-01-24, 20:20
Harsh J
2012-01-24, 20:44
selvi k
2012-01-25, 02:46
Douglas Creager
2012-01-24, 21:00
selvi k
2012-01-25, 02:50
Harsh J
2012-01-24, 21:06
selvi k
2012-01-25, 02:56
|
-
Getting started with Avro + Reading from an Avro formatted fileselvi k 2012-01-24, 15:31
Hello All,
I would like some suggestions on where I can start in the Avro project. I want to be able to read from an Avro formatted log file (specifically the History Log file created at the end of a Hadoop job) and create a Comma Separated file of certain log entries. I need a csv file because this is the format that is accepted by post processing software I am working with (eg: Matlab). Initially I was using a BASH script to grep and awk from this file and create my CSV file because I needed a very few values from it, and a quick script just worked. I didn't try to get to know what format the log file was in and utilize that. (my bad!) Now that I need to be scaling up and want to have a reliable way to parse, I would like to try and do it the right way. My question is this: For the above goal, could you please guide me with steps I can follow - such as reading material and libraries I could try to use. As I go through the Quick Start Guide and FAQ, I see that a lot of the information here is geared to someone who wants to use the data serialization and RPC functionality provided by Avro. Given that I only want to be able to "read", where may I start? I can comfortably script with BASH and Perl. Given that I only see support for Java, Python and Ruby, I think I can take this as as opportunity to learn Python and get up to speed. Thanks a lot. -Selvi +
selvi k 2012-01-24, 15:31
-
Re: Getting started with Avro + Reading from an Avro formatted fileDouglas Creager 2012-01-24, 15:54
> I want to be able to read from an Avro formatted log file (specifically the History Log file created at the end of a Hadoop job) and create a Comma Separated file of certain log entries. I need a csv file because this is the format that is accepted by post processing software I am working with (eg: Matlab).
> > Initially I was using a BASH script to grep and awk from this file and create my CSV file because I needed a very few values from it, and a quick script just worked. I didn't try to get to know what format the log file was in and utilize that. (my bad!) Now that I need to be scaling up and want to have a reliable way to parse, I would like to try and do it the right way. > > My question is this: For the above goal, could you please guide me with steps I can follow - such as reading material and libraries I could try to use. As I go through the Quick Start Guide and FAQ, I see that a lot of the information here is geared to someone who wants to use the data serialization and RPC functionality provided by Avro. Given that I only want to be able to "read", where may I start? > > I can comfortably script with BASH and Perl. Given that I only see support for Java, Python and Ruby, I think I can take this as as opportunity to learn Python and get up to speed. You could also take a look at the C bindings. We've recently added a couple of command-line tools for outputting the contents of an Avro file to stdout: avrocat and avropipe. avrocat outputs each record in an Avro file on a single line, using the JSON encoding defined by the Avro spec [1]. avropipe produces a separate line for each “field” in each record; its output is (roughly speaking) what you'd get from piping the JSON encoding of each record through the jsonpipe [2] tool. (Technically speaking, it's what you get from putting all of the records into a JSON array, and sending that array through jsonpipe.) [1] http://avro.apache.org/docs/current/spec.html#json_encoding [2] https://github.com/dvxhouse/jsonpipe So, with the example quickstop.db file, the avrocat gives you: $ avrocat examples/quickstop.db | head {"ID": 1, "First": "Dante", "Last": "Hicks", "Phone": "(0)", "Age": 32} {"ID": 2, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30} {"ID": 3, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28} {"ID": 4, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27} {"ID": 5, "First": "Bob", "Last": "Silent", "Phone": "(555) 123-6422", "Age": 29} {"ID": 6, "First": "Jay", "Last": "???", "Phone": "(0)", "Age": 26} {"ID": 7, "First": "Dante", "Last": "Hicks", "Phone": "(1)", "Age": 32} {"ID": 8, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30} {"ID": 9, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28} {"ID": 10, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27} While avropipe gives you: $ avropipe examples/quickstop.db | head -n 25 / [] /0 {} /0/ID 1 /0/First "Dante\u0000" /0/Last "Hicks\u0000" /0/Phone "(0)\u0000" /0/Age 32 /1 {} /1/ID 2 /1/First "Randal\u0000" /1/Last "Graves\u0000" /1/Phone "(555) 123-5678\u0000" /1/Age 30 /2 {} /2/ID 3 /2/First "Veronica\u0000" /2/Last "Loughran\u0000" /2/Phone "(555) 123-0987\u0000" /2/Age 28 /3 {} /3/ID 4 /3/First "Caitlin\u0000" /3/Last "Bree\u0000" /3/Phone "(555) 123-2323\u0000" /3/Age 27 Although I'm seeing a bug there, since those NUL terminators shouldn't appear in the output. I'm going to open a ticket for that and fix it real quick. But, these tools might be exactly what you need, especially since the C bindings don't have any library dependencies to install. cheers –doug +
Douglas Creager 2012-01-24, 15:54
-
Re: Getting started with Avro + Reading from an Avro formatted fileHarsh J 2012-01-24, 16:01
Selvi,
Expanding on Douglas' response, if you have installed Avro's python libraries (Simplest way to get latest stable is: "easy_install avro", or install from the distribution -- Post back if you need help on this), you can simply do, using the now-installed 'avro' executable: $ ls sample_input.avro $ avro cat sample_input.avro --format csv 011990-99999,0,-619524000000 011990-99999,22,-619506000000 011990-99999,-11,-619484400000 012650-99999,111,-655531200000 012650-99999,78,-655509600000 Or, write to a resultant file, as you would regularly in a shell: $ avro cat sample_input.avro --format csv > sample_input.csv For more options on avro's cat and write opts: $ avro --help On Tue, Jan 24, 2012 at 9:01 PM, selvi k <[EMAIL PROTECTED]> wrote: > Hello All, > > > I would like some suggestions on where I can start in the Avro project. > > > I want to be able to read from an Avro formatted log file (specifically the > History Log file created at the end of a Hadoop job) and create a Comma > Separated file of certain log entries. I need a csv file because this is the > format that is accepted by post processing software I am working with (eg: > Matlab). > > > Initially I was using a BASH script to grep and awk from this file and > create my CSV file because I needed a very few values from it, and a quick > script just worked. I didn't try to get to know what format the log file was > in and utilize that. (my bad!) Now that I need to be scaling up and want to > have a reliable way to parse, I would like to try and do it the right way. > > > My question is this: For the above goal, could you please guide me with > steps I can follow - such as reading material and libraries I could try to > use. As I go through the Quick Start Guide and FAQ, I see that a lot of the > information here is geared to someone who wants to use the data > serialization and RPC functionality provided by Avro. Given that I only want > to be able to "read", where may I start? > > > I can comfortably script with BASH and Perl. Given that I only see support > for Java, Python and Ruby, I think I can take this as as opportunity to > learn Python and get up to speed. > > > Thanks a lot. > > > -Selvi > > -- Harsh J Customer Ops. Engineer, Cloudera +
Harsh J 2012-01-24, 16:01
-
Re: Getting started with Avro + Reading from an Avro formatted fileselvi k 2012-01-24, 19:37
Douglas and Harsh - Thanks a lot for the immediate and detailed replies!
Looks like both of these would work well for me. In order to start trying these, I have tried a few things to get started with Avro, but this is where I am stuck: 1. I first downloaded the stable version in the form of "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine). I don't find a readme file and am not familar with installing a python package, so I am not sure if what I am doing is correct. After some basic googling, I did: avro-1.6.1$ ./setup.py build This appears to complete successfully. Then when I do this: ...avro-1.6.1$ sudo ./setup.py install I get an error message. (pasted at the end of this mail [1]) 2. I tried the technique suggested by Harsh, but it ends with a similar error as pasted below in [2] /avro$ sudo easy_install avro Then I tried to install snappy by itself: /avro$ sudo easy_install python-snappy I get the same error. Also I read that that this might help with this type of error, so I tried: avro$ sudo apt-get install python2.6-dev I ensured I have gcc and installed g++ too (because I wasn't sure what was needed). I did see a similar error message reported here for Avro and OS X: https://issues.apache.org/jira/browse/AVRO-981 Before installing g++ and python-dev, the error message I was seeing from easy_install python_snappy was different and shorter (attached below) [3]. Sorry if I should just be reading up on general Python development or packages or installs (and/or other things), before I should even be attempting to do this. I'll be doing that now to move further. But in case anyone might have suggestions for the errors I am seeing, that would be great. I did find this Quick Start Guide from the main Avro wiki page, but when I look through the Python example it is once again focussed client/server and RPC communication between them: https://github.com/phunt/avro-rpc-quickstart Also my understanding is that I must 'install' or deploy Avro before I can try out the C bindings suggested by Douglas. I am stating this since I am not exactly clear by what this meant: - "especially since the C bindings don't have any library dependencies to install". I am assuming it means, I don't need anything *beyond* a basic install of Avro. 3. With regards to the two suggested ways, would either of these techniques allow me to filter my data records using some sort of a condition on a field?(or a few fields) If not it seems like I would have to resort to first grepping the log file with the condition I want, and then using either of these two techniques to convert to CSV file. This would still be much better than what I am doing now, which is through not-so-pretty awk invocations to retrieve the fields I need (after the initial grep). But if the existing API, allows me to scan through the log file and specify conditions for fields, it might be much more efficient. I can imagine that I might have to use the low-level API and write a program to do this, but I am not sure at this point how to get started on this. Any pointers would be really helpful! Thank you, Selvi [1] /avro-1.6.1$ sudo ./setup.py install running install Checking .pth file support in /usr/local/lib/python2.6/dist-packages/ /usr/bin/python -E -c pass TEST PASSED: /usr/local/lib/python2.6/dist-packages/ appears to support .pth files running bdist_egg running egg_info writing requirements to avro.egg-info/requires.txt writing avro.egg-info/PKG-INFO writing top-level names to avro.egg-info/top_level.txt writing dependency_links to avro.egg-info/dependency_links.txt reading manifest file 'avro.egg-info/SOURCES.txt' writing manifest file 'avro.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py creating build/bdist.linux-x86_64 creating build/bdist.linux-x86_64/egg creating build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/io.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/datafile.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/tool.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/txipc.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/ipc.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/protocol.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/__init__.py -> build/bdist.linux-x86_64/egg/avro copying build/lib.linux-x86_64-2.6/avro/schema.py -> build/bdist.linux-x86_64/egg/avro byte-compiling build/bdist.linux-x86_64/egg/avro/io.py to io.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/datafile.py to datafile.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/tool.py to tool.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/txipc.py to txipc.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/ipc.py to ipc.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/protocol.py to protocol.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/__init__.py to __init__.pyc byte-compiling build/bdist.linux-x86_64/egg/avro/schema.py to schema.pyc creating build/bdist.linux-x86_64/egg/EGG-INFO installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts running install_scripts running build_scripts creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts copying build/scripts-2.6/avro -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/avro to 755 copying avro.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO copying avro.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying avro.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying avro.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying avro.egg-info/top_level.txt -> build/bdist.li +
selvi k 2012-01-24, 19:37
-
Re: Getting started with Avro + Reading from an Avro formatted fileselvi k 2012-01-24, 20:20
I found out what the issue was:
I first needed to install snappy downloaded from here: http://code.google.com/p/snappy/ After a simple ./configure, make and make install, 'easy_install avro' completed successfully. I will try out both the CSV conversion options and update this thread in a bit. -Selvi On Tue, Jan 24, 2012 at 2:37 PM, selvi k <[EMAIL PROTECTED]> wrote: > Douglas and Harsh - Thanks a lot for the immediate and detailed replies! > Looks like both of these would work well for me. > > > In order to start trying these, I have tried a few things to get started > with Avro, but this is where I am stuck: > > > 1. I first downloaded the stable version in the form of > "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine). > > I don't find a readme file and am not familar with installing a python > package, so I am not sure if what I am doing is correct. After some basic > googling, I did: > > avro-1.6.1$ ./setup.py build > > This appears to complete successfully. Then when I do this: > > ...avro-1.6.1$ sudo ./setup.py install > > I get an error message. (pasted at the end of this mail [1]) > > > 2. I tried the technique suggested by Harsh, but it ends with a similar > error as pasted below in [2] > > /avro$ sudo easy_install avro > > Then I tried to install snappy by itself: > > /avro$ sudo easy_install python-snappy > > I get the same error. > > Also I read that that this might help with this type of error, so I tried: > > avro$ sudo apt-get install python2.6-dev > > I ensured I have gcc and installed g++ too (because I wasn't sure what was > needed). > > I did see a similar error message reported here for Avro and OS X: > https://issues.apache.org/jira/browse/AVRO-981 > > Before installing g++ and python-dev, the error message I was seeing from > easy_install python_snappy was different and shorter (attached below) [3]. > > > > > Sorry if I should just be reading up on general Python development or > packages or installs (and/or other things), before I should even be > attempting to do this. I'll be doing that now to move further. But in > case anyone might have suggestions for the errors I am seeing, that would > be great. > > > I did find this Quick Start Guide from the main Avro wiki page, but when I > look through the Python example it is once again focussed client/server and > RPC communication between them: > > https://github.com/phunt/avro-rpc-quickstart > > > Also my understanding is that I must 'install' or deploy Avro before I can > try out the C bindings suggested by Douglas. I am stating this since I am > not exactly clear by what this meant: - "especially since the C bindings > don't have any library dependencies to install". I am assuming it means, I > don't need anything *beyond* a basic install of Avro. > > > > 3. With regards to the two suggested ways, would either of these > techniques allow me to filter my data records using some sort of a > condition on a field?(or a few fields) If not it seems like I would have > to resort to first grepping the log file with the condition I want, and > then using either of these two techniques to convert to CSV file. This > would still be much better than what I am doing now, which is through > not-so-pretty awk invocations to retrieve the fields I need (after the > initial grep). But if the existing API, allows me to scan through the log > file and specify conditions for fields, it might be much more efficient. I > can imagine that I might have to use the low-level API and write a program > to do this, but I am not sure at this point how to get started on this. > > > Any pointers would be really helpful! > > > Thank you, > > Selvi > > > > > > [1] > > > /avro-1.6.1$ sudo ./setup.py install > > running install > > Checking .pth file support in /usr/local/lib/python2.6/dist-packages/ > > /usr/bin/python -E -c pass > > TEST PASSED: /usr/local/lib/python2.6/dist-packages/ appears to support > .pth files > > running bdist_egg > > running egg_info +
selvi k 2012-01-24, 20:20
-
Re: Getting started with Avro + Reading from an Avro formatted fileHarsh J 2012-01-24, 20:44
If you want to try out the Python API for Avro datafiles, I had
written a short blog post on reading/writing that at http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ which still holds good I think. Hope this helps. On Wed, Jan 25, 2012 at 1:50 AM, selvi k <[EMAIL PROTECTED]> wrote: > I found out what the issue was: > I first needed to install snappy downloaded from here: > http://code.google.com/p/snappy/ > > After a simple ./configure, make and make install, 'easy_install avro' > completed successfully. > > I will try out both the CSV conversion options and update this thread in a > bit. > > -Selvi > > > > On Tue, Jan 24, 2012 at 2:37 PM, selvi k <[EMAIL PROTECTED]> wrote: >> >> Douglas and Harsh - Thanks a lot for the immediate and detailed replies! >> Looks like both of these would work well for me. >> >> >> In order to start trying these, I have tried a few things to get started >> with Avro, but this is where I am stuck: >> >> >> 1. I first downloaded the stable version in the form of >> "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine). >> >> I don't find a readme file and am not familar with installing a python >> package, so I am not sure if what I am doing is correct. After some basic >> googling, I did: >> >> avro-1.6.1$ ./setup.py build >> >> This appears to complete successfully. Then when I do this: >> >> ...avro-1.6.1$ sudo ./setup.py install >> >> I get an error message. (pasted at the end of this mail [1]) >> >> >> 2. I tried the technique suggested by Harsh, but it ends with a similar >> error as pasted below in [2] >> >> /avro$ sudo easy_install avro >> >> Then I tried to install snappy by itself: >> >> /avro$ sudo easy_install python-snappy >> >> I get the same error. >> >> Also I read that that this might help with this type of error, so I tried: >> >> avro$ sudo apt-get install python2.6-dev >> >> I ensured I have gcc and installed g++ too (because I wasn't sure what was >> needed). >> >> I did see a similar error message reported here for Avro and OS X: >> https://issues.apache.org/jira/browse/AVRO-981 >> >> Before installing g++ and python-dev, the error message I was seeing from >> easy_install python_snappy was different and shorter (attached below) [3]. >> >> >> >> >> Sorry if I should just be reading up on general Python development or >> packages or installs (and/or other things), before I should even be >> attempting to do this. I'll be doing that now to move further. But in case >> anyone might have suggestions for the errors I am seeing, that would be >> great. >> >> >> I did find this Quick Start Guide from the main Avro wiki page, but when I >> look through the Python example it is once again focussed client/server and >> RPC communication between them: >> >> https://github.com/phunt/avro-rpc-quickstart >> >> >> Also my understanding is that I must 'install' or deploy Avro before I can >> try out the C bindings suggested by Douglas. I am stating this since I am >> not exactly clear by what this meant: - "especially since the C bindings >> don't have any library dependencies to install". I am assuming it means, I >> don't need anything beyond a basic install of Avro. >> >> >> >> 3. With regards to the two suggested ways, would either of these >> techniques allow me to filter my data records using some sort of a condition >> on a field?(or a few fields) If not it seems like I would have to resort to >> first grepping the log file with the condition I want, and then using either >> of these two techniques to convert to CSV file. This would still be much >> better than what I am doing now, which is through not-so-pretty awk >> invocations to retrieve the fields I need (after the initial grep). But if >> the existing API, allows me to scan through the log file and specify >> conditions for fields, it might be much more efficient. I can imagine that I >> might have to use the low-level API and write a program to do this, but I am Harsh J Customer Ops. Engineer, Cloudera +
Harsh J 2012-01-24, 20:44
-
Re: Getting started with Avro + Reading from an Avro formatted fileselvi k 2012-01-25, 02:46
I was able to set both up and use them. And they work like a charm! :)
- The advantage with the C version for me was that the CSV file created, retained the field names for every field. Even though this makes it bulky, as I move my data through different processing steps, this would come in handy for me to eyeball and look for patterns or issues. - With the "avro cat" Python executable, in addition to the "--field" flag, there is this great filtering option in the command line, that allows all kinds of compound expressions. As an example, for someone else reading this thread: $ avro cat test.db --format csv --filter="r['name']>'Person 45' and r['company']>'Company 7'" Company 8,Person 8,"[u'http://myurl0.net', u'http://myurl1.net', u' http://myurl2.net']" Company 9,Person 9,"[u'http://myurl0.net', u'http://myurl1.net', u' http://myurl2.net', u'http://myurl3.net']" (The sample avro file test.db was obtained easily by executing code from here: https://github.com/matteobertozzi/Hadoop/tree/master/avro-examples) One drawback with the Python executable was that fields in the csv aren't in the schema order. Given that I would have records having few tens of fields atleast, this might mean I would have to do some reordering. I do see a FIXME in the source code, from which I understand that it is not in schema order but I don't yet understand (from the code and the comments around it) what other type of ordering has actually been chosen. I am going to choose one of these definitely, just not sure yet which one. I appreciate the help very much! This has saved and will save me a lot of time. Thanks, -Selvi On Tue, Jan 24, 2012 at 3:20 PM, selvi k <[EMAIL PROTECTED]> wrote: > I found out what the issue was: > I first needed to install snappy downloaded from here: > http://code.google.com/p/snappy/ > > After a simple ./configure, make and make install, 'easy_install avro' > completed successfully. > > I will try out both the CSV conversion options and update this thread in a > bit. > > -Selvi > > > > On Tue, Jan 24, 2012 at 2:37 PM, selvi k <[EMAIL PROTECTED]> wrote: > >> Douglas and Harsh - Thanks a lot for the immediate and detailed replies! >> Looks like both of these would work well for me. >> >> >> In order to start trying these, I have tried a few things to get started >> with Avro, but this is where I am stuck: >> >> >> 1. I first downloaded the stable version in the form of >> "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine). >> >> I don't find a readme file and am not familar with installing a python >> package, so I am not sure if what I am doing is correct. After some basic >> googling, I did: >> >> avro-1.6.1$ ./setup.py build >> >> This appears to complete successfully. Then when I do this: >> >> ...avro-1.6.1$ sudo ./setup.py install >> >> I get an error message. (pasted at the end of this mail [1]) >> >> >> 2. I tried the technique suggested by Harsh, but it ends with a similar >> error as pasted below in [2] >> >> /avro$ sudo easy_install avro >> >> Then I tried to install snappy by itself: >> >> /avro$ sudo easy_install python-snappy >> >> I get the same error. >> >> Also I read that that this might help with this type of error, so I tried: >> >> avro$ sudo apt-get install python2.6-dev >> >> I ensured I have gcc and installed g++ too (because I wasn't sure what >> was needed). >> >> I did see a similar error message reported here for Avro and OS X: >> https://issues.apache.org/jira/browse/AVRO-981 >> >> Before installing g++ and python-dev, the error message I was seeing from >> easy_install python_snappy was different and shorter (attached below) [3]. >> >> >> >> >> Sorry if I should just be reading up on general Python development or >> packages or installs (and/or other things), before I should even be >> attempting to do this. I'll be doing that now to move further. But in >> case anyone might have suggestions for the errors I am seeing, that would >> be great. +
selvi k 2012-01-25, 02:46
-
Re: Getting started with Avro + Reading from an Avro formatted fileDouglas Creager 2012-01-24, 21:00
> Also my understanding is that I must 'install' or deploy Avro before I can try out the C bindings suggested by Douglas. I am stating this since I am not exactly clear by what this meant: - "especially since the C bindings don't have any library dependencies to install". I am assuming it means, I don't need anything beyond a basic install of Avro.
Sorry about the misunderstanding. And you're right — you'd need to download, compile, and install the Avro C bindings, but they don't require any *additional* libraries to be installed. It's a pretty standard set of CMake build scripts, and there's an INSTALL file in the tarball containing more detailed instructions. (I should also point out that in the next release, 1.6.2, the C bindings will support the zlib and lzma compression codecs. You'll need to have the zlib and xz/lzma libraries installed for those to work. If you don't have those libraries, you can still install the Avro C bindings; you just won't be able to read or write Avro data files that use the corresponding codecs.) –doug +
Douglas Creager 2012-01-24, 21:00
-
Re: Getting started with Avro + Reading from an Avro formatted fileselvi k 2012-01-25, 02:50
On Tue, Jan 24, 2012 at 4:00 PM, Douglas Creager <[EMAIL PROTECTED]>wrote:
> > Also my understanding is that I must 'install' or deploy Avro before I > can try out the C bindings suggested by Douglas. I am stating this since I > am not exactly clear by what this meant: - "especially since the C > bindings don't have any library dependencies to install". I am assuming it > means, I don't need anything beyond a basic install of Avro. > > Sorry about the misunderstanding. And you're right — you'd need to > download, compile, and install the Avro C bindings, but they don't require > any *additional* libraries to be installed. It's a pretty standard set of > CMake build scripts, and there's an INSTALL file in the tarball containing > more detailed instructions. > Thank you Doug, for pointing this out specifically. From things I have done in the past, API's seemed to be a part of packages, so I didn't quite even look outside of the avro package. But after you mentioned it, I went back and saw that the main download page itself had a separate folder for the C bindings. After I installed cmake, the install was straightforward. -Selvi > > (I should also point out that in the next release, 1.6.2, the C bindings > will support the zlib and lzma compression codecs. You'll need to have the > zlib and xz/lzma libraries installed for those to work. If you don't have > those libraries, you can still install the Avro C bindings; you just won't > be able to read or write Avro data files that use the corresponding codecs.) > > –doug > > +
selvi k 2012-01-25, 02:50
-
Re: Getting started with Avro + Reading from an Avro formatted fileHarsh J 2012-01-24, 21:06
Selvi,
(Forgot to reply to this before) On Wed, Jan 25, 2012 at 1:07 AM, selvi k <[EMAIL PROTECTED]> wrote: > 3. With regards to the two suggested ways, would either of these techniques > allow me to filter my data records using some sort of a condition on a > field?(or a few fields) If not it seems like I would have to resort to > first grepping the log file with the condition I want, and then using either > of these two techniques to convert to CSV file. This would still be much > better than what I am doing now, which is through not-so-pretty awk > invocations to retrieve the fields I need (after the initial grep). But if > the existing API, allows me to scan through the log file and specify > conditions for fields, it might be much more efficient. I can imagine that I > might have to use the low-level API and write a program to do this, but I am > not sure at this point how to get started on this. $ avro --help has some options that can help you out. For "avro cat", the following may help: --fields=FIELDS fields to show, comma separated (show all by default) But no, the utility does not provide a way to filter anything out. Its a mere reader with some extensibility on fields/format. You'd have to do filtering via your own full-fledged reader program, or via Bash using "avro cat" and grep/etc. -- Harsh J Customer Ops. Engineer, Cloudera +
Harsh J 2012-01-24, 21:06
-
Re: Getting started with Avro + Reading from an Avro formatted fileselvi k 2012-01-25, 02:56
> $ avro --help has some options that can help you out.
> > For "avro cat", the following may help: > > --fields=FIELDS fields to show, comma separated (show all by default) Thanks a lot for this pointer Harsh..this is how I chanced up the filter flag. I am going to take a look at the blog post next, for getting to the programming side of things. Since the Hadoop system log files I am looking at seem to have multiple schemas for different log entries within the same file, I would have to explore more to understand it better...but for now it seems like I would have to do a combination of readers, grep and avro cat/avrocat. -Selvi > > +
selvi k 2012-01-25, 02:56
|