Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # dev >> 3x faster python reader


+
Uri Laserson 2013-04-29, 05:24
+
Doug Cutting 2013-04-29, 16:22
+
Philip Zeyliger 2013-04-29, 17:56
+
Uri Laserson 2013-04-29, 18:55
+
Miki Tebeka 2013-04-29, 21:32
+
Russell Jurney 2013-04-30, 06:10
Copy link to this message
-
Re: 3x faster python reader
Hi Miki,

Yes, I followed your model in remaking the Avro reader, but I performed the
schema resolution so that you could still specify separate writer/reader
schemas.  Your code is still 2.5x faster than mine when using the C
extensions.

I personally find the current API somewhat confusing, so I'd be into
changing it.

Uri
On Mon, Apr 29, 2013 at 2:32 PM, Miki Tebeka <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I did the same for fastavro <https://bitbucket.org/tebeka/fastavro>. I
> found changing the current code while keeping the same API very hard.
>
> Another option we can take is leave the current code as version 1 add the
> new code either as new module under avro or as avro2.
>
> All the best,
> --
> Miki
>
>
> On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <[EMAIL PROTECTED]
> >wrote:
>
> > Hi all,
> >
> > I rewrote some of the python code to read avro files.  I was able to
> > achieve a ~3x speedup over the current impl, and can probably do better
> if
> > it was cleaned up more.  The main changes are:
> > * Eliminated the object-oriented nature of the reader.  It's just
> functions
> > now.  Presumably this can be changed back, but it didn't really seem like
> > there was any reason for it.
> > * Given a reader and writer schema, it precomputes as much helpful info
> as
> > it can upfront and caches this in a dictionary that the read functions
> use
> > * The code is compiled with Cython for speedup.
> >
> > How can this be used to improve the current python api?  Let me know how
> I
> > can be helpful...
> >
> > Uri
> >
> > --
> > Uri Laserson, PhD
> > Data Scientist, Cloudera
> > Twitter/GitHub: @laserson
> > +1 617 910 0447
> > [EMAIL PROTECTED]
> >
>

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB