Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> 3x faster python reader


Copy link to this message
-
Re: 3x faster python reader
It's probably too messy to go into a patch at this point.  I just put the
code up on a fork:

https://github.com/laserson/avro/tree/perf

Phil, perhaps we could sit down at some point and go through it briefly?
On Mon, Apr 29, 2013 at 10:56 AM, Philip Zeyliger <[EMAIL PROTECTED]>wrote:

> Hi Uri,
>
> Once you post to the JIRA, I'd be happy to review it.
>
> -- Philip
>
>
> On Mon, Apr 29, 2013 at 9:22 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> > Uri,
> >
> > This sounds awesome!  Is the API compatible with the existing API?  If
> > it's incompatible and cannot easily be made compatible then perhaps we
> > can add it as the 'new' API and deprecate the old one.  Regardless,
> > please file an issue in Jira (issues.apache.org/jira/browse/AVRO) and
> > attach your patch there.
> >
> > Thanks,
> >
> > Doug
> >
> > On Sun, Apr 28, 2013 at 10:24 PM, Uri Laserson <[EMAIL PROTECTED]>
> > wrote:
> > > Hi all,
> > >
> > > I rewrote some of the python code to read avro files.  I was able to
> > > achieve a ~3x speedup over the current impl, and can probably do better
> > if
> > > it was cleaned up more.  The main changes are:
> > > * Eliminated the object-oriented nature of the reader.  It's just
> > functions
> > > now.  Presumably this can be changed back, but it didn't really seem
> like
> > > there was any reason for it.
> > > * Given a reader and writer schema, it precomputes as much helpful info
> > as
> > > it can upfront and caches this in a dictionary that the read functions
> > use
> > > * The code is compiled with Cython for speedup.
> > >
> > > How can this be used to improve the current python api?  Let me know
> how
> > I
> > > can be helpful...
> > >
> > > Uri
> > >
> > > --
> > > Uri Laserson, PhD
> > > Data Scientist, Cloudera
> > > Twitter/GitHub: @laserson
> > > +1 617 910 0447
> > > [EMAIL PROTECTED]
> >
>

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB