Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # user >> How to load data in Drill


Copy link to this message
-
Re: How to load data in Drill
Hi Madhu,

For use in pig, you will just have to have the line below in your script
before you call store.

set parquet.enable.dictionary false

This will force all of the values in your dataset to be encoded with
standard encoding, rather than the dictionary encoding. I will keep you
updated on our continued progress with our progress on full parquet support.

Thanks,
Jason Altekruse

On Tue, Jan 14, 2014 at 8:08 PM, Madhu Borkar <[EMAIL PROTECTED]> wrote:

> Please, explain what do I do when I create parquet file?
>
>
> On Tue, Jan 14, 2014 at 2:42 PM, Jason Altekruse
> <[EMAIL PROTECTED]>wrote:
>
> > Hi Madhu,
> >
> > I'm not sure I completely understand your question. You can still use
> > varchar columns in parquet, you can even use their compression algorithms
> > like snappy and gzip. There is just the dictionary encoding for varchar
> > columns that we unfortunately have not implemented.
> >
> > -Jason Altekruse
> >
> >
> > On Tue, Jan 14, 2014 at 2:10 PM, Madhu Borkar <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi Jason,
> > > So I do not use key with Varchar? Is that right?
> > >
> > > Thanks for response.
> > >
> > >
> > > On Sun, Jan 12, 2014 at 9:42 PM, Jason Altekruse
> > > <[EMAIL PROTECTED]>wrote:
> > >
> > > > Hello Madhu,
> > > >
> > > > I'm very sorry its been so long to get a response to you on this. I
> > have
> > > > been attending school and not working on Drill full time for the past
> > > > couple of months.
> > > >
> > > > I did run drill with your file in a debugger and confirmed my
> > suspicions
> > > > with an encoding problem. Parquet supports a very space-efficient
> > > > dictionary encoding for varchar columns that are easily described as
> a
> > > > reasonably small list of values (into the thousands or tens of
> > > thousands).
> > > > This allows all of the values to be stored once, and the actual
> values
> > > > within the records to just index into the dictionary with integers.
> > When
> > > we
> > > > were writing the parquet implementation we realized that turning
> these
> > > > integer keys into their string values might not always be optimal. If
> > the
> > > > values are going to be filtered, we can always filter the dictionary
> > and
> > > > the prune out integer keys that are no longer needed, rather than
> > running
> > > > filter rules repeatedly on duplicated values as the appear throughout
> > the
> > > > dataset. Similar optimizations can be done for sort and a few other
> > > > operations.
> > > >
> > > > For this reason, we did not bother writing code for handling the
> > > > materialization of dictionary encoded values at read time, as this
> code
> > > > would just be a duplication of the join code we will need elsewhere
> in
> > > the
> > > > project. We tabled its implementation for when the optimizer could
> > handle
> > > > more sophisticated rules, to decide when it is best to match the keys
> > > with
> > > > their values, which is something we are working on in the coming
> weeks.
> > > >
> > > > Unfortunately for now, you will have to avoid using dictionary
> encoding
> > > for
> > > > strings in you parquet files if you want to read them with Drill, we
> > hope
> > > > to have this implemented soon. In the meantime I will submit a pull
> > > request
> > > > to have the reader report an error with a descriptive message about
> the
> > > > real problem, rather than just having it run into the NPE.
> > > >
> > > > Thank you for your help testing Drill!
> > > > -Jason
> > > >
> > > >
> > > > On Wed, Dec 4, 2013 at 11:14 AM, Jinfeng Ni <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > > Hi Tom,
> > > > >
> > > > > I can recreate NPE using Madhu's file.  Currently, I ask Jason, who
> > is
> > > > the
> > > > > main drill developer of parquet reader, to help take a look.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Jinfeng
> > > > >
> > > > >
> > > > > On Wed, Dec 4, 2013 at 7:02 AM, Tom Seddon <
> [EMAIL PROTECTED]>
> > > > > wrote:
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB