|
|
-
ColumnarSerDe and LazyBinaryColumnarSerDe
Yin Huai 2012-03-06, 16:58
Hi,
Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in general?
Let me make my question more specific.
I generated two tables from the table lineitem of TPC-H using ColumnarSerDe and LazyBinaryColumnarSerDe as follows... CREATE TABLE lineitem_rcfile_lazybinary ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe" STORED AS RCFile AS SELECT * from lineitem;
CREATE TABLE lineitem_rcfile_lazy ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" STORED AS RCFile AS SELECT * from lineitem;
Since serialization of LazyBinaryColumnarSerDe is binary-based and that of ColumnarSerDe is text-based, I expect to see table lineitem_rcfile_lazybinary is smaller than lineitem_rcfile_lazy. However, no matter whether compression is enabled, lineitem_rcfile_lazybinary is little bit larger than lineitem_rcfile_lazy. Did I use LazyBinaryColumnarSerDe in a wrong way?
btw, the row group size of RCFile is 32MB.
Thanks,
Yin
-
Re: ColumnarSerDe and LazyBinaryColumnarSerDe
yongqiang he 2012-03-06, 19:42
I guess LazyBinaryColumnarSerDe is not saving spaces, but is cpu efficient. You tests aligns with our internal tests long time ago.
On Tue, Mar 6, 2012 at 8:58 AM, Yin Huai <[EMAIL PROTECTED]> wrote: > Hi, > > Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in > general? > > Let me make my question more specific. > > I generated two tables from the table lineitem of TPC-H > using ColumnarSerDe and LazyBinaryColumnarSerDe as follows... > CREATE TABLE lineitem_rcfile_lazybinary > ROW FORMAT SERDE > "org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe" > STORED AS RCFile AS > SELECT * from lineitem; > > CREATE TABLE lineitem_rcfile_lazy > ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" > STORED AS RCFile AS > SELECT * from lineitem; > > Since serialization of LazyBinaryColumnarSerDe is binary-based and that > of ColumnarSerDe is text-based, I expect to see > table lineitem_rcfile_lazybinary is smaller than lineitem_rcfile_lazy. > However, no matter whether compression is > enabled, lineitem_rcfile_lazybinary is little bit larger > than lineitem_rcfile_lazy. Did I use LazyBinaryColumnarSerDe in a wrong way? > > btw, the row group size of RCFile is 32MB. > > Thanks, > > Yin
-
Re: ColumnarSerDe and LazyBinaryColumnarSerDe
Yin Huai 2012-03-07, 18:35
Thanks.
I forgot to consider the DOUBLE data type in the table. For the case of lineitem, ColumnarSerDe can use less bytes to store a double than LazyBinaryColumnarSerDe (8bytes).
Yin
On Tue, Mar 6, 2012 at 2:42 PM, yongqiang he <[EMAIL PROTECTED]>wrote:
> I guess LazyBinaryColumnarSerDe is not saving spaces, but is cpu efficient. > You tests aligns with our internal tests long time ago. > > On Tue, Mar 6, 2012 at 8:58 AM, Yin Huai <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in > > general? > > > > Let me make my question more specific. > > > > I generated two tables from the table lineitem of TPC-H > > using ColumnarSerDe and LazyBinaryColumnarSerDe as follows... > > CREATE TABLE lineitem_rcfile_lazybinary > > ROW FORMAT SERDE > > "org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe" > > STORED AS RCFile AS > > SELECT * from lineitem; > > > > CREATE TABLE lineitem_rcfile_lazy > > ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" > > STORED AS RCFile AS > > SELECT * from lineitem; > > > > Since serialization of LazyBinaryColumnarSerDe is binary-based and that > > of ColumnarSerDe is text-based, I expect to see > > table lineitem_rcfile_lazybinary is smaller than lineitem_rcfile_lazy. > > However, no matter whether compression is > > enabled, lineitem_rcfile_lazybinary is little bit larger > > than lineitem_rcfile_lazy. Did I use LazyBinaryColumnarSerDe in a wrong > way? > > > > btw, the row group size of RCFile is 32MB. > > > > Thanks, > > > > Yin >
|
|