Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Issue with Hive and table with lots of column


Copy link to this message
-
Re: Issue with Hive and table with lots of column
Edward Capriolo 2014-01-31, 19:52
Final table compression should not effect the de serialized size of the
data over the wire.
On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague <[EMAIL PROTECTED]> wrote:

> Excellent progress David.   So.  What the most important thing here we
> learned was that it works (!) by running hive in local mode and that this
> error is a limitation in the HiveServer2.  That's important.
>
> so textfile storage handler and having issues converting it to ORC. hmmm.
>
> follow-ups.
>
> 1. what is your query that fails?
>
> 2. can you add a "limit 1" to the end of your query and tell us if that
> works? this'll tell us if it's column or row bound.
>
> 3. bonus points. run these in local mode:
>       > set hive.exec.compress.output=true;
>       > set mapred.output.compression.type=BLOCK;
>       > set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>       > create table blah stored as ORC as select * from <your table>;
> #i'm curious if this'll work.
>       > show create table blah;  #send output back if previous step worked.
>
> 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works any
> differently.
>
>
>
> I'm wondering if compression would have any effect on the size of the
> internal ArrayList the thrift server uses.
>
>
>
> On Fri, Jan 31, 2014 at 9:21 AM, David Gayou <[EMAIL PROTECTED]> wrote:
>
>> Ok, so here are some news :
>>
>> I tried to boost the HADOOP_HEAPSIZE to 8192,
>> I also setted the mapred.child.java.opts to 512M
>>
>> And it doesn't seem's to have any effect.
>>  ------
>>
>> I tried it using an ODBC driver => fail after few minutes.
>> Using a local JDBC (beeline) => running forever without any error.
>>
>> Both through hiveserver 2
>>
>> If i use the local mode : it works!   (but that not really what i need,
>> as i don't really how to access it with my software)
>>
>> ------
>> I use a text file as storage.
>> I tried to use ORC, but i can't populate it with a load data  (it return
>> an error of file format).
>>
>> Using an "ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC" after
>> populating the table, i have a file format error on select.
>>
>> ------
>>
>> @Edward :
>>
>> I've tried to look around on how i can change the thrift heap size but
>> haven't found anything.
>> Same thing for my client (haven't found how to change the heap size)
>>
>> My usecase is really to have the most possible columns.
>>
>>
>> Thanks a lot for your help
>>
>>
>> Regards
>>
>> David
>>
>>
>>
>>
>>
>> On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
>>
>>> Ok here are the problem(s). Thrift has frame size limits, thrift has to
>>> buffer rows into memory.
>>>
>>> Hove thrift has a heap size, it needs to big in this case.
>>>
>>> Your client needs a big heap size as well.
>>>
>>> The way to do this query if it is possible may be turning row lateral,
>>> potwntially by treating it as a list, it will make queries on it awkward.
>>>
>>> Good luck
>>>
>>>
>>> On Thursday, January 30, 2014, Stephen Sprague <[EMAIL PROTECTED]>
>>> wrote:
>>> > oh. thinking some more about this i forgot to ask some other basic
>>> questions.
>>> >
>>> > a) what storage format are you using for the table (text, sequence,
>>> rcfile, orc or custom)?   "show create table <table>" would yield that.
>>> >
>>> > b) what command is causing the stack trace?
>>> >
>>> > my thinking here is rcfile and orc are column based (i think) and if
>>> you don't select all the columns that could very well limit the size of the
>>> "row" being returned and hence the size of the internal ArrayList.  OTOH,
>>> if you're using "select *", um, you have my sympathies. :)
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague <[EMAIL PROTECTED]>
>>> wrote:
>>> >
>>> > thanks for the information. Up-to-date hive. Cluster on the smallish
>>> side. And, well, sure looks like a memory issue. :)  rather than an
>>> inherent hive limitation that is.
>>> >
>>> > So.  I can only speak as a user (ie. not a hive developer) but what