Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)


Copy link to this message
-
Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)
Stephen Boesch 2013-06-20, 16:28
Stephen:  would you be willing to share an example of specifying a
"directory" as the  add "file" target?    I have not seen this working

I have attempted to use it as follows:

*We will access a script within the "hivetry" directory located here:*
hive> ! ls -l  /opt/am/ver/1.0/hive/hivetry/classifier_wf.py;
-rwxrwxr-x 1 hadoop hadoop 11241 Jun 18 19:37
/opt/am/ver/1.0/hive/hivetry/classifier_wf.py

*Add the directory  to hive:*
hive> add file /opt/am/ver/1.0/hive/hivetry;
Added resource: /opt/am/ver/1.0/hive/hivetry

*Attempt to run transform query using that script:*
*
*
*Attempt one: use the script name unqualified:*

hive>    from (select transform (aappname,qappname) using
'classifier_wf.py' as (aappname2 string, qappname2 string) from eqx )
o insert overwrite table c select o.aappname2, o.qappname2;

(Failed:   Caused by: java.io.IOException: Cannot run program
"classifier_wf.py": java.io.IOException: error=2, No such file or
directory)
*Attempt two: use the script name with the directory name prefix: *
hive>    from (select transform (aappname,qappname) using
'hive/classifier_wf.py' as (aappname2 string, qappname2 string) from
eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

(Failed:   Caused by: java.io.IOException: Cannot run program
"hive/classifier_wf.py": java.io.IOException: error=2, No such file or
directory)
2013/6/20 Stephen Sprague <[EMAIL PROTECTED]>

> yeah.  the archive isn't unpacked on the remote side. I think add archive
> is mostly used for finding java packages since CLASSPATH will reference the
> archive (and as such there is no need to expand it.)
>
>
> On Thu, Jun 20, 2013 at 9:00 AM, Stephen Boesch <[EMAIL PROTECTED]> wrote:
>
>> thx for the tip on "add <file>" where <file> is directory. I will try
>> that.
>>
>>
>> 2013/6/20 Stephen Sprague <[EMAIL PROTECTED]>
>>
>>> i personally only know of adding a .jar file via add archive but my
>>> experience there is very limited.  i believe if you 'add file' and the file
>>> is a directory it'll recursively take everything underneath but i know of
>>> nothing that inflates or un tars things on the remote end automatically.
>>>
>>> i would 'add file' your python script and then within that untar your
>>> tarball to get at your model data. its just the matter of figuring out the
>>> path to that tarball that's kinda up in the air when its added as 'add
>>> file'.  Yeah. "local downlooads directory".  What's the literal path is
>>> what i'd like to know. :)
>>>
>>>
>>> On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch <[EMAIL PROTECTED]>wrote:
>>>
>>>>
>>>> @Stephen:  given the  'relative' path for hive is from a local
>>>> downloads directory on each local tasktracker in the cluster,  it was my
>>>> thought that if the archive were actually being expanded then
>>>> somedir/somefileinthearchive  should work.  I will go ahead and test this
>>>> assumption.
>>>>
>>>> In the meantime, is there any facility available in hive for making
>>>> archived files available to hive jobs?  archive or hadoop archive ("har")
>>>> etc?
>>>>
>>>>
>>>> 2013/6/20 Stephen Sprague <[EMAIL PROTECTED]>
>>>>
>>>>> what would be interesting would be to run a little experiment and find
>>>>> out what the default PATH is on your data nodes.  How much of a pain would
>>>>> it be to run a little python script to print to stderr the value of the
>>>>> environmental variable $PATH and $PWD (or the shell command 'pwd') ?
>>>>>
>>>>> that's of course going through normal channels of "add file".
>>>>>
>>>>> the thing is given you're using a relative path "hive/parse_qx.py"
>>>>> you need to know what the "current directory" is when the process runs on
>>>>> the data nodes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>>
>>>>>> We have a few dozen files that need to be made available to all
>>>>>> mappers/reducers in the cluster while running  hive transformation steps .
>>>>>>
>>>>>> It seems the "add archive"  does not make the entries unarchived and