Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)


Copy link to this message
-
Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)
thx for the tip on "add <file>" where <file> is directory. I will try that.
2013/6/20 Stephen Sprague <[EMAIL PROTECTED]>

> i personally only know of adding a .jar file via add archive but my
> experience there is very limited.  i believe if you 'add file' and the file
> is a directory it'll recursively take everything underneath but i know of
> nothing that inflates or un tars things on the remote end automatically.
>
> i would 'add file' your python script and then within that untar your
> tarball to get at your model data. its just the matter of figuring out the
> path to that tarball that's kinda up in the air when its added as 'add
> file'.  Yeah. "local downlooads directory".  What's the literal path is
> what i'd like to know. :)
>
>
> On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch <[EMAIL PROTECTED]> wrote:
>
>>
>> @Stephen:  given the  'relative' path for hive is from a local downloads
>> directory on each local tasktracker in the cluster,  it was my thought that
>> if the archive were actually being expanded then
>> somedir/somefileinthearchive  should work.  I will go ahead and test this
>> assumption.
>>
>> In the meantime, is there any facility available in hive for making
>> archived files available to hive jobs?  archive or hadoop archive ("har")
>> etc?
>>
>>
>> 2013/6/20 Stephen Sprague <[EMAIL PROTECTED]>
>>
>>> what would be interesting would be to run a little experiment and find
>>> out what the default PATH is on your data nodes.  How much of a pain would
>>> it be to run a little python script to print to stderr the value of the
>>> environmental variable $PATH and $PWD (or the shell command 'pwd') ?
>>>
>>> that's of course going through normal channels of "add file".
>>>
>>> the thing is given you're using a relative path "hive/parse_qx.py"  you
>>> need to know what the "current directory" is when the process runs on the
>>> data nodes.
>>>
>>>
>>>
>>>
>>> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <[EMAIL PROTECTED]>wrote:
>>>
>>>>
>>>> We have a few dozen files that need to be made available to all
>>>> mappers/reducers in the cluster while running  hive transformation steps .
>>>>
>>>> It seems the "add archive"  does not make the entries unarchived and
>>>> thus available directly on the default file path - and that is what we are
>>>> looking for.
>>>>
>>>> To illustrate:
>>>>
>>>>    add file modelfile.1;
>>>>    add file modelfile.2;
>>>>    ..
>>>>     add file modelfile.N;
>>>>
>>>>   Then, our model that is invoked during the transformation step *does
>>>> *have correct access to its model files in the defaul path.
>>>>
>>>> But .. those model files take low *minutes* to all load..
>>>>
>>>> instead when we try:
>>>>    add archive  modelArchive.tgz.
>>>>
>>>> The problem is the archive does not get exploded apparently ..
>>>>
>>>> I have an archive for example that contains shell scripts under the
>>>> "hive" directory stored inside.  I am *not *able to access
>>>> hive/my-shell-script.sh  after adding the archive. Specifically the
>>>> following fails:
>>>>
>>>> $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
>>>> -rwxrwxr-x stephenb/stephenb    664 2013-06-18 17:46
>>>> appminer/bin/launch-quixey_to_xml.sh
>>>>
>>>> from (select transform (aappname,qappname)
>>>> *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
>>>> from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;
>>>>
>>>> Cannot run program "hive/parse_qx.py": java.io.IOException: error=2, No such file or directory
>>>>
>>>>
>>>>
>>>>
>>>
>>
>