-Re: Loopup objects in distributed cache
Jan Dolinár 2013-04-04, 07:11
GenericUDTF has method initialize() which is only called once per task. So
if you read your files in this method and store the structures in memory
then the overhead is relatively small (reading 15MB per mapper is
negligible compared to several GB of processed data).
On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre <[EMAIL PROTECTED]>wrote:
> I want to write a functionality using UDTF. The functionality involves
> reading 7 different text files and create lookup structures such as Map,
> Set, List , Map of String and List etc to be used in the logic.
> These files are small size average 15 MB.
> I can add these files in distributed cache and access them in UDTF, read
> the files, and create the necessary lookup data structures, but this would
> mean that the files will be opened, read and closed every time the UDTF is
> Is there a way that I can just read the files once, create the data
> structures needed , put them in distributed cache and access them from UDTF?
> I don't think creating hive tables from these files and doing a map side
> join is possible, as the functionality that I want to implement is fairly
> complex and I am not sure if it can be done just using hive query and join
> without using UDTF.
> Thanks in advance.