Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> new feature in hive: links


Copy link to this message
-
Re: new feature in hive: links
To kickstart the review, I did a quick review of the doc. Few questions
popped out to me, which I asked. Sambavi was kind enough to come back with
replies for them. I am continuing to look into it. Will encourage other
folks to look into it as well.
Thanks,

Ashutosh
<Begin Forward Message>
Hi Ashutosh****

** **

Thanks for looking through the design and providing your feedback!****

** **

Responses below:****

* What exactly is contained in tracking capacity usage. One is disk space.
That I presume you are going to track via summing size under database
directory. Are you also thinking of tracking resource usage in terms of
CPU/memory/network utilization for different teams? ****

Right now the capacity usage in Hive we will track is the disk space
(managed tables that belong to the namespace + imported tables). We will
track the mappers and reducers that the namepace utilizes directly from
Hadoop.****

** **

* Each namespace (ns) will have exactly one database. If so, then users are
not allowed to create/use databases in such deployment? Not necessarily a
problem, just trying to understand design.****

Yes, you are correct – this is a limitation of the design. Introducing a
new concept seemed heavyweight, so you can instead think of this as
“self-contained” databases. But it means that a given namespace cannot have
sub-databases in it.****

** **

* How are you going to keep metadata consistent across two ns? If metadata
gets updated in remote ns, will it get automatically updated in user's
local ns? If yes, how will this be implemented? If no, then every time user
need to use data from remote ns, she has to bring metadata uptodate in her
ns. How will she do it?****

Metadata will be kept in sync for linked tables. We will make alter table
on the remote table (source of the link) cause an update to the target of
the link. Note that from a Hive perspective, the metadata for the source
and target of a link is in the same metastore.****

** **

* Is it even required that metadata of two linked tables to be consistent?
Seems like user has to run "alter link add partition" herself for each
partition. She can choose only to add few partitions. In this case, tables
in two ns have different number of partitions and thus data.****

What you say above is true for static links. For dynamic links, add and
drop partition on the source of the link will cause the target to get those
partitions as well (we trap alter table add/drop partition to provide this
behavior).****

** **

* Who is allowed to create links?****

Any user on the database who has create/all privileges on the database. We
could potentially create a new privilege for this, but I think create
privilege should suffice. We can similarly map alter, drop privileges to
the appropriate operations.****

** **

* Once user creates a link, who can use it? If everyone is allowed to
access, then I don't see how is it different from the problem that you are
outlining in first alternative design option, wherein user having an access
to two ns via roles has access to data on both ns.****

The link creates metadata in the target database. So you can only access
data that has been linked into this database (access is via the T@Y or Y.T
syntax depending on the chosen design option). Note that this is different
than having a role that a user maps to since in that case, there is no
local metadata in the target database specifying if the imported data is
accessible from this database.****

** **

* If links are first class concepts, then authorization model also needs to
understand them? I don't see any mention of that.****

Yes, you are correct. We need to account for the authorization model.****

** **

* I see there is a hdfs jira for implementing hard links of files in hdfs
layer, so that takes care of linking physical data on hdfs. What about
tables whose data is stored in external systems. For example, hbase. Does
hbase also needs to implement feature of hard-linking their table for hive
to make use of this feature? What about other storage handlers like
cassandra, mongodb etc.****

The link does not create a link on HDFS. It just points to the source
table/partitions. You can think of it as a Hive-level link so there is no
need for any changes/features from the other storage handlers.****

** **

* Migration will involve two step process of distcp'ing data from one
cluster to another and then replicating one mysql instance to another. Are
there any other steps? Do you plan to (later) build tools to automate this
process of migration.****

Yes, we will be building tools to enable migration of a namespace.
Migration will involve replicating the metadata and the data as you mention
above.****

** **

* When migrating ns from one datacenter to another, will links be dropped
or they are also preserved? ****

We will preserve them – by copying the data for the links to the other
datacenter.****

** **

Hope that helps. Please ask any more questions that come up as you read the
design.****

** **

Thanks!****

Sambavi****

**

On Mon, May 21, 2012 at 3:34 PM, Namit Jain <[EMAIL PROTECTED]> wrote: