I have one scenario where our organization is trying to implement
Supoose we have variouse data sources , for example RDBMS, HDFS,
*Source Dataset Types :*
3.Filtered Data set
We nee to pull the data from one source to other , it could be from HDFS to
RDBMS or vice versa based on condition , that means out of whole data from
source we need only the specific data,whole data,join data into the
destination . So which direction we should go to pull the data based on the
above dataset type condition.
I am thinking .
CASE-1 DATA from HDFS to HDFS (different cluster) whole data
:- we will use *distcp *
CASE-2 DATA from HDFS to HDFS (different cluster) conditional data
(Filter data) :- we will use *CUSTOM MAP REDUCE PROGRAM Where we will do
the filter operation then load*
CASE-3 DATA from HDFS to RDBMS(Whole data): *SQOOP*
CASE-4 DATA from HDFS to RDBMS(conditional data): *SQOOP*
CASE-5 SOME DATA FROM RDBMS and SOME DATA FROM HDFS then do filter and
load into HDFS : *JDBC WITH Map/Reduce program*
Note: Can any one suggest me, if I am wrong and we need to do something
other then this, which will be easy to do .