|
|
-
Re: How to solve one Scenario in hadoop ?
Vikas Jadhav 2013-03-06, 19:46
I will go with first case because if data size is large then it will distribute data across multiple nodes. On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra < [EMAIL PROTECTED]> wrote:
> Hi All, > I have one scenario where our organization is trying to implement > hadoop. > > Scenario Statement: > > --------------------------------------- > > Supoose we have variouse data sources , for example RDBMS, HDFS, > Streaming . > > > *Source Dataset Types :* > > 1.Single Source > > 2.Joining Sources > > 3.Filtered Data set > > 4.Specific columns > > > We nee to pull the data from one source to other , it could be from HDFS > to RDBMS or vice versa based on condition , that means out of whole data > from source we need only the specific data,whole data,join data into the > destination . So which direction we should go to pull the data based on the > above dataset type condition. > > > I am thinking . > > CASE-1 DATA from HDFS to HDFS (different cluster) whole data > :- we will use *distcp * > > CASE-2 DATA from HDFS to HDFS (different cluster) conditional data > (Filter data) :- we will use *CUSTOM MAP REDUCE PROGRAM Where we will > do the filter operation then load* > > CASE-3 DATA from HDFS to RDBMS(Whole data): *SQOOP* > > CASE-4 DATA from HDFS to RDBMS(conditional data): *SQOOP* > > CASE-5 SOME DATA FROM RDBMS and SOME DATA FROM HDFS then do filter and > load into HDFS : *JDBC WITH Map/Reduce program* > > > Note: Can any one suggest me, if I am wrong and we need to do something > other then this, which will be easy to do . > > > Regards, > > samir. > > > > -- * * *
Thanx and Regards* * Vikas Jadhav*
-
Re: How to solve one Scenario in hadoop ?
Dino Kečo 2013-03-06, 19:53
I would sugest Hive in these cases because it is easy to manage multiple data sources, it uses SQL like syntax, it scales because of Hadoop and it has joining implemented and optimized
Regards Dino On Mar 6, 2013 8:46 PM, "Vikas Jadhav" <[EMAIL PROTECTED]> wrote:
> I will go with first case because if data size is large then it will > distribute data across multiple nodes. > > > On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra < > [EMAIL PROTECTED]> wrote: > >> Hi All, >> I have one scenario where our organization is trying to implement >> hadoop. >> >> Scenario Statement: >> >> --------------------------------------- >> >> Supoose we have variouse data sources , for example RDBMS, HDFS, >> Streaming . >> >> >> *Source Dataset Types :* >> >> 1.Single Source >> >> 2.Joining Sources >> >> 3.Filtered Data set >> >> 4.Specific columns >> >> >> We nee to pull the data from one source to other , it could be from HDFS >> to RDBMS or vice versa based on condition , that means out of whole data >> from source we need only the specific data,whole data,join data into the >> destination . So which direction we should go to pull the data based on the >> above dataset type condition. >> >> >> I am thinking . >> >> CASE-1 DATA from HDFS to HDFS (different cluster) whole data >> :- we will use *distcp * >> >> CASE-2 DATA from HDFS to HDFS (different cluster) conditional data >> (Filter data) :- we will use *CUSTOM MAP REDUCE PROGRAM Where we will >> do the filter operation then load* >> >> CASE-3 DATA from HDFS to RDBMS(Whole data): *SQOOP* >> >> CASE-4 DATA from HDFS to RDBMS(conditional data): *SQOOP* >> >> CASE-5 SOME DATA FROM RDBMS and SOME DATA FROM HDFS then do filter and >> load into HDFS : *JDBC WITH Map/Reduce program* >> >> >> Note: Can any one suggest me, if I am wrong and we need to do something >> other then this, which will be easy to do . >> >> >> Regards, >> >> samir. >> >> >> >> > > > -- > * > * > * > > Thanx and Regards* > * Vikas Jadhav* >
|
|