|
Ibrahim Yakti
2012-12-24, 13:08
Dean Wampler
2012-12-24, 14:51
Ibrahim Yakti
2012-12-24, 15:34
Dean Wampler
2012-12-24, 18:12
Ibrahim Yakti
2012-12-24, 18:25
Kshiva Kps
2012-12-25, 05:50
Mohammad Tariq
2012-12-25, 05:56
Mohammad Tariq
2012-12-25, 05:59
Ibrahim Yakti
2012-12-26, 06:27
Ibrahim Yakti
2012-12-26, 13:54
Mohammad Tariq
2012-12-26, 14:52
Ibrahim Yakti
2012-12-26, 14:56
Mohammad Tariq
2012-12-24, 13:19
Ibrahim Yakti
2012-12-24, 13:30
Mohammad Tariq
2012-12-24, 13:35
Ibrahim Yakti
2012-12-24, 13:38
Mohammad Tariq
2012-12-24, 14:03
Ibrahim Yakti
2012-12-24, 14:08
Mohammad Tariq
2012-12-24, 14:25
Ibrahim Yakti
2012-12-24, 14:28
Jeremiah Peschka
2012-12-24, 14:22
Edward Capriolo
2012-12-24, 14:28
Mohammad Tariq
2012-12-24, 14:31
Ibrahim Yakti
2012-12-24, 14:29
Edward Capriolo
2012-12-24, 14:37
Ibrahim Yakti
2012-12-24, 14:41
|
-
Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 13:08
Hi All,
We are new to hadoop and hive, we are trying to use hive to run analytical queries and we are using sqoop to import data into hive, in our RDBMS the data updated very frequently and this needs to be reflected to hive. Hive does not support update/delete but there are many workarounds to do this task. What's in our mind is importing all the tables into hive as is, then we build the required tables for reporting. My questions are: 1. What is the best way to reflect MySQL updates into Hive with minimal resources? 2. Is sqoop the right tool to do the ETL? 3. Is Hive the right tool to do this kind of queries or we should search for alternatives? Any hint will be useful, thanks in advanced. -- Ibrahim +
Ibrahim Yakti 2012-12-24, 13:08
-
Re: Reflect MySQL updates into HiveDean Wampler 2012-12-24, 14:51
This is not as hard as it sounds. The hardest part is setting up the
incremental query against your MySQL database. Then you can write the results to new files in the HDFS directory for the table and Hive will see them immediately. Yes, even though Hive doesn't support updates, it doesn't care how many files are in the directory. The trick is to avoid lots of little files. As others have suggested, you should consider partitioning the data, perhaps by time. Say you import about a few HDFS blocks-worth of data each day, then use year/month/day partitioning to speed up your Hive queries. You'll need to add the partitions to the table as you go, but actually, you can add those once a month, for example, for all partitions. Hive doesn't care if the partition directories don't exist yet or the directories are empty. I also recommend using an external table, which gives you more flexibility on directory layout, etc. Sqoop might be the easiest tool for importing the data, as it will even generate a Hive table schema from the original MySQL table. However, that feature may not be useful in this case, as you already have the table. I think Oozie is horribly complex to use and overkill for this purpose. A simple bash script triggered periodically by cron is all you need. If you aren't using a partitioned table, you have a single sqoop command to run. If you have partitioned data, you'll also need a hive statement in the script to create the partition, unless you do those in batch once a month, etc., etc. Hope this helps, dean On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > Hi All, > > We are new to hadoop and hive, we are trying to use hive to > run analytical queries and we are using sqoop to import data into hive, in > our RDBMS the data updated very frequently and this needs to be reflected > to hive. Hive does not support update/delete but there are many workarounds > to do this task. > > What's in our mind is importing all the tables into hive as is, then we > build the required tables for reporting. > > My questions are: > > 1. What is the best way to reflect MySQL updates into Hive with > minimal resources? > 2. Is sqoop the right tool to do the ETL? > 3. Is Hive the right tool to do this kind of queries or we should > search for alternatives? > > Any hint will be useful, thanks in advanced. > > -- > Ibrahim > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330 +
Dean Wampler 2012-12-24, 14:51
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 15:34
Thanks Dean for the great reply, setting incremental import should be easy,
if I partitioned my data how hive will get me the updated rows only considering that the row may have multiple fields that will be updated over time? and how will I manage the tables that based on multiple sources? and do you recommend to import the data to HDFS instead of Hive directly? Won't we have a lot of duplicated records then? Regarding automation we were thinking to use sqoop-job command or crons as you suggested. So, the suggested flow as follows: MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive tables for reporting. Correct? Appreciated. -- Ibrahim On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler < [EMAIL PROTECTED]> wrote: > This is not as hard as it sounds. The hardest part is setting up the > incremental query against your MySQL database. Then you can write the > results to new files in the HDFS directory for the table and Hive will see > them immediately. Yes, even though Hive doesn't support updates, it doesn't > care how many files are in the directory. The trick is to avoid lots of > little files. > > As others have suggested, you should consider partitioning the data, > perhaps by time. Say you import about a few HDFS blocks-worth of data each > day, then use year/month/day partitioning to speed up your Hive queries. > You'll need to add the partitions to the table as you go, but actually, you > can add those once a month, for example, for all partitions. Hive doesn't > care if the partition directories don't exist yet or the directories are > empty. I also recommend using an external table, which gives you more > flexibility on directory layout, etc. > > Sqoop might be the easiest tool for importing the data, as it will even > generate a Hive table schema from the original MySQL table. However, that > feature may not be useful in this case, as you already have the table. > > I think Oozie is horribly complex to use and overkill for this purpose. A > simple bash script triggered periodically by cron is all you need. If you > aren't using a partitioned table, you have a single sqoop command to run. > If you have partitioned data, you'll also need a hive statement in the > script to create the partition, unless you do those in batch once a month, > etc., etc. > > Hope this helps, > dean > > On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Hi All, >> >> We are new to hadoop and hive, we are trying to use hive to >> run analytical queries and we are using sqoop to import data into hive, in >> our RDBMS the data updated very frequently and this needs to be reflected >> to hive. Hive does not support update/delete but there are many workarounds >> to do this task. >> >> What's in our mind is importing all the tables into hive as is, then we >> build the required tables for reporting. >> >> My questions are: >> >> 1. What is the best way to reflect MySQL updates into Hive with >> minimal resources? >> 2. Is sqoop the right tool to do the ETL? >> 3. Is Hive the right tool to do this kind of queries or we should >> search for alternatives? >> >> Any hint will be useful, thanks in advanced. >> >> -- >> Ibrahim >> > > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > > +
Ibrahim Yakti 2012-12-24, 15:34
-
Re: Reflect MySQL updates into HiveDean Wampler 2012-12-24, 18:12
Looks good, but a few suggestions. If you can eliminate duplicates, etc. as
you ingest the data into HDFS, that would eliminate a cleansing step. Note that if the target directory in HDFS IS the specified location for an external Hive table/partition, then there will be no separate step to "load in Hive as External Table". It's already there! Your "transform data..." is a common pattern; stage "raw" data into a location, then use Hive (or Pig) to transform it into the final form and INSERT INTO the final Hive table. dean On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > Thanks Dean for the great reply, setting incremental import should be > easy, if I partitioned my data how hive will get me the updated rows only > considering that the row may have multiple fields that will be updated over > time? and how will I manage the tables that based on multiple sources? and > do you recommend to import the data to HDFS instead of Hive directly? Won't > we have a lot of duplicated records then? > > Regarding automation we were thinking to use sqoop-job command or crons as > you suggested. > > So, the suggested flow as follows: > > MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in > Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive > tables for reporting. > > > Correct? > > Appreciated. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler < > [EMAIL PROTECTED]> wrote: > >> This is not as hard as it sounds. The hardest part is setting up the >> incremental query against your MySQL database. Then you can write the >> results to new files in the HDFS directory for the table and Hive will see >> them immediately. Yes, even though Hive doesn't support updates, it doesn't >> care how many files are in the directory. The trick is to avoid lots of >> little files. >> >> As others have suggested, you should consider partitioning the data, >> perhaps by time. Say you import about a few HDFS blocks-worth of data each >> day, then use year/month/day partitioning to speed up your Hive queries. >> You'll need to add the partitions to the table as you go, but actually, you >> can add those once a month, for example, for all partitions. Hive doesn't >> care if the partition directories don't exist yet or the directories are >> empty. I also recommend using an external table, which gives you more >> flexibility on directory layout, etc. >> >> Sqoop might be the easiest tool for importing the data, as it will even >> generate a Hive table schema from the original MySQL table. However, that >> feature may not be useful in this case, as you already have the table. >> >> I think Oozie is horribly complex to use and overkill for this purpose. A >> simple bash script triggered periodically by cron is all you need. If you >> aren't using a partitioned table, you have a single sqoop command to run. >> If you have partitioned data, you'll also need a hive statement in the >> script to create the partition, unless you do those in batch once a month, >> etc., etc. >> >> Hope this helps, >> dean >> >> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >>> Hi All, >>> >>> We are new to hadoop and hive, we are trying to use hive to >>> run analytical queries and we are using sqoop to import data into hive, in >>> our RDBMS the data updated very frequently and this needs to be reflected >>> to hive. Hive does not support update/delete but there are many workarounds >>> to do this task. >>> >>> What's in our mind is importing all the tables into hive as is, then we >>> build the required tables for reporting. >>> >>> My questions are: >>> >>> 1. What is the best way to reflect MySQL updates into Hive with >>> minimal resources? >>> 2. Is sqoop the right tool to do the ETL? >>> 3. Is Hive the right tool to do this kind of queries or we should >>> search for alternatives? >>> >>> Any hint will be useful, thanks in advanced. > *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330 +
Dean Wampler 2012-12-24, 18:12
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 18:25
My problem is in eliminating the duplicates and only keep the correct data,
any advise please? On Dec 24, 2012 9:13 PM, "Dean Wampler" <[EMAIL PROTECTED]> wrote: > Looks good, but a few suggestions. If you can eliminate duplicates, etc. > as you ingest the data into HDFS, that would eliminate a cleansing step. > Note that if the target directory in HDFS IS the specified location for an > external Hive table/partition, then there will be no separate step to "load > in Hive as External Table". It's already there! > > Your "transform data..." is a common pattern; stage "raw" data into a > location, then use Hive (or Pig) to transform it into the final form and > INSERT INTO the final Hive table. > > dean > > On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Thanks Dean for the great reply, setting incremental import should be >> easy, if I partitioned my data how hive will get me the updated rows only >> considering that the row may have multiple fields that will be updated over >> time? and how will I manage the tables that based on multiple sources? and >> do you recommend to import the data to HDFS instead of Hive directly? Won't >> we have a lot of duplicated records then? >> >> Regarding automation we were thinking to use sqoop-job command or crons >> as you suggested. >> >> So, the suggested flow as follows: >> >> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in >> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive >> tables for reporting. >> >> >> Correct? >> >> Appreciated. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler < >> [EMAIL PROTECTED]> wrote: >> >>> This is not as hard as it sounds. The hardest part is setting up the >>> incremental query against your MySQL database. Then you can write the >>> results to new files in the HDFS directory for the table and Hive will see >>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>> care how many files are in the directory. The trick is to avoid lots of >>> little files. >>> >>> As others have suggested, you should consider partitioning the data, >>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>> day, then use year/month/day partitioning to speed up your Hive queries. >>> You'll need to add the partitions to the table as you go, but actually, you >>> can add those once a month, for example, for all partitions. Hive doesn't >>> care if the partition directories don't exist yet or the directories are >>> empty. I also recommend using an external table, which gives you more >>> flexibility on directory layout, etc. >>> >>> Sqoop might be the easiest tool for importing the data, as it will even >>> generate a Hive table schema from the original MySQL table. However, that >>> feature may not be useful in this case, as you already have the table. >>> >>> I think Oozie is horribly complex to use and overkill for this purpose. >>> A simple bash script triggered periodically by cron is all you need. If you >>> aren't using a partitioned table, you have a single sqoop command to run. >>> If you have partitioned data, you'll also need a hive statement in the >>> script to create the partition, unless you do those in batch once a month, >>> etc., etc. >>> >>> Hope this helps, >>> dean >>> >>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>>> Hi All, >>>> >>>> We are new to hadoop and hive, we are trying to use hive to >>>> run analytical queries and we are using sqoop to import data into hive, in >>>> our RDBMS the data updated very frequently and this needs to be reflected >>>> to hive. Hive does not support update/delete but there are many workarounds >>>> to do this task. >>>> >>>> What's in our mind is importing all the tables into hive as is, then we >>>> build the required tables for reporting. >>>> >>>> My questions are: >>>> >>>> 1. What is the best way to reflect MySQL updates into Hive with +
Ibrahim Yakti 2012-12-24, 18:25
-
Re: Reflect MySQL updates into HiveKshiva Kps 2012-12-25, 05:50
Hi,
Is there any Hive editors and where we can write 100 to 150 Hive scripts I'm believing is not essay to do in CLI mode all scripts . Like IDE for JAVA /TOAD for SQL pls advice , many thanks Thanks On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < [EMAIL PROTECTED]> wrote: > This is not as hard as it sounds. The hardest part is setting up the > incremental query against your MySQL database. Then you can write the > results to new files in the HDFS directory for the table and Hive will see > them immediately. Yes, even though Hive doesn't support updates, it doesn't > care how many files are in the directory. The trick is to avoid lots of > little files. > > As others have suggested, you should consider partitioning the data, > perhaps by time. Say you import about a few HDFS blocks-worth of data each > day, then use year/month/day partitioning to speed up your Hive queries. > You'll need to add the partitions to the table as you go, but actually, you > can add those once a month, for example, for all partitions. Hive doesn't > care if the partition directories don't exist yet or the directories are > empty. I also recommend using an external table, which gives you more > flexibility on directory layout, etc. > > Sqoop might be the easiest tool for importing the data, as it will even > generate a Hive table schema from the original MySQL table. However, that > feature may not be useful in this case, as you already have the table. > > I think Oozie is horribly complex to use and overkill for this purpose. A > simple bash script triggered periodically by cron is all you need. If you > aren't using a partitioned table, you have a single sqoop command to run. > If you have partitioned data, you'll also need a hive statement in the > script to create the partition, unless you do those in batch once a month, > etc., etc. > > Hope this helps, > dean > > On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Hi All, >> >> We are new to hadoop and hive, we are trying to use hive to >> run analytical queries and we are using sqoop to import data into hive, in >> our RDBMS the data updated very frequently and this needs to be reflected >> to hive. Hive does not support update/delete but there are many workarounds >> to do this task. >> >> What's in our mind is importing all the tables into hive as is, then we >> build the required tables for reporting. >> >> My questions are: >> >> 1. What is the best way to reflect MySQL updates into Hive with >> minimal resources? >> 2. Is sqoop the right tool to do the ETL? >> 3. Is Hive the right tool to do this kind of queries or we should >> search for alternatives? >> >> Any hint will be useful, thanks in advanced. >> >> -- >> Ibrahim >> > > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > > +
Kshiva Kps 2012-12-25, 05:50
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-25, 05:56
Have a look at Beeswax.
BTW, do you have access to Google at your station?Same question on the Pig mailing list as well, that too twice. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]> wrote: > Hi, > > Is there any Hive editors and where we can write 100 to 150 Hive scripts > I'm believing is not essay to do in CLI mode all scripts . > Like IDE for JAVA /TOAD for SQL pls advice , many thanks > > > Thanks > > On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < > [EMAIL PROTECTED]> wrote: > >> This is not as hard as it sounds. The hardest part is setting up the >> incremental query against your MySQL database. Then you can write the >> results to new files in the HDFS directory for the table and Hive will see >> them immediately. Yes, even though Hive doesn't support updates, it doesn't >> care how many files are in the directory. The trick is to avoid lots of >> little files. >> >> As others have suggested, you should consider partitioning the data, >> perhaps by time. Say you import about a few HDFS blocks-worth of data each >> day, then use year/month/day partitioning to speed up your Hive queries. >> You'll need to add the partitions to the table as you go, but actually, you >> can add those once a month, for example, for all partitions. Hive doesn't >> care if the partition directories don't exist yet or the directories are >> empty. I also recommend using an external table, which gives you more >> flexibility on directory layout, etc. >> >> Sqoop might be the easiest tool for importing the data, as it will even >> generate a Hive table schema from the original MySQL table. However, that >> feature may not be useful in this case, as you already have the table. >> >> I think Oozie is horribly complex to use and overkill for this purpose. A >> simple bash script triggered periodically by cron is all you need. If you >> aren't using a partitioned table, you have a single sqoop command to run. >> If you have partitioned data, you'll also need a hive statement in the >> script to create the partition, unless you do those in batch once a month, >> etc., etc. >> >> Hope this helps, >> dean >> >> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >>> Hi All, >>> >>> We are new to hadoop and hive, we are trying to use hive to >>> run analytical queries and we are using sqoop to import data into hive, in >>> our RDBMS the data updated very frequently and this needs to be reflected >>> to hive. Hive does not support update/delete but there are many workarounds >>> to do this task. >>> >>> What's in our mind is importing all the tables into hive as is, then we >>> build the required tables for reporting. >>> >>> My questions are: >>> >>> 1. What is the best way to reflect MySQL updates into Hive with >>> minimal resources? >>> 2. Is sqoop the right tool to do the ETL? >>> 3. Is Hive the right tool to do this kind of queries or we should >>> search for alternatives? >>> >>> Any hint will be useful, thanks in advanced. >>> >>> -- >>> Ibrahim >>> >> >> >> >> -- >> *Dean Wampler, Ph.D.* >> thinkbiganalytics.com >> +1-312-339-1330 >> >> > +
Mohammad Tariq 2012-12-25, 05:56
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-25, 05:59
Also, have a look at this :
http://www.catb.org/~esr/faqs/smart-questions.html Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Have a look at Beeswax. > > BTW, do you have access to Google at your station?Same question on the Pig > mailing list as well, that too twice. > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Is there any Hive editors and where we can write 100 to 150 Hive scripts >> I'm believing is not essay to do in CLI mode all scripts . >> Like IDE for JAVA /TOAD for SQL pls advice , many thanks >> >> >> Thanks >> >> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < >> [EMAIL PROTECTED]> wrote: >> >>> This is not as hard as it sounds. The hardest part is setting up the >>> incremental query against your MySQL database. Then you can write the >>> results to new files in the HDFS directory for the table and Hive will see >>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>> care how many files are in the directory. The trick is to avoid lots of >>> little files. >>> >>> As others have suggested, you should consider partitioning the data, >>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>> day, then use year/month/day partitioning to speed up your Hive queries. >>> You'll need to add the partitions to the table as you go, but actually, you >>> can add those once a month, for example, for all partitions. Hive doesn't >>> care if the partition directories don't exist yet or the directories are >>> empty. I also recommend using an external table, which gives you more >>> flexibility on directory layout, etc. >>> >>> Sqoop might be the easiest tool for importing the data, as it will even >>> generate a Hive table schema from the original MySQL table. However, that >>> feature may not be useful in this case, as you already have the table. >>> >>> I think Oozie is horribly complex to use and overkill for this purpose. >>> A simple bash script triggered periodically by cron is all you need. If you >>> aren't using a partitioned table, you have a single sqoop command to run. >>> If you have partitioned data, you'll also need a hive statement in the >>> script to create the partition, unless you do those in batch once a month, >>> etc., etc. >>> >>> Hope this helps, >>> dean >>> >>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>>> Hi All, >>>> >>>> We are new to hadoop and hive, we are trying to use hive to >>>> run analytical queries and we are using sqoop to import data into hive, in >>>> our RDBMS the data updated very frequently and this needs to be reflected >>>> to hive. Hive does not support update/delete but there are many workarounds >>>> to do this task. >>>> >>>> What's in our mind is importing all the tables into hive as is, then we >>>> build the required tables for reporting. >>>> >>>> My questions are: >>>> >>>> 1. What is the best way to reflect MySQL updates into Hive with >>>> minimal resources? >>>> 2. Is sqoop the right tool to do the ETL? >>>> 3. Is Hive the right tool to do this kind of queries or we should >>>> search for alternatives? >>>> >>>> Any hint will be useful, thanks in advanced. >>>> >>>> -- >>>> Ibrahim >>>> >>> >>> >>> >>> -- >>> *Dean Wampler, Ph.D.* >>> thinkbiganalytics.com >>> +1-312-339-1330 >>> >>> >> > +
Mohammad Tariq 2012-12-25, 05:59
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-26, 06:27
Mohammad, I am not sure if the answers & the link were to me or to Kshiva's
question. if I have partitioned my data based on status for example, when I run the update query it will add the updated data on a new partition (success or shipped for example) and it will keep the old data (confirmed or paid for example), right? -- Ibrahim On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Also, have a look at this : > http://www.catb.org/~esr/faqs/smart-questions.html > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> Have a look at Beeswax. >> >> BTW, do you have access to Google at your station?Same question on the >> Pig mailing list as well, that too twice. >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> Is there any Hive editors and where we can write 100 to 150 Hive scripts >>> I'm believing is not essay to do in CLI mode all scripts . >>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks >>> >>> >>> Thanks >>> >>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < >>> [EMAIL PROTECTED]> wrote: >>> >>>> This is not as hard as it sounds. The hardest part is setting up the >>>> incremental query against your MySQL database. Then you can write the >>>> results to new files in the HDFS directory for the table and Hive will see >>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>>> care how many files are in the directory. The trick is to avoid lots of >>>> little files. >>>> >>>> As others have suggested, you should consider partitioning the data, >>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>>> day, then use year/month/day partitioning to speed up your Hive queries. >>>> You'll need to add the partitions to the table as you go, but actually, you >>>> can add those once a month, for example, for all partitions. Hive doesn't >>>> care if the partition directories don't exist yet or the directories are >>>> empty. I also recommend using an external table, which gives you more >>>> flexibility on directory layout, etc. >>>> >>>> Sqoop might be the easiest tool for importing the data, as it will even >>>> generate a Hive table schema from the original MySQL table. However, that >>>> feature may not be useful in this case, as you already have the table. >>>> >>>> I think Oozie is horribly complex to use and overkill for this purpose. >>>> A simple bash script triggered periodically by cron is all you need. If you >>>> aren't using a partitioned table, you have a single sqoop command to run. >>>> If you have partitioned data, you'll also need a hive statement in the >>>> script to create the partition, unless you do those in batch once a month, >>>> etc., etc. >>>> >>>> Hope this helps, >>>> dean >>>> >>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi All, >>>>> >>>>> We are new to hadoop and hive, we are trying to use hive to >>>>> run analytical queries and we are using sqoop to import data into hive, in >>>>> our RDBMS the data updated very frequently and this needs to be reflected >>>>> to hive. Hive does not support update/delete but there are many workarounds >>>>> to do this task. >>>>> >>>>> What's in our mind is importing all the tables into hive as is, then >>>>> we build the required tables for reporting. >>>>> >>>>> My questions are: >>>>> >>>>> 1. What is the best way to reflect MySQL updates into Hive with >>>>> minimal resources? >>>>> 2. Is sqoop the right tool to do the ETL? >>>>> 3. Is Hive the right tool to do this kind of queries or we should >>>>> search for alternatives? >>>>> >>>>> Any hint will be useful, thanks in advanced. >>>>> >>>>> -- >>>>> Ibrahim >>>>> >>>> >>>> >>>> >>>> -- >>>> *Dean Wampler, Ph.D.* > +
Ibrahim Yakti 2012-12-26, 06:27
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-26, 13:54
After more reading, a suggested scenario looks like:
MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins & Queries ---> Update HBase as needed & Reload in Hive. What do you think please? -- Ibrahim On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > Mohammad, I am not sure if the answers & the link were to me or to > Kshiva's question. > > if I have partitioned my data based on status for example, when I run the > update query it will add the updated data on a new partition (success or > shipped for example) and it will keep the old data (confirmed or paid for > example), right? > > > -- > Ibrahim > > > On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> Also, have a look at this : >> http://www.catb.org/~esr/faqs/smart-questions.html >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >>> Have a look at Beeswax. >>> >>> BTW, do you have access to Google at your station?Same question on the >>> Pig mailing list as well, that too twice. >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]>wrote: >>> >>>> Hi, >>>> >>>> Is there any Hive editors and where we can write 100 to 150 Hive >>>> scripts I'm believing is not essay to do in CLI mode all scripts . >>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks >>>> >>>> >>>> Thanks >>>> >>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> This is not as hard as it sounds. The hardest part is setting up the >>>>> incremental query against your MySQL database. Then you can write the >>>>> results to new files in the HDFS directory for the table and Hive will see >>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>>>> care how many files are in the directory. The trick is to avoid lots of >>>>> little files. >>>>> >>>>> As others have suggested, you should consider partitioning the data, >>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>>>> day, then use year/month/day partitioning to speed up your Hive queries. >>>>> You'll need to add the partitions to the table as you go, but actually, you >>>>> can add those once a month, for example, for all partitions. Hive doesn't >>>>> care if the partition directories don't exist yet or the directories are >>>>> empty. I also recommend using an external table, which gives you more >>>>> flexibility on directory layout, etc. >>>>> >>>>> Sqoop might be the easiest tool for importing the data, as it will >>>>> even generate a Hive table schema from the original MySQL table. However, >>>>> that feature may not be useful in this case, as you already have the table. >>>>> >>>>> I think Oozie is horribly complex to use and overkill for this >>>>> purpose. A simple bash script triggered periodically by cron is all you >>>>> need. If you aren't using a partitioned table, you have a single sqoop >>>>> command to run. If you have partitioned data, you'll also need a hive >>>>> statement in the script to create the partition, unless you do those in >>>>> batch once a month, etc., etc. >>>>> >>>>> Hope this helps, >>>>> dean >>>>> >>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> We are new to hadoop and hive, we are trying to use hive to >>>>>> run analytical queries and we are using sqoop to import data into hive, in >>>>>> our RDBMS the data updated very frequently and this needs to be reflected >>>>>> to hive. Hive does not support update/delete but there are many workarounds >>>>>> to do this task. >>>>>> >>>>>> What's in our mind is importing all the tables into hive as is, then >>>>>> we build the required tables for reporting. +
Ibrahim Yakti 2012-12-26, 13:54
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-26, 14:52
Hello Ibrahim,
Sorry for the late response. Those replies were for Kshiva. I saw his question(exactly same as this one) multiple times on Pig mailing list as well, so just thought of giving some pointers to him on how to use the list. I should have specified it properly. Apologies for creating the nuisance. Coming back to the actual point, yes the flow is fine. Normally people do it like this. But I was looking for some alternate way, so that we don't have to go through this long process for the updates. I'll let you know once I find something useful. But till now I haven't found anything better than whatever Dean sir has suggested. Please, do let me know if you find something before me. Many thanks. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Wed, Dec 26, 2012 at 7:24 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > After more reading, a suggested scenario looks like: > > MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as > external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins & > Queries ---> Update HBase as needed & Reload in Hive. > > What do you think please? > > > > -- > Ibrahim > > > On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Mohammad, I am not sure if the answers & the link were to me or to >> Kshiva's question. >> >> if I have partitioned my data based on status for example, when I run the >> update query it will add the updated data on a new partition (success or >> shipped for example) and it will keep the old data (confirmed or paid for >> example), right? >> >> >> -- >> Ibrahim >> >> >> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >>> Also, have a look at this : >>> http://www.catb.org/~esr/faqs/smart-questions.html >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>>> Have a look at Beeswax. >>>> >>>> BTW, do you have access to Google at your station?Same question on the >>>> Pig mailing list as well, that too twice. >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]>wrote: >>>> >>>>> Hi, >>>>> >>>>> Is there any Hive editors and where we can write 100 to 150 Hive >>>>> scripts I'm believing is not essay to do in CLI mode all scripts . >>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks >>>>> >>>>> >>>>> Thanks >>>>> >>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < >>>>> [EMAIL PROTECTED]> wrote: >>>>> >>>>>> This is not as hard as it sounds. The hardest part is setting up the >>>>>> incremental query against your MySQL database. Then you can write the >>>>>> results to new files in the HDFS directory for the table and Hive will see >>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>>>>> care how many files are in the directory. The trick is to avoid lots of >>>>>> little files. >>>>>> >>>>>> As others have suggested, you should consider partitioning the data, >>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>>>>> day, then use year/month/day partitioning to speed up your Hive queries. >>>>>> You'll need to add the partitions to the table as you go, but actually, you >>>>>> can add those once a month, for example, for all partitions. Hive doesn't >>>>>> care if the partition directories don't exist yet or the directories are >>>>>> empty. I also recommend using an external table, which gives you more >>>>>> flexibility on directory layout, etc. >>>>>> >>>>>> Sqoop might be the easiest tool for importing the data, as it will >>>>>> even generate a Hive table schema from the original MySQL table. However, >>>>>> that feature may not be useful in this case, as you already have the table. >>>>>> >>>>>> I think Oozie is horribly complex to use and overkill for this +
Mohammad Tariq 2012-12-26, 14:52
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-26, 14:56
Thanks Mohammad, I will be waiting ... meanwhile, seems I will get into
HBase and give it a try ... unless someone advised with something better/easier. -- Ibrahim On Wed, Dec 26, 2012 at 5:52 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello Ibrahim, > > Sorry for the late response. Those replies were for Kshiva. I > saw his question(exactly same as this one) multiple times on Pig mailing > list as well, so just thought of giving some pointers to him on how to use > the list. I should have specified it properly. Apologies for creating the > nuisance. > > Coming back to the actual point, yes the flow is fine. Normally people do > it like this. But I was looking for some alternate way, so that we don't > have to go through this long process for the updates. I'll let you know > once I find something useful. But till now I haven't found anything better > than whatever Dean sir has suggested. Please, do let me know if you find > something before me. > > Many thanks. > > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Wed, Dec 26, 2012 at 7:24 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> After more reading, a suggested scenario looks like: >> >> MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as >> external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins & >> Queries ---> Update HBase as needed & Reload in Hive. >> >> What do you think please? >> >> >> >> -- >> Ibrahim >> >> >> On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >>> Mohammad, I am not sure if the answers & the link were to me or to >>> Kshiva's question. >>> >>> if I have partitioned my data based on status for example, when I run >>> the update query it will add the updated data on a new partition (success >>> or shipped for example) and it will keep the old data (confirmed or paid >>> for example), right? >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>>> Also, have a look at this : >>>> http://www.catb.org/~esr/faqs/smart-questions.html >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>> >>>>> Have a look at Beeswax. >>>>> >>>>> BTW, do you have access to Google at your station?Same question on the >>>>> Pig mailing list as well, that too twice. >>>>> >>>>> Best Regards, >>>>> Tariq >>>>> +91-9741563634 >>>>> https://mtariq.jux.com/ >>>>> >>>>> >>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Is there any Hive editors and where we can write 100 to 150 Hive >>>>>> scripts I'm believing is not essay to do in CLI mode all scripts . >>>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler < >>>>>> [EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> This is not as hard as it sounds. The hardest part is setting up the >>>>>>> incremental query against your MySQL database. Then you can write the >>>>>>> results to new files in the HDFS directory for the table and Hive will see >>>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>>>>>> care how many files are in the directory. The trick is to avoid lots of >>>>>>> little files. >>>>>>> >>>>>>> As others have suggested, you should consider partitioning the data, >>>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>>>>>> day, then use year/month/day partitioning to speed up your Hive queries. >>>>>>> You'll need to add the partitions to the table as you go, but actually, you >>>>>>> can add those once a month, for example, for all partitions. Hive doesn't >>>>>>> care if the partition directories don't exist yet or the directories are >>>>>>> empty. I also recommend using an external table, which gives you more +
Ibrahim Yakti 2012-12-26, 14:56
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-24, 13:19
Hello Ibrahim,
A quick questio. Are you planning to replace your SQL DB with Hive? If that is the case, I would not suggest to do that. Both are meant for entirely different purposes. Hive is for batch processing and not for real time system. So if you are requirements involve real time things, you need to think before moving ahead. Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. HTH Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > Hi All, > > We are new to hadoop and hive, we are trying to use hive to > run analytical queries and we are using sqoop to import data into hive, in > our RDBMS the data updated very frequently and this needs to be reflected > to hive. Hive does not support update/delete but there are many workarounds > to do this task. > > What's in our mind is importing all the tables into hive as is, then we > build the required tables for reporting. > > My questions are: > > 1. What is the best way to reflect MySQL updates into Hive with > minimal resources? > 2. Is sqoop the right tool to do the ETL? > 3. Is Hive the right tool to do this kind of queries or we should > search for alternatives? > > Any hint will be useful, thanks in advanced. > > -- > Ibrahim > +
Mohammad Tariq 2012-12-24, 13:19
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 13:30
Thanks Mohammad, No, we do not have any plans to replace our RDBMS with
Hive. Hadoop/Hive will be used as Data Warehouse & batch processing computing, as I said we want to use Hive for analytical queries. -- Ibrahim On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello Ibrahim, > > A quick questio. Are you planning to replace your SQL DB with Hive? > If that is the case, I would not suggest to do that. Both are meant for > entirely different purposes. Hive is for batch processing and not for real > time system. So if you are requirements involve real time things, you need > to think before moving ahead. > > Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. > > HTH > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Hi All, >> >> We are new to hadoop and hive, we are trying to use hive to >> run analytical queries and we are using sqoop to import data into hive, in >> our RDBMS the data updated very frequently and this needs to be reflected >> to hive. Hive does not support update/delete but there are many workarounds >> to do this task. >> >> What's in our mind is importing all the tables into hive as is, then we >> build the required tables for reporting. >> >> My questions are: >> >> 1. What is the best way to reflect MySQL updates into Hive with >> minimal resources? >> 2. Is sqoop the right tool to do the ETL? >> 3. Is Hive the right tool to do this kind of queries or we should >> search for alternatives? >> >> Any hint will be useful, thanks in advanced. >> >> -- >> Ibrahim >> > > +
Ibrahim Yakti 2012-12-24, 13:30
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-24, 13:35
Cool. Then go ahead :)
Just in case you need something in realtime, you can have a look at Impala.(I know nobody likes to get preached, but just in case ;) ). Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > Thanks Mohammad, No, we do not have any plans to replace our RDBMS with > Hive. Hadoop/Hive will be used as Data Warehouse & batch processing > computing, as I said we want to use Hive for analytical queries. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> Hello Ibrahim, >> >> A quick questio. Are you planning to replace your SQL DB with Hive? >> If that is the case, I would not suggest to do that. Both are meant for >> entirely different purposes. Hive is for batch processing and not for real >> time system. So if you are requirements involve real time things, you need >> to think before moving ahead. >> >> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >> >> HTH >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >>> Hi All, >>> >>> We are new to hadoop and hive, we are trying to use hive to >>> run analytical queries and we are using sqoop to import data into hive, in >>> our RDBMS the data updated very frequently and this needs to be reflected >>> to hive. Hive does not support update/delete but there are many workarounds >>> to do this task. >>> >>> What's in our mind is importing all the tables into hive as is, then we >>> build the required tables for reporting. >>> >>> My questions are: >>> >>> 1. What is the best way to reflect MySQL updates into Hive with >>> minimal resources? >>> 2. Is sqoop the right tool to do the ETL? >>> 3. Is Hive the right tool to do this kind of queries or we should >>> search for alternatives? >>> >>> Any hint will be useful, thanks in advanced. >>> >>> -- >>> Ibrahim >>> >> >> > +
Mohammad Tariq 2012-12-24, 13:35
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 13:38
My question was how to reflect MySQL updates to hadoop/hive, this is our
problem now. -- Ibrahim On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Cool. Then go ahead :) > > Just in case you need something in realtime, you can have a look at > Impala.(I know nobody likes to get preached, but just in case ;) ). > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >> computing, as I said we want to use Hive for analytical queries. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >>> Hello Ibrahim, >>> >>> A quick questio. Are you planning to replace your SQL DB with Hive? >>> If that is the case, I would not suggest to do that. Both are meant for >>> entirely different purposes. Hive is for batch processing and not for real >>> time system. So if you are requirements involve real time things, you need >>> to think before moving ahead. >>> >>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>> >>> HTH >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>>> Hi All, >>>> >>>> We are new to hadoop and hive, we are trying to use hive to >>>> run analytical queries and we are using sqoop to import data into hive, in >>>> our RDBMS the data updated very frequently and this needs to be reflected >>>> to hive. Hive does not support update/delete but there are many workarounds >>>> to do this task. >>>> >>>> What's in our mind is importing all the tables into hive as is, then we >>>> build the required tables for reporting. >>>> >>>> My questions are: >>>> >>>> 1. What is the best way to reflect MySQL updates into Hive with >>>> minimal resources? >>>> 2. Is sqoop the right tool to do the ETL? >>>> 3. Is Hive the right tool to do this kind of queries or we should >>>> search for alternatives? >>>> >>>> Any hint will be useful, thanks in advanced. >>>> >>>> -- >>>> Ibrahim >>>> >>> >>> >> > +
Ibrahim Yakti 2012-12-24, 13:38
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-24, 14:03
You can use Apache Oozie to schedule your imports.
Alternatively, you can have an additional column in your SQL table, say LastUpdatedTime or something. As soon as there is a change in this column you can start the import from this point. This way you don't have to import all the things everytime there is a change in your table. You just have to move only the most recent data, say only the 'delta' amount of data. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > My question was how to reflect MySQL updates to hadoop/hive, this is our > problem now. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> Cool. Then go ahead :) >> >> Just in case you need something in realtime, you can have a look at >> Impala.(I know nobody likes to get preached, but just in case ;) ). >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>> computing, as I said we want to use Hive for analytical queries. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>>> Hello Ibrahim, >>>> >>>> A quick questio. Are you planning to replace your SQL DB with >>>> Hive? If that is the case, I would not suggest to do that. Both are meant >>>> for entirely different purposes. Hive is for batch processing and not for >>>> real time system. So if you are requirements involve real time things, you >>>> need to think before moving ahead. >>>> >>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>>> >>>> HTH >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi All, >>>>> >>>>> We are new to hadoop and hive, we are trying to use hive to >>>>> run analytical queries and we are using sqoop to import data into hive, in >>>>> our RDBMS the data updated very frequently and this needs to be reflected >>>>> to hive. Hive does not support update/delete but there are many workarounds >>>>> to do this task. >>>>> >>>>> What's in our mind is importing all the tables into hive as is, then >>>>> we build the required tables for reporting. >>>>> >>>>> My questions are: >>>>> >>>>> 1. What is the best way to reflect MySQL updates into Hive with >>>>> minimal resources? >>>>> 2. Is sqoop the right tool to do the ETL? >>>>> 3. Is Hive the right tool to do this kind of queries or we should >>>>> search for alternatives? >>>>> >>>>> Any hint will be useful, thanks in advanced. >>>>> >>>>> -- >>>>> Ibrahim >>>>> >>>> >>>> >>> >> > +
Mohammad Tariq 2012-12-24, 14:03
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 14:08
This already done, but Hive does not support update nor deletion of data,
so when I import the data after specific "last_update_time" records, hive will append it not replace. -- Ibrahim On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > You can use Apache Oozie to schedule your imports. > > Alternatively, you can have an additional column in your SQL table, say > LastUpdatedTime or something. As soon as there is a change in this column > you can start the import from this point. This way you don't have to import > all the things everytime there is a change in your table. You just have to > move only the most recent data, say only the 'delta' amount of data. > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> My question was how to reflect MySQL updates to hadoop/hive, this is our >> problem now. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >>> Cool. Then go ahead :) >>> >>> Just in case you need something in realtime, you can have a look at >>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>>> computing, as I said we want to use Hive for analytical queries. >>>> >>>> >>>> -- >>>> Ibrahim >>>> >>>> >>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>> >>>>> Hello Ibrahim, >>>>> >>>>> A quick questio. Are you planning to replace your SQL DB with >>>>> Hive? If that is the case, I would not suggest to do that. Both are meant >>>>> for entirely different purposes. Hive is for batch processing and not for >>>>> real time system. So if you are requirements involve real time things, you >>>>> need to think before moving ahead. >>>>> >>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>>>> >>>>> HTH >>>>> >>>>> Best Regards, >>>>> Tariq >>>>> +91-9741563634 >>>>> https://mtariq.jux.com/ >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> We are new to hadoop and hive, we are trying to use hive to >>>>>> run analytical queries and we are using sqoop to import data into hive, in >>>>>> our RDBMS the data updated very frequently and this needs to be reflected >>>>>> to hive. Hive does not support update/delete but there are many workarounds >>>>>> to do this task. >>>>>> >>>>>> What's in our mind is importing all the tables into hive as is, then >>>>>> we build the required tables for reporting. >>>>>> >>>>>> My questions are: >>>>>> >>>>>> 1. What is the best way to reflect MySQL updates into Hive with >>>>>> minimal resources? >>>>>> 2. Is sqoop the right tool to do the ETL? >>>>>> 3. Is Hive the right tool to do this kind of queries or we should >>>>>> search for alternatives? >>>>>> >>>>>> Any hint will be useful, thanks in advanced. >>>>>> >>>>>> -- >>>>>> Ibrahim >>>>>> >>>>> >>>>> >>>> >>> >> > +
Ibrahim Yakti 2012-12-24, 14:08
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-24, 14:25
I was actually trying to answer you actual questions. What are you
currently doing to tackle this update problem and what kind of tweak you are looking for?There is no direct solution to achieve this, out-of-the-box, as you have said. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 7:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > This already done, but Hive does not support update nor deletion of data, > so when I import the data after specific "last_update_time" records, hive > will append it not replace. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> You can use Apache Oozie to schedule your imports. >> >> Alternatively, you can have an additional column in your SQL table, say >> LastUpdatedTime or something. As soon as there is a change in this column >> you can start the import from this point. This way you don't have to import >> all the things everytime there is a change in your table. You just have to >> move only the most recent data, say only the 'delta' amount of data. >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >>> My question was how to reflect MySQL updates to hadoop/hive, this is our >>> problem now. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>>> Cool. Then go ahead :) >>>> >>>> Just in case you need something in realtime, you can have a look at >>>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>>> >>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS >>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>>>> computing, as I said we want to use Hive for analytical queries. >>>>> >>>>> >>>>> -- >>>>> Ibrahim >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Hello Ibrahim, >>>>>> >>>>>> A quick questio. Are you planning to replace your SQL DB with >>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant >>>>>> for entirely different purposes. Hive is for batch processing and not for >>>>>> real time system. So if you are requirements involve real time things, you >>>>>> need to think before moving ahead. >>>>>> >>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>>>>> >>>>>> HTH >>>>>> >>>>>> Best Regards, >>>>>> Tariq >>>>>> +91-9741563634 >>>>>> https://mtariq.jux.com/ >>>>>> >>>>>> >>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> We are new to hadoop and hive, we are trying to use hive to >>>>>>> run analytical queries and we are using sqoop to import data into hive, in >>>>>>> our RDBMS the data updated very frequently and this needs to be reflected >>>>>>> to hive. Hive does not support update/delete but there are many workarounds >>>>>>> to do this task. >>>>>>> >>>>>>> What's in our mind is importing all the tables into hive as is, then >>>>>>> we build the required tables for reporting. >>>>>>> >>>>>>> My questions are: >>>>>>> >>>>>>> 1. What is the best way to reflect MySQL updates into Hive with >>>>>>> minimal resources? >>>>>>> 2. Is sqoop the right tool to do the ETL? >>>>>>> 3. Is Hive the right tool to do this kind of queries or we >>>>>>> should search for alternatives? >>>>>>> >>>>>>> Any hint will be useful, thanks in advanced. >>>>>>> >>>>>>> -- >>>>>>> Ibrahim >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > +
Mohammad Tariq 2012-12-24, 14:25
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 14:28
What if you have many columns that need to be updated? a simple example:
confirmation date, payment status(es) + status update time, delivery, ... etc ? on what base you will set your partition and how the old data will be removed because the updated data will be reloaded in other partition if I partition using payment status for example. -- Ibrahim On Mon, Dec 24, 2012 at 5:25 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > I was actually trying to answer you actual questions. What are you > currently doing to tackle this update problem and what kind of tweak you > are looking for?There is no direct solution to achieve this, > out-of-the-box, as you have said. > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 7:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> This already done, but Hive does not support update nor deletion of data, >> so when I import the data after specific "last_update_time" records, hive >> will append it not replace. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >>> You can use Apache Oozie to schedule your imports. >>> >>> Alternatively, you can have an additional column in your SQL table, say >>> LastUpdatedTime or something. As soon as there is a change in this column >>> you can start the import from this point. This way you don't have to import >>> all the things everytime there is a change in your table. You just have to >>> move only the most recent data, say only the 'delta' amount of data. >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>>> My question was how to reflect MySQL updates to hadoop/hive, this is >>>> our problem now. >>>> >>>> >>>> -- >>>> Ibrahim >>>> >>>> >>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>> >>>>> Cool. Then go ahead :) >>>>> >>>>> Just in case you need something in realtime, you can have a look at >>>>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>>>> >>>>> Best Regards, >>>>> Tariq >>>>> +91-9741563634 >>>>> https://mtariq.jux.com/ >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS >>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>>>>> computing, as I said we want to use Hive for analytical queries. >>>>>> >>>>>> >>>>>> -- >>>>>> Ibrahim >>>>>> >>>>>> >>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> Hello Ibrahim, >>>>>>> >>>>>>> A quick questio. Are you planning to replace your SQL DB with >>>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant >>>>>>> for entirely different purposes. Hive is for batch processing and not for >>>>>>> real time system. So if you are requirements involve real time things, you >>>>>>> need to think before moving ahead. >>>>>>> >>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Best Regards, >>>>>>> Tariq >>>>>>> +91-9741563634 >>>>>>> https://mtariq.jux.com/ >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> We are new to hadoop and hive, we are trying to use hive to >>>>>>>> run analytical queries and we are using sqoop to import data into hive, in >>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected >>>>>>>> to hive. Hive does not support update/delete but there are many workarounds >>>>>>>> to do this task. >>>>>>>> >>>>>>>> What's in our mind is importing all the tables into hive as is, >>>>>>>> then we build the required tables for reporting. >>>>>>>> >>>>>>>> My questions are: >>>>>>>> >>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with +
Ibrahim Yakti 2012-12-24, 14:28
-
Re: Reflect MySQL updates into HiveJeremiah Peschka 2012-12-24, 14:22
If it were me, I would find a way to identify the partitions that have
modified data and then re-load a subset of the partitions (only the ones with changes) on a regular basis. Instead of updating/deleting data, you'll be re-loading specific partitions as an all or nothing action. On Monday, December 24, 2012, Ibrahim Yakti wrote: > This already done, but Hive does not support update nor deletion of data, > so when I import the data after specific "last_update_time" records, hive > will append it not replace. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > > You can use Apache Oozie to schedule your imports. > > Alternatively, you can have an additional column in your SQL table, say > LastUpdatedTime or something. As soon as there is a change in this column > you can start the import from this point. This way you don't have to import > all the things everytime there is a change in your table. You just have to > move only the most recent data, say only the 'delta' amount of data. > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > > My question was how to reflect MySQL updates to hadoop/hive, this is our > problem now. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > > Cool. Then go ahead :) > > Just in case you need something in realtime, you can have a look at > Impala.(I know nobody likes to get preached, but just in case ;) ). > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > > Thanks Mohammad, No, we do not have any plans to replace our RDBMS with > Hive. Hadoop/Hive will be used as Data Warehouse & batch processing > computing, as I said we want to use Hive for analytical queries. > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > > Hello Ibrahim, > > A quick questio. Are you planning to replace your SQL DB with Hive? > If that is the case, I would not suggest to do that. Both are meant for > entirely different purposes. Hive is for batch processing and not for real > time system. So if you are requirements involve real time things, you need > to think before moving ahead. > > Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. > > HTH > > Best Regards, > Tariq > +91-9741563634 > https://mtariq.jux.com/ > > > On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > > Hi All, > > We are new to hadoop and hive, we are trying to use hive to > run analytical queries and we are using sqoop to import data into hive, in > our RDBMS the data updated very frequently and this needs to be reflected > to hive. Hive does not support update/delete but there are many workarounds > to do this task. > > What's in our mind is importing all the > > -- --- Jeremiah Peschka Founder, Brent Ozar Unlimited Microsoft SQL Server MVP +
Jeremiah Peschka 2012-12-24, 14:22
-
Re: Reflect MySQL updates into HiveEdward Capriolo 2012-12-24, 14:28
You can only do the last_update idea if this is an insert only dataset.
If your table takes updates you need a different strategy. 1) full dumps every interval. 2) Using a storage handler like hbase or cassandra that takes update operations On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka < [EMAIL PROTECTED]> wrote: > If it were me, I would find a way to identify the partitions that have > modified data and then re-load a subset of the partitions (only the ones > with changes) on a regular basis. Instead of updating/deleting data, you'll > be re-loading specific partitions as an all or nothing action. > > On Monday, December 24, 2012, Ibrahim Yakti wrote: > >> This already done, but Hive does not support update nor deletion of data, >> so when I import the data after specific "last_update_time" records, hive >> will append it not replace. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >> You can use Apache Oozie to schedule your imports. >> >> Alternatively, you can have an additional column in your SQL table, say >> LastUpdatedTime or something. As soon as there is a change in this column >> you can start the import from this point. This way you don't have to import >> all the things everytime there is a change in your table. You just have to >> move only the most recent data, say only the 'delta' amount of data. >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >> My question was how to reflect MySQL updates to hadoop/hive, this is our >> problem now. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >> Cool. Then go ahead :) >> >> Just in case you need something in realtime, you can have a look at >> Impala.(I know nobody likes to get preached, but just in case ;) ). >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >> computing, as I said we want to use Hive for analytical queries. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >> Hello Ibrahim, >> >> A quick questio. Are you planning to replace your SQL DB with Hive? >> If that is the case, I would not suggest to do that. Both are meant for >> entirely different purposes. Hive is for batch processing and not for real >> time system. So if you are requirements involve real time things, you need >> to think before moving ahead. >> >> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >> >> HTH >> >> Best Regards, >> Tariq >> +91-9741563634 >> https://mtariq.jux.com/ >> >> >> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >> >> Hi All, >> >> We are new to hadoop and hive, we are trying to use hive to >> run analytical queries and we are using sqoop to import data into hive, in >> our RDBMS the data updated very frequently and this needs to be reflected >> to hive. Hive does not support update/delete but there are many workarounds >> to do this task. >> >> What's in our mind is importing all the >> >> > > -- > --- > Jeremiah Peschka > Founder, Brent Ozar Unlimited > Microsoft SQL Server MVP > > +
Edward Capriolo 2012-12-24, 14:28
-
Re: Reflect MySQL updates into HiveMohammad Tariq 2012-12-24, 14:31
Good points by Edward. I specially love the point no. 2.
Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 7:58 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > You can only do the last_update idea if this is an insert only dataset. > > If your table takes updates you need a different strategy. > 1) full dumps every interval. > 2) Using a storage handler like hbase or cassandra that takes update > operations > > > > On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka < > [EMAIL PROTECTED]> wrote: > >> If it were me, I would find a way to identify the partitions that have >> modified data and then re-load a subset of the partitions (only the ones >> with changes) on a regular basis. Instead of updating/deleting data, you'll >> be re-loading specific partitions as an all or nothing action. >> >> On Monday, December 24, 2012, Ibrahim Yakti wrote: >> >>> This already done, but Hive does not support update nor deletion of >>> data, so when I import the data after specific "last_update_time" records, >>> hive will append it not replace. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>> You can use Apache Oozie to schedule your imports. >>> >>> Alternatively, you can have an additional column in your SQL table, say >>> LastUpdatedTime or something. As soon as there is a change in this column >>> you can start the import from this point. This way you don't have to import >>> all the things everytime there is a change in your table. You just have to >>> move only the most recent data, say only the 'delta' amount of data. >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>> My question was how to reflect MySQL updates to hadoop/hive, this is our >>> problem now. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>> Cool. Then go ahead :) >>> >>> Just in case you need something in realtime, you can have a look at >>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>> computing, as I said we want to use Hive for analytical queries. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>> Hello Ibrahim, >>> >>> A quick questio. Are you planning to replace your SQL DB with Hive? >>> If that is the case, I would not suggest to do that. Both are meant for >>> entirely different purposes. Hive is for batch processing and not for real >>> time system. So if you are requirements involve real time things, you need >>> to think before moving ahead. >>> >>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>> >>> HTH >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>> Hi All, >>> >>> We are new to hadoop and hive, we are trying to use hive to >>> run analytical queries and we are using sqoop to import data into hive, in >>> our RDBMS the data updated very frequently and this needs to be reflected >>> to hive. Hive does not support update/delete but there are many workarounds >>> to do this task. >>> >>> What's in our mind is importing all the >>> >>> >> >> -- >> --- >> Jeremiah Peschka >> Founder, Brent Ozar Unlimited >> Microsoft SQL Server MVP >> >> > +
Mohammad Tariq 2012-12-24, 14:31
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 14:29
Edward can you explain more please? you suggesting that I should use HBase
for such tasks instead of hive? -- Ibrahim On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > You can only do the last_update idea if this is an insert only dataset. > > If your table takes updates you need a different strategy. > 1) full dumps every interval. > 2) Using a storage handler like hbase or cassandra that takes update > operations > > > > On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka < > [EMAIL PROTECTED]> wrote: > >> If it were me, I would find a way to identify the partitions that have >> modified data and then re-load a subset of the partitions (only the ones >> with changes) on a regular basis. Instead of updating/deleting data, you'll >> be re-loading specific partitions as an all or nothing action. >> >> On Monday, December 24, 2012, Ibrahim Yakti wrote: >> >>> This already done, but Hive does not support update nor deletion of >>> data, so when I import the data after specific "last_update_time" records, >>> hive will append it not replace. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>> You can use Apache Oozie to schedule your imports. >>> >>> Alternatively, you can have an additional column in your SQL table, say >>> LastUpdatedTime or something. As soon as there is a change in this column >>> you can start the import from this point. This way you don't have to import >>> all the things everytime there is a change in your table. You just have to >>> move only the most recent data, say only the 'delta' amount of data. >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>> My question was how to reflect MySQL updates to hadoop/hive, this is our >>> problem now. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>> Cool. Then go ahead :) >>> >>> Just in case you need something in realtime, you can have a look at >>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>> computing, as I said we want to use Hive for analytical queries. >>> >>> >>> -- >>> Ibrahim >>> >>> >>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>> >>> Hello Ibrahim, >>> >>> A quick questio. Are you planning to replace your SQL DB with Hive? >>> If that is the case, I would not suggest to do that. Both are meant for >>> entirely different purposes. Hive is for batch processing and not for real >>> time system. So if you are requirements involve real time things, you need >>> to think before moving ahead. >>> >>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>> >>> HTH >>> >>> Best Regards, >>> Tariq >>> +91-9741563634 >>> https://mtariq.jux.com/ >>> >>> >>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>> >>> Hi All, >>> >>> We are new to hadoop and hive, we are trying to use hive to >>> run analytical queries and we are using sqoop to import data into hive, in >>> our RDBMS the data updated very frequently and this needs to be reflected >>> to hive. Hive does not support update/delete but there are many workarounds >>> to do this task. >>> >>> What's in our mind is importing all the >>> >>> >> >> -- >> --- >> Jeremiah Peschka >> Founder, Brent Ozar Unlimited >> Microsoft SQL Server MVP >> >> > +
Ibrahim Yakti 2012-12-24, 14:29
-
Re: Reflect MySQL updates into HiveEdward Capriolo 2012-12-24, 14:37
Hive can not easily handle updates. The most creative way I saw this done
was someone managed to capture all updates and then use union queries which rewrote the same hive table with the newest value. original + union delta + column with latest timestamp = new original But that is a lot of processing especially when you may not have man updates. Hive has storage handlers that let you lay a table over hbase and cassandra data. Store your data in those systems, they take updates, then use hive to query those. On Mon, Dec 24, 2012 at 9:29 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > Edward can you explain more please? you suggesting that I should use HBase > for such tasks instead of hive? > > > -- > Ibrahim > > > On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > >> You can only do the last_update idea if this is an insert only dataset. >> >> If your table takes updates you need a different strategy. >> 1) full dumps every interval. >> 2) Using a storage handler like hbase or cassandra that takes update >> operations >> >> >> >> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka < >> [EMAIL PROTECTED]> wrote: >> >>> If it were me, I would find a way to identify the partitions that have >>> modified data and then re-load a subset of the partitions (only the ones >>> with changes) on a regular basis. Instead of updating/deleting data, you'll >>> be re-loading specific partitions as an all or nothing action. >>> >>> On Monday, December 24, 2012, Ibrahim Yakti wrote: >>> >>>> This already done, but Hive does not support update nor deletion of >>>> data, so when I import the data after specific "last_update_time" records, >>>> hive will append it not replace. >>>> >>>> >>>> -- >>>> Ibrahim >>>> >>>> >>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>> >>>> You can use Apache Oozie to schedule your imports. >>>> >>>> Alternatively, you can have an additional column in your SQL table, say >>>> LastUpdatedTime or something. As soon as there is a change in this column >>>> you can start the import from this point. This way you don't have to import >>>> all the things everytime there is a change in your table. You just have to >>>> move only the most recent data, say only the 'delta' amount of data. >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>>> >>>> My question was how to reflect MySQL updates to hadoop/hive, this is >>>> our problem now. >>>> >>>> >>>> -- >>>> Ibrahim >>>> >>>> >>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>> >>>> Cool. Then go ahead :) >>>> >>>> Just in case you need something in realtime, you can have a look at >>>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: >>>> >>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with >>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>>> computing, as I said we want to use Hive for analytical queries. >>>> >>>> >>>> -- >>>> Ibrahim >>>> >>>> >>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>> >>>> Hello Ibrahim, >>>> >>>> A quick questio. Are you planning to replace your SQL DB with >>>> Hive? If that is the case, I would not suggest to do that. Both are meant >>>> for entirely different purposes. Hive is for batch processing and not for >>>> real time system. So if you are requirements involve real time things, you >>>> need to think before moving ahead. >>>> >>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose. >>>> >>>> HTH >>>> >>>> Best Regards, >>>> Tariq >>>> +91-9741563634 >>>> https://mtariq.jux.com/ >>>> >>>> >>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: +
Edward Capriolo 2012-12-24, 14:37
-
Re: Reflect MySQL updates into HiveIbrahim Yakti 2012-12-24, 14:41
Bottom line: use sqoop to import data into HBase/Cassandra for storage and
use Hive to query the data using external tables, did I miss anything? -- Ibrahim On Mon, Dec 24, 2012 at 5:37 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > Hive can not easily handle updates. The most creative way I saw this done > was someone managed to capture all updates and then use union queries which > rewrote the same hive table with the newest value. > > original + union delta + column with latest timestamp = new original > > But that is a lot of processing especially when you may not have man > updates. Hive has storage handlers that let you lay a table over hbase and > cassandra data. Store your data in those systems, they take updates, then > use hive to query those. > > > On Mon, Dec 24, 2012 at 9:29 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote: > >> Edward can you explain more please? you suggesting that I should use >> HBase for such tasks instead of hive? >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: >> >>> You can only do the last_update idea if this is an insert only dataset. >>> >>> If your table takes updates you need a different strategy. >>> 1) full dumps every interval. >>> 2) Using a storage handler like hbase or cassandra that takes update >>> operations >>> >>> >>> >>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka < >>> [EMAIL PROTECTED]> wrote: >>> >>>> If it were me, I would find a way to identify the partitions that have >>>> modified data and then re-load a subset of the partitions (only the ones >>>> with changes) on a regular basis. Instead of updating/deleting data, you'll >>>> be re-loading specific partitions as an all or nothing action. >>>> >>>> On Monday, December 24, 2012, Ibrahim Yakti wrote: >>>> >>>>> This already done, but Hive does not support update nor deletion of >>>>> data, so when I import the data after specific "last_update_time" records, >>>>> hive will append it not replace. >>>>> >>>>> >>>>> -- >>>>> Ibrahim >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>>> >>>>> You can use Apache Oozie to schedule your imports. >>>>> >>>>> Alternatively, you can have an additional column in your SQL table, >>>>> say LastUpdatedTime or something. As soon as there is a change in this >>>>> column you can start the import from this point. This way you don't have to >>>>> import all the things everytime there is a change in your table. You just >>>>> have to move only the most recent data, say only the 'delta' amount of data. >>>>> >>>>> Best Regards, >>>>> Tariq >>>>> +91-9741563634 >>>>> https://mtariq.jux.com/ >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>> >>>>> My question was how to reflect MySQL updates to hadoop/hive, this is >>>>> our problem now. >>>>> >>>>> >>>>> -- >>>>> Ibrahim >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>>> >>>>> Cool. Then go ahead :) >>>>> >>>>> Just in case you need something in realtime, you can have a look at >>>>> Impala.(I know nobody likes to get preached, but just in case ;) ). >>>>> >>>>> Best Regards, >>>>> Tariq >>>>> +91-9741563634 >>>>> https://mtariq.jux.com/ >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote: >>>>> >>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS >>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing >>>>> computing, as I said we want to use Hive for analytical queries. >>>>> >>>>> >>>>> -- >>>>> Ibrahim >>>>> >>>>> >>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >>>>> >>>>> Hello Ibrahim, >>>>> >>>>> A quick questio. Are you planning to replace your SQL DB with >>>>> Hive? If that is the case, I would not suggest to do that. Both are meant >>>>> for entirely different purposes. Hive is for batch processing and not for +
Ibrahim Yakti 2012-12-24, 14:41
|