|
Sever Fundatureanu
2012-07-26, 16:39
Sateesh Lakkarsu
2012-07-26, 16:47
Sever Fundatureanu
2012-07-26, 22:46
Anil Gupta
2012-07-27, 03:40
Bijeet Singh
2012-07-27, 06:17
Sever Fundatureanu
2012-07-27, 11:17
Sever Fundatureanu
2012-07-27, 13:46
Alex Baranau
2012-07-27, 14:01
|
-
Bulk loading disadvantagesSever Fundatureanu 2012-07-26, 16:39
Hello,
For the bulkloading process, the HBase documentation mentions that in a 2nd stage "the appropriate Region Server adopts the HFile, moving it into its storage directory and making the data available to clients." But from my experience the files also remain in the original location from where they are "adopted". So I guess the data is actually copied into the HBase directory right? This means that, compared to the online importing, when bulk loading you essentially need twice the disk space on HDFS, right? Another problem is with data locality immediately after bulk loading through MR. I understand that the locality is obtained in time through compactions and splits. However you don't get this problem while importing online, right? Thanks in advance, Sever
-
Re: Bulk loading disadvantagesSateesh Lakkarsu 2012-07-26, 16:47
>
> > For the bulkloading process, the HBase documentation mentions that in > a 2nd stage "the appropriate Region Server adopts the HFile, moving it > into its storage directory and making the data available to clients." > But from my experience the files also remain in the original location > from where they are "adopted". So I guess the data is actually copied > into the HBase directory right? This means that, compared to the > online importing, when bulk loading you essentially need twice the > disk space on HDFS, right? > Yes, if you are generating HFiles on one cluster and loading into a separate hbase cluster. If they are co-located, its just a hdfs mv. Another problem is with data locality immediately after bulk loading > through MR. I understand that the locality is obtained in time through > compactions and splits. However you don't get this problem while > importing online, right? > > Yes
-
Re: Bulk loading disadvantagesSever Fundatureanu 2012-07-26, 22:46
On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[EMAIL PROTECTED]> wrote:
>> >> >> For the bulkloading process, the HBase documentation mentions that in >> a 2nd stage "the appropriate Region Server adopts the HFile, moving it >> into its storage directory and making the data available to clients." >> But from my experience the files also remain in the original location >> from where they are "adopted". So I guess the data is actually copied >> into the HBase directory right? This means that, compared to the >> online importing, when bulk loading you essentially need twice the >> disk space on HDFS, right? >> > > Yes, if you are generating HFiles on one cluster and loading into a > separate hbase cluster. If they are co-located, its just a hdfs mv. Hmm, both the HFile generation and the HBase cluster runs on top of the same HDFS cluster. I did a "du" on both the source HDFS directory and the destination "/hbase" directory and I got the same sizes (+- few bytes). I deleted the source directory from HDFS and then scanned the table without any problems. Maybe there is a config parameter I'm missing? Sever
-
Re: Bulk loading disadvantagesAnil Gupta 2012-07-27, 03:40
Hi Sever,
That's a very interesting thing. Which Hadoop and hbase version you are using? I am going to run bulk loads tomorrow. If you can tell me which directories in hdfs you compared with /hbase/$table then I will try to check the same. Best Regards, Anil On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <[EMAIL PROTECTED]> wrote: > On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[EMAIL PROTECTED]> wrote: >>> >>> >>> For the bulkloading process, the HBase documentation mentions that in >>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it >>> into its storage directory and making the data available to clients." >>> But from my experience the files also remain in the original location >>> from where they are "adopted". So I guess the data is actually copied >>> into the HBase directory right? This means that, compared to the >>> online importing, when bulk loading you essentially need twice the >>> disk space on HDFS, right? >>> >> >> Yes, if you are generating HFiles on one cluster and loading into a >> separate hbase cluster. If they are co-located, its just a hdfs mv. > > Hmm, both the HFile generation and the HBase cluster runs on top of > the same HDFS cluster. I did a "du" on both the source HDFS directory > and the destination "/hbase" directory and I got the same sizes (+- > few bytes). I deleted the source directory from HDFS and then scanned > the table without any problems. Maybe there is a config parameter I'm > missing? > > Sever
-
Re: Bulk loading disadvantagesBijeet Singh 2012-07-27, 06:17
Anil,
The two directories in question here are - 1. the HDFS location where the MapReduce job creates the HFiles 2. the directory pointed to by hbase.rootdir in your HBase configuration - the default value is /hbase. Inside the HBase root directory, there are per-table subdirectories. So for the kind of comparison that you mentioned, you need to look in the directory <hbase.rootdir>/<table-name> and the directory where you are creating the HFiles. BIjeet On Fri, Jul 27, 2012 at 9:10 AM, Anil Gupta <[EMAIL PROTECTED]> wrote: > Hi Sever, > > That's a very interesting thing. Which Hadoop and hbase version you are > using? I am going to run bulk loads tomorrow. If you can tell me which > directories in hdfs you compared with /hbase/$table then I will try to > check the same. > > Best Regards, > Anil > > On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu < > [EMAIL PROTECTED]> wrote: > > > On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[EMAIL PROTECTED]> > wrote: > >>> > >>> > >>> For the bulkloading process, the HBase documentation mentions that in > >>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it > >>> into its storage directory and making the data available to clients." > >>> But from my experience the files also remain in the original location > >>> from where they are "adopted". So I guess the data is actually copied > >>> into the HBase directory right? This means that, compared to the > >>> online importing, when bulk loading you essentially need twice the > >>> disk space on HDFS, right? > >>> > >> > >> Yes, if you are generating HFiles on one cluster and loading into a > >> separate hbase cluster. If they are co-located, its just a hdfs mv. > > > > Hmm, both the HFile generation and the HBase cluster runs on top of > > the same HDFS cluster. I did a "du" on both the source HDFS directory > > and the destination "/hbase" directory and I got the same sizes (+- > > few bytes). I deleted the source directory from HDFS and then scanned > > the table without any problems. Maybe there is a config parameter I'm > > missing? > > > > Sever >
-
Re: Bulk loading disadvantagesSever Fundatureanu 2012-07-27, 11:17
Hi Anil,
I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed the ones mentioned my Bijeet. I can also add that I am doing the 2nd stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path sourceDir, HTable table) on a LoadIncrementalHFiles object. Best, Sever On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <[EMAIL PROTECTED]> wrote: > Hi Sever, > > That's a very interesting thing. Which Hadoop and hbase version you are using? I am going to run bulk loads tomorrow. If you can tell me which directories in hdfs you compared with /hbase/$table then I will try to check the same. > > Best Regards, > Anil > > On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <[EMAIL PROTECTED]> wrote: > >> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>> For the bulkloading process, the HBase documentation mentions that in >>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it >>>> into its storage directory and making the data available to clients." >>>> But from my experience the files also remain in the original location >>>> from where they are "adopted". So I guess the data is actually copied >>>> into the HBase directory right? This means that, compared to the >>>> online importing, when bulk loading you essentially need twice the >>>> disk space on HDFS, right? >>>> >>> >>> Yes, if you are generating HFiles on one cluster and loading into a >>> separate hbase cluster. If they are co-located, its just a hdfs mv. >> >> Hmm, both the HFile generation and the HBase cluster runs on top of >> the same HDFS cluster. I did a "du" on both the source HDFS directory >> and the destination "/hbase" directory and I got the same sizes (+- >> few bytes). I deleted the source directory from HDFS and then scanned >> the table without any problems. Maybe there is a config parameter I'm >> missing? >> >> Sever -- Sever Fundatureanu Vrije Universiteit Amsterdam E-mail: [EMAIL PROTECTED]
-
Re: Bulk loading disadvantagesSever Fundatureanu 2012-07-27, 13:46
After digging a bit I've found my problem comes from the following
lines in the Store class: void bulkLoadHFile(String srcPathStr) throws IOException { Path srcPath = new Path(srcPathStr); // Move the file if it's on another filesystem FileSystem srcFs = srcPath.getFileSystem(conf); if (!srcFs.equals(fs)) { LOG.info("File " + srcPath + " on different filesystem than " + "destination store - moving to this filesystem."); Path tmpPath = getTmpPath(); FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf); LOG.info("Copied to temporary path on dst filesystem: " + tmpPath); srcPath = tmpPath; } The equality for the 2 filesystems fails in my case and I get the following log: 2012-07-27 14:47:25,321 INFO org.apache.hadoop.hbase.regionserver.Store: File hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357 on different filesystem than destination store - moving to this filesystem. 2012-07-27 14:47:27,286 INFO org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path on dst filesystem: hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 2012-07-27 14:47:27,286 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 to hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712 2012-07-27 14:47:27,297 INFO org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in column family configuration 2012-07-27 14:47:27,297 INFO org.apache.hadoop.hbase.regionserver.Store: Moved hfile hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 into store directory hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F - updating store file list. 2012-07-27 14:47:27,297 INFO org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store file hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 into store F (new location: hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712) In my hbase-site.xml I have: <property> <name>hbase.rootdir</name> <value>hdfs://fs0.cm.cluster:8020/hbase</value> <description>The directory shared by RegionServers. </description> </property> and in my hdfs-site.xml I have: <property> <name>fs.default.name</name> <value>hdfs://fs0.cm.cluster:8020</value> </property> As you can see they point to the same namenode. So I really don't understand why the above check fails.. Regards, Sever On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu <[EMAIL PROTECTED]> wrote: > Hi Anil, > > I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed > the ones mentioned my Bijeet. I can also add that I am doing the 2nd > stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path > sourceDir, HTable table) on a LoadIncrementalHFiles object. > > Best, > Sever > > > On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <[EMAIL PROTECTED]> wrote: >> Hi Sever, >> >> That's a very interesting thing. Which Hadoop and hbase version you are using? I am going to run bulk loads tomorrow. If you can tell me which directories in hdfs you compared with /hbase/$table then I will try to check the same. >> >> Best Regards, >> Anil >> >> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <[EMAIL PROTECTED]> wrote: >> >>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>> For the bulkloading process, the HBase documentation mentions that in >>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it Sever Fundatureanu Vrije Universiteit Amsterdam E-mail: [EMAIL PROTECTED]
-
Re: Bulk loading disadvantagesAlex Baranau 2012-07-27, 14:01
> Another problem is with data locality immediately after bulk loading
> through MR. You might find this recent discussion about that useful: [1] Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr [1] The start is here: http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201207.mbox/%3CCAA7+[EMAIL PROTECTED]%3Ebut then the thread gets broken due to "FWD"/"RES" adding into subj. Also you can find it here: http://search-hadoop.com/?q=bulk+import+and+data+locality On Fri, Jul 27, 2012 at 9:46 AM, Sever Fundatureanu < [EMAIL PROTECTED]> wrote: > After digging a bit I've found my problem comes from the following > lines in the Store class: > > void bulkLoadHFile(String srcPathStr) throws IOException { > Path srcPath = new Path(srcPathStr); > > // Move the file if it's on another filesystem > FileSystem srcFs = srcPath.getFileSystem(conf); > if (!srcFs.equals(fs)) { > LOG.info("File " + srcPath + " on different filesystem than " + > "destination store - moving to this filesystem."); > Path tmpPath = getTmpPath(); > FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf); > LOG.info("Copied to temporary path on dst filesystem: " + tmpPath); > srcPath = tmpPath; > } > > The equality for the 2 filesystems fails in my case and I get the > following log: > > 2012-07-27 14:47:25,321 INFO > org.apache.hadoop.hbase.regionserver.Store: File > > hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357 > on different filesystem than destination store - moving to this > filesystem. > 2012-07-27 14:47:27,286 INFO > org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path > on dst filesystem: > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > 2012-07-27 14:47:27,286 DEBUG > org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > to > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712 > 2012-07-27 14:47:27,297 INFO > org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter > type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in > column family configuration > 2012-07-27 14:47:27,297 INFO > org.apache.hadoop.hbase.regionserver.Store: Moved hfile > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > into store directory > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F > - updating store file list. > 2012-07-27 14:47:27,297 INFO > org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store > file > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 > into store F (new location: > > hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712) > > In my hbase-site.xml I have: > <property> > <name>hbase.rootdir</name> > <value>hdfs://fs0.cm.cluster:8020/hbase</value> > <description>The directory shared by RegionServers. > </description> > </property> > > and in my hdfs-site.xml I have: > <property> > <name>fs.default.name</name> > <value>hdfs://fs0.cm.cluster:8020</value> > </property> > > As you can see they point to the same namenode. So I really don't > understand why the above check fails.. > > Regards, > Sever > > On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu > <[EMAIL PROTECTED]> wrote: > > Hi Anil, > > > > I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed > > the ones mentioned my Bijeet. I can also add that I am doing the 2nd Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr |