I'm biased, but I'd recommend checking out Sqoop (
http://github.com/cloudera/sqoop) for moving data from RDBMS systems into
HDFS/Hive/HBase and Flume (http://github.com/cloudera/flume) for moving log
files into HDFS/Hive/HBase.
For moving large sets of files into HDFS, I think distcp (
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/distcp.html) is your
On Fri, Jul 16, 2010 at 4:51 AM, Urckle <[EMAIL PROTECTED]> wrote:
> Hadoop version: 0.20.2
> MR coding will be done in java.
> Just starting out with my first Hadoop setup. I would like to know are
> there any best practice ways to load data into the dfs? I have (obviously)
> manually put data files into hdfs using the shell commands while playing
> with it at setup but going forward I will want to be retrieving large
> numbers of data feeds from remote, 3rd party locations and throwing them
> into hadoop for analysis later. What is the best way to automate this? Is it
> to gather the retrieved files into known locations to be mounted and then
> automate via script etc. to put the files into hdfs? Or is there some other
> practice? I've not been able to find specific use case yet... all docs cover
> the basic fs command without giving much details about more advanced setups.
> thanks for any info