|
|
-
Best way to write files to hdfs (from a Python app)
Bjoern Schiessle 2010-08-09, 16:18
Hi all,
I develop a web application with Django(Python) which should access an hbase database and store large files to hdfs.
I wonder what is the best way to write files to hdfs from my Django app? Basically I thought about two ways but maybe you know a better option:
1. First store the file on the local file system and than move it with the thrift interface to hdfs. (downside: needs always enough space on the web application server)
2. Use hdfs-fuse to mount the hdfs file system and write the file directly to hdfs. (downside: I don't know how well hdfs-fuse is supported and I'm not sure if it is a good idea to mount the file system and run large operation on it).
Since I'm new to hdfs and Hadoop in general I'm not sure what's the best and less error-prone way.
What would be your recommendation?
Thanks a lot! Björn
+
Bjoern Schiessle 2010-08-09, 16:18
-
Re: Best way to write files to hdfs (from a Python app)
Philip Zeyliger 2010-08-09, 23:35
Hi Bjoern, To give you an example of how this may be done, HUE, under the covers, pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put - path'. (That's from memory, but it's approximately right; the full python code is at http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692) Cheers, -- Philip On Mon, Aug 9, 2010 at 9:18 AM, Bjoern Schiessle <[EMAIL PROTECTED]>wrote: > Hi all, > > I develop a web application with Django(Python) which should access an > hbase database and store large files to hdfs. > > I wonder what is the best way to write files to hdfs from my Django app? > Basically I thought about two ways but maybe you know a better option: > > 1. First store the file on the local file system and than move it with > the thrift interface to hdfs. (downside: needs always enough space on the > web application server) > > 2. Use hdfs-fuse to mount the hdfs file system and write the file directly > to hdfs. (downside: I don't know how well hdfs-fuse is supported and I'm > not sure if it is a good idea to mount the file system and run large > operation on it). > > Since I'm new to hdfs and Hadoop in general I'm not sure what's the best > and less error-prone way. > > What would be your recommendation? > > Thanks a lot! > Björn > >
+
Philip Zeyliger 2010-08-09, 23:35
-
Re: Best way to write files to hdfs (from a Python app)
Bjoern Schiessle 2010-08-10, 12:06
Hi Philip, On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote: > To give you an example of how this may be done, HUE, under the covers, > pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put - > path'. (That's from memory, but it's approximately right; the full > python code is at > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692> ) Thank you! If I understand it correctly this only works if my python app runs on the same server as hadoop, right? I would like to run the python app on a different server. Therefore my two ideas (1) Thrift or (2) hdfs-fuse. Thrift seems to be able to store string content only to hdfs but no binary files. At least I couldn't find an interface for a simple put operation. So at the moment I'm not sure how to continue. Any ideas? Thanks, Björn
+
Bjoern Schiessle 2010-08-10, 12:06
-
Re: Best way to write files to hdfs (from a Python app)
Philip Zeyliger 2010-08-10, 16:39
On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle <[EMAIL PROTECTED]>wrote: > Hi Philip, > > On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote: > > To give you an example of how this may be done, HUE, under the covers, > > pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put - > > path'. (That's from memory, but it's approximately right; the full > > python code is at > > > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692> > ) > > Thank you! If I understand it correctly this only works if my python app > runs on the same server as hadoop, right? > It works only if your python app has network connectivity to your namenode. You can access an explicitly specified HDFS by passing -Dfs.default.name=hdfs://<namenode>:<namenode_port>/ . (The default is read from hadoop-site.xml (or perhaps hdfs-site.xml), and, I think, defaults to file:///).
+
Philip Zeyliger 2010-08-10, 16:39
-
Re: Best way to write files to hdfs (from a Python app)
Travis Crawford 2010-08-11, 04:17
Has anyone had tried using swig to wrap libhdfs? I spent some time today doing this, and it seems like it could be a great solution, but its also a fair amount of work (especially having never used swig before). If this seems generally worthwhile I could finish it up. Or is the thrift interface the API to use? Is anyone successfully using it? I'm primarily interested in building some filesystem management + reporting tools, so being slower than the Java interface is not problematic. I'd prefer to not to parse the command-line tool output though :) --travis On Tue, Aug 10, 2010 at 9:39 AM, Philip Zeyliger <[EMAIL PROTECTED]> wrote: > > > On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle <[EMAIL PROTECTED]> > wrote: >> >> Hi Philip, >> >> On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote: >> > To give you an example of how this may be done, HUE, under the covers, >> > pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put - >> > path'. (That's from memory, but it's approximately right; the full >> > python code is at >> > >> > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692>> > ) >> >> Thank you! If I understand it correctly this only works if my python app >> runs on the same server as hadoop, right? > > It works only if your python app has network connectivity to your namenode. > You can access an explicitly specified HDFS by passing > -Dfs.default.name=hdfs://<namenode>:<namenode_port>/ . (The default is read > from hadoop-site.xml (or perhaps hdfs-site.xml), and, I think, defaults to > file:///). >
+
Travis Crawford 2010-08-11, 04:17
-
Re: Best way to write files to hdfs (from a Python app)
Bjoern Schiessle 2010-08-11, 11:39
On Tue, 10 Aug 2010 09:39:17 -0700 Philip Zeyliger wrote: > On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle > <[EMAIL PROTECTED]>wrote: > > On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote: > > > To give you an example of how this may be done, HUE, under the > > > covers, pipes your data to 'bin/hadoop fs > > > -Dhadoop.job.ugi=user,group put - path'. (That's from memory, but > > > it's approximately right; the full python code is at > > > > > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692> > > ) > > > > Thank you! If I understand it correctly this only works if my python > > app runs on the same server as hadoop, right? > > > > It works only if your python app has network connectivity to your > namenode. You can access an explicitly specified HDFS by passing > -Dfs.default.name=hdfs://<namenode>:<namenode_port>/ > . (The default is read from hadoop-site.xml (or perhaps hdfs-site.xml), > and, I think, defaults to file:///). Thank you. This sounds really good! I tried it but i still have a problem. The namenode is defined at hadoop/conf/core-site.xml. At the namenode it looks like: <property> <name>fs.default.name</name> <value>hdfs://hadoopserver:9000</value> </property> I have now copied the whole hadoop directory to the client where the python app runs. If I run "hadoop fs -ls /" I get a message the he can't connect to the server and hadoop tries to connect again and again: 10/08/11 12:06:34 INFO ipc.Client: Retrying connect to server: hadoopserver/129.69.216.55:9000. Already tried 0 time(s). 10/08/11 12:06:35 INFO ipc.Client: Retrying connect to server: hadoopserver/129.69.216.55:9000. Already tried 1 time(s). From the client I can access the web interface of the namenode (hadoopserver:50070). "Browse the file system" links to http://pcmoholynagy:50070/nn_browsedfscontent.jsp but if I click at the link I get redirected to http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=%2Fwhich of course can't be accessed by the client. If I replace "localhost" with "hadoopserver" it works. Maybe the wrong redirection also causes the problem if i call "bin/hadoop fs -ls /"? If have tried to find something by reading the documentation and by google but I couldn't find a solution. Any ideas? Thanks! Björn
+
Bjoern Schiessle 2010-08-11, 11:39
-
Re: Best way to write files to hdfs (from a Python app)
Jeff Hammerbacher 2010-08-11, 17:40
Hey Björn, You also mention that your app will be accessing data stored in HBase. There's a Python client for the Avro HBase gateway at http://github.com/hammer/pyhbase. If you try it out, let me know how it goes. Thanks, Jeff On Wed, Aug 11, 2010 at 4:39 AM, Bjoern Schiessle <[EMAIL PROTECTED]>wrote: > On Tue, 10 Aug 2010 09:39:17 -0700 Philip Zeyliger wrote: > > On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle > > <[EMAIL PROTECTED]>wrote: > > > On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote: > > > > To give you an example of how this may be done, HUE, under the > > > > covers, pipes your data to 'bin/hadoop fs > > > > -Dhadoop.job.ugi=user,group put - path'. (That's from memory, but > > > > it's approximately right; the full python code is at > > > > > > > > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692> > > > ) > > > > > > Thank you! If I understand it correctly this only works if my python > > > app runs on the same server as hadoop, right? > > > > > > > It works only if your python app has network connectivity to your > > namenode. You can access an explicitly specified HDFS by passing > > -Dfs.default.name=hdfs://<namenode>:<namenode_port>/ > > . (The default is read from hadoop-site.xml (or perhaps hdfs-site.xml), > > and, I think, defaults to file:///). > > Thank you. This sounds really good! I tried it but i still have a problem. > > The namenode is defined at hadoop/conf/core-site.xml. At the namenode it > looks like: > > <property> > <name>fs.default.name</name> > <value>hdfs://hadoopserver:9000</value> > </property> > > I have now copied the whole hadoop directory to the client where the > python app runs. > > If I run "hadoop fs -ls /" > I get a message the he can't connect to the server and hadoop tries to > connect again and again: > > 10/08/11 12:06:34 INFO ipc.Client: Retrying connect to server: > hadoopserver/129.69.216.55:9000. Already tried 0 time(s). 10/08/11 > 12:06:35 INFO ipc.Client: Retrying connect to server: hadoopserver/ > 129.69.216.55:9000. Already tried 1 time(s). > > From the client I can access the web interface of the namenode > (hadoopserver:50070). "Browse the file system" links to > http://pcmoholynagy:50070/nn_browsedfscontent.jsp but if I click at the > link I get redirected to > http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=%2F> which of course can't be accessed by the client. If I replace "localhost" > with "hadoopserver" it works. > > Maybe the wrong redirection also causes the problem if i call "bin/hadoop > fs -ls /"? > > If have tried to find something by reading the documentation and by > google but I couldn't find a solution. > > Any ideas? > > Thanks! > Björn >
+
Jeff Hammerbacher 2010-08-11, 17:40
-
Re: Best way to write files to hdfs (from a Python app)
Bjoern Schiessle 2010-08-12, 12:01
Hey Jeff, On Wed, 11 Aug 2010 10:40:29 -0700 Jeff Hammerbacher wrote: > You also mention that your app will be accessing data stored in HBase. > There's a Python client for the Avro HBase gateway at > http://github.com/hammer/pyhbase. If you try it out, let me know how it > goes. What's the difference between Avro and Thrift? Are there any specific reasons to prefer one of the other? I tried to find some documentation about Avro, but it seems that this is a quite new project. best wishes, Björn
+
Bjoern Schiessle 2010-08-12, 12:01
-
Re: Best way to write files to hdfs (from a Python app)
David Rosenstrauch 2010-08-12, 13:43
On 08/12/2010 08:01 AM, Bjoern Schiessle wrote: > Hey Jeff, > > On Wed, 11 Aug 2010 10:40:29 -0700 Jeff Hammerbacher wrote: >> You also mention that your app will be accessing data stored in HBase. >> There's a Python client for the Avro HBase gateway at >> http://github.com/hammer/pyhbase. If you try it out, let me know how it >> goes. > > What's the difference between Avro and Thrift? Are there any specific > reasons to prefer one of the other? > > I tried to find some documentation about Avro, but it seems that this is > a quite new project. > > best wishes, > Björn This blog post is a good intro: http://www.searchenginecaffe.com/2009/07/hadoop-data-serialization-battle.htmlAvro is going to be supported natively in Hadoop going forward, so if you're on the fence, I'd choose Avro. I've been using Avro for about a month now (just for serialization, not RPC) and I've been pretty happy with it. HTH, DR
+
David Rosenstrauch 2010-08-12, 13:43
-
Re: Best way to write files to hdfs (from a Python app)
Bjoern Schiessle 2010-08-12, 12:23
Hi,
I read various mailing list archives and played a little bit with my configuration. It seems other had similar problems (remote access to the namenode) in the past.
I'm now one step further. On both the hadoop server and the client which doesn't run any hadoop daemon I have replaced the hostname with the actual IP of the server. Modified configuration files: core-site.xml, masters, slaves, mapred-site.xml.
Now I can access the namenode and file system from the client with the web interface. Also "telnet hadoopserver 9000" works.
But running "bin/hadoop fs -ls /" at the client still gives me:
10/08/12 14:08:11 INFO ipc.Client: Retrying connect to server: /129.69.216.55:9000. Already tried 0 time(s). 10/08/12 14:08:12 INFO ipc.Client: Retrying connect to server: /129.69.216.55:9000. Already tried 1 time(s). 10/08/12 14:08:13 INFO ipc.Client: Retrying connect to server: /129.69.216.55:9000. Already tried 2 time(s). ...
This error doesn't generate any log messages. Is it possible to get a more verbose output for debugging?
Any idea what could be wrong?
Thanks a lot! Björn
+
Bjoern Schiessle 2010-08-12, 12:23
-
Re: Best way to write files to hdfs (from a Python app) (problem solved)
Bjoern Schiessle 2010-08-12, 14:31
Hi all,
I have solved the problem. The problem wasn't Hadoop but my network setup. I have used my laptop (via wlan) as a client and it seems like the wlan gateway has a firewall which blocked the connection. After I had connected my laptop by wire everything works! :-)
bes wishes & thanks a lot for all your help and useful mails! Björn
-- Björn Schießle Support Free Software, join FSFE's Fellowship (fellowship.fsfe.org) Buy books and support Free Software (wiki.fsfe.org/SupportPrograms)
+
Bjoern Schiessle 2010-08-12, 14:31
-
Re: Best way to write files to hdfs (from a Python app)
stu24mail@... 2010-08-10, 16:02
Hello Bjoern, Thrift works with binary data - at least with hbase. I have a C# app that people can use to put binary (and get) files in hbase via Thrift. I'll send example code later. I also have java apps that upload files to hdfs directly and are not on the server - but they do need access to the copies of the config files. But they just use the standard java hdfs api. Best, -stu ------Original Message------ From: Bjoern Schiessle To: [EMAIL PROTECTED] ReplyTo: [EMAIL PROTECTED] Subject: Re: Best way to write files to hdfs (from a Python app) Sent: Aug 10, 2010 05:06 Hi Philip, On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote: > To give you an example of how this may be done, HUE, under the covers, > pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put - > path'. (That's from memory, but it's approximately right; the full > python code is at > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692> ) Thank you! If I understand it correctly this only works if my python app runs on the same server as hadoop, right? I would like to run the python app on a different server. Therefore my two ideas (1) Thrift or (2) hdfs-fuse. Thrift seems to be able to store string content only to hdfs but no binary files. At least I couldn't find an interface for a simple put operation. So at the moment I'm not sure how to continue. Any ideas? Thanks, Björn
+
stu24mail@... 2010-08-10, 16:02
-
Re: Best way to write files to hdfs (from a Python app)
Bjoern Schiessle 2010-08-12, 12:04
On Tue, 10 Aug 2010 16:02:04 +0000 [EMAIL PROTECTED] wrote: > Thrift works with binary data - at least with hbase. I have a C# app > that people can use to put binary (and get) files in hbase via Thrift. > I'll send example code later. I also have java apps that upload files > to hdfs directly and are not on the server - but they do need access to > the copies of the config files. But they just use the standard java > hdfs api.
could be interesting. Can you send me some example code?
How large are the binary file you store on hbase? I have read that for large files (> 10MB) hdfs is the better place to store binary data.
best wishes, Björn
+
Bjoern Schiessle 2010-08-12, 12:04
-
Re: Best way to write files to hdfs (from a Python app)
Stuart Smith 2010-08-12, 18:46
Hello Bjoern, Loading binary data into HBase isn't terribly different then other data. In fact - just a warning - what gives me the most trouble is strings - because I have to deal with windows clients, which doesn't do UTF8 - So you have to be careful loading and retrieving strings on windows clients. Caveats aside.. example code (not in python, but I think the relevant data types should map easily).
Open a connection:
//mailing lists say TBuffered is important! (vs just TSocket). TBufferedTransport transport = new TBufferedTransport(new TSocket(host, port)); TProtocol protocol = new TBinaryProtocol(transport, true, true); Hbase.Client client = new Hbase.Client(protocol); transport.Open();
//Get your file buffer (buf), and .. Mutation mut = new Mutation(); mut.Column = encoder.GetBytes(/*some column*/); mut.Value = buf;
List<Mutation> row = new List<Mutation>(); newRow.Add(mut);
client.mutateRow(encoder.GetBytes(/*tablename*/),encoder.GetBytes(/*row key*/), newRow);
transport.Close()
So not so bad - once you figured it out - thrift documentation is a bit sparse ;)
I've stored files up to 500 MB in HBase - but I wouldn't recommend it. Hbase handles it fine, but I just had a M/R task throw an OOME when processing a large cell. My rule of thumb has been to store anything up to 64MB in Hbase - basically up to the chunk size of the hdfs file system. Basically the *lower* limit for hdfs is my upper limit for hbase.
That said, the hbase FAQ says 10 MB, as you mentioned. But that's an average size. My average size is closer to 300 KB, but there's a lot of variance/deviation around that number. For your average I would definitely follow the Hbase FAQ's advice of 10 MB. For the the max - 64 MB?
Take care, -stu --- On Thu, 8/12/10, Bjoern Schiessle <[EMAIL PROTECTED]> wrote:
> From: Bjoern Schiessle <[EMAIL PROTECTED]> > Subject: Re: Best way to write files to hdfs (from a Python app) > To: [EMAIL PROTECTED] > Date: Thursday, August 12, 2010, 8:04 AM > On Tue, 10 Aug 2010 16:02:04 +0000 [EMAIL PROTECTED] > wrote: > > Thrift works with binary data - at least with > hbase. I have a C# app > > that people can use to put binary (and get) files in > hbase via Thrift. > > I'll send example code later. I also have java apps > that upload files > > to hdfs directly and are not on the server - but they > do need access to > > the copies of the config files. But they just use the > standard java > > hdfs api. > > could be interesting. Can you send me some example code? > > How large are the binary file you store on hbase? I have > read that for > large files (> 10MB) hdfs is the better place to store > binary data. > > best wishes, > Björn >
+
Stuart Smith 2010-08-12, 18:46
|
|