I have been working on getting various frameworks working on my MapR Cluster that is also running Mesos. Basically, while I know that there is a package from MapR (for Drill) I am trying to find a way to better separate the storage layer from the computer layer.
This isn't a dig on MapR, or any of the Hadoop distributions, it's only I want flexibility to try things, to have an R&D team working with the data in an environment that can try out new frameworks etc. This combination has been very good to me (maybe not to MapR support who received lots of quirky questions from me. They have been helpful in furthering my understanding of this space!)
My next project I wanted to play with was Drill. I found https://github.com/mhausenblas/dromedar (Thanks Michael!) as a basic start to a Drill on Mesos approach. I read through the code, I understand it, but I wanted to see it at a more basic level.
So I just figured out how to run Drill bits in Marathon (manually for now). Basically, for anyone wanting to play along at home, This actually works VERY well. I used MapR FS to host my package from Drill, I set a conf directory. (Multiple conf directories actually, I set it up so I could launch different "sized" drillbits). I have been able to get things running, and be performant on my small test cluster.
For those who may be interested here are some of my notes.
- I compiled Drill 1.2.0-SNAPSHOT from source. I ran into some compiling issues that Jacques was able to help me through. Basically, Java 1.8 isn't support for building yet (fails some tests) but there is a work around to that.
- I took the built package and placed it in MapR FS. Now, I have every node mounting MapRFS to same NFS location. I could be using a hdfs (maprfs) based tarball but I haven't done that yet. I am just playing around and the NFS mounting of MapRFS sure is handy in this regard.
- At first I created a single sized Drill bit, the Marathon JSON is like this:
} So I can walk you through this. The first is the command obviously. I use runbit instead of drillbit.sh start because I want this process to stay running (from Marathon's perspective). If I used the drillbit.sh, it uses nohup and backgrounds it, Mesos/Marathon thinks it died and tries to start another.
cpus: obvious, maybe a bit small, but I have a small cluster.
mem: When I set mem to 6144 (6GB) in my drill-env.sh, I set max direct memory to 6GB and max heap to 3GB. I wasn't sure if I needed to set my marathon memory to 9GB or if the heap was used inside the direct memory. I could use some pointers here.
id: This is the id of my cluster in the drill-overides.conf. I did this so HA proxy would let me connect to the cluster via drillpot.marathon.mesos and it worked pretty well!
instances: I started with one, but could scale up with marathon
constrains; I only wanted one drill bit per node because of port conflicts. If I want to be multi tenant and have more than one drill bit per node, I would need to figure out how to abstract the ports. This is something that I could potentially do in a frame work for Mesos. But at the same time, I wonder if if when a drill bit registers with a cluster, it could just "report" it ports in the zookeeper information.. This is intriguing because if it did this, we could allow it to pull random ports offered to it from Mesos, registers the information, and away we go. It would be intriguing. Once I posted this to marathon, all was good, bits started, queries were had by all! It worked well. Some challenges: 1. Ports (as mentioned above) I am not managing those, so port conflicts could occur.
2. I should use a tarball for Marathon, this would allow drill to work on Mesos without the MapR requirement.
3. Logging. I have the default logback.xml in the conf directory and I am getting file not found issues in my stderr on the Mesos tasks. This isn't kill drill, and it still works, but I should organize my logging better. Hopeful for the future:
1. It would be neat to have a frame work that did the actual running of the bits. Perhaps something that could scale up and down based on query usage. I played around with some smaller drillbits (similar to how myriad defines profiles) so I could have a drill cluster of 2 large bits, and 2 small bits on my 5 node cluster. That worked, but lots of manual work. A framework would be handy for managing that.
2. Other? I know this isn't a production thing, but I could see being able to go from this to something a subset of production users could use in MapR/Mesos (or just Mesos) I just wanted to share some of my thought processes and show a way that various tools can integrate. Always happy to talk to shop with folks on this stuff if anyone has any questions. John
Great write up and information! Will be interesting to see how this evolves.
A quick note, memory allocation is additive so you have to allocate for direct plus heap memory. Drill uses direct memory for data structures/operations and this is the one that will grow with larger data sets, etc.
On Jul 16, 2015, at 5:23 AM, John Omernik <[EMAIL PROTECTED]> wrote:
I played with that, and the performance I was getting in Docker was about half that I was getting native. I think that for me, that was occurring because if I ran it in Docker, I needed to install the MapR Client in the container too, whereas when I run it in marathon, it's using the node's access to the disk. I am comfortable in places where performance stuff like this occurs, to not docker all the things, and allow for the tar ball method. Perhaps Mesos could find a way to cache locally? (Note, putting it in MapR FS still has it load pretty quick)
John On Thu, Jul 16, 2015 at 11:44 AM, Timothy Chen <[EMAIL PROTECTED]> wrote:
The nice thing about the approach you are taking and adding a docker deployment with something like Drill is that you really don't care where those docker instance land in your cluster because you can build your configuration into your docker image and you are off and running and should have no problem dynamically spinning up a few more instances whenever you want. Should hopefully simplify administration.
On Thu, Jul 16, 2015 at 2:08 PM, John Omernik <[EMAIL PROTECTED]> wrote: *Jim Scott* Director, Enterprise Strategy & Architecture +1 (347) 746-9281
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext