Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Blur >> mail # dev >> Re: [incubator-blur] The new Adhoc command is working though there are a few things hard coded that need to be pulled into the API. (753ab41)


Copy link to this message
-
Re: [incubator-blur] The new Adhoc command is working though there are a few things hard coded that need to be pulled into the API. (753ab41)
How about this?

public abstract class Command<T1, T2> implements Serializable {

 public abstract void mergeFinal(Iterable<T2> results, BlurContext<T2>
context) throws IOException;
 public abstract void mergeLocal(Iterable<T1> results, BlurContext<T2>
context) throws IOException;
 public abstract void processIndex(BlurIndex blurIndex, BlurContext<T1>
context) throws IOException;

}

Where BlurContext<T> looks like:

public class BlurCommand<T1> implements Serializable {

 public void write(T1 object) throws IOException;
 public void progress();
 public void incCounter(String counter);
 public void setCounter(String counter, long num);

 public Object[] getArgs();
 public void setArgs(Object[] args);
}
Probably looks really familiar.. :)

 By providing the Iterable interface our implementation behind the scenes
could be running through each call to proccessIndex, that way we don't have
to realize the full List<T1> like the current implementation does.  Its a
step in the right direction, now real memory usage is contained within the
Command as opposed to message passing.  Its not total streaming but we have
removed one complete copy of intermediate results from ram.

 I also like the BlurContext idea more and more, we might not know all the
things we want to expose as hooks (blockcache, tmp disk access,
blurConfig??) up front but this gives us an api compatible way to extend
that without junking the core interface.

 The one last thing was while talking with Aaron he mentioned maybe
separating what the shardserver does from the controller server.  And this
is because it might give us more freedom to intergrate with other bulk
processing/streaming engines which ideally will hit the shards directly and
not pull data back via the controllers.
​  I'm not sure how that would look yet, its hard to get out of the mindset
that shards and controllers look the same api wise.

Anyways, hopefully this will spawn more ideas! ​
On Thu, Jul 31, 2014 at 1:30 PM, Tim Williams <[EMAIL PROTECTED]> wrote: