Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Anybody using custom Serializer/Deserializer in Pig Streaming?


Copy link to this message
-
Re: Anybody using custom Serializer/Deserializer in Pig Streaming?
Nice summarization Koji. Wish we had some object that has byte[] and length
instead of byte[] as the return type of serialize() and method param of
deserialize(). That would enable reuse and cut down on some of the copy.

At least there is one copy we can cut down without any API changes by
having a new function StorageUtil.textToTuple(bytes, fieldDel)

@Override

    public Tuple deserialize(byte[] bytes) throws IOException {

        Text val = new Text(bytes);  //Remove this copy and construct the
tuple directly from the bytes

        return StorageUtil.textToTuple(val, fieldDel);

    }
Regards,
Rohini
On Wed, Mar 20, 2013 at 11:51 AM, Koji Noguchi <[EMAIL PROTECTED]>wrote:

> Hi.
>
> Do you know anyone using custom serializer/deserializer in pig streaming?
>
> I was looking at http://wiki.apache.org/pig/PigStreamingFunctionalSpecand was impressed on various features it supports.
> Then, looking at the code, I was sad to see many additional data copying
> done to support those features when simple case should be one copy out to
> stdin and another copy in from stdout.
>
> So far, this is my understanding.  2 extra copying on the sender side and
> 3 extra copying on the receiver side.
>
> Assuming Default(Input/Output)Handler + PigStreaming, then
>
> PigInputHandler.putNext(Tuple t)
> --> serializer.serialize(t)
> -->--> COPY to out(ByteArrayOutputStream)
> -->--> COPY by out.toByteArray()
> --> write to stdin (copy but necessary)
>
> Streaming
>
> --> OutputHandler.getNext()
> -->--> Text value = readLine(stdin)   (copy but necessary)
> -->--> System.arraycopy(value.getBytes(), 0, newBytes, 0,
> value.getLength());   COPY just because deserialize require exact size byte
> array?
> -->-->deserializer.deserialize(byte [])
> -->-->-->  Text val = new Text(bytes); COPY since Text somehow does not
> want to reuse the byte array
> -->-->-->  StorageUtil.textToTuple(val, fieldDel)
> -->-->-->--> Create ArrayList of DataByteArrays    COPY.
>
> Now wondering if we can simplify it somehow.
>
> Thanks,
> Koji
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB