Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Anybody using custom Serializer/Deserializer in Pig Streaming?

Copy link to this message
Anybody using custom Serializer/Deserializer in Pig Streaming?

Do you know anyone using custom serializer/deserializer in pig streaming?

I was looking at http://wiki.apache.org/pig/PigStreamingFunctionalSpec and was impressed on various features it supports.
Then, looking at the code, I was sad to see many additional data copying done to support those features when simple case should be one copy out to stdin and another copy in from stdout.

So far, this is my understanding.  2 extra copying on the sender side and 3 extra copying on the receiver side.

Assuming Default(Input/Output)Handler + PigStreaming, then

PigInputHandler.putNext(Tuple t)
--> serializer.serialize(t)
-->--> COPY to out(ByteArrayOutputStream)
-->--> COPY by out.toByteArray()
--> write to stdin (copy but necessary)


--> OutputHandler.getNext()
-->--> Text value = readLine(stdin)   (copy but necessary)
-->--> System.arraycopy(value.getBytes(), 0, newBytes, 0, value.getLength());   COPY just because deserialize require exact size byte array?
-->-->deserializer.deserialize(byte [])
-->-->-->  Text val = new Text(bytes); COPY since Text somehow does not want to reuse the byte array
-->-->-->  StorageUtil.textToTuple(val, fieldDel)
-->-->-->--> Create ArrayList of DataByteArrays    COPY.

Now wondering if we can simplify it somehow.