Hi,

I'm trying to paralellize reading columns from a parquet file, to serialize it back to standard format in the quickest way.

I'm currently using pool.map(), such that a reader/binarizer toplevel  accepts the filepath, column to be read and some other data as a tuple.

It looks something like this:
(prettier coloring on stackoverflow -
[https://stackoverflow.com/questions/48177823/using-multiprocessing-with-pyarrows-hdfsclient)](https://stackoverflow.com/questions/48177823/using-multiprocessing-with-pyarrows-hdfsclient)

import

pyarrow

as

pa

import

pyarrow

.

parquet

as

pq

from

multiprocessing

import

Pool

def

binarizer

(

file_data_tuple

):

''' Read a Parquet column a file, binarize and return'''

path

,

col_name

,

col_meta

,

native

=

file_data_tuple

if

not

native

:

# Either this or using a top level hdfs_con

hdfs_con

=

pa

.

hdfs

.

connect

(*

hadoop_params

)

read_pq

=

pq

.

read_table

if

native

else

hdfs_con

.

read_parquet

# Read the column, serialize and return
   arrow_col

=

read_pq

(

filepath

,

columns

=

(

col_name

,))

bin_col

=

imported_binarizng_function

(

arrow_col

)

return

bin_col

def

read_binarize_parallel

(

filepaths

):

''' Setup parallel reading and binarizing of a parquet file'''

# list of tuples containing the filepath, column name, meta, and mode

pool_params

=

[(),..]

pool

=

Pool

()

for

file

in

filepaths

:

binary_columns

=

pool

.

map

(

binarizer

,

pool_params

)

chunk

=

b

''

.

join

(

binary_columns

)

send_over_socket

(

chunk

)

When this is used in "native" mode, reading with pyarrow from the local disk, this works fine and saves a nice amount of time ( a little less than X4 on machine) for the binarizing part.

However, when I try to do the same but read from hdfs, I get the following error stream:
(Also less un-pretty on stackoverflow)

[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "Hdfs.Internal.RpcResponseHeaderProto" because it is missing required fields: callId, status [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "Hdfs.Internal.RpcResponseHeaderProto" because it is missing required fields: callId, status [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "Hdfs.Internal.RpcResponseHeaderProto" because it is missing required fields: callId, status [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "Hdfs.Internal.RpcResponseHeaderProto" because it is missing required fields: callId, status 2018-01-09 21:41:47.939006, p10007, th139965275871040, ERROR Failed to invoke RPC call "getFileInfo" on server "192.168.0.101:9000": RpcChannel.cpp: 703: HdfsRpcException: RPC channel to "192.168.0.101:9000" got protocol mismatch: RPC channel cannot find pending call: id = 3. @ Unknown

I'd be happy for any feedback on how to proceed to achieve some gains over the hadoop read as well.
I'm not a Python expert so pardon any noobness that I may have incurred with this question :)

Thanks!
Eli

Sent with [ProtonMail](https://protonmail.com) Secure Email.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB