I've been playing with Sqoop, and it seems to fit my use case (to export
some log data from HDFS to Microsoft SQL Server).
A look at the documentation shows that sqoop will export/import between
tables of similar schema. However, my data export is more complicated.
Allow me to describe it. I have JSON strings stored in Hadoop Sequence
files, with each string indexed by timestamp. Each JSON string is similar
to the following:
Each string represents an array of objects, with the "Unique_Key" and
"Timestamp" of each of these objects corresponding to a row in one SQL
table (Let's call it Table A). Each object has inside it another
"Inner_Array" - each element of this Inner_Array needs to go into another
SQL table (Table B), and will be associated with the previous table using
the Unique_Key as a foreign key.
So, the schema of the two SQL tables will be:
Unique_Key (Primary Key) | TimeStamp
Unique_Key (Foreign Key) | Name | Value
If I wanted to implement this functionality in Sqoop (placing nested JSON
in multiple tables), it seems I would need to firstly implement a "JSON
parser" in lib and add schema mapping specifications to the configuration.
We would also need to provide an option for parser selection. Is there
anything I am missing? Any comments? Is this functionality already being
implemented by someone?
Thanks for your patient reading,