Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Json Loader - Array of objects - Loading results in empty data set


Copy link to this message
-
Json Loader - Array of objects - Loading results in empty data set
Hello,

I am new to this list. I tried to solve this problem for the last 48h but I am stuck. I hope someone here can hint me in the right direction.

I have problems using the Pig JsonLoader and wondering if I do something wrong or I encounter another problem.

The 1st half of this post is to show I know a at least something about what I am talking and that I did my homework. During research I found a lot about elephant-bird but there seems to be a conflict with cloudera. This way I am stuck as well. If you trust me already you can directly jump to the 2nd half of my post ,-).

The desired solution should work both, in Cloudera and on Amazon EMR.

To proof something works.
I have this data file:

```

$ cat a.json

{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}}

$ ./jq '.' a.json

{

  "DataASet": {

    "A1": 1,

    "A2": 4,

    "DataBSets": [

      {

        "B1": "1",

        "B2": "1"

      },

      {

        "B1": "2",

        "B2": "2"

      }

    ]

  }

}

$

```

I am using this Pig Script to load it.

``` Pig

a = load 'a.json' using JsonLoader('

     DataASet: (

       A1:int,

       A2:int,

       DataBSets: {

        (

           B1:chararray,

           B2:chararray

         )

       }

     )

');

```

In grunt everything seems ok.

```

grunt> describe a;

a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}

grunt> dump a;

((1,4,{(1,1),(2,2)}))

grunt>

```

So far so good.

Real Problem
In fact my real data (Gigabytes) looks a little bit different. The array is in fact an array of an object.

```

$ ./jq '.' b.json

{

  "DataASet": {

    "A1": 1,

    "A2": 4,

    "DataBSets": [

      {

        "DataBSet": {

          "B1": "1",

          "B2": "1"

        }

      },

      {

        "DataBSet": {

          "B1": "2",

          "B2": "2"

        }

      }

    ]

  }

}

$ cat b.json

{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}}

$

```

I trying to load this json with the following schema:

``` Pig

b = load 'b.json' using JsonLoader('

     DataASet: (

       A1:int,

       A2:int,

       DataBSets: {

        DataBSet: (

           B1:chararray,

           B2:chararray

         )

       }

     )

');

```

Again it looks good so far in grunt.

```

grunt> describe b;

b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2: chararray)})} ```

I expect someting like this when dumping b:

```

((1,4,{((1,1)),((2,2))}))

```

But I get this:

```

grunt> dump b;

()

grunt>

```

Obviously I am doing something wrong. An empty set hints in the direction that the schema does not match on the input line.

Any hints? Thanks in advance.

Kind regards.

Ralf