Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - how to best process key-value pairs with Pig


Copy link to this message
-
Re: how to best process key-value pairs with Pig
Bill Graham 2012-03-23, 15:37
If you have to have one row per entity, you could store the data using Avro
or JSON. Both would allow you to associate a Map of key/values with your
entity. AvroStorage in piggybank and JsonLoader in Pig would help if you
were to store the entire row as Avro or JSON. If you just want to store a
field as a serialized object, then you could write a UDF to do the same.

On Wed, Mar 21, 2012 at 10:32 PM, shan s <[EMAIL PROTECTED]> wrote:

> The numbers 100, 20 denotes meta-data numbers. The instances of data is
> large. Moreover given the demoralized form, it can’t take advantage of
> indexes.
>
> The data is currently demoralized, in the sense instead of having 100
> parse columns, the data is stored as key value pair in 3 column table.
> One row for every attribute of an entity.
> Resulting in N rows for an entity where N = number of attributes the
> entity has.
>
> I guess there are 2 options of converting this data to text files to yield
> one row for each entity.
> 1. Use parse columns. Add a column for each possible property/attribute of
> entity.
> Means at ETL time add new columns to file, maintain the schema.
> 2. Translate to key=value pairs.
> And handle the complexity of the parsing in the Pig scripts.
>
> For option#2, are there any tools, UDFs which makes parsing/processing of
> key-value pair easier?
> Example of a converted line is
> 8a9e202b-4da6-4cc0-958b-0000bd4c2c9d,prop1=xyz,prop2=9cd72489-6c03-489a-92cd-c9f938a7b223,prop3=20120312
> 04:38:02.140,prop4=20120312
> 04:38:02.140,prop5=e689968f-2c64-457b-a0ba-5f0122687172,prop6=5ce12c5b-2c82-4fbe-961e-fd04de96a8ae
>
> In other words, if I need to query this data with
> WHERE prop4 > now()
> WHERE prop2 = ‘9cd72489-6c03-489a-92cd-c9f938a7b223’
>
> Do I need to write UDFs or are there pre-existing tools that I can use to
> do this.
> Thanks!
>
> On Thu, Mar 22, 2012 at 9:29 AM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
>> What about denormalizing and just representing these as 4-tuples of (id,
>> type, name, value) in a text file? You could always then group by type if
>> you need to get back to distinct types.
>>
>> Are you joining against a larger dataset? I ask just because 10x200 values
>> is not a lot and can be done without Hadoop.
>>
>>
>> On Wed, Mar 21, 2012 at 11:49 AM, shan s <[EMAIL PROTECTED]> wrote:
>>
>> > In the relational database we have a large key, value type of data in 2
>> > tables. Let’s call it Entity and EntityAttribute.
>> >
>> >
>> >
>> > Table: Entity                       Columns: Entity ID, Entity Type
>> >
>> > Table: EntityAttribute        Columns: EntityID, PropertyName,
>> > PropertyValue.
>> >
>> >
>> >
>> > These entities are loosely related to each other, hence are under a
>> single
>> > roof.
>> >
>> > There are approx.  100 attributes among entities and 20 different entity
>> > types.
>> >
>> >
>> >
>> > My questions are:
>> >
>> > -          What is the best way to represent this kind of key-value pair
>> > data for processing with Pig.
>> >
>> > -          Do I represent it as key=value pairs in the text files,  if
>> so
>> > how would I process such data in Pig.
>> >
>> > -          Any pointer to UDFs that help with key- value pairs would be
>> > great.
>> >
>> >
>> >
>> > Many Thanks,
>> >
>> > Shan
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> [EMAIL PROTECTED] going forward.*
>>
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*