Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Re: Is it possible to implement transpose with PigLatin/any other MR language?


+
madhu phatak 2012-06-21, 09:00
+
Subir S 2012-06-25, 19:55
Copy link to this message
-
Re: Is it possible to implement transpose with PigLatin/any other MR language?
Robert Evans 2012-06-22, 14:47
@Subit

You can do it.  Here is some pseudo code for it in map/reduce.  It abuses Map/Reduce a little to be more performent.  But it is definitely doable.  At the end you will get a file for each reducer you have configured.  If you want a single file you can concatenate all of the files together ordered by the name of the file.  You should be able to do it in pig too, but you will need an input format that will give you the offset, and you will need to possibly have the reducer sort by the offset internally in the bag it is handed.  This may cause pig to have performance issues if it cannot keep the entire bag in memory to sort, which is why I did it in MR instead.

//Assuming Text input format where the key is the offset into the original input file, and there is only one input file. If there is more then one input file you need a way to include the ordering of the input files in the offset.
Map (LongWriteable offset, Text line)
    String[] parts = line.toString().split(',');
    for(long I = 0 ; I < parts.length; i++) {
        collect(new ColumnOffsetKey(offset, I), new Text(parts[I]));
    }
 }

//We need to know the max Columns ahead of time to get total order partitioning to work
Int partition(ColumnOffsetKey key) {
  return (int)(((double)key.column/MaxColumns)*numPartitions);
}

//You probably want to put in a binary comparator for performance reasons
int compare(ColumnOffsetKey key1, ColumnOffsetKey key2) {
    //First sort by column(which will become the new row) next sort by offset which will tell us the new column ordering
    if(key1.column > key2.column) {
      return 1;
    } else if(key1.column < key2.column) {
      return -1;
    } else if(key1.offset > key2.offset) {
      return 1;
    } else if (key1.offset < key2.offset) {
      return -1;
    }
    return 0;
}

StringBuffer currentRow = null;
Long currentRowNum = -1;

Reduce(RowOffsetKey key, Iterable<Text> part) {
//This is a bit ugly because we did need to detect changes to the row, there is probably a cleaner way to do this
    if(currentRowNum != key.column) {
        //Output the currentRow if needed
        if(currentRow != null) {
            collect(null, currentRow);
        }
        currentRow = new StringBuffer();
        currentRow.append(part);
        currentRowNum = key.column;
    } else {
        currentRow.append(',');
        currentRow.append(part);
    }
}

//This is called at the end of the reducer in the new API, ro something like it I don't remember the method name off the top of my head
cleanup() {
    if(currentRow != null)
      collect(null, currentRow);
}

On 6/22/12 5:35 AM, "Subir S" <[EMAIL PROTECTED]> wrote:

Thank you for the inputs!

@Norbert,
 But a Group By column number clause also does not guarantee the order of
columns to be preserved. Like even the row number should be known so that
may be in the end we can sort each row based on the row number using a
nested FOREACH. But after that  FOREACH since sorting is not preserved, for
other operations again data may be in wrong order in the row.

To me it seems like it is not possible to do this in MR.
On Fri, Jun 22, 2012 at 12:56 AM, Robert Evans <[EMAIL PROTECTED]> wrote:

> That may be true, I have not read through the code very closely, if you
> have multiple reduces,  so you can run it with a single reduce or you can
> write a custom partitioner to do it.  You only need to know the length of
> the column, and then you can divide them up appropriately, kind of like how
> the total order partitioner does it.
>
> --Bobby Evans
>
> On 6/21/12 1:15 PM, "Norbert Burger" <[EMAIL PROTECTED]> wrote:
>
> While it may be fine for many cases, If I'm reading the Nectar code
> correctly, that transpose doesn't guarantee anything about the order of
> rows within each column.  In other words, transposing:
>
> a - b -c
> d - e - f
> g - h - i
>
> may give you different permutations of "a - d - g" as the first row,
> depending on shuffle order.  You can trivially avoid this with one
+
Robert Evans 2012-06-21, 19:26