Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Review Request: PIG-3215 [piggybank] Add LTSVLoader to load LTSV files


+
Taku Miyakawa 2013-02-28, 22:53
+
Jonathan Coveney 2013-03-01, 14:54
+
Cheolsoo Park 2013-03-04, 01:46
+
Taku Miyakawa 2013-03-03, 06:35
+
Jonathan Coveney 2013-03-04, 15:05
Copy link to this message
-
Re: Review Request: PIG-3215 [piggybank] Add LTSVLoader to load LTSV files
Taku Miyakawa 2013-03-05, 23:10


> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java, line 157
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line157>
> >
> >     In the case where they do not give it a schema, I think that they shouldn't have to actually specify that it is a map. It's implicit. Furthermore, I think that instead of passing the Schema as an argument, if they do an "as (schema)" THEN it should attempt to return a Schema in accordance with what you provide.
> >    
> >     Open to argument, though.
>
> Taku Miyakawa wrote:
>     > In the case where they do not give it a schema, I think that they shouldn't have to actually specify that it is a map.
>    
>     Yes ":map[]" is redundant. I will remove it.
>    
>     > if they do an "as (schema)" THEN it should attempt to return a Schema in accordance with what you provide.
>    
>     Does this code meet your point?
>    
>     <code>
>     access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader() AS (host:chararray, ua:chararray);
>     ua = FOREACH access GENERATE ua;
>     dump ua;
>     </code>
>    
>     It will be good, but I could not find a way to implement the function in that manner. LoadFunc seems to have no method which can get an input schema specified by users.
>
> Cheolsoo Park wrote:
>     >> It will be good, but I could not find a way to implement the function in that manner. LoadFunc seems to have no method which can get an input schema specified by users.
>    
>     You don't need access to the schema given by the AS clause inside LoadFunc. You can just make LoadMetaData.getSchema() return null and load every field as bytearray in LoadFunc.getNext(). Then, Pig inserts a FOREACH operator after LOAD and does type conversion using the schema given by the AS clause.
>    
>     This is what happens when PigStorage is not given a JSON schema file, but the output schema is defined in the AS clause.
>
> Jonathan Coveney wrote:
>     Cheolsoo is totally correct. That said, this should probably be documented better. Also, the fact that we rely on an implicit foreach is so gross... but this is how you do it!

I think the function is similar to JsonLoader, it also takes a schema as an argument.

An LTSV file itself contains field names in each row.  A column consists of a name and a value (<name>:<value>).  And users may want to select fields by their names.  That is why the loader function should know field names, and it takes a schema as an argument.

Let me explain in detail.

In an LTSV file, orders of columns may differ for each row (although it is not a common case).  And a row may lack some columns.  In this example, the second row lacks "ua" column, and the third row has a different order of columns, but it's OK.

{data}
host:pc.example.com req:GET /index.html ua:Opera/9.80
host:user.example.net req:GET /favicon.ico
ua:Mozilla/5.0 req:GET /news.html host:workstation.example.org
{/data}

To load "host" column and "ua" column into Pig fields, the loader function should know their names, because it cannot depend on positions of columns.

{code}
access = LOAD 'access.log' USING LTSVLoader('host, ua');
DUMP access;

-- Output:
-- (pc.example.com, Opera/9.80)
-- (user.example.net,)
-- (workstation.example.org, Mozilla/5.0)
{/code}
> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java, line 576
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line576>
> >
> >     Why doesn't this support pushProjection as well?

It was my sloth. I will implement it.
> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java, line 547
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line547>
> >
> >     It is probably useful to LOG.debug (or LOG.warn) once for each label that comes up that isn't used. Right now it would be silent, and people might want to know.

LOG.debug for the first occurrence of a unused column seems good. I will implement in that way.
- Taku
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9685/#review17239
On Feb. 28, 2013, 10:53 p.m., Taku Miyakawa wrote:
+
Jonathan Coveney 2013-03-19, 16:45
+
Taku Miyakawa 2013-03-05, 23:38
+
Taku Miyakawa 2013-03-05, 23:42
+
Taku Miyakawa 2013-03-07, 22:46
+
Taku Miyakawa 2013-03-07, 23:31
+
Taku Miyakawa 2013-03-09, 08:18