-Re: Review Request: PIG-3215 [piggybank] Add LTSVLoader to load LTSV files
Jonathan Coveney 2013-03-19, 16:45
> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java, line 157
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line157>
> > In the case where they do not give it a schema, I think that they shouldn't have to actually specify that it is a map. It's implicit. Furthermore, I think that instead of passing the Schema as an argument, if they do an "as (schema)" THEN it should attempt to return a Schema in accordance with what you provide.
> > Open to argument, though.
> Taku Miyakawa wrote:
> > In the case where they do not give it a schema, I think that they shouldn't have to actually specify that it is a map.
> Yes ":map" is redundant. I will remove it.
> > if they do an "as (schema)" THEN it should attempt to return a Schema in accordance with what you provide.
> Does this code meet your point?
> access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader() AS (host:chararray, ua:chararray);
> ua = FOREACH access GENERATE ua;
> dump ua;
> It will be good, but I could not find a way to implement the function in that manner. LoadFunc seems to have no method which can get an input schema specified by users.
> Cheolsoo Park wrote:
> >> It will be good, but I could not find a way to implement the function in that manner. LoadFunc seems to have no method which can get an input schema specified by users.
> You don't need access to the schema given by the AS clause inside LoadFunc. You can just make LoadMetaData.getSchema() return null and load every field as bytearray in LoadFunc.getNext(). Then, Pig inserts a FOREACH operator after LOAD and does type conversion using the schema given by the AS clause.
> This is what happens when PigStorage is not given a JSON schema file, but the output schema is defined in the AS clause.
> Jonathan Coveney wrote:
> Cheolsoo is totally correct. That said, this should probably be documented better. Also, the fact that we rely on an implicit foreach is so gross... but this is how you do it!
> Taku Miyakawa wrote:
> I think the function is similar to JsonLoader, it also takes a schema as an argument.
> An LTSV file itself contains field names in each row. A column consists of a name and a value (<name>:<value>). And users may want to select fields by their names. That is why the loader function should know field names, and it takes a schema as an argument.
> Let me explain in detail.
> In an LTSV file, orders of columns may differ for each row (although it is not a common case). And a row may lack some columns. In this example, the second row lacks "ua" column, and the third row has a different order of columns, but it's OK.
> host:pc.example.com req:GET /index.html ua:Opera/9.80
> host:user.example.net req:GET /favicon.ico
> ua:Mozilla/5.0 req:GET /news.html host:workstation.example.org
> To load "host" column and "ua" column into Pig fields, the loader function should know their names, because it cannot depend on positions of columns.
> access = LOAD 'access.log' USING LTSVLoader('host, ua');
> DUMP access;
> -- Output:
> -- (pc.example.com, Opera/9.80)
> -- (user.example.net,)
> -- (workstation.example.org, Mozilla/5.0)
Hmm, ask on the dev@ mailing list about this. I understand your reasoning and it is reasonable, but that said, this syntax irks me. I think there has to be a way to do this, but I am not sure what it is. If there isn't let's add one :) But there should be. Once we nail that, I will give it another go but it looks close to a +1.
This is an automatically generated e-mail. To reply, visit:
On March 9, 2013, 8:18 a.m., Taku Miyakawa wrote: