|
|
-
Dealing with changing file format
Mohit Anchlia 2012-07-02, 21:09
I am wondering what's the right way to go about designing reading input and output where file format may change over period. For instance we might start with "field1,field2,field3" but at some point we add new field4 in the input. What's the best way to deal with such scenarios? Keep a catalog of changes that timestamped?
+
Mohit Anchlia 2012-07-02, 21:09
-
Re: Dealing with changing file format
Robert Evans 2012-07-02, 21:17
There are several different ways. One of the ways is to use something like Hcatalog to track the format and location of the dataset. This may be overkill for your problem, but it will grow with you. Another is to store the scheme with the data when it is written out. Your code may need to the dynamically adjust to when the field is there and when it is not.
--Bobby Evans
On 7/2/12 4:09 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:
>I am wondering what's the right way to go about designing reading input >and >output where file format may change over period. For instance we might >start with "field1,field2,field3" but at some point we add new field4 in >the input. What's the best way to deal with such scenarios? Keep a catalog >of changes that timestamped?
+
Robert Evans 2012-07-02, 21:17
-
Re: Dealing with changing file format
Harsh J 2012-07-03, 02:10
In addition to what Robert says, using a schema-based approach such as Apache Avro can also help here. The schemas in Avro can evolve over time if done right, while not breaking old readers.
On Tue, Jul 3, 2012 at 2:47 AM, Robert Evans <[EMAIL PROTECTED]> wrote: > There are several different ways. One of the ways is to use something > like Hcatalog to track the format and location of the dataset. This may > be overkill for your problem, but it will grow with you. Another is to > store the scheme with the data when it is written out. Your code may need > to the dynamically adjust to when the field is there and when it is not. > > --Bobby Evans > > On 7/2/12 4:09 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote: > >>I am wondering what's the right way to go about designing reading input >>and >>output where file format may change over period. For instance we might >>start with "field1,field2,field3" but at some point we add new field4 in >>the input. What's the best way to deal with such scenarios? Keep a catalog >>of changes that timestamped? >
-- Harsh J
+
Harsh J 2012-07-03, 02:10
-
Re: Dealing with changing file format
Mohit Anchlia 2012-07-03, 04:40
On Mon, Jul 2, 2012 at 7:10 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> In addition to what Robert says, using a schema-based approach such as > Apache Avro can also help here. The schemas in Avro can evolve over > time if done right, while not breaking old readers. >
Thanks! Is there a good example of this that I can look at?
> > On Tue, Jul 3, 2012 at 2:47 AM, Robert Evans <[EMAIL PROTECTED]> wrote: > > There are several different ways. One of the ways is to use something > > like Hcatalog to track the format and location of the dataset. This may > > be overkill for your problem, but it will grow with you. Another is to > > store the scheme with the data when it is written out. Your code may > need > > to the dynamically adjust to when the field is there and when it is not. > > > > --Bobby Evans > > > > On 7/2/12 4:09 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote: > > > >>I am wondering what's the right way to go about designing reading input > >>and > >>output where file format may change over period. For instance we might > >>start with "field1,field2,field3" but at some point we add new field4 in > >>the input. What's the best way to deal with such scenarios? Keep a > catalog > >>of changes that timestamped? > > > > > > -- > Harsh J >
+
Mohit Anchlia 2012-07-03, 04:40
|
|