Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: The NCDC Weather Data for Hadoop the Definitive Guide

Copy link to this message
Re: The NCDC Weather Data for Hadoop the Definitive Guide
To avoid creation of recursively folder follow below steps
1. Create one Folder in your Local drive
  i created as "*/home/sujit/Desktop/Data/*"

2. Create below script and run

for i in {1901..2012}
cd */home/sujit/Desktop/Data/*
wget -r --no-parent --reject "index.html*"

On Fri, Nov 16, 2012 at 1:01 PM, Sujit Dhamale <[EMAIL PROTECTED]>wrote:

> Hi,
> If Needed you can run Below Script for Storing Data on your Local System
> for i in {1901..2012}
> do
> cd /home/ubuntu/work/
> wget -r -np -nH .cut-dirs=3 -R index.html
> http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
> cd pub/data/noaa/$i/
> cp *.gz /home/ubuntu/work/files
> cd /home/ubuntu/work/
> rm -r pub/
> done
> On Mon, Feb 13, 2012 at 3:43 PM, Andy Doddington <[EMAIL PROTECTED]>wrote:
>> OK, well for starters, I think you can safely ignore the PDF data; to
>> paraphrase Star Wars" “that isn’t the data
>> in which you are interested”.
>> Page 16 of the book describes the data format and refers to a data store
>> that contains directories for each year from
>> 1901 to 2001. It also shows the naming of .gz files within a sample
>> directory (1990). The files in this directory have
>> names "010010-99999-1990.gz", "010014-99999-1990.gz",
>> "010015-99999-1990.gz", and so on…
>> Referring back to the NCDC web site, at the link below (
>> http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
>> link on the left-hand side of the screen beings up a new screen, as shown
>> below:
>> Clicking again on the ‘Free Data’ link in the middle section of this page
>> brings up another page, listing the available
>> data sets:
>> As this page notes, although some of this data needs to be paid for,
>> there is at least one ‘free’ options within
>> each section. For simplicity, I went for the first one - the one labelled
>> “3505 FTP data access” - which the comment
>> says is free. I used anonymous FTP and found that this site contained
>> directories for each year from 1901 to 2012.
>> I expect the additional directories reflect the fact that time has moved
>> on since the book was written :-)
>> There are also several text or pdf files that provide further information
>> on the contents of the site. I suggest you
>> read some of these to get more details. One of these is called
>> "ish-format-document.pdf" and it seems to describe
>> the document format in some detail. If you open this, you can check
>> whether it matches the formate expected by
>> the hadoop sample code. There is also a ‘software’ directory, which
>> contains various bits of code that might
>> prove useful.
>> On drilling down into the directory for 1990, I get the following list of
>> files:
>> Which looks close enough to the the file names in the hadoop book - I’d
>> guess that these are the correct files.
>> Given the passage of time, it is still possible that the file format has
>> changed to make it incompatible with the
>> hadoop code. However, it shouldn’t be that difficult to modify the code
>> to suit the new format (which is very
>> well documented, as already noted).
>> Good luck!
>>  Andy
>> ——————————————
>> On 12 Feb 2012, at 08:50, Bing Li wrote:
>> Andy,
>> Since there is a lot of data on the free data of the site, I cannot figure
>> out which one is the one talked in the book. Any format differences might
>> cause the source code to get exceptions. Some data is even in PDF format!
>> Thanks so much!
>> Bing
>> On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <[EMAIL PROTECTED]
>> >wrote:
>> According to Page 15 of the book, this data is available from the US
>> National Climatic Data Center, at
>> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
>> links on the left-hand side of the
>> page, listed under the heading ‘Data & Products’. I suspect that the entry
>> labelled ‘Free Data’ is the most
>> likely area you need to investigate :-)