Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> map reduce and sync


Copy link to this message
-
Re: map reduce and sync
It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
That also would explain the weird behavior of tail, which seems to always
jump to the start since file length is 0.

So, basically, sync doesn't update file length, any code based on file
size, is unreliable.

Am I right?

How can I get around this?

Lucas

On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:

> I didn't notice, thanks for the heads up.
>
>
> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> Just an aside (I've not tried to look at the original issue yet), but
>> Scribe has not been maintained (nor has seen a release) in over a year
>> now -- looking at the commit history. Same case with both Facebook and
>> Twitter's fork.
>>
>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:
>> > Yeah I looked at scribe, looks good but sounds like too much for my
>> problem.
>> > I'd rather make it work the simple way. Could you pleas post your code,
>> may
>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>> block
>> > size or some other  parameter is different...
>> >
>> > Thanks!
>> > Lucas
>> >
>> >
>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>> > <[EMAIL PROTECTED]> wrote:
>> >>
>> >> I am using the same version of Hadoop as you.
>> >>
>> >> Can you look at something like Scribe, which AFAIK fits the use case
>> you
>> >> describe.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <[EMAIL PROTECTED]>
>> wrote:
>> >>>
>> >>> That is exactly what I did, but in my case, it is like if the file
>> were
>> >>> empty, the job counters say no bytes read.
>> >>> I'm using hadoop 1.0.3 which version did you try?
>> >>>
>> >>> What I'm trying to do is just some basic analyitics on a product
>> search
>> >>> system. There is a search service, every time a user performs a
>> search, the
>> >>> search string, and the results are stored in this file, and the file
>> is
>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>> work,
>> >>> like I described, because the file looks empty for the map reduce
>> >>> components. I thought it was about pig, but I wasn't sure, so I tried
>> a
>> >>> simple mr job, and used the word count to test the map reduce
>> compoinents
>> >>> actually see the sync'ed bytes.
>> >>>
>> >>> Of course if I close the file, everything works perfectly, but I don't
>> >>> want to close the file every while, since that means I should create
>> another
>> >>> one (since no append support), and that would end up with too many
>> tiny
>> >>> files, something we know is bad for mr performance, and I don't want
>> to add
>> >>> more parts to this (like a file merging tool). I think unign sync is
>> a clean
>> >>> solution, since we don't care about writing performance, so I'd
>> rather keep
>> >>> it like this if I can make it work.
>> >>>
>> >>> Any idea besides hadoop version?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Lucas
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>> >>> <[EMAIL PROTECTED]> wrote:
>> >>>>
>> >>>> Hi Lucas,
>> >>>>
>> >>>> I tried something like this but got different results.
>> >>>>
>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>> sync.
>> >>>> Without closing the file, I ran a wordcount with that file as input.
>> It did
>> >>>> work fine and was able to count the words that were sync'ed (even
>> though the
>> >>>> file length seems to come as 0 like you noted in fs -ls)
>> >>>>
>> >>>> So, not sure what's happening in your case. In the MR job, do the job
>> >>>> counters indicate no bytes were read ?
>> >>>>
>> >>>> On a different note though, if you can describe a little more what
>> you
>> >>>> are trying to accomplish, we could probably work a better solution.
>> >>>>
>> >>>> Thanks
>> >>>> hemanth
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB