Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - map reduce and sync


+
Lucas Bernardi 2013-02-21, 21:17
+
Hemanth Yamijala 2013-02-23, 07:37
+
Lucas Bernardi 2013-02-23, 13:45
+
Hemanth Yamijala 2013-02-23, 14:54
+
Hemanth Yamijala 2013-02-25, 01:31
+
Lucas Bernardi 2013-02-25, 01:46
Copy link to this message
-
Re: map reduce and sync
Harsh J 2013-02-25, 07:31
Just an aside (I've not tried to look at the original issue yet), but
Scribe has not been maintained (nor has seen a release) in over a year
now -- looking at the commit history. Same case with both Facebook and
Twitter's fork.

On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:
> Yeah I looked at scribe, looks good but sounds like too much for my problem.
> I'd rather make it work the simple way. Could you pleas post your code, may
> be I'm doing something wrong on the sync side. Maybe a buffer size, block
> size or some other  parameter is different...
>
> Thanks!
> Lucas
>
>
> On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> <[EMAIL PROTECTED]> wrote:
>>
>> I am using the same version of Hadoop as you.
>>
>> Can you look at something like Scribe, which AFAIK fits the use case you
>> describe.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:
>>>
>>> That is exactly what I did, but in my case, it is like if the file were
>>> empty, the job counters say no bytes read.
>>> I'm using hadoop 1.0.3 which version did you try?
>>>
>>> What I'm trying to do is just some basic analyitics on a product search
>>> system. There is a search service, every time a user performs a search, the
>>> search string, and the results are stored in this file, and the file is
>>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>>> like I described, because the file looks empty for the map reduce
>>> components. I thought it was about pig, but I wasn't sure, so I tried a
>>> simple mr job, and used the word count to test the map reduce compoinents
>>> actually see the sync'ed bytes.
>>>
>>> Of course if I close the file, everything works perfectly, but I don't
>>> want to close the file every while, since that means I should create another
>>> one (since no append support), and that would end up with too many tiny
>>> files, something we know is bad for mr performance, and I don't want to add
>>> more parts to this (like a file merging tool). I think unign sync is a clean
>>> solution, since we don't care about writing performance, so I'd rather keep
>>> it like this if I can make it work.
>>>
>>> Any idea besides hadoop version?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Hi Lucas,
>>>>
>>>> I tried something like this but got different results.
>>>>
>>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>>> Without closing the file, I ran a wordcount with that file as input. It did
>>>> work fine and was able to count the words that were sync'ed (even though the
>>>> file length seems to come as 0 like you noted in fs -ls)
>>>>
>>>> So, not sure what's happening in your case. In the MR job, do the job
>>>> counters indicate no bytes were read ?
>>>>
>>>> On a different note though, if you can describe a little more what you
>>>> are trying to accomplish, we could probably work a better solution.
>>>>
>>>> Thanks
>>>> hemanth
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <[EMAIL PROTECTED]>
>>>> wrote:
>>>>>
>>>>> Helo Hemanth, thanks for answering.
>>>>> The file is open by a separate process not map reduce related at all.
>>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>>> this file, every time a request is received it is written and
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>>
>>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>>> runing the word count example doesn't seem to work, it is like if the file
>>>>> were empty.
>>>>>
>>>>> hadoop -fs -tail works just fine, and reading the file using
>>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>>
>>>>> Last thing, the web interface doesn't see the contents, and command
>>>>> hadoop -fs -ls says the file is empty.
>>>>>
>>>>> What am I doing wrong?

Harsh J
+
Lucas Bernardi 2013-02-25, 22:03
+
Lucas Bernardi 2013-03-04, 16:09