Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> map reduce and sync


Copy link to this message
-
Re: map reduce and sync
Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <[EMAIL PROTECTED]>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <[EMAIL PROTECTED]>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is