Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> map reduce and sync


+
Lucas Bernardi 2013-02-21, 21:17
+
Hemanth Yamijala 2013-02-23, 07:37
+
Lucas Bernardi 2013-02-23, 13:45
+
Hemanth Yamijala 2013-02-23, 14:54
+
Hemanth Yamijala 2013-02-25, 01:31
+
Lucas Bernardi 2013-02-25, 01:46
+
Harsh J 2013-02-25, 07:31
+
Lucas Bernardi 2013-02-25, 22:03
Copy link to this message
-
Re: map reduce and sync
Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <[EMAIL PROTECTED]> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <[EMAIL PROTECTED]>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <[EMAIL PROTECTED]>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB