The list is a bad idea. When you will have millions of lines per
regions, are going to pu millions of them in memory in your list?
Your MR will scan the entire table, row by row. If you modify the
current row, when the scanner will search for the next one, it will
not look at current one. So there is no real issue with that.
Also, instead of doing puts one by one I will recommand you to buffer
them (let's say, 100 by 100) and put them as a batch. Don't forget to
push the remaining at the end of the job. The drawback is that if the
MR crash you will have some rows already processed and not marked as
2013/6/22 Rohit Kelkar <[EMAIL PROTECTED]>:
> I have a usecase where I push data in my HTable in waves followed by
> Mapper-only processing. Currently once a row is processed in map I
> immediately mark it as processed=true. For this inside the map I execute a
> table.put(isprocessed=true). I am not sure if modifying the table like this
> is a good idea. I am also concerned that I am modifying the same table that
> I am running the MR job on.
> So I am thinking of another approach where I accumulate the processed rows
> in a list (or a better compact data structure) and use the cleanup method
> of the MR job to execute all the table.put(isprocessed=true) at once.
> What is the suggested best practice?
> - R