Read issues while Healing under progress

Latest response

Hello Experts,

I got introduced to RHGS recently and i am getting to know things step by step, as i understand in a dispersed volume with Erasure code 4+2, we allow a failure of 2 bricks at most. please excuse and correct me if i am wrong.

When a brick is offline, heal entries starts to pile up, and when the bad brick is online, healing process 'actually' start healing the data till all 6 blocks (4 data +2 parity) of data are in sync.

My question is, in this situation at any point of time, be it when 2 bricks are offline when no healing is actually possible, and online when healing is actually going on, is there any possibility of application reading the file getting a read error?

Responses

Yes, Your understanding is correct. 4+2 combination will continue serving even with two bricks(Max) failure.

Yes, As soon as the offline bricks come online, the healing process will trigger.

No. Till you have 4 chunks of data you won't (should not) get IO error.

Hello Bipin, Thanks for your prompt response, theoretically yes i understand your point, but on production setups, we see I/O errors when there is healing going on, we are not sure whether healing is the problem or not, If the heal count not going down and also we have I/O errors : it means the file is corrupted beyond redundancy factor? that's why the question i asked : If a file can be healed, file should get healed Successfully at some point. can we expect I/O errors before the file gets healed successfully?

no, IO errors would not appear in case you have 4 correct data chunk and there is some pending heal. In case one or many (out of 6) chunk are corrupted, you might get IO errors. You need to see 1) why healing is not progressing 2) you need to figure out how many such files are there and then work on those files to understand the issue.

I would suggest reaching out to RH support or Upstream community for the help depending on the bits you are using with few things : - logs/sos-report - files where you see failure(IO errors) - getfattr -d -m. -ehex output. This should be done for all the bricks - gluster volume heal info output