Master instance crash in replicated fail-over AMQ ( 7.6 version )

Latest response

We have deployed replicated fail-over AMQ ( 7.6 version) in our environment. Master instance is crashing with following error

# JRE version: Java(TM) SE Runtime Environment (8.0_202-b08) (build 1.8.0_202-b08)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.202-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# j  org.apache.activemq.artemis.nativo.jlibaio.LibaioContext.done(Lorg/apache/activemq/artemis/nativo/jlibaio/SubmitInfo;)V+1
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

'journal-type value' set as ASYNCIO
'journal-max-io' not configured.

Responses

Thanks for raising your problem John. Can you provide any more details as to the circumstances under which the broker crashes? Was it during startup or at runtime while serving clients? What protocols do the clients use to connect to the broker? Any details on your use case may be relevant. Is the problem reproducible on your end? If yes, instructions on how to reproduce the problem would be very helpful.

Best Regards, Torsten Mielke

The issue is not consistent and not easily reproducible in our area. Broker crash happened at runtime with normal load, not at startup. We use JMS API to consume from queues and publish messages to queues. It is HA setup with replicated fail-over. Master instance only crashed, same time fail-over also worked fine. journal-type configured as 'ASYNCIO' as it is deployed in Linux machine.

 <persistence-enabled>true</persistence-enabled>

      <!-- this could be ASYNCIO, MAPPED, NIO
           ASYNCIO: Linux Libaio
           MAPPED: mmap files
           NIO: Plain Java Files
       -->
      <journal-type>ASYNCIO</journal-type>

      <paging-directory>data/paging</paging-directory>

      <bindings-directory>data/bindings</bindings-directory>

      <journal-directory>data/journal</journal-directory>

      <large-messages-directory>data/large-messages</large-messages-directory>

      <journal-datasync>true</journal-datasync>

      <journal-min-files>2</journal-min-files>

      <journal-pool-files>10</journal-pool-files>

      <journal-device-block-size>4096</journal-device-block-size>

      <journal-file-size>10M</journal-file-size>
 <acceptors>
         <!-- Acceptor for every supported protocol -->
         <acceptor name="artemis">tcp://<ip>:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>

      </acceptors>

Interesting.... Can you please confirm the exact version of your Operating system? Also, you may want to allow for core files to be written the next time the JVM crashes.

If you happen to have an active support subscription for Red Hat AMQ, I suggest to open a support case and move the discussion and investigation there.

I am not sure if you have access to https://access.redhat.com/solutions/5080901. It links to bug https://issues.redhat.com/browse/ENTMQBR-3642 which was raised just 3 days ago.

Operating system is RHEL 7.7 and Java version is Java HotSpot(TM) 8.0_202-b08 (build 1.8.0_202-b08). On this link https://access.redhat.com/solutions/5080901, there is workaround suggested that to switch from 'ASYNCIO' to 'NIO'. Same applicable here also?

Hi Rony! Could you provide the kernel version that you are using on RHEL 7.7?

Hi Tiago, Kernal version isLinux 3.10.0-1127.e17.x86_64

that's the problematic version.

You should be able to fix the issue by downgrading to Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

If you confirm it fixed, it would be a very good information to our tickets actually.

Clebert, thank you so much for the prompt response. Actually issue is not reproducible in our local area, and it is only happening sporadically in on-premises Customer area. Unfortunately we won't be able to tryout different Linux Kernal version as of now. Would be switching journal type to NIO as an easy workaround.

You can actually see the same version you mentioned here, on my extended post with all the technical details on what is the issue.

you should be able to solve this by downgrading your kenrel to:

Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I'm still waiting the kernel version you are using.

Yes the same workaround should be applicable in your case as well. That change may affect journal performance marginally but typically it won't be noticeable.

I hope that helps. Torsten Mielke

This issue is a recent bug introduced in RHEL 7.8. In the tests I performed you shouldn't see this on RHEL 7.7. unless you upgraded the kernel.

https://issues.redhat.com/browse/ENTMQBR-3642

Here is the version where the issue has regressed: Linux 3.10.0-1136.el7.x86_64 #1 SMP Fri Apr 17 11:40:59 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

the issue is that artemis-native will use a common (to certain Databases) reaping algorithm, that will use a ring buffer shared by the kernel. This is known as reaping events from the kernel:

https://github.com/apache/activemq-artemis-native/blob/81006cd5b4947d09f27f85729ef6e98e24cc6031/src/main/c/org_apache_activemq_artemis_nativo_jlibaio_LibaioContext.c#L110-L148

if no events are found on the reaping procedure, I then perform a libaio syscall to the kernel: https://github.com/apache/activemq-artemis-native/blob/81006cd5b4947d09f27f85729ef6e98e24cc6031/src/main/c/org_apache_activemq_artemis_nativo_jlibaio_LibaioContext.c#L155

on certain kernel versions, this will lead to duplicate events, and when I perform this call here, the object was previously destroyed, and it causes a crash:

https://github.com/apache/activemq-artemis-native/blob/81006cd5b4947d09f27f85729ef6e98e24cc6031/src/main/c/org_apache_activemq_artemis_nativo_jlibaio_LibaioContext.c#L824-L826

I am making a release of activemq-artemis-native today, that will perceive this situation and instead of crashing, it would stop using the reaping method, and I will as well add a system property to control this in advance.

As of a short term, you can either move to RHEL 8.x where this issue does not exist, or change it to NIO.

@Torsten: Can you please confirm your kernel version?

for some reasons links are not working here, I added the same comment here to this JIRA, and you can probably follow it better there:

https://issues.redhat.com/browse/ENTMQBR-3642

Clebert, Rony is the affected user, he needs to confirm the kernel version. I am in RH Support trying to help :-) Rony, can you confirm the kernel version please?

Clebert, it is RHEL 7.7 with kernal version as Linux 3.10.0-1127.e17.x86_64.

Rony, I tried on RHEL 7.7 with kernel 3.10.0-1062.18.1.el7.x86_64 and looks like the problem do not occur. You may try to downgrade you kernel version to that one if you think it is possible. I suggest you to try it 1st in a test environment before do the steps below in a production server.

This command will list all available kernel packages to be installed: $ yum --showduplicates list kernel

If kernel-3.10.0-1062.18.1.el7 is available, you may try to install it using: $ yum install kernel-3.10.0-1062.18.1.el7 and then select it during the boot time as I think it will not be the default one as it is a old version.

I hope it helps.

Hi Tiago, thank you so much for the prompt response. Actually issue is not reproducible in our local area, and it is only happening sporadically in on-premises Customer area. Unfortunately we won't be able to tryout different Linux Kernal version as of now. Would be switching journal type to NIO as an easy workaround.

Thank you.

I have access to a work around to the kernel issue, and I'm producing a new version of the artemis-native.

We will attach the new version to the KCS when available:

https://access.redhat.com/solutions/5080901