SSH refused on RHEL 5.4

Latest response

The server was up and running and no recent change or upgrade was done.
configuration on pair server is identical however only the SSH connection on db2 is refused, and ssh on db1 is working.

[oracle@to5uspdb1 ~]$ ssh -v 10.128.36.19
OpenSSH_4.3p2, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to 10.128.36.19 [10.128.36.19] port 22.
debug1: Connection established.
debug1: identity file /opt/oracle/.ssh/identity type -1
debug1: identity file /opt/oracle/.ssh/id_rsa type 1
debug1: identity file /opt/oracle/.ssh/id_dsa type 2
debug1: loaded 3 keys
debug1: Remote protocol version 2.0, remote software version OpenSSH_4.3
debug1: match: OpenSSH_4.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_4.3
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
Connection closed by 10.128.36.19
[oracle@to5uspdb1 ~]$

xxx# ssh root@to5uspdb2
key_load_public: invalid format
ssh_packet_read: Connection closed
xxx#

what is the solution and how we can apply since there is no ssh access.

thanks,

Responses

I assume 10.129.36.19 = to5uspdb2, is this correct? Apparently when network interface is configured with an incorrect MTU value, the SSH connection negotiation will typically break down at that exact point. (I guess that's the point where the first SSH protocol messages that are longer than the standard Ethernet MTU are exchanged.)

If that is the cause, the misconfiguration might be on one of the servers, or in the network hardware the connection passes through. For example, the network segment might be configured for jumbo frames, but the switch to which to5uspdb2 is connected to might be missing that essential bit of configuration. Perhaps the network administrator did configure the switch correctly, but failed to save the new configuration - and now the switch may have been rebooted, for whatever reason.

To troubleshoot this, you may need a network administrator to help you. If you don't have remote console access to that system, you may also need to go physically to the server and log in to the server console - or find someone else who can do that for you.

Hi Matti, appreciate your response Yes you are correct about the hostname. Other than communicating with appropriate network team what action is required to be taken on the server console?

thanks,

On the console, the MTU values for each network interface should be verified. Either the ifconfig or ip link commands can be used for this:

# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 08:00:27:27:90:2b brd ff:ff:ff:ff:ff:ff
...

[root@rhel511 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:27:90:2B  
          inet addr:192.168.42.17  Bcast:192.168.42.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:84 errors:0 dropped:0 overruns:0 frame:0
          TX packets:65 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:13692 (13.3 KiB)  TX bytes:8634 (8.4 KiB)
...

In this example, there is mtu 1500 in ip link output, and MTU:1500 in ifconfig output for eth0. 1500 is the standard value in Ethernet networks. If the network segment is configured for jumbo frames (which can improve the performance of large data transfers) the MTU value might be something like 9000 or close to it. On the other hand, sometimes a reduced MTU value might be necessary; ask your network administrator for the correct value.

If the MTU value for any network interface on the server needs to be corrected, that will be the next step on the server console. For example, if the MTU value for eth0 is 9000 and the network administrator says it should be 1500, one of the following commands will fix it temporarily:

# ifconfig eth0 1500
or
# ip link set eth0 mtu 1500

If a server-side MTU mismatch is causing the difficulties in SSH connections, the server should now be reachable over SSH. But the change is not persistent yet; the next step would be to verify that the correct MTU is specified in the /etc/sysconfig/network-scripts/ifcfg-* files too. For example, if the MTU for eth0 interface should be 1300, there should be a line like this in /etc/sysconfig/network-scripts/ifcfg-eth0:

MTU=1300

But if the SSH connection still does not work, the next step on the console would be to check the /var/log/secure log file for SSH connection errors: if the sshd daemon rejects a connection, the rejection reason will be stored in this log file. The rejection reason is not sent to the SSH client that is attempting to connect, as the client has not been authenticated yet and sshd assumes the client might be malicious. The rejection reason should be useful for continued troubleshooting.

(Side note: If a smaller-than-normal MTU value is needed, it might indicate that at some point, an over-zealous firewall administrator has blocked the ICMP Fragmentation Needed messages. This breaks the PMTUD mechanism, or Path MTU Discovery. If PMTUD is not available, it may cause problems just like the one you're experiencing when accessing these servers through VPNs or from various cloud services. Blocking of the ICMP Fragmentation Needed messages has been ill-advised for more than 20 years now.)

Just to bring up some points here: there hasn't been any configuration change on server's interface. No MTU value is configured on the pair node, but it is working. DB2 supposed to have same configuration. so if the root cause is the MTU value on the switch, once they correct on their side(it is not confirmed the issue resides on network or not yet)side should I be able to get my ssh remote access back?

Yes, you should.

Hi Matti, I rebooted the node and got the ssh access for a few hours before the ssh goes down again. now I also lost the Console access from ilom. the logs on the server: var/log/messages

Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697810624, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697810120, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 87226264
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697817792, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697817608, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 87227200
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697818048, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697818304, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=697817800, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 87227224
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=655557080, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=655557064, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 81944632
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=655090304, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=655090560, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:39 to5uspdb2 kernel: sdb1: rw=0, want=655090056, limit=209712447
Jul 15 18:46:39 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 81886256
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=655574296, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=655574280, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 81946784
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=655557376, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=655557632, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=655557128, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 81944640
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=695451264, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:41 to5uspdb2 kernel: sdb1: rw=0, want=695451016, limit=209712447
Jul 15 18:46:41 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 86931376
Jul 15 18:46:44 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:44 to5uspdb2 kernel: sdb1: rw=0, want=688804096, limit=209712447
Jul 15 18:46:44 to5uspdb2 kernel: attempt to access beyond end of device
Jul 15 18:46:44 to5uspdb2 kernel: sdb1: rw=0, want=688803848, limit=209712447
Jul 15 18:46:44 to5uspdb2 kernel: Buffer I/O error on device sdb1, logical block 86100480
Jul 15 20:03:59 to5uspdb2 kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=2
Jul 15 20:03:59 to5uspdb2 kernel: mptbase: ioc0:   SMART data received, ASC/ASCQ = 5dh/00h
Jul 31 14:43:17 to5uspdb2 syslogd 1.4.1: restart.
Jul 31 14:43:17 to5uspdb2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jul 31 14:43:17 to5uspdb2 kernel: Linux version 2.6.18-164.el5 (mockbuild@x86-003.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Aug 18 15:51:48 EDT 2009
---------------------------------------------
[root@to5uspdb2 log]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              15G  3.2G   11G  24% /
/dev/sda3             9.5G  2.0G  7.1G  22% /var
/dev/sda2              46G   20G   24G  45% /opt
tmpfs                 7.9G     0  7.9G   0% /dev/shm
/dev/sdf1             495M   89M  406M  18% /ocluster
/dev/sdb1             467G   73G  394G  16% /backup
/dev/sdc1             200G   40G  161G  20% /reporting
/dev/scd0             3.4G  3.4G     0 100% /media/RHEL_5.4 x86_64 DVD

If you originally lost access to the system on July 15 or so, those messages might well be related. It looks like the system is trying to access /dev/sdb1 beyond the end of that partition. That might be caused by corrupted filesystem metadata, or possibly corrupted data in RAM: in other words, the system may have been thinking the partition was bigger than it actually is. Whatever it is, something was very wrong with that system on that day. A hardware problem of some kind is possible, and it might be the root cause of your SSH connection problems too.

This message indicates your hardware RAID might be having some sort of a problem:

Jul 15 20:03:59 to5uspdb2 kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=2
Jul 15 20:03:59 to5uspdb2 kernel: mptbase: ioc0:   SMART data received, ASC/ASCQ = 5dh/00h

ASC/ASCQ pair 5dh/00h means "Failure Prediction Threshold Exceeded", so it means at least one of your disks might be failing or about to fail.

You may have to reboot the node again, and then the next step should probably be verifying the status of your hardware RAID sets. Use the RAID controller management tools provided by your hardware vendor to get more information about their state.

If you are using RAID 1 or RAID 5, the problem with hardware RAID controllers is that they sometimes work too well in a certain sense: if you don't specifically monitor them, the system will keep working even if one disk has failed and you won't see a problem if you don't know what to check for... until another disk in the same RAID set fails, at which point the RAID set can no longer recover your data.

Hi Matti, a new drive has been ordered. Slot 0 has the faulty disk, we have 2 inbuilt disks in total and is running Raid 1. From Oracle perspective it is more like a plug and play, taking faulty disk out and inserting a new one. There is a few action to take and prepare the disk using MegaRaid controller which we will be performing remotely, is there any other tips you might have to complete this work as smooth as possible? post checks or pre checks you may list of.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.