SSH refused on RHEL 5.4
The server was up and running and no recent change or upgrade was done.
configuration on pair server is identical however only the SSH connection on db2 is refused, and ssh on db1 is working.
[oracle@to5uspdb1 ~]$ ssh -v 10.128.36.19
OpenSSH_4.3p2, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to 10.128.36.19 [10.128.36.19] port 22.
debug1: Connection established.
debug1: identity file /opt/oracle/.ssh/identity type -1
debug1: identity file /opt/oracle/.ssh/id_rsa type 1
debug1: identity file /opt/oracle/.ssh/id_dsa type 2
debug1: loaded 3 keys
debug1: Remote protocol version 2.0, remote software version OpenSSH_4.3
debug1: match: OpenSSH_4.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_4.3
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
Connection closed by 10.128.36.19
[oracle@to5uspdb1 ~]$
xxx# ssh root@to5uspdb2
key_load_public: invalid format
ssh_packet_read: Connection closed
xxx#
what is the solution and how we can apply since there is no ssh access.
thanks,
Responses
I assume 10.129.36.19 = to5uspdb2, is this correct? Apparently when network interface is configured with an incorrect MTU value, the SSH connection negotiation will typically break down at that exact point. (I guess that's the point where the first SSH protocol messages that are longer than the standard Ethernet MTU are exchanged.)
If that is the cause, the misconfiguration might be on one of the servers, or in the network hardware the connection passes through. For example, the network segment might be configured for jumbo frames, but the switch to which to5uspdb2 is connected to might be missing that essential bit of configuration. Perhaps the network administrator did configure the switch correctly, but failed to save the new configuration - and now the switch may have been rebooted, for whatever reason.
To troubleshoot this, you may need a network administrator to help you. If you don't have remote console access to that system, you may also need to go physically to the server and log in to the server console - or find someone else who can do that for you.
On the console, the MTU values for each network interface should be verified. Either the ifconfig or ip link commands can be used for this:
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 08:00:27:27:90:2b brd ff:ff:ff:ff:ff:ff
...
[root@rhel511 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 08:00:27:27:90:2B
inet addr:192.168.42.17 Bcast:192.168.42.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:84 errors:0 dropped:0 overruns:0 frame:0
TX packets:65 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:13692 (13.3 KiB) TX bytes:8634 (8.4 KiB)
...
In this example, there is mtu 1500 in ip link output, and MTU:1500 in ifconfig output for eth0.
1500 is the standard value in Ethernet networks. If the network segment is configured for jumbo frames (which can improve the performance of large data transfers) the MTU value might be something like 9000 or close to it. On the other hand, sometimes a reduced MTU value might be necessary; ask your network administrator for the correct value.
If the MTU value for any network interface on the server needs to be corrected, that will be the next step on the server console. For example, if the MTU value for eth0 is 9000 and the network administrator says it should be 1500, one of the following commands will fix it temporarily:
# ifconfig eth0 1500
or
# ip link set eth0 mtu 1500
If a server-side MTU mismatch is causing the difficulties in SSH connections, the server should now be reachable over SSH.
But the change is not persistent yet; the next step would be to verify that the correct MTU is specified in the /etc/sysconfig/network-scripts/ifcfg-* files too. For example, if the MTU for eth0 interface should be 1300, there should be a line like this in /etc/sysconfig/network-scripts/ifcfg-eth0:
MTU=1300
But if the SSH connection still does not work, the next step on the console would be to check the /var/log/secure log file for SSH connection errors: if the sshd daemon rejects a connection, the rejection reason will be stored in this log file. The rejection reason is not sent to the SSH client that is attempting to connect, as the client has not been authenticated yet and sshd assumes the client might be malicious. The rejection reason should be useful for continued troubleshooting.
(Side note: If a smaller-than-normal MTU value is needed, it might indicate that at some point, an over-zealous firewall administrator has blocked the ICMP Fragmentation Needed messages. This breaks the PMTUD mechanism, or Path MTU Discovery. If PMTUD is not available, it may cause problems just like the one you're experiencing when accessing these servers through VPNs or from various cloud services. Blocking of the ICMP Fragmentation Needed messages has been ill-advised for more than 20 years now.)
If you originally lost access to the system on July 15 or so, those messages might well be related. It looks like the system is trying to access /dev/sdb1 beyond the end of that partition. That might be caused by corrupted filesystem metadata, or possibly corrupted data in RAM: in other words, the system may have been thinking the partition was bigger than it actually is. Whatever it is, something was very wrong with that system on that day. A hardware problem of some kind is possible, and it might be the root cause of your SSH connection problems too.
This message indicates your hardware RAID might be having some sort of a problem:
Jul 15 20:03:59 to5uspdb2 kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=2
Jul 15 20:03:59 to5uspdb2 kernel: mptbase: ioc0: SMART data received, ASC/ASCQ = 5dh/00h
ASC/ASCQ pair 5dh/00h means "Failure Prediction Threshold Exceeded", so it means at least one of your disks might be failing or about to fail.
You may have to reboot the node again, and then the next step should probably be verifying the status of your hardware RAID sets. Use the RAID controller management tools provided by your hardware vendor to get more information about their state.
If you are using RAID 1 or RAID 5, the problem with hardware RAID controllers is that they sometimes work too well in a certain sense: if you don't specifically monitor them, the system will keep working even if one disk has failed and you won't see a problem if you don't know what to check for... until another disk in the same RAID set fails, at which point the RAID set can no longer recover your data.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
