Failover dhcpd pair stuck in partner-down/shutdown state

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
Hi!

I run two ISC DHCP Servers version 4.3.5 in failover mode.

They have been running just fine for several years being upgraded from time to time
until recently I found that first one runs in "partner-down" state
and second in "shutdown" state despite of tcp/647 control connection
in perfectly working state and data running over it according to tcpdump.

They were running in such state for very long time (over a year) and
I have no old logs to check due to log rotation. At the moment,
second server added "not responding (shut down)" to DHCPDISCOVER/DHCPREQUEST
lines written to its log.

I tried to resolve the issue by stopping second dhcpd completely
and starting it again. At start, it wrote to the log:

dhcpd: failover peer default: I move from shutdown to startup

Then it connected its control connection tcp/647 to second server,
exchanged some data over the connection, appended to dhcpd.leases file:

        failover peer "default" state {
          my state shutdown at 4 2017/03/30 02:17:13;
          partner state partner-down at 4 2017/03/30 02:17:13;
          mclt 60;
        }

Then it wrote to the log:

dhcpd: failover peer default: I move from startup to shutdown

And things settle again in same state.

Restart of first server did not help either.

I was forced to stop both of servers for short time, manually delete all
"failover" records quoted above from both dhcpd.leases files
and start servers again. Only then both servers got to "normal" state
(editing only one of dhcpd.leases files did not help).

My question: why did servers stuck in partner-down/shutdown state "forever"
and could not get from it without manual intervention despite of perfectly working
control TCP connection? Is this problem fixed in recent versions?

Here is dhcpd.conf of first server:

# default ports tcp/647

failover peer "default" {
        primary;
        address 62.231.191.161;
        peer address 62.231.191.174;
        max-response-delay 60;
        max-unacked-updates 10;
        mclt 60;
        split 128;
        auto-partner-down 60;
        load balance max seconds 3;
}

subnet 62.231.191.160 netmask 255.255.255.252 {}
include "/usr/local/etc/dhcpd.master";

Second server uses same configuraton except of IP addresses
and it uses identical dhcpd.master file containin rest of configuration.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Bob Harold

On Tue, Dec 25, 2018 at 4:24 AM Eugene Grosbein <[hidden email]> wrote:
Hi!

I run two ISC DHCP Servers version 4.3.5 in failover mode.

They have been running just fine for several years being upgraded from time to time
until recently I found that first one runs in "partner-down" state
and second in "shutdown" state despite of tcp/647 control connection
in perfectly working state and data running over it according to tcpdump.

They were running in such state for very long time (over a year) and
I have no old logs to check due to log rotation. At the moment,
second server added "not responding (shut down)" to DHCPDISCOVER/DHCPREQUEST
lines written to its log.

I tried to resolve the issue by stopping second dhcpd completely
and starting it again. At start, it wrote to the log:

dhcpd: failover peer default: I move from shutdown to startup

Then it connected its control connection tcp/647 to second server,
exchanged some data over the connection, appended to dhcpd.leases file:

        failover peer "default" state {
          my state shutdown at 4 2017/03/30 02:17:13;
          partner state partner-down at 4 2017/03/30 02:17:13;
          mclt 60;
        }

Then it wrote to the log:

dhcpd: failover peer default: I move from startup to shutdown

And things settle again in same state.

Restart of first server did not help either.

I was forced to stop both of servers for short time, manually delete all
"failover" records quoted above from both dhcpd.leases files
and start servers again. Only then both servers got to "normal" state
(editing only one of dhcpd.leases files did not help).

My question: why did servers stuck in partner-down/shutdown state "forever"
and could not get from it without manual intervention despite of perfectly working
control TCP connection? Is this problem fixed in recent versions?

Here is dhcpd.conf of first server:

# default ports tcp/647

failover peer "default" {
        primary;
        address 62.231.191.161;
        peer address 62.231.191.174;
        max-response-delay 60;
        max-unacked-updates 10;
        mclt 60;
        split 128;
        auto-partner-down 60;
        load balance max seconds 3;
}

subnet 62.231.191.160 netmask 255.255.255.252 {}
include "/usr/local/etc/dhcpd.master";

Second server uses same configuraton except of IP addresses
and it uses identical dhcpd.master file containin rest of configuration.


When you say " Second server uses same configuraton ", I hope you did not accidentally mark both as "primary".
Here is the config on one of my pairs, for comparision:

-------- first server ------------

failover peer "mydhcppair1"
{
primary;
address 141.211.147.232;
port 847;
peer address 141.211.147.248;
peer port 647;
max-response-delay 60;
max-unacked-updates 10;
mclt 1800;
split 128;
load balance max seconds 3;
}


-------- second server ------------

failover peer "mydhcppair1"
{
secondary;
address X.X.X.248;
port 647;
peer address X.X.X.232;
peer port 847;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 3;
}

Note that "mclt" and "split" can only be specified on the primary.

-- 
Bob Harold


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
02.01.2019 22:24, Bob Harold wrote:

> When you say " Second server uses same configuraton ", I hope you did not accidentally mark both as "primary".

No, second is secondary just like in your configuration.

> Here is the config on one of my pairs, for comparision:
>
> -------- first server ------------
>
> failover peer "mydhcppair1"
> {
> primary;
> address 141.211.147.232;
> port 847;
> peer address 141.211.147.248;
> peer port 647;
> max-response-delay 60;
> max-unacked-updates 10;
> mclt 1800;
> split 128;
> load balance max seconds 3;
> }
>
>
> -------- second server ------------
>
> failover peer "mydhcppair1"
> {
> secondary;
> address X.X.X.248;
> port 647;
> peer address X.X.X.232;
> peer port 847;
> max-response-delay 60;
> max-unacked-updates 10;
> load balance max seconds 3;
> }
>
> Note that "mclt" and "split" can only be specified on the primary.

Hmm, I have "mclt" in the config of secondary server by accident
and it does not complain... Can it be a culprit?

Also, both of my servers use same default control port.

# default ports tcp/647
failover peer "default" {
        secondary;
        address 62.231.191.174;
        peer address 62.231.191.161;
        max-response-delay 60;
        max-unacked-updates 10;
        mclt 60;
        auto-partner-down 60;
        load balance max seconds 3;
}

subnet 62.231.191.172 netmask 255.255.255.252 {}
include "/usr/local/etc/dhcpd.master";

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Bob Harold

On Wed, Jan 2, 2019 at 11:26 AM Eugene Grosbein <[hidden email]> wrote:
02.01.2019 22:24, Bob Harold wrote:

> When you say " Second server uses same configuraton ", I hope you did not accidentally mark both as "primary".

No, second is secondary just like in your configuration.

> Here is the config on one of my pairs, for comparision:
>
> -------- first server ------------
>
> failover peer "mydhcppair1"
> {
> primary;
> address 141.211.147.232;
> port 847;
> peer address 141.211.147.248;
> peer port 647;
> max-response-delay 60;
> max-unacked-updates 10;
> mclt 1800;
> split 128;
> load balance max seconds 3;
> }
>
>
> -------- second server ------------
>
> failover peer "mydhcppair1"
> {
> secondary;
> address X.X.X.248;
> port 647;
> peer address X.X.X.232;
> peer port 847;
> max-response-delay 60;
> max-unacked-updates 10;
> load balance max seconds 3;
> }
>
> Note that "mclt" and "split" can only be specified on the primary.

Hmm, I have "mclt" in the config of secondary server by accident
and it does not complain... Can it be a culprit?

Also, both of my servers use same default control port.

# default ports tcp/647
failover peer "default" {
        secondary;
        address 62.231.191.174;
        peer address 62.231.191.161;
        max-response-delay 60;
        max-unacked-updates 10;
        mclt 60;
        auto-partner-down 60;
        load balance max seconds 3;
}

subnet 62.231.191.172 netmask 255.255.255.252 {}
include "/usr/local/etc/dhcpd.master";


Your config looks correct to me.  I don't have any other clues.

-- 
Bob Harold
 

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Thomas Markwalder
In reply to this post by Eugene Grosbein
Hello:

I suspect that at some point in the past one of the servers was put into
the shutdown state by setting it's state to shutdown (8) via omshell. 
This caused the other server to toggle to partner-down (4).  They
servers will stay that way until you take them through recovery by
setting the partner-down peer's state to recover (6).  When a server is
set to shutdown state it remains that until you intervene.  This is
intended to allow you to do maintenance and what not with minimal issues.

Regards,

Thomas Markwalder
ISC Software Engineering


On 12/25/18 4:23 AM, Eugene Grosbein wrote:

> Hi!
>
> I run two ISC DHCP Servers version 4.3.5 in failover mode.
>
> They have been running just fine for several years being upgraded from time to time
> until recently I found that first one runs in "partner-down" state
> and second in "shutdown" state despite of tcp/647 control connection
> in perfectly working state and data running over it according to tcpdump.
>
> They were running in such state for very long time (over a year) and
> I have no old logs to check due to log rotation. At the moment,
> second server added "not responding (shut down)" to DHCPDISCOVER/DHCPREQUEST
> lines written to its log.
>
> I tried to resolve the issue by stopping second dhcpd completely
> and starting it again. At start, it wrote to the log:
>
> dhcpd: failover peer default: I move from shutdown to startup
>
> Then it connected its control connection tcp/647 to second server,
> exchanged some data over the connection, appended to dhcpd.leases file:
>
>          failover peer "default" state {
>            my state shutdown at 4 2017/03/30 02:17:13;
>            partner state partner-down at 4 2017/03/30 02:17:13;
>            mclt 60;
>          }
>
> Then it wrote to the log:
>
> dhcpd: failover peer default: I move from startup to shutdown
>
> And things settle again in same state.
>
> Restart of first server did not help either.
>
> I was forced to stop both of servers for short time, manually delete all
> "failover" records quoted above from both dhcpd.leases files
> and start servers again. Only then both servers got to "normal" state
> (editing only one of dhcpd.leases files did not help).
>
> My question: why did servers stuck in partner-down/shutdown state "forever"
> and could not get from it without manual intervention despite of perfectly working
> control TCP connection? Is this problem fixed in recent versions?
>
> Here is dhcpd.conf of first server:
>
> # default ports tcp/647
>
> failover peer "default" {
>          primary;
>          address 62.231.191.161;
>          peer address 62.231.191.174;
>          max-response-delay 60;
>          max-unacked-updates 10;
>          mclt 60;
>          split 128;
>          auto-partner-down 60;
>          load balance max seconds 3;
> }
>
> subnet 62.231.191.160 netmask 255.255.255.252 {}
> include "/usr/local/etc/dhcpd.master";
>
> Second server uses same configuraton except of IP addresses
> and it uses identical dhcpd.master file containin rest of configuration.
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
In reply to this post by Eugene Grosbein
03.01.2019 0:38, Bruce Hudson wrote:

>     This may be a silly question but, have you tried to change the server
> state manually? You said you restarted the server. However, as you pointed
> out, a "failover peer" stanza is added to the lease file and should put
> the server back into its previous state as soon as the lease file is
> processed. Was there an existing "failover peer" stanza in your file?

Yes, it was.

>     One way to restore your state would be through OMAPI, assuming you
> have the listener configured. You want to run something like the code
> below on each server.

My question was not how to restore state but why it does not restore automatically?

>     The other way might be to shut down the server and edit the lease file
> before restarting it.

That's what I did and described in my first port.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
In reply to this post by Thomas Markwalder
03.01.2019 3:08, Thomas Markwalder wrote:

> Hello:
>
> I suspect that at some point in the past one of the servers was put into the shutdown state
> by setting it's state to shutdown (8) via omshell.  This caused the other server to toggle to partner-down (4).
> They servers will stay that way until you take them through recovery by setting the partner-down peer's state to recover (6).
> When a server is set to shutdown state it remains that until you intervene.
> This is intended to allow you to do maintenance and what not with minimal issues.

I've never learned how to use omshell, never used it and it is not configured here even.


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Thomas Markwalder
There are also a handful of error conditions that can cause a server to
transition to shut_down.

On 1/2/19 11:15 PM, Eugene Grosbein wrote:

> 03.01.2019 3:08, Thomas Markwalder wrote:
>
>> Hello:
>>
>> I suspect that at some point in the past one of the servers was put into the shutdown state
>> by setting it's state to shutdown (8) via omshell.  This caused the other server to toggle to partner-down (4).
>> They servers will stay that way until you take them through recovery by setting the partner-down peer's state to recover (6).
>> When a server is set to shutdown state it remains that until you intervene.
>> This is intended to allow you to do maintenance and what not with minimal issues.
> I've never learned how to use omshell, never used it and it is not configured here even.
>
>

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
03.01.2019 20:26, Thomas Markwalder wrote:
> On 1/2/19 11:15 PM, Eugene Grosbein wrote:
>> 03.01.2019 3:08, Thomas Markwalder wrote:
>>> They servers will stay that way until you take them through recovery by setting the partner-down peer's state to recover (6).
>>> When a server is set to shutdown state it remains that until you intervene.
>>> This is intended to allow you to do maintenance and what not with minimal issues.

Thanks, I did not know this and could not find a manual page documenting this.

>> I've never learned how to use omshell, never used it and it is not configured here even.
> There are also a handful of error conditions that can cause a server to transition to shut_down.

Can you please explain what error conditions can lock a server in shut_down state?

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
In reply to this post by Eugene Grosbein
04.01.2019 4:02, Bruce Hudson wrote:

> On Thu, Jan 03, 2019 at 11:12:49AM +0700, Eugene Grosbein wrote:
>  
>> My question was not how to restore state but why it does not restore automatically?
>
>     My apologies, I guess I skimmed your first port too quickly. The short
> answer is "design choice". There is a problem in redundant systems called
> "partioning" that cannot generally be solved with only two peers. To avoid it
> you need "half + 1" in the working set. ISC's answer was to make the process
> manual.

Thank you for explaining this. The problem is, the server(s) got stuck in this state
due to some obscure error all by itself.


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover dhcpd pair stuck in partner-down/shutdown state

Eugene Grosbein
04.01.2019 21:26, Bruce Hudson wrote:

> On Fri, Jan 04, 2019 at 10:28:04AM +0700, Eugene Grosbein wrote:
>>
>> Thank you for explaining this. The problem is, the server(s) got stuck in
>> this state due to some obscure error all by itself.
>
>     That can be possible for the "shutdown" state but not "partner-down".
> The only way to get to that state is manually. So, who else was around
> almost 2 years ago (March 30, 2017) at 9:17AM?

Sadly, I have no old logs.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users