Failover state changes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Failover state changes

kraishak

Hi
 I am using the isc dhcp in standalone and it is working fine with that mode, I added the dhcp failover to my existing setup which caused the issue to my setup 

 The server is in partner down state for so long time which made me panic. How can I reduce this time and make them in to normal normal state 
FYI: I did some reading and found one param max-unacked-updates which was configured to 10 when i tried for the first time thought it would be the cause and I  increased its value to 5000 because my config contains nearly 800 subnets data which is large but no luck
Does any one face same issue while adding the failover or any idea or suggestion how to decrease the time taking to recover the failover states into normal normal 
  
on Primary state changes
cat dhcpd.leases | egrep "my state|partner state"
  my state partner-down at 5 2019/11/29 06:23:00;
  partner state recover-done at 5 2019/11/29 06:52:59;
  my state normal at 5 2019/11/29 06:52:59;
  partner state recover-done at 5 2019/11/29 06:52:59;
  my state normal at 5 2019/11/29 06:52:59;
  partner state normal at 5 2019/11/29 06:53:00;
On failover state changes
 cat dhcpd.leases | egrep "my state|partner state"
  my state recover at 5 2019/11/29 06:22:59;
  partner state communications-interrupted at 5 2019/11/29 06:23:00;
  my state recover at 5 2019/11/29 06:22:59;
  partner state communications-interrupted at 5 2019/11/29 06:23:00;
  my state recover at 5 2019/11/29 06:22:59;
  partner state partner-down at 5 2019/11/29 06:23:00;
  my state recover-wait at 5 2019/11/29 06:22:59;
  partner state partner-down at 5 2019/11/29 06:23:00;
  my state recover-done at 5 2019/11/29 06:52:59;
  partner state partner-down at 5 2019/11/29 06:23:00;
  my state recover-done at 5 2019/11/29 06:52:59;
  partner state normal at 5 2019/11/29 06:53:00;
  my state normal at 5 2019/11/29 06:53:00;
  partner state normal at 5 2019/11/29 06:53:00;

on Primary Server
========================================
failover peer "peer5" {
        primary;
        address YYY.YYY.YY.YYY;
        port 647;
        peer address YYY.YYY.YY.YYY;
        peer port 647;
        max-response-delay 30;
        max-unacked-updates 5000;
        load balance max seconds 3;
        mclt 1800;
        split 128;
}
on failover server
==========================================
failover peer "peer5" {
        secondary;
        address YYY.YYY.YY.YYY;
        port 647;
        peer address YYY.YYY.YY.YYY;
        peer port 647;
        max-response-delay 30;
        max-unacked-updates 5000;
        load balance max seconds 3;
}

It took nearly 30 minutes which makes issue for my environment, Do we have any tunable parameters

Thanks in Advance

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover state changes

Steven Carr
Why are you putting the system into partner-down? You only need to do
this if the partner is actually down.

30 minutes recovery time (RECOVER-WAIT) is because your MCLT value is
1800. This is normal. This is how failover works when recovering from
partner-down (hence why you should avoid using partner-down unless
absolutely necessary).

If you're updating the configuration then just update the config on
both systems and restart them. If you're using canned init scripts
then make sure they aren't doing anything stupid like causing the
systems to go into partner-down.

Highly recommend getting a copy of the DHCP Handbook and reading the
section on DHCP Failover.

Steve




On Fri, 29 Nov 2019 at 07:09, Kraishak Mahtha <[hidden email]> wrote:

>
>
> Hi
>  I am using the isc dhcp in standalone and it is working fine with that mode, I added the dhcp failover to my existing setup which caused the issue to my setup
>
>  The server is in partner down state for so long time which made me panic. How can I reduce this time and make them in to normal normal state
> FYI: I did some reading and found one param max-unacked-updates which was configured to 10 when i tried for the first time thought it would be the cause and I  increased its value to 5000 because my config contains nearly 800 subnets data which is large but no luck
> Does any one face same issue while adding the failover or any idea or suggestion how to decrease the time taking to recover the failover states into normal normal
>
> on Primary state changes
> cat dhcpd.leases | egrep "my state|partner state"
>   my state partner-down at 5 2019/11/29 06:23:00;
>   partner state recover-done at 5 2019/11/29 06:52:59;
>   my state normal at 5 2019/11/29 06:52:59;
>   partner state recover-done at 5 2019/11/29 06:52:59;
>   my state normal at 5 2019/11/29 06:52:59;
>   partner state normal at 5 2019/11/29 06:53:00;
> On failover state changes
>  cat dhcpd.leases | egrep "my state|partner state"
>   my state recover at 5 2019/11/29 06:22:59;
>   partner state communications-interrupted at 5 2019/11/29 06:23:00;
>   my state recover at 5 2019/11/29 06:22:59;
>   partner state communications-interrupted at 5 2019/11/29 06:23:00;
>   my state recover at 5 2019/11/29 06:22:59;
>   partner state partner-down at 5 2019/11/29 06:23:00;
>   my state recover-wait at 5 2019/11/29 06:22:59;
>   partner state partner-down at 5 2019/11/29 06:23:00;
>   my state recover-done at 5 2019/11/29 06:52:59;
>   partner state partner-down at 5 2019/11/29 06:23:00;
>   my state recover-done at 5 2019/11/29 06:52:59;
>   partner state normal at 5 2019/11/29 06:53:00;
>   my state normal at 5 2019/11/29 06:53:00;
>   partner state normal at 5 2019/11/29 06:53:00;
>
> on Primary Server
> ========================================
> failover peer "peer5" {
>         primary;
>         address YYY.YYY.YY.YYY;
>         port 647;
>         peer address YYY.YYY.YY.YYY;
>         peer port 647;
>         max-response-delay 30;
>         max-unacked-updates 5000;
>         load balance max seconds 3;
>         mclt 1800;
>         split 128;
> }
> on failover server
> ==========================================
> failover peer "peer5" {
>         secondary;
>         address YYY.YYY.YY.YYY;
>         port 647;
>         peer address YYY.YYY.YY.YYY;
>         peer port 647;
>         max-response-delay 30;
>         max-unacked-updates 5000;
>         load balance max seconds 3;
> }
>
> It took nearly 30 minutes which makes issue for my environment, Do we have any tunable parameters
>
> Thanks in Advance
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover state changes

kraishak
Steven Thanks for response
 This is what exactly happened, When I add the failover to my production
system the peer went into to recovery, recover-wait state for so long time
and peer is not offering the leases which made me panic and   I restore the
old config which is standalone config(no failover) which worked fine for
that time, So I want to do the failover in my sand box environment with
similar config and check the duration

So I got these doubts
Does DHCP Offer IP when the server is in recover, recover-wait state?
How can the time duration be reduced to make them normal -normal in order
not face same issue, If MClt is the deciding parameter can we reduce it to 5
or 10 min will it be suggestible param value

Why are you putting the system into partner-down? You only need to do
this if the partner is actually down. ------>
No the steps I am doing on my test environment is as follows, Added failover
to the primary which is taking so long time, so trying with changing the
parameter value by stopping the dhcpd on failover, delete the lease file and
touch it back and restart dhcpd on the failover, I didn't put any peer into
partner-down state manually I just stop dhcpd on failover and start it back
by deleting and recreating the lease file, not sure whether this process
triggered to make it partner-down




--
Sent from: http://isc-dhcp-users.2343191.n4.nabble.com/
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover state changes

kraishak
Hi Team ,

When the failover server is newly added to the existing or when the failover
is down and brought after few time to rejoin the cluster, the state on the
failover is set to recover ---> recover-wait---->recover-done--normal but
here I have few concerns about this process
1)The duration of time to update the state from the recover to normal stage
is not predictable, I have tried multiple trials on my sandbox environment
for same servers some times it take the exact values of the mclt duration
some times it is almost triple or four times of the value, can i know what
is the deciding parameters for the state changes ?
2) Does the value of max-unacked-updates have the role in the failover state
change boosting time


Thanks in Advance
kraishak




--
Sent from: http://isc-dhcp-users.2343191.n4.nabble.com/
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover state changes

Chris Buxton
I can answer the latter question. Yes, max-unacked-updates affects recovery time when there are a large number of dynamic addresses, whether they have been assigned to client devices or not. Every available lease, whatever state, must be synchronized with the peer. Increasing the number of leases that can be sent at a time definitely appears to help speed this process along. I use a value of 1000 for this purpose, rather than the default of 10, and I've never seen any problems resulting.

Chris Buxton

> On Dec 18, 2019, at 2:08 AM, kraishak <[hidden email]> wrote:
>
> Hi Team ,
>
> When the failover server is newly added to the existing or when the failover
> is down and brought after few time to rejoin the cluster, the state on the
> failover is set to recover ---> recover-wait---->recover-done--normal but
> here I have few concerns about this process
> 1)The duration of time to update the state from the recover to normal stage
> is not predictable, I have tried multiple trials on my sandbox environment
> for same servers some times it take the exact values of the mclt duration
> some times it is almost triple or four times of the value, can i know what
> is the deciding parameters for the state changes ?
> 2) Does the value of max-unacked-updates have the role in the failover state
> change boosting time
>
>
> Thanks in Advance
> kraishak
>
>
>
>
> --
> Sent from: http://isc-dhcp-users.2343191.n4.nabble.com/
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
>

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Failover state changes

kraishak
Hi Chris,

Thanks for your valuable reply and suggestion, I just tried with
max-unacked-updates as 1000 but still it is taking more time will it be any
impact if it be more like 10000 just want to cross check,
And some times my failover peer servers both are getting stuck in the
recovery states Ex: primary in recovery and failover in recover-wait state
for so long time more than 3.5 hours. So i just stopped the dhcpd service on
the failover server and deleted the lease file and restarted it. This time
it took mclt time duration but not sure whether it is a correct way to
approach? Does any of the users face same situation of failover peer getting
stuck in the recover, recover-wait state for so long time?  
If so what would be the best way of fixing them without impacting the
environment for so long time?  



--
Sent from: http://isc-dhcp-users.2343191.n4.nabble.com/
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users