Restarting DHCP safely whilst avoiding partner-down state

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
Hi,

I'm attempting to write a systemd .service file for my own uses of ISC
DHCP. However, if it can be made sufficiently generic then I would
intend to push this upstream or at least into distributions.

It needs to be suitable for managing failover pairs and I'm struggling
with the age-old problem of restarting a dhcpd instance. From reading
around there does not currently appear to be a method for restarting
dhcpd that is both *safe* and *useful* in such a setup.


Restarting with signals:

From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
option, except where there is a high turnover of leases and the
production environment requires a high degree of reliability from
DHCP. In that case, we'd suggest that administrators consider using
OMAPI to control the daemon instead and to request a graceful
shutdown. The reason for this is that there is the slight possibility
that by using kill, administrators may stop dhcpd in the middle of
appending a lease to the leases file (in which case it may become
corrupted). This risk, while tiny, may be significant enough for some
administrators to prefer to use OMAPI instead."

In other words this is recommending that casual users take the risk
that their service might not recover after restarting. This may be
unlikely but it's still dangerous advice! The documentation does
indicates that a feature for "gentle shutdown" in response to a signal
was added in the 4.2 time frame and then subsequently removed:

"Added support for gentle shutdown after signal is received. [ISC-Bugs
#32692] [ISC-Bugs 34945]"
"Disable the gentle shutdown functionality until we can determine the
best way to present it to remove or reduce the side effects. [ISC-Bugs
#36066]"

Is it still the case that kill isn't suitable for production purposes?


With OMAPI:

You can cleanly shutdown via OMAPI "set state=2, etc." however the
effect on the failover protocol is less-ideal than with signals.

OMAPI shutdown will place the partner into "partner-down" state making
it become active for all leases in the failover pools which isn't
ideal when brief restarting an instance. Contrast this with the effect
of restarting an instance with kill which is to briefly place the
partner into "communications-interrupted" state from which it
immediate revert to "normal" once the restarted instance is available
(with auto-partner-down taking care for things if the instance does
not recover.)


Is there a safe way to restart DHCP that has minimal impact on the
failover protocol?


Thanks,

Terry
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Anderson, Charles R
FWIW, we've been using the "kill" method for over a decade without any
noticable side-effects (the default init.d scripts from RHEL 6
(actually Scientific Linux 6) dhcp package).  We've never had to
manually clean up a corrupted lease file.  We restart the services
automatically on a 20 minute cycle, as needed.  We do one, then
immediately do the other.  We do not wait to restart the other, and we
do not monitor to see if failover has reconnected and rebalanced
before restarting the other, but since we are SSH-ing into each server
to do the restart, there might be enough of a built-in delay between
restarting each server.

I don't know if a corrupted lease file would cause a failure to start
the dhcp server, or if it would just go unnoticed, perhaps with a log
message.  But like I said, we've never had a failure to start the
server that was caused by a lease file issue.

Our script does test the config file before doing the restart:

#!/bin/bash
echo -n "Testing DHCP configuration: "
if sudo /etc/rc.d/init.d/dhcpd configtest; then
        echo "Restarting DHCP"
        sudo /etc/rc.d/init.d/dhcpd restart
else
        echo "FAIL: Not restarting DHCP"
fi

which in CentOS 6 does the following:

exec=/usr/sbin/dhcpd
configtest() {
    [ -x $exec ] || return 5
    [ -f $config ] || return 6
    $exec -q -t -cf $config
    RETVAL=$?
    if [ $RETVAL -eq 1 ]; then
        $exec -t -cf $config
    else
        echo "Syntax: OK" >&2
    fi
    return $RETVAL
}


On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:

> Hi,
>
> I'm attempting to write a systemd .service file for my own uses of ISC
> DHCP. However, if it can be made sufficiently generic then I would
> intend to push this upstream or at least into distributions.
>
> It needs to be suitable for managing failover pairs and I'm struggling
> with the age-old problem of restarting a dhcpd instance. From reading
> around there does not currently appear to be a method for restarting
> dhcpd that is both *safe* and *useful* in such a setup.
>
>
> Restarting with signals:
>
> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
> option, except where there is a high turnover of leases and the
> production environment requires a high degree of reliability from
> DHCP. In that case, we'd suggest that administrators consider using
> OMAPI to control the daemon instead and to request a graceful
> shutdown. The reason for this is that there is the slight possibility
> that by using kill, administrators may stop dhcpd in the middle of
> appending a lease to the leases file (in which case it may become
> corrupted). This risk, while tiny, may be significant enough for some
> administrators to prefer to use OMAPI instead."
>
> In other words this is recommending that casual users take the risk
> that their service might not recover after restarting. This may be
> unlikely but it's still dangerous advice! The documentation does
> indicates that a feature for "gentle shutdown" in response to a signal
> was added in the 4.2 time frame and then subsequently removed:
>
> "Added support for gentle shutdown after signal is received. [ISC-Bugs
> #32692] [ISC-Bugs 34945]"
> "Disable the gentle shutdown functionality until we can determine the
> best way to present it to remove or reduce the side effects. [ISC-Bugs
> #36066]"
>
> Is it still the case that kill isn't suitable for production purposes?
>
>
> With OMAPI:
>
> You can cleanly shutdown via OMAPI "set state=2, etc." however the
> effect on the failover protocol is less-ideal than with signals.
>
> OMAPI shutdown will place the partner into "partner-down" state making
> it become active for all leases in the failover pools which isn't
> ideal when brief restarting an instance. Contrast this with the effect
> of restarting an instance with kill which is to briefly place the
> partner into "communications-interrupted" state from which it
> immediate revert to "normal" once the restarted instance is available
> (with auto-partner-down taking care for things if the instance does
> not recover.)
>
>
> Is there a safe way to restart DHCP that has minimal impact on the
> failover protocol?
>
>
> Thanks,
>
> Terry
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Steve van der Burg
Here we push out new configs to a partner pair from a central server.  The config for one of the partners contains an extra file (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl script):

  if ( -e "$spath/dhcpd.i.am.secondary" ) {
     exit if (localtime)[1] % 2 == 0;
  }
  else {
     exit if (localtime)[1] % 2 == 1;
  }

  ... continue (test new config, kill running server, start new one, etc)

So the config change, stop, start, etc, can only happen on odd minutes for one server and even minutes for the other.  As long as startup time is less than a minute (and it's much, much less than that) it all works smoothly.

...Steve

--
Steve van der Burg
Information Technology Services
London Health Sciences Centre
& St. Joseph's Health Care London
(519) 685-8500 ext 35559
[hidden email]

Chuck Anderson <[hidden email]> wrote:

> FWIW, we've been using the "kill" method for over a decade without any
> noticable side-effects (the default init.d scripts from RHEL 6
> (actually Scientific Linux 6) dhcp package).  We've never had to
> manually clean up a corrupted lease file.  We restart the services
> automatically on a 20 minute cycle, as needed.  We do one, then
> immediately do the other.  We do not wait to restart the other, and we
> do not monitor to see if failover has reconnected and rebalanced
> before restarting the other, but since we are SSH-ing into each server
> to do the restart, there might be enough of a built-in delay between
> restarting each server.
>
> I don't know if a corrupted lease file would cause a failure to start
> the dhcp server, or if it would just go unnoticed, perhaps with a log
> message.  But like I said, we've never had a failure to start the
> server that was caused by a lease file issue.
>
> Our script does test the config file before doing the restart:
>
> #!/bin/bash
> echo -n "Testing DHCP configuration: "
> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>         echo "Restarting DHCP"
>         sudo /etc/rc.d/init.d/dhcpd restart
> else
>         echo "FAIL: Not restarting DHCP"
> fi
>
> which in CentOS 6 does the following:
>
> exec=/usr/sbin/dhcpd
> configtest() {
>     [ -x $exec ] || return 5
>     [ -f $config ] || return 6
>     $exec -q -t -cf $config
>     RETVAL=$?
>     if [ $RETVAL -eq 1 ]; then
>         $exec -t -cf $config
>     else
>         echo "Syntax: OK" >&2
>     fi
>     return $RETVAL
> }
>
>
> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>> Hi,
>>
>> I'm attempting to write a systemd .service file for my own uses of ISC
>> DHCP. However, if it can be made sufficiently generic then I would
>> intend to push this upstream or at least into distributions.
>>
>> It needs to be suitable for managing failover pairs and I'm struggling
>> with the age-old problem of restarting a dhcpd instance. From reading
>> around there does not currently appear to be a method for restarting
>> dhcpd that is both *safe* and *useful* in such a setup.
>>
>>
>> Restarting with signals:
>>
>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>> option, except where there is a high turnover of leases and the
>> production environment requires a high degree of reliability from
>> DHCP. In that case, we'd suggest that administrators consider using
>> OMAPI to control the daemon instead and to request a graceful
>> shutdown. The reason for this is that there is the slight possibility
>> that by using kill, administrators may stop dhcpd in the middle of
>> appending a lease to the leases file (in which case it may become
>> corrupted). This risk, while tiny, may be significant enough for some
>> administrators to prefer to use OMAPI instead."
>>
>> In other words this is recommending that casual users take the risk
>> that their service might not recover after restarting. This may be
>> unlikely but it's still dangerous advice! The documentation does
>> indicates that a feature for "gentle shutdown" in response to a signal
>> was added in the 4.2 time frame and then subsequently removed:
>>
>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>> #32692] [ISC-Bugs 34945]"
>> "Disable the gentle shutdown functionality until we can determine the
>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>> #36066]"
>>
>> Is it still the case that kill isn't suitable for production purposes?
>>
>>
>> With OMAPI:
>>
>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>> effect on the failover protocol is less-ideal than with signals.
>>
>> OMAPI shutdown will place the partner into "partner-down" state making
>> it become active for all leases in the failover pools which isn't
>> ideal when brief restarting an instance. Contrast this with the effect
>> of restarting an instance with kill which is to briefly place the
>> partner into "communications-interrupted" state from which it
>> immediate revert to "normal" once the restarted instance is available
>> (with auto-partner-down taking care for things if the instance does
>> not recover.)
>>
>>
>> Is there a safe way to restart DHCP that has minimal impact on the
>> failover protocol?
>>
>>
>> Thanks,
>>
>> Terry
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users

 --------------------------------------------------------------------------------
This information is directed in confidence solely to the person named above and may contain confidential and/or privileged material. This information may not otherwise be distributed, copied or disclosed. If you have received this e-mail in error, please notify the sender immediately via a return e-mail and destroy original message. Thank you for your cooperation.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by Anderson, Charles R
On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:

> FWIW, we've been using the "kill" method for over a decade without any
> noticable side-effects (the default init.d scripts from RHEL 6
> (actually Scientific Linux 6) dhcp package).  We've never had to
> manually clean up a corrupted lease file.  We restart the services
> automatically on a 20 minute cycle, as needed.  We do one, then
> immediately do the other.  We do not wait to restart the other, and we
> do not monitor to see if failover has reconnected and rebalanced
> before restarting the other, but since we are SSH-ing into each server
> to do the restart, there might be enough of a built-in delay between
> restarting each server.

Thanks Chuck,

That's exactly our experience with SCPing the config from our IPAM
host, then using SSH to test and restart, for each instance. It's
*never* failed in practise despite doing this with up to a 5 minute
frequency.

But since I have the incentive to migrate our sys-v init scripts to
systemd and produce something useful to others I am trying to set the
bar higher and do the "right thing".

> I don't know if a corrupted lease file would cause a failure to start
> the dhcp server, or if it would just go unnoticed, perhaps with a log
> message.  But like I said, we've never had a failure to start the
> server that was caused by a lease file issue.

In our experience leases files corrupted by other means can cause a
failure to start. I don't recall whether that was due to mere
truncation though...

Thanks again for sharing your scripts.


> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>> Hi,
>>
>> I'm attempting to write a systemd .service file for my own uses of ISC
>> DHCP. However, if it can be made sufficiently generic then I would
>> intend to push this upstream or at least into distributions.
>>
>> It needs to be suitable for managing failover pairs and I'm struggling
>> with the age-old problem of restarting a dhcpd instance. From reading
>> around there does not currently appear to be a method for restarting
>> dhcpd that is both *safe* and *useful* in such a setup.
>>
>>
>> Restarting with signals:
>>
>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>> option, except where there is a high turnover of leases and the
>> production environment requires a high degree of reliability from
>> DHCP. In that case, we'd suggest that administrators consider using
>> OMAPI to control the daemon instead and to request a graceful
>> shutdown. The reason for this is that there is the slight possibility
>> that by using kill, administrators may stop dhcpd in the middle of
>> appending a lease to the leases file (in which case it may become
>> corrupted). This risk, while tiny, may be significant enough for some
>> administrators to prefer to use OMAPI instead."
>>
>> In other words this is recommending that casual users take the risk
>> that their service might not recover after restarting. This may be
>> unlikely but it's still dangerous advice! The documentation does
>> indicates that a feature for "gentle shutdown" in response to a signal
>> was added in the 4.2 time frame and then subsequently removed:
>>
>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>> #32692] [ISC-Bugs 34945]"
>> "Disable the gentle shutdown functionality until we can determine the
>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>> #36066]"
>>
>> Is it still the case that kill isn't suitable for production purposes?
>>
>>
>> With OMAPI:
>>
>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>> effect on the failover protocol is less-ideal than with signals.
>>
>> OMAPI shutdown will place the partner into "partner-down" state making
>> it become active for all leases in the failover pools which isn't
>> ideal when brief restarting an instance. Contrast this with the effect
>> of restarting an instance with kill which is to briefly place the
>> partner into "communications-interrupted" state from which it
>> immediate revert to "normal" once the restarted instance is available
>> (with auto-partner-down taking care for things if the instance does
>> not recover.)
>>
>>
>> Is there a safe way to restart DHCP that has minimal impact on the
>> failover protocol?
>>
>>
>> Thanks,
>>
>> Terry
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by Steve van der Burg
On 13 May 2016 at 15:10, Steve van der Burg <[hidden email]> wrote:

> Here we push out new configs to a partner pair from a central server.  The config for one of the partners contains an extra file (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl script):
>
>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>      exit if (localtime)[1] % 2 == 0;
>   }
>   else {
>      exit if (localtime)[1] % 2 == 1;
>   }
>
>   ... continue (test new config, kill running server, start new one, etc)
>
> So the config change, stop, start, etc, can only happen on odd minutes for one server and even minutes for the other.  As long as startup time is less than a minute (and it's much, much less than that) it all works smoothly.

Thanks Steve. We've also been pushing configs around then
synchronously restarting servers back-to-back (without sleeping) for
several years without incident.

It makes me a little suspicious about whether just killing the process
is indeed unsafe... But then maybe we've been lucky.

As mentioned I want to improve on what distributions are currently
doing so I'm deliberately setting the bar high and it would be great
if ISC could provide a single, approved, safe shutdown/restart
mechanism or describe what is required to develop such a mechanism.
Unfortunately the detail of Bug #36066 (retracting support for gentle
shutdown) isn't available as it would be interesting to see what
issues were encountered with the previous approach.


> Chuck Anderson <[hidden email]> wrote:
>> FWIW, we've been using the "kill" method for over a decade without any
>> noticable side-effects (the default init.d scripts from RHEL 6
>> (actually Scientific Linux 6) dhcp package).  We've never had to
>> manually clean up a corrupted lease file.  We restart the services
>> automatically on a 20 minute cycle, as needed.  We do one, then
>> immediately do the other.  We do not wait to restart the other, and we
>> do not monitor to see if failover has reconnected and rebalanced
>> before restarting the other, but since we are SSH-ing into each server
>> to do the restart, there might be enough of a built-in delay between
>> restarting each server.
>>
>> I don't know if a corrupted lease file would cause a failure to start
>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>> message.  But like I said, we've never had a failure to start the
>> server that was caused by a lease file issue.
>>
>> Our script does test the config file before doing the restart:
>>
>> #!/bin/bash
>> echo -n "Testing DHCP configuration: "
>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>         echo "Restarting DHCP"
>>         sudo /etc/rc.d/init.d/dhcpd restart
>> else
>>         echo "FAIL: Not restarting DHCP"
>> fi
>>
>> which in CentOS 6 does the following:
>>
>> exec=/usr/sbin/dhcpd
>> configtest() {
>>     [ -x $exec ] || return 5
>>     [ -f $config ] || return 6
>>     $exec -q -t -cf $config
>>     RETVAL=$?
>>     if [ $RETVAL -eq 1 ]; then
>>         $exec -t -cf $config
>>     else
>>         echo "Syntax: OK" >&2
>>     fi
>>     return $RETVAL
>> }
>>
>>
>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>> Hi,
>>>
>>> I'm attempting to write a systemd .service file for my own uses of ISC
>>> DHCP. However, if it can be made sufficiently generic then I would
>>> intend to push this upstream or at least into distributions.
>>>
>>> It needs to be suitable for managing failover pairs and I'm struggling
>>> with the age-old problem of restarting a dhcpd instance. From reading
>>> around there does not currently appear to be a method for restarting
>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>
>>>
>>> Restarting with signals:
>>>
>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>> option, except where there is a high turnover of leases and the
>>> production environment requires a high degree of reliability from
>>> DHCP. In that case, we'd suggest that administrators consider using
>>> OMAPI to control the daemon instead and to request a graceful
>>> shutdown. The reason for this is that there is the slight possibility
>>> that by using kill, administrators may stop dhcpd in the middle of
>>> appending a lease to the leases file (in which case it may become
>>> corrupted). This risk, while tiny, may be significant enough for some
>>> administrators to prefer to use OMAPI instead."
>>>
>>> In other words this is recommending that casual users take the risk
>>> that their service might not recover after restarting. This may be
>>> unlikely but it's still dangerous advice! The documentation does
>>> indicates that a feature for "gentle shutdown" in response to a signal
>>> was added in the 4.2 time frame and then subsequently removed:
>>>
>>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>>> #32692] [ISC-Bugs 34945]"
>>> "Disable the gentle shutdown functionality until we can determine the
>>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>>> #36066]"
>>>
>>> Is it still the case that kill isn't suitable for production purposes?
>>>
>>>
>>> With OMAPI:
>>>
>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>> effect on the failover protocol is less-ideal than with signals.
>>>
>>> OMAPI shutdown will place the partner into "partner-down" state making
>>> it become active for all leases in the failover pools which isn't
>>> ideal when brief restarting an instance. Contrast this with the effect
>>> of restarting an instance with kill which is to briefly place the
>>> partner into "communications-interrupted" state from which it
>>> immediate revert to "normal" once the restarted instance is available
>>> (with auto-partner-down taking care for things if the instance does
>>> not recover.)
>>>
>>>
>>> Is there a safe way to restart DHCP that has minimal impact on the
>>> failover protocol?
>>>
>>>
>>> Thanks,
>>>
>>> Terry
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Pallissard, Matthew
I just tested this and it seemed to work for me.

#dhcpd4.service
[Unit]
Description=IPv4 DHCP server
After=network.target

[Service]
Type=forking
PIDFile=/run/dhcpd4.pid
ExecStart=/usr/bin/dhcpd -4 -q -cf /etc/dhcpd.conf -pf /run/dhcpd4.pid
ExecStop=/path/to/shutdown/script.sh

[Install]
WantedBy=multi-user.target
                           

#/path/to/shutdown/script.sh
#copy-pasted from https://kb.isc.org/article/AA-00475/0/Sending-a-Server-Shutdown-Message-Via-OMAPI.html
#
#!/bin/sh

#  uses omshell to connect to a dhcp server on the
#  local machine, create a control object, set the
#  state of the control object, and update the
#  running server to cause that server to shut down
#  gracefully.
#
#  per dhcpd man page, server shutdown can take
#  several seconds as the server waits for close
#  on all OMAPI connections.  Watching log files
#  for shutdown messages is recommended.

omshell << END_OF_INPUT > /dev/null 2> /dev/null
server localhost
port 7911
key omapi_key Ofakekeyfakekeyfakekey==
connect
new control
open
set state=2
update
END_OF_INPUT

echo "done sending shutdown instruction to dhcp server.."

Matt Pallissard

On 05/13/2016 09:33 AM, Terry Burton wrote:

> On 13 May 2016 at 15:10, Steve van der Burg <[hidden email]> wrote:
>> Here we push out new configs to a partner pair from a central server.  The config for one of the partners contains an extra file (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl script):
>>
>>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>>      exit if (localtime)[1] % 2 == 0;
>>   }
>>   else {
>>      exit if (localtime)[1] % 2 == 1;
>>   }
>>
>>   ... continue (test new config, kill running server, start new one, etc)
>>
>> So the config change, stop, start, etc, can only happen on odd minutes for one server and even minutes for the other.  As long as startup time is less than a minute (and it's much, much less than that) it all works smoothly.
>
> Thanks Steve. We've also been pushing configs around then
> synchronously restarting servers back-to-back (without sleeping) for
> several years without incident.
>
> It makes me a little suspicious about whether just killing the process
> is indeed unsafe... But then maybe we've been lucky.
>
> As mentioned I want to improve on what distributions are currently
> doing so I'm deliberately setting the bar high and it would be great
> if ISC could provide a single, approved, safe shutdown/restart
> mechanism or describe what is required to develop such a mechanism.
> Unfortunately the detail of Bug #36066 (retracting support for gentle
> shutdown) isn't available as it would be interesting to see what
> issues were encountered with the previous approach.
>
>
>> Chuck Anderson <[hidden email]> wrote:
>>> FWIW, we've been using the "kill" method for over a decade without any
>>> noticable side-effects (the default init.d scripts from RHEL 6
>>> (actually Scientific Linux 6) dhcp package).  We've never had to
>>> manually clean up a corrupted lease file.  We restart the services
>>> automatically on a 20 minute cycle, as needed.  We do one, then
>>> immediately do the other.  We do not wait to restart the other, and we
>>> do not monitor to see if failover has reconnected and rebalanced
>>> before restarting the other, but since we are SSH-ing into each server
>>> to do the restart, there might be enough of a built-in delay between
>>> restarting each server.
>>>
>>> I don't know if a corrupted lease file would cause a failure to start
>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>> message.  But like I said, we've never had a failure to start the
>>> server that was caused by a lease file issue.
>>>
>>> Our script does test the config file before doing the restart:
>>>
>>> #!/bin/bash
>>> echo -n "Testing DHCP configuration: "
>>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>>         echo "Restarting DHCP"
>>>         sudo /etc/rc.d/init.d/dhcpd restart
>>> else
>>>         echo "FAIL: Not restarting DHCP"
>>> fi
>>>
>>> which in CentOS 6 does the following:
>>>
>>> exec=/usr/sbin/dhcpd
>>> configtest() {
>>>     [ -x $exec ] || return 5
>>>     [ -f $config ] || return 6
>>>     $exec -q -t -cf $config
>>>     RETVAL=$?
>>>     if [ $RETVAL -eq 1 ]; then
>>>         $exec -t -cf $config
>>>     else
>>>         echo "Syntax: OK" >&2
>>>     fi
>>>     return $RETVAL
>>> }
>>>
>>>
>>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>>> Hi,
>>>>
>>>> I'm attempting to write a systemd .service file for my own uses of ISC
>>>> DHCP. However, if it can be made sufficiently generic then I would
>>>> intend to push this upstream or at least into distributions.
>>>>
>>>> It needs to be suitable for managing failover pairs and I'm struggling
>>>> with the age-old problem of restarting a dhcpd instance. From reading
>>>> around there does not currently appear to be a method for restarting
>>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>>
>>>>
>>>> Restarting with signals:
>>>>
>>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>>> option, except where there is a high turnover of leases and the
>>>> production environment requires a high degree of reliability from
>>>> DHCP. In that case, we'd suggest that administrators consider using
>>>> OMAPI to control the daemon instead and to request a graceful
>>>> shutdown. The reason for this is that there is the slight possibility
>>>> that by using kill, administrators may stop dhcpd in the middle of
>>>> appending a lease to the leases file (in which case it may become
>>>> corrupted). This risk, while tiny, may be significant enough for some
>>>> administrators to prefer to use OMAPI instead."
>>>>
>>>> In other words this is recommending that casual users take the risk
>>>> that their service might not recover after restarting. This may be
>>>> unlikely but it's still dangerous advice! The documentation does
>>>> indicates that a feature for "gentle shutdown" in response to a signal
>>>> was added in the 4.2 time frame and then subsequently removed:
>>>>
>>>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>>>> #32692] [ISC-Bugs 34945]"
>>>> "Disable the gentle shutdown functionality until we can determine the
>>>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>>>> #36066]"
>>>>
>>>> Is it still the case that kill isn't suitable for production purposes?
>>>>
>>>>
>>>> With OMAPI:
>>>>
>>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>>> effect on the failover protocol is less-ideal than with signals.
>>>>
>>>> OMAPI shutdown will place the partner into "partner-down" state making
>>>> it become active for all leases in the failover pools which isn't
>>>> ideal when brief restarting an instance. Contrast this with the effect
>>>> of restarting an instance with kill which is to briefly place the
>>>> partner into "communications-interrupted" state from which it
>>>> immediate revert to "normal" once the restarted instance is available
>>>> (with auto-partner-down taking care for things if the instance does
>>>> not recover.)
>>>>
>>>>
>>>> Is there a safe way to restart DHCP that has minimal impact on the
>>>> failover protocol?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Terry
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
>
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
On 13 May 2016 at 15:37, Pallissard, Matthew
<[hidden email]> wrote:
> I just tested this and it seemed to work for me.

Do you not find if you tail the log on the partner that it transitions
to "partner-down" rather than "communications-interrupted"?

Thanks!


> #dhcpd4.service
> [Unit]
> Description=IPv4 DHCP server
> After=network.target
>
> [Service]
> Type=forking
> PIDFile=/run/dhcpd4.pid
> ExecStart=/usr/bin/dhcpd -4 -q -cf /etc/dhcpd.conf -pf /run/dhcpd4.pid
> ExecStop=/path/to/shutdown/script.sh
>
> [Install]
> WantedBy=multi-user.target
>
> #/path/to/shutdown/script.sh
> #copy-pasted from
> https://kb.isc.org/article/AA-00475/0/Sending-a-Server-Shutdown-Message-Via-OMAPI.html
> #
> #!/bin/sh
>
> #  uses omshell to connect to a dhcp server on the
> #  local machine, create a control object, set the
> #  state of the control object, and update the
> #  running server to cause that server to shut down
> #  gracefully.
> #
> #  per dhcpd man page, server shutdown can take
> #  several seconds as the server waits for close
> #  on all OMAPI connections.  Watching log files
> #  for shutdown messages is recommended.
>
> omshell << END_OF_INPUT > /dev/null 2> /dev/null
> server localhost
> port 7911
> key omapi_key Ofakekeyfakekeyfakekey==
> connect
> new control
> open
> set state=2
> update
> END_OF_INPUT
>
> echo "done sending shutdown instruction to dhcp server.."
>
> Matt Pallissard
>
>
> On 05/13/2016 09:33 AM, Terry Burton wrote:
>>
>> On 13 May 2016 at 15:10, Steve van der Burg <[hidden email]>
>> wrote:
>>>
>>> Here we push out new configs to a partner pair from a central server.
>>> The config for one of the partners contains an extra file
>>> (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl
>>> script):
>>>
>>>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>>>      exit if (localtime)[1] % 2 == 0;
>>>   }
>>>   else {
>>>      exit if (localtime)[1] % 2 == 1;
>>>   }
>>>
>>>   ... continue (test new config, kill running server, start new one, etc)
>>>
>>> So the config change, stop, start, etc, can only happen on odd minutes
>>> for one server and even minutes for the other.  As long as startup time is
>>> less than a minute (and it's much, much less than that) it all works
>>> smoothly.
>>
>>
>> Thanks Steve. We've also been pushing configs around then
>> synchronously restarting servers back-to-back (without sleeping) for
>> several years without incident.
>>
>> It makes me a little suspicious about whether just killing the process
>> is indeed unsafe... But then maybe we've been lucky.
>>
>> As mentioned I want to improve on what distributions are currently
>> doing so I'm deliberately setting the bar high and it would be great
>> if ISC could provide a single, approved, safe shutdown/restart
>> mechanism or describe what is required to develop such a mechanism.
>> Unfortunately the detail of Bug #36066 (retracting support for gentle
>> shutdown) isn't available as it would be interesting to see what
>> issues were encountered with the previous approach.
>>
>>
>>> Chuck Anderson <[hidden email]> wrote:
>>>>
>>>> FWIW, we've been using the "kill" method for over a decade without any
>>>> noticable side-effects (the default init.d scripts from RHEL 6
>>>> (actually Scientific Linux 6) dhcp package).  We've never had to
>>>> manually clean up a corrupted lease file.  We restart the services
>>>> automatically on a 20 minute cycle, as needed.  We do one, then
>>>> immediately do the other.  We do not wait to restart the other, and we
>>>> do not monitor to see if failover has reconnected and rebalanced
>>>> before restarting the other, but since we are SSH-ing into each server
>>>> to do the restart, there might be enough of a built-in delay between
>>>> restarting each server.
>>>>
>>>> I don't know if a corrupted lease file would cause a failure to start
>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>> message.  But like I said, we've never had a failure to start the
>>>> server that was caused by a lease file issue.
>>>>
>>>> Our script does test the config file before doing the restart:
>>>>
>>>> #!/bin/bash
>>>> echo -n "Testing DHCP configuration: "
>>>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>>>         echo "Restarting DHCP"
>>>>         sudo /etc/rc.d/init.d/dhcpd restart
>>>> else
>>>>         echo "FAIL: Not restarting DHCP"
>>>> fi
>>>>
>>>> which in CentOS 6 does the following:
>>>>
>>>> exec=/usr/sbin/dhcpd
>>>> configtest() {
>>>>     [ -x $exec ] || return 5
>>>>     [ -f $config ] || return 6
>>>>     $exec -q -t -cf $config
>>>>     RETVAL=$?
>>>>     if [ $RETVAL -eq 1 ]; then
>>>>         $exec -t -cf $config
>>>>     else
>>>>         echo "Syntax: OK" >&2
>>>>     fi
>>>>     return $RETVAL
>>>> }
>>>>
>>>>
>>>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm attempting to write a systemd .service file for my own uses of ISC
>>>>> DHCP. However, if it can be made sufficiently generic then I would
>>>>> intend to push this upstream or at least into distributions.
>>>>>
>>>>> It needs to be suitable for managing failover pairs and I'm struggling
>>>>> with the age-old problem of restarting a dhcpd instance. From reading
>>>>> around there does not currently appear to be a method for restarting
>>>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>>>
>>>>>
>>>>> Restarting with signals:
>>>>>
>>>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>>>> option, except where there is a high turnover of leases and the
>>>>> production environment requires a high degree of reliability from
>>>>> DHCP. In that case, we'd suggest that administrators consider using
>>>>> OMAPI to control the daemon instead and to request a graceful
>>>>> shutdown. The reason for this is that there is the slight possibility
>>>>> that by using kill, administrators may stop dhcpd in the middle of
>>>>> appending a lease to the leases file (in which case it may become
>>>>> corrupted). This risk, while tiny, may be significant enough for some
>>>>> administrators to prefer to use OMAPI instead."
>>>>>
>>>>> In other words this is recommending that casual users take the risk
>>>>> that their service might not recover after restarting. This may be
>>>>> unlikely but it's still dangerous advice! The documentation does
>>>>> indicates that a feature for "gentle shutdown" in response to a signal
>>>>> was added in the 4.2 time frame and then subsequently removed:
>>>>>
>>>>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>>>>> #32692] [ISC-Bugs 34945]"
>>>>> "Disable the gentle shutdown functionality until we can determine the
>>>>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>>>>> #36066]"
>>>>>
>>>>> Is it still the case that kill isn't suitable for production purposes?
>>>>>
>>>>>
>>>>> With OMAPI:
>>>>>
>>>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>>>> effect on the failover protocol is less-ideal than with signals.
>>>>>
>>>>> OMAPI shutdown will place the partner into "partner-down" state making
>>>>> it become active for all leases in the failover pools which isn't
>>>>> ideal when brief restarting an instance. Contrast this with the effect
>>>>> of restarting an instance with kill which is to briefly place the
>>>>> partner into "communications-interrupted" state from which it
>>>>> immediate revert to "normal" once the restarted instance is available
>>>>> (with auto-partner-down taking care for things if the instance does
>>>>> not recover.)
>>>>>
>>>>>
>>>>> Is there a safe way to restart DHCP that has minimal impact on the
>>>>> failover protocol?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Terry
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Pallissard, Matthew
I just tested it on a standalone box I use for testing to see if it brought down dhcp cleanly.  Give it a whirl on a test environment and let us know how it goes.


Matt Pallissard

On 05/13/2016 09:42 AM, Terry Burton wrote:

> On 13 May 2016 at 15:37, Pallissard, Matthew
> <[hidden email]> wrote:
>> I just tested this and it seemed to work for me.
>
> Do you not find if you tail the log on the partner that it transitions
> to "partner-down" rather than "communications-interrupted"?
>
> Thanks!
>
>
>> #dhcpd4.service
>> [Unit]
>> Description=IPv4 DHCP server
>> After=network.target
>>
>> [Service]
>> Type=forking
>> PIDFile=/run/dhcpd4.pid
>> ExecStart=/usr/bin/dhcpd -4 -q -cf /etc/dhcpd.conf -pf /run/dhcpd4.pid
>> ExecStop=/path/to/shutdown/script.sh
>>
>> [Install]
>> WantedBy=multi-user.target
>>
>> #/path/to/shutdown/script.sh
>> #copy-pasted from
>> https://kb.isc.org/article/AA-00475/0/Sending-a-Server-Shutdown-Message-Via-OMAPI.html
>> #
>> #!/bin/sh
>>
>> #  uses omshell to connect to a dhcp server on the
>> #  local machine, create a control object, set the
>> #  state of the control object, and update the
>> #  running server to cause that server to shut down
>> #  gracefully.
>> #
>> #  per dhcpd man page, server shutdown can take
>> #  several seconds as the server waits for close
>> #  on all OMAPI connections.  Watching log files
>> #  for shutdown messages is recommended.
>>
>> omshell << END_OF_INPUT > /dev/null 2> /dev/null
>> server localhost
>> port 7911
>> key omapi_key Ofakekeyfakekeyfakekey==
>> connect
>> new control
>> open
>> set state=2
>> update
>> END_OF_INPUT
>>
>> echo "done sending shutdown instruction to dhcp server.."
>>
>> Matt Pallissard
>>
>>
>> On 05/13/2016 09:33 AM, Terry Burton wrote:
>>>
>>> On 13 May 2016 at 15:10, Steve van der Burg <[hidden email]>
>>> wrote:
>>>>
>>>> Here we push out new configs to a partner pair from a central server.
>>>> The config for one of the partners contains an extra file
>>>> (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl
>>>> script):
>>>>
>>>>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>>>>      exit if (localtime)[1] % 2 == 0;
>>>>   }
>>>>   else {
>>>>      exit if (localtime)[1] % 2 == 1;
>>>>   }
>>>>
>>>>   ... continue (test new config, kill running server, start new one, etc)
>>>>
>>>> So the config change, stop, start, etc, can only happen on odd minutes
>>>> for one server and even minutes for the other.  As long as startup time is
>>>> less than a minute (and it's much, much less than that) it all works
>>>> smoothly.
>>>
>>>
>>> Thanks Steve. We've also been pushing configs around then
>>> synchronously restarting servers back-to-back (without sleeping) for
>>> several years without incident.
>>>
>>> It makes me a little suspicious about whether just killing the process
>>> is indeed unsafe... But then maybe we've been lucky.
>>>
>>> As mentioned I want to improve on what distributions are currently
>>> doing so I'm deliberately setting the bar high and it would be great
>>> if ISC could provide a single, approved, safe shutdown/restart
>>> mechanism or describe what is required to develop such a mechanism.
>>> Unfortunately the detail of Bug #36066 (retracting support for gentle
>>> shutdown) isn't available as it would be interesting to see what
>>> issues were encountered with the previous approach.
>>>
>>>
>>>> Chuck Anderson <[hidden email]> wrote:
>>>>>
>>>>> FWIW, we've been using the "kill" method for over a decade without any
>>>>> noticable side-effects (the default init.d scripts from RHEL 6
>>>>> (actually Scientific Linux 6) dhcp package).  We've never had to
>>>>> manually clean up a corrupted lease file.  We restart the services
>>>>> automatically on a 20 minute cycle, as needed.  We do one, then
>>>>> immediately do the other.  We do not wait to restart the other, and we
>>>>> do not monitor to see if failover has reconnected and rebalanced
>>>>> before restarting the other, but since we are SSH-ing into each server
>>>>> to do the restart, there might be enough of a built-in delay between
>>>>> restarting each server.
>>>>>
>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>> message.  But like I said, we've never had a failure to start the
>>>>> server that was caused by a lease file issue.
>>>>>
>>>>> Our script does test the config file before doing the restart:
>>>>>
>>>>> #!/bin/bash
>>>>> echo -n "Testing DHCP configuration: "
>>>>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>>>>         echo "Restarting DHCP"
>>>>>         sudo /etc/rc.d/init.d/dhcpd restart
>>>>> else
>>>>>         echo "FAIL: Not restarting DHCP"
>>>>> fi
>>>>>
>>>>> which in CentOS 6 does the following:
>>>>>
>>>>> exec=/usr/sbin/dhcpd
>>>>> configtest() {
>>>>>     [ -x $exec ] || return 5
>>>>>     [ -f $config ] || return 6
>>>>>     $exec -q -t -cf $config
>>>>>     RETVAL=$?
>>>>>     if [ $RETVAL -eq 1 ]; then
>>>>>         $exec -t -cf $config
>>>>>     else
>>>>>         echo "Syntax: OK" >&2
>>>>>     fi
>>>>>     return $RETVAL
>>>>> }
>>>>>
>>>>>
>>>>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm attempting to write a systemd .service file for my own uses of ISC
>>>>>> DHCP. However, if it can be made sufficiently generic then I would
>>>>>> intend to push this upstream or at least into distributions.
>>>>>>
>>>>>> It needs to be suitable for managing failover pairs and I'm struggling
>>>>>> with the age-old problem of restarting a dhcpd instance. From reading
>>>>>> around there does not currently appear to be a method for restarting
>>>>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>>>>
>>>>>>
>>>>>> Restarting with signals:
>>>>>>
>>>>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>>>>> option, except where there is a high turnover of leases and the
>>>>>> production environment requires a high degree of reliability from
>>>>>> DHCP. In that case, we'd suggest that administrators consider using
>>>>>> OMAPI to control the daemon instead and to request a graceful
>>>>>> shutdown. The reason for this is that there is the slight possibility
>>>>>> that by using kill, administrators may stop dhcpd in the middle of
>>>>>> appending a lease to the leases file (in which case it may become
>>>>>> corrupted). This risk, while tiny, may be significant enough for some
>>>>>> administrators to prefer to use OMAPI instead."
>>>>>>
>>>>>> In other words this is recommending that casual users take the risk
>>>>>> that their service might not recover after restarting. This may be
>>>>>> unlikely but it's still dangerous advice! The documentation does
>>>>>> indicates that a feature for "gentle shutdown" in response to a signal
>>>>>> was added in the 4.2 time frame and then subsequently removed:
>>>>>>
>>>>>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>>>>>> #32692] [ISC-Bugs 34945]"
>>>>>> "Disable the gentle shutdown functionality until we can determine the
>>>>>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>>>>>> #36066]"
>>>>>>
>>>>>> Is it still the case that kill isn't suitable for production purposes?
>>>>>>
>>>>>>
>>>>>> With OMAPI:
>>>>>>
>>>>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>>>>> effect on the failover protocol is less-ideal than with signals.
>>>>>>
>>>>>> OMAPI shutdown will place the partner into "partner-down" state making
>>>>>> it become active for all leases in the failover pools which isn't
>>>>>> ideal when brief restarting an instance. Contrast this with the effect
>>>>>> of restarting an instance with kill which is to briefly place the
>>>>>> partner into "communications-interrupted" state from which it
>>>>>> immediate revert to "normal" once the restarted instance is available
>>>>>> (with auto-partner-down taking care for things if the instance does
>>>>>> not recover.)
>>>>>>
>>>>>>
>>>>>> Is there a safe way to restart DHCP that has minimal impact on the
>>>>>> failover protocol?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Terry
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
>
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Anderson, Charles R
In reply to this post by Terry Burton
On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
> > I don't know if a corrupted lease file would cause a failure to start
> > the dhcp server, or if it would just go unnoticed, perhaps with a log
> > message.  But like I said, we've never had a failure to start the
> > server that was caused by a lease file issue.
>
> In our experience leases files corrupted by other means can cause a
> failure to start. I don't recall whether that was due to mere
> truncation though...

There is also the -T parameter to test the lease file:

       The -T flag can be used to test the lease database file in a similar way.

It might be a good idea to also use this test before restarting.
While it won't fix a corrupted lease file, it may prevent you from
losing all DHCP service due to a failure to restart.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:

> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
>> > I don't know if a corrupted lease file would cause a failure to start
>> > the dhcp server, or if it would just go unnoticed, perhaps with a log
>> > message.  But like I said, we've never had a failure to start the
>> > server that was caused by a lease file issue.
>>
>> In our experience leases files corrupted by other means can cause a
>> failure to start. I don't recall whether that was due to mere
>> truncation though...
>
> There is also the -T parameter to test the lease file:
>
>        The -T flag can be used to test the lease database file in a similar way.
>
> It might be a good idea to also use this test before restarting.
> While it won't fix a corrupted lease file, it may prevent you from
> losing all DHCP service due to a failure to restart.

I think this will require the leases file to be closed at the point of
testing, i.e. the daemon has already exited.

For the more general issue with systemd verifying the configuration
see: https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by Pallissard, Matthew
On 13 May 2016 at 15:48, Pallissard, Matthew
<[hidden email]> wrote:
> I just tested it on a standalone box I use for testing to see if it brought
> down dhcp cleanly.  Give it a whirl on a test environment and let us know
> how it goes.

Testing shows that as alluded to in my original description of the
problem the OMAPI method is unworkable for the restart of a failover
pair.

In its dying breath the server transitions from normal -> shutdown -> recover:

May 13 16:34:12 dhcp1 dhcpd: failover peer dhcp1: I move from normal to shutdown
May 13 16:34:17 dhcp1 dhcpd: failover peer dhcp1: I move from shutdown
to recover

Upon start it resumes its recover state, sends a "update request all"
message and enters recover-wait (pushing the peer into partner-down):

May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: I move from recover to startup
May 13 16:34:20 dhcp1 dhcpd: Server starting service.
May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: peer moves from
normal to communications-interrupted
May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: I move from startup to recover
May 13 16:34:20 dhcp1 dhcpd: Sent update request all message to dhcp1
May 13 16:34:20 dhcp1 dhcpd: failover peer dhcp1: peer moves from
communications-interrupted to partner-down
May 13 16:34:22 dhcp1 dhcpd: failover peer dhcp1: peer update completed.
May 13 16:34:22 dhcp1 dhcpd: failover peer dhcp1: I move from recover
to recover-wait

This is far too disruptive for a frequent reload of the configuration
of both instances in a pair.


> On 05/13/2016 09:42 AM, Terry Burton wrote:
>>
>> On 13 May 2016 at 15:37, Pallissard, Matthew
>> <[hidden email]> wrote:
>>>
>>> I just tested this and it seemed to work for me.
>>
>>
>> Do you not find if you tail the log on the partner that it transitions
>> to "partner-down" rather than "communications-interrupted"?
>>
>> Thanks!
>>
>>
>>> #dhcpd4.service
>>> [Unit]
>>> Description=IPv4 DHCP server
>>> After=network.target
>>>
>>> [Service]
>>> Type=forking
>>> PIDFile=/run/dhcpd4.pid
>>> ExecStart=/usr/bin/dhcpd -4 -q -cf /etc/dhcpd.conf -pf /run/dhcpd4.pid
>>> ExecStop=/path/to/shutdown/script.sh
>>>
>>> [Install]
>>> WantedBy=multi-user.target
>>>
>>> #/path/to/shutdown/script.sh
>>> #copy-pasted from
>>>
>>> https://kb.isc.org/article/AA-00475/0/Sending-a-Server-Shutdown-Message-Via-OMAPI.html
>>> #
>>> #!/bin/sh
>>>
>>> #  uses omshell to connect to a dhcp server on the
>>> #  local machine, create a control object, set the
>>> #  state of the control object, and update the
>>> #  running server to cause that server to shut down
>>> #  gracefully.
>>> #
>>> #  per dhcpd man page, server shutdown can take
>>> #  several seconds as the server waits for close
>>> #  on all OMAPI connections.  Watching log files
>>> #  for shutdown messages is recommended.
>>>
>>> omshell << END_OF_INPUT > /dev/null 2> /dev/null
>>> server localhost
>>> port 7911
>>> key omapi_key Ofakekeyfakekeyfakekey==
>>> connect
>>> new control
>>> open
>>> set state=2
>>> update
>>> END_OF_INPUT
>>>
>>> echo "done sending shutdown instruction to dhcp server.."
>>>
>>> Matt Pallissard
>>>
>>>
>>> On 05/13/2016 09:33 AM, Terry Burton wrote:
>>>>
>>>>
>>>> On 13 May 2016 at 15:10, Steve van der Burg
>>>> <[hidden email]>
>>>> wrote:
>>>>>
>>>>>
>>>>> Here we push out new configs to a partner pair from a central server.
>>>>> The config for one of the partners contains an extra file
>>>>> (dhcpd.i.am.secondary).  Each of the partners runs this every minute
>>>>> (perl
>>>>> script):
>>>>>
>>>>>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>>>>>      exit if (localtime)[1] % 2 == 0;
>>>>>   }
>>>>>   else {
>>>>>      exit if (localtime)[1] % 2 == 1;
>>>>>   }
>>>>>
>>>>>   ... continue (test new config, kill running server, start new one,
>>>>> etc)
>>>>>
>>>>> So the config change, stop, start, etc, can only happen on odd minutes
>>>>> for one server and even minutes for the other.  As long as startup time
>>>>> is
>>>>> less than a minute (and it's much, much less than that) it all works
>>>>> smoothly.
>>>>
>>>>
>>>>
>>>> Thanks Steve. We've also been pushing configs around then
>>>> synchronously restarting servers back-to-back (without sleeping) for
>>>> several years without incident.
>>>>
>>>> It makes me a little suspicious about whether just killing the process
>>>> is indeed unsafe... But then maybe we've been lucky.
>>>>
>>>> As mentioned I want to improve on what distributions are currently
>>>> doing so I'm deliberately setting the bar high and it would be great
>>>> if ISC could provide a single, approved, safe shutdown/restart
>>>> mechanism or describe what is required to develop such a mechanism.
>>>> Unfortunately the detail of Bug #36066 (retracting support for gentle
>>>> shutdown) isn't available as it would be interesting to see what
>>>> issues were encountered with the previous approach.
>>>>
>>>>
>>>>> Chuck Anderson <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>> FWIW, we've been using the "kill" method for over a decade without any
>>>>>> noticable side-effects (the default init.d scripts from RHEL 6
>>>>>> (actually Scientific Linux 6) dhcp package).  We've never had to
>>>>>> manually clean up a corrupted lease file.  We restart the services
>>>>>> automatically on a 20 minute cycle, as needed.  We do one, then
>>>>>> immediately do the other.  We do not wait to restart the other, and we
>>>>>> do not monitor to see if failover has reconnected and rebalanced
>>>>>> before restarting the other, but since we are SSH-ing into each server
>>>>>> to do the restart, there might be enough of a built-in delay between
>>>>>> restarting each server.
>>>>>>
>>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>>> message.  But like I said, we've never had a failure to start the
>>>>>> server that was caused by a lease file issue.
>>>>>>
>>>>>> Our script does test the config file before doing the restart:
>>>>>>
>>>>>> #!/bin/bash
>>>>>> echo -n "Testing DHCP configuration: "
>>>>>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>>>>>         echo "Restarting DHCP"
>>>>>>         sudo /etc/rc.d/init.d/dhcpd restart
>>>>>> else
>>>>>>         echo "FAIL: Not restarting DHCP"
>>>>>> fi
>>>>>>
>>>>>> which in CentOS 6 does the following:
>>>>>>
>>>>>> exec=/usr/sbin/dhcpd
>>>>>> configtest() {
>>>>>>     [ -x $exec ] || return 5
>>>>>>     [ -f $config ] || return 6
>>>>>>     $exec -q -t -cf $config
>>>>>>     RETVAL=$?
>>>>>>     if [ $RETVAL -eq 1 ]; then
>>>>>>         $exec -t -cf $config
>>>>>>     else
>>>>>>         echo "Syntax: OK" >&2
>>>>>>     fi
>>>>>>     return $RETVAL
>>>>>> }
>>>>>>
>>>>>>
>>>>>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm attempting to write a systemd .service file for my own uses of
>>>>>>> ISC
>>>>>>> DHCP. However, if it can be made sufficiently generic then I would
>>>>>>> intend to push this upstream or at least into distributions.
>>>>>>>
>>>>>>> It needs to be suitable for managing failover pairs and I'm
>>>>>>> struggling
>>>>>>> with the age-old problem of restarting a dhcpd instance. From reading
>>>>>>> around there does not currently appear to be a method for restarting
>>>>>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>>>>>
>>>>>>>
>>>>>>> Restarting with signals:
>>>>>>>
>>>>>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>>>>>> option, except where there is a high turnover of leases and the
>>>>>>> production environment requires a high degree of reliability from
>>>>>>> DHCP. In that case, we'd suggest that administrators consider using
>>>>>>> OMAPI to control the daemon instead and to request a graceful
>>>>>>> shutdown. The reason for this is that there is the slight possibility
>>>>>>> that by using kill, administrators may stop dhcpd in the middle of
>>>>>>> appending a lease to the leases file (in which case it may become
>>>>>>> corrupted). This risk, while tiny, may be significant enough for some
>>>>>>> administrators to prefer to use OMAPI instead."
>>>>>>>
>>>>>>> In other words this is recommending that casual users take the risk
>>>>>>> that their service might not recover after restarting. This may be
>>>>>>> unlikely but it's still dangerous advice! The documentation does
>>>>>>> indicates that a feature for "gentle shutdown" in response to a
>>>>>>> signal
>>>>>>> was added in the 4.2 time frame and then subsequently removed:
>>>>>>>
>>>>>>> "Added support for gentle shutdown after signal is received.
>>>>>>> [ISC-Bugs
>>>>>>> #32692] [ISC-Bugs 34945]"
>>>>>>> "Disable the gentle shutdown functionality until we can determine the
>>>>>>> best way to present it to remove or reduce the side effects.
>>>>>>> [ISC-Bugs
>>>>>>> #36066]"
>>>>>>>
>>>>>>> Is it still the case that kill isn't suitable for production
>>>>>>> purposes?
>>>>>>>
>>>>>>>
>>>>>>> With OMAPI:
>>>>>>>
>>>>>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>>>>>> effect on the failover protocol is less-ideal than with signals.
>>>>>>>
>>>>>>> OMAPI shutdown will place the partner into "partner-down" state
>>>>>>> making
>>>>>>> it become active for all leases in the failover pools which isn't
>>>>>>> ideal when brief restarting an instance. Contrast this with the
>>>>>>> effect
>>>>>>> of restarting an instance with kill which is to briefly place the
>>>>>>> partner into "communications-interrupted" state from which it
>>>>>>> immediate revert to "normal" once the restarted instance is available
>>>>>>> (with auto-partner-down taking care for things if the instance does
>>>>>>> not recover.)
>>>>>>>
>>>>>>>
>>>>>>> Is there a safe way to restart DHCP that has minimal impact on the
>>>>>>> failover protocol?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Terry
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Sten Carlsen
In reply to this post by Terry Burton
There has been a number of times a discussion about restarting the DHCP server. AFAICT the official method is to kill it and start again.

On 13/05/2016 17:02, Terry Burton wrote:
On 13 May 2016 at 15:57, Chuck Anderson [hidden email] wrote:
On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
On 13 May 2016 at 14:22, Chuck Anderson [hidden email] wrote:
I don't know if a corrupted lease file would cause a failure to start
the dhcp server, or if it would just go unnoticed, perhaps with a log
message.  But like I said, we've never had a failure to start the
server that was caused by a lease file issue.
In our experience leases files corrupted by other means can cause a
failure to start. I don't recall whether that was due to mere
truncation though...
There is also the -T parameter to test the lease file:

       The -T flag can be used to test the lease database file in a similar way.

It might be a good idea to also use this test before restarting.
While it won't fix a corrupted lease file, it may prevent you from
losing all DHCP service due to a failure to restart.
I think this will require the leases file to be closed at the point of
testing, i.e. the daemon has already exited.

For the more general issue with systemd verifying the configuration
see: https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users

-- 
Best regards

Sten Carlsen

No improvements come from shouting:

       "MALE BOVINE MANURE!!!" 

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Anderson, Charles R
In reply to this post by Terry Burton
On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:

> On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:
> > On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
> >> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
> >> > I don't know if a corrupted lease file would cause a failure to start
> >> > the dhcp server, or if it would just go unnoticed, perhaps with a log
> >> > message.  But like I said, we've never had a failure to start the
> >> > server that was caused by a lease file issue.
> >>
> >> In our experience leases files corrupted by other means can cause a
> >> failure to start. I don't recall whether that was due to mere
> >> truncation though...
> >
> > There is also the -T parameter to test the lease file:
> >
> >        The -T flag can be used to test the lease database file in a similar way.
> >
> > It might be a good idea to also use this test before restarting.
> > While it won't fix a corrupted lease file, it may prevent you from
> > losing all DHCP service due to a failure to restart.
>
> I think this will require the leases file to be closed at the point of
> testing, i.e. the daemon has already exited.
>
> For the more general issue with systemd verifying the configuration
> see: https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html

Is there a way to signal dhcpd to write out the lease file so it can
be checked?

It seems that dhcpd needs a journaling mechanism similar to named,
where it writes the changes to a .jnl file and periodically
incorporates those changes into the main zone file.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

dave c
Are folks forgetting that the default action of the kill command is to send the TERM signal?
That signal should tell the daemon to do an orderly shutdown, close the leases file cleanly,
send whatever signals to the partner that are required and then exit when everything is ready.

All the concern I am seeing below would be true if folks were issuing a kill -9 to stop the
service. At which point the leases file would get potentially corrupted.

As for a journal for the leases file, that could be created, but then it would break the methods
currently used to monitor and process the leases file. Today, it seems to append each new lease,
so it's always adding to the end of the file but then once an hour it will save the active
leases file that was just being appended by renaming it and write out a brand new file from
scratch of all active leases from memory. I learned the hard way what happens when DHCPD has RW
access to the leases file but not create new access to the enclosing directory... the leases
file will grow forever and never be rewritten :( Or at least grow until the next restart as the
leases file gets rewritten as part of the startup process while the daemon is still running as
root before it does it's priv shedding to the dhcp user. I had a cron restarting my daemon until
I realized what I had allowed to happen :)

So it sounds like a lot of angst over nothing... a TERM signal is defined as closing all
processes and threads cleanly, writing out the last bits of data and stopping things in an
orderly fashion. So seems that issuing kill {dhcpd pid} would be perfectly acceptable to close
things down even in a partner scenario.

What I don't yet have a clear handle on is the timing considerations of a partner system being
manipulated by external command and control processes e.g. adding a new vlan definition to both
servers and restarting them at the same time or within seconds of each other.

Do I need to incorporate a delay as was done by one of the earlier posters on this thread or is
that precaution an unneeded complication? What happens when both partners are restarted at the
same time? Does it delay the startup and cause DHCP responses to be ignored until they work
things out among themselves?

I am seeing reports in this thread from both extremes... one who forces a delay with even/odd
minute detection and another who seems to not care how closely in time the two restart.

That's the question I believe we should be caring about here...

Thanks,
Dav

On 5/13/16 13:02, Chuck Anderson wrote:

> On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:
>> On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:
>>> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>>>> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>> message.  But like I said, we've never had a failure to start the
>>>>> server that was caused by a lease file issue.
>>>>
>>>> In our experience leases files corrupted by other means can cause a
>>>> failure to start. I don't recall whether that was due to mere
>>>> truncation though...
>>>
>>> There is also the -T parameter to test the lease file:
>>>
>>>        The -T flag can be used to test the lease database file in a similar way.
>>>
>>> It might be a good idea to also use this test before restarting.
>>> While it won't fix a corrupted lease file, it may prevent you from
>>> losing all DHCP service due to a failure to restart.
>>
>> I think this will require the leases file to be closed at the point of
>> testing, i.e. the daemon has already exited.
>>
>> For the more general issue with systemd verifying the configuration
>> see: https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
>
> Is there a way to signal dhcpd to write out the lease file so it can
> be checked?
>
> It seems that dhcpd needs a journaling mechanism similar to named,
> where it writes the changes to a .jnl file and periodically
> incorporates those changes into the main zone file.
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
>

--
Dave Calafrancesco
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Simon Hobson
In reply to this post by Anderson, Charles R
Chuck Anderson <[hidden email]> wrote:

> Is there a way to signal dhcpd to write out the lease file so it can
> be checked?

Surely a simple change would be to not act on a normal kill signal in the middle of a lease file write ? Capture that the signal arrived, and act on it as soon as the complete lease has been written.
That one change alone would completely remove the "wrote half a lease to the file" issue.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by Sten Carlsen
On 13 May 2016 at 18:37, Sten Carlsen <[hidden email]> wrote:
> There has been a number of times a discussion about restarting the DHCP
> server. AFAICT the official method is to kill it and start again.

This is not the case. See the recent article AA-01043 [1] quoted in my
initial post:

"kill is the recommended
option, except where there is a high turnover of leases and the
production environment requires a high degree of reliability from
DHCP. In that case, we'd suggest that administrators consider using
OMAPI to control the daemon instead and to request a graceful
shutdown. The reason for this is that there is the slight possibility
that by using kill, administrators may stop dhcpd in the middle of
appending a lease to the leases file (in which case it may become
corrupted). This risk, while tiny, may be significant enough for some
administrators to prefer to use OMAPI instead."


[1] https://kb.isc.org/article/AA-01043/0/Recommendations-for-restarting-a-DHCP-failover-pair.html


> On 13/05/2016 17:02, Terry Burton wrote:
>
> On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:
>
> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>
> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
>
> I don't know if a corrupted lease file would cause a failure to start
> the dhcp server, or if it would just go unnoticed, perhaps with a log
> message.  But like I said, we've never had a failure to start the
> server that was caused by a lease file issue.
>
> In our experience leases files corrupted by other means can cause a
> failure to start. I don't recall whether that was due to mere
> truncation though...
>
> There is also the -T parameter to test the lease file:
>
>        The -T flag can be used to test the lease database file in a similar
> way.
>
> It might be a good idea to also use this test before restarting.
> While it won't fix a corrupted lease file, it may prevent you from
> losing all DHCP service due to a failure to restart.
>
> I think this will require the leases file to be closed at the point of
> testing, i.e. the daemon has already exited.
>
> For the more general issue with systemd verifying the configuration
> see:
> https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by Anderson, Charles R
On 13 May 2016 at 19:02, Chuck Anderson <[hidden email]> wrote:

> On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:
>> On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:
>> > On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>> >> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
>> >> > I don't know if a corrupted lease file would cause a failure to start
>> >> > the dhcp server, or if it would just go unnoticed, perhaps with a log
>> >> > message.  But like I said, we've never had a failure to start the
>> >> > server that was caused by a lease file issue.
>> >>
>> >> In our experience leases files corrupted by other means can cause a
>> >> failure to start. I don't recall whether that was due to mere
>> >> truncation though...
>> >
>> > There is also the -T parameter to test the lease file:
>> >
>> >        The -T flag can be used to test the lease database file in a similar way.
>> >
>> > It might be a good idea to also use this test before restarting.
>> > While it won't fix a corrupted lease file, it may prevent you from
>> > losing all DHCP service due to a failure to restart.
>>
>> I think this will require the leases file to be closed at the point of
>> testing, i.e. the daemon has already exited.
>>
>> For the more general issue with systemd verifying the configuration
>> see: https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
>
> Is there a way to signal dhcpd to write out the lease file so it can
> be checked?

Then you're examining the crime scene before the murder is committed.

> It seems that dhcpd needs a journaling mechanism similar to named,
> where it writes the changes to a .jnl file and periodically
> incorporates those changes into the main zone file.

That would seem to be a good idea, however it may not be necessary. By
default dhcpd calls fsync after writing a lease into the leases file
and before issuing the DHCP response to the client. If the server can
detect and remove an incomplete lease at the end of the file then all
should be well. Perhaps it does?...
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by Simon Hobson
On 13 May 2016 at 19:26, Simon Hobson <[hidden email]> wrote:
> Chuck Anderson <[hidden email]> wrote:
>
>> Is there a way to signal dhcpd to write out the lease file so it can
>> be checked?
>
> Surely a simple change would be to not act on a normal kill signal in the middle of a lease file write ? Capture that the signal arrived, and act on it as soon as the complete lease has been written.
> That one change alone would completely remove the "wrote half a lease to the file" issue.

I couldn't agree more...
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
In reply to this post by dave c
On 13 May 2016 at 19:25, dave c <[hidden email]> wrote:
> Are folks forgetting that the default action of the kill command is to send
> the TERM signal? That signal should tell the daemon to do an orderly
> shutdown, close the leases file cleanly, send whatever signals to the
> partner that are required and then exit when everything is ready.
>
> All the concern I am seeing below would be true if folks were issuing a kill
> -9 to stop the service. At which point the leases file would get potentially
> corrupted.
<...snip...>
> So it sounds like a lot of angst over nothing... a TERM signal is defined as
> closing all processes and threads cleanly, writing out the last bits of data
> and stopping things in an orderly fashion. So seems that issuing kill {dhcpd
> pid} would be perfectly acceptable to close things down even in a partner
> scenario.

Where do you get the definition of a SIGTERM causing a graceful
shutdown (other than by convention) and if this were the case for ISC
DHCP then why the warning about truncated leases given in AA-01043?

The effect of receiving a handleable signal is to immediately jump
into the trap handler if one is configured for that signal, otherwise
to die.

Unless a handler takes care to ensure that everything is consistent
and then exit then SIGTERM, SIGINT, etc. are potentially dangerous.

The release notes indicate that a "gentle shutdown" feature was added
in the past and then subsequently removed because the semantics chosen
caused operational issues - but what these were isn't known because
the associated bug report isn't publicly available.

I need to find time to understand the current codebase, but what I'd
like to know the intended semantics and what issues are encountered
with implementing these in the way that Simon Hobson suggests.

> What I don't yet have a clear handle on is the timing considerations of a
> partner system being manipulated by external command and control processes
> e.g. adding a new vlan definition to both servers and restarting them at the
> same time or within seconds of each other.
>
> Do I need to incorporate a delay as was done by one of the earlier posters
> on this thread or is that precaution an unneeded complication? What happens
> when both partners are restarted at the same time? Does it delay the startup
> and cause DHCP responses to be ignored until they work things out among
> themselves?
>
> I am seeing reports in this thread from both extremes... one who forces a
> delay with even/odd minute detection and another who seems to not care how
> closely in time the two restart.
>
> That's the question I believe we should be caring about here...

That's *your question* (perhaps an interesting one, no offence
intended) but it is not the one I'm asking here so feel free to open a
new thread.


Thanks,

Terry


> On 5/13/16 13:02, Chuck Anderson wrote:
>>
>> On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:
>>>
>>> On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:
>>>>
>>>> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>>>>>
>>>>> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
>>>>>>
>>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>>> message.  But like I said, we've never had a failure to start the
>>>>>> server that was caused by a lease file issue.
>>>>>
>>>>>
>>>>> In our experience leases files corrupted by other means can cause a
>>>>> failure to start. I don't recall whether that was due to mere
>>>>> truncation though...
>>>>
>>>>
>>>> There is also the -T parameter to test the lease file:
>>>>
>>>>        The -T flag can be used to test the lease database file in a
>>>> similar way.
>>>>
>>>> It might be a good idea to also use this test before restarting.
>>>> While it won't fix a corrupted lease file, it may prevent you from
>>>> losing all DHCP service due to a failure to restart.
>>>
>>>
>>> I think this will require the leases file to be closed at the point of
>>> testing, i.e. the daemon has already exited.
>>>
>>> For the more general issue with systemd verifying the configuration
>>> see:
>>> https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
>>
>>
>> Is there a way to signal dhcpd to write out the lease file so it can
>> be checked?
>>
>> It seems that dhcpd needs a journaling mechanism similar to named,
>> where it writes the changes to a .jnl file and periodically
>> incorporates those changes into the main zone file.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Restarting DHCP safely whilst avoiding partner-down state

Terry Burton
On 13 May 2016 at 20:06, Terry Burton <[hidden email]> wrote:

> On 13 May 2016 at 19:25, dave c <[hidden email]> wrote:
>> Are folks forgetting that the default action of the kill command is to send
>> the TERM signal? That signal should tell the daemon to do an orderly
>> shutdown, close the leases file cleanly, send whatever signals to the
>> partner that are required and then exit when everything is ready.
>>
>> All the concern I am seeing below would be true if folks were issuing a kill
>> -9 to stop the service. At which point the leases file would get potentially
>> corrupted.
> <...snip...>
>> So it sounds like a lot of angst over nothing... a TERM signal is defined as
>> closing all processes and threads cleanly, writing out the last bits of data
>> and stopping things in an orderly fashion. So seems that issuing kill {dhcpd
>> pid} would be perfectly acceptable to close things down even in a partner
>> scenario.
>
> Where do you get the definition of a SIGTERM causing a graceful
> shutdown (other than by convention) and if this were the case for ISC
> DHCP then why the warning about truncated leases given in AA-01043?
>
> The effect of receiving a handleable signal is to immediately jump
> into the trap handler if one is configured for that signal, otherwise
> to die.
>
> Unless a handler takes care to ensure that everything is consistent
> and then exit then SIGTERM, SIGINT, etc. are potentially dangerous.
>
> The release notes indicate that a "gentle shutdown" feature was added
> in the past and then subsequently removed because the semantics chosen
> caused operational issues - but what these were isn't known because
> the associated bug report isn't publicly available.
>
> I need to find time to understand the current codebase, but what I'd
> like to know the intended semantics and what issues are encountered
> with implementing these in the way that Simon Hobson suggests.

So currently there are no trap handlers for SIGTERM or SIGINT and
therefore no cleanup whatsoever at exit.

There is a compiled-out option ENABLE_GENTLE_SHUTDOWN which installs
handlers for these signals but when this was activated it implemented
the harmful semantics of putting the server through a
recovery+partner-down transition which isn't useful for a quick
configuration reload:

/* Enable the gentle shutdown signal handling.  Currently this
   means that on SIGINT or SIGTERM a client will release its
   address and a server in a failover pair will go through
   partner down.  Both of which can be undesireable in some
   situations.  We plan to revisit this feature and may
   make non-backwards compatible changes including the
   removal of this define.  Use at your own risk.  */
/* #define ENABLE_GENTLE_SHUTDOWN */

#if defined(ENABLE_GENTLE_SHUTDOWN)
        /* no signal handlers until we deal with the side effects */
        /* install signal handlers */
        signal(SIGINT, dhcp_signal_handler);   /* control-c */
        signal(SIGTERM, dhcp_signal_handler);  /* kill */
#endif

Having a more basic signal handler that defers the exit in order to
continue to write out an outstanding lease seems better. Perhaps once
could even differentiate these exit semantics based on SIGINT vs
SIGTERM.

If someone who can speak for ISC is able to indicate whether this
would be a sensible approach then I am happy to work up a patch.


>> On 5/13/16 13:02, Chuck Anderson wrote:
>>>
>>> On Fri, May 13, 2016 at 04:02:23PM +0100, Terry Burton wrote:
>>>>
>>>> On 13 May 2016 at 15:57, Chuck Anderson <[hidden email]> wrote:
>>>>>
>>>>> On Fri, May 13, 2016 at 03:23:25PM +0100, Terry Burton wrote:
>>>>>>
>>>>>> On 13 May 2016 at 14:22, Chuck Anderson <[hidden email]> wrote:
>>>>>>>
>>>>>>> I don't know if a corrupted lease file would cause a failure to start
>>>>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>>>>> message.  But like I said, we've never had a failure to start the
>>>>>>> server that was caused by a lease file issue.
>>>>>>
>>>>>>
>>>>>> In our experience leases files corrupted by other means can cause a
>>>>>> failure to start. I don't recall whether that was due to mere
>>>>>> truncation though...
>>>>>
>>>>>
>>>>> There is also the -T parameter to test the lease file:
>>>>>
>>>>>        The -T flag can be used to test the lease database file in a
>>>>> similar way.
>>>>>
>>>>> It might be a good idea to also use this test before restarting.
>>>>> While it won't fix a corrupted lease file, it may prevent you from
>>>>> losing all DHCP service due to a failure to restart.
>>>>
>>>>
>>>> I think this will require the leases file to be closed at the point of
>>>> testing, i.e. the daemon has already exited.
>>>>
>>>> For the more general issue with systemd verifying the configuration
>>>> see:
>>>> https://lists.freedesktop.org/archives/systemd-devel/2016-May/036481.html
>>>
>>>
>>> Is there a way to signal dhcpd to write out the lease file so it can
>>> be checked?
>>>
>>> It seems that dhcpd needs a journaling mechanism similar to named,
>>> where it writes the changes to a .jnl file and periodically
>>> incorporates those changes into the main zone file.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
12