Catastrophic failure and recovery

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Catastrophic failure and recovery

Gregory Sloop
Catastrophic failure and recovery So, in the case I'm interested in here, I've got a pair of peers [failover].
[ISC/We really should pick a different name than failover, because it's essentially load-balancing with redundancy, but I digress :) ]

Now while I'm using two peers, I think the question I'm asking about will be the same regardless of peers or a single server...

So, lets assume the DHCP server [or a peer] dies. Assume we lost a disk.
Assume I've got configs, but no leases file.

What's the best recovery method?

---
I assume we'll simply put the configurations back on a "new" server. [or peer]
Turn it on and bring it up. [In the peer setup, let it communicate with the other peer.]

Since it won't have a record of any leases [that the dead-peer/old-server actually leased] we'll have a bit of a mess.
But, we'd hope that most machines would already have a lease, and would ask for renewal of that lease.
The server, I think, would generally grant that lease renewal on the same IP. [Even though it has no record of it initially.]

"New" machines just powered up, may/will ask for new addresses, and may "steal" a lease from an active client. ...BUT...
However, if the DHCP server can [and is set to use ping-check] AND the station isn't firewalled or otherwise prevented from receiving/responding to the ping-check, then the DHCP server will realize there's an active client using the address and will avoid leasing that address.

If the active lease is on a machine that's off and returns to the network [before the end of the lease] I'm not sure of the result. I *think* it will attempt to confirm the lease when it comes back on, will get a NAK and be forced to get a new lease.

Thus, generally, using best practices, the result of a catastrophic loss of a DHCP server shouldn't be too disruptive.
[Provided it can be replaced fairly quickly before too many machines lose their current lease.]
[hidden email]
The above setup will be a lot cleaner if there's not much/any IP address churn - in that, for a particular pool, there's enough addresses to give every machine an address simultaneously. If there's a lot of churn it will be substantially more messy, but machines will see far less stability in IP address assignment [But there wasn't a lot of stability to start with, so we've probably only increased the churn rate some.]

Does that sound about right?
I'm sure there's use cases I'm not considering because I don't have those configurations - but am I missing anything serious?

---
On a side note - is it worth capturing [backing up] the leases file, say at a rate of 0.5 times the lease length? [The idea would be to have a reasonably current leases file that might be 80%+ right. Or is this likely to cause more problems than no leases file at all.]

Pointers to FAQ/Docs etc gladly accepted!

TIA
-Greg.
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Catastrophic failure and recovery

perl-list
The way you describe is how it would work if you didn't have failover setup at all.  With failover setup, the "new" server, when it connects to the existing, will get a list of all the current leases and such.  It will then enter the "recover" period where it won't hand any leases out.  "Recover" is the length of MCLT (from the failover configuration).  Once that period is passed, both servers will operate as normal.

----- Original Message -----
> From: "Gregory Sloop" <[hidden email]>
> To: "Users of ISC DHCP" <[hidden email]>
> Sent: Monday, June 25, 2018 1:29:59 PM
> Subject: Catastrophic failure and recovery

> Catastrophic failure and recovery So, in the case I'm interested in here, I've
> got a pair of peers [failover].
> [ISC/We really should pick a different name than failover, because it's
> essentially load-balancing with redundancy, but I digress :) ]

> Now while I'm using two peers, I think the question I'm asking about will be the
> same regardless of peers or a single server...

> So, lets assume the DHCP server [or a peer] dies. Assume we lost a disk.
> Assume I've got configs, but no leases file.

> What's the best recovery method?

> ---
> I assume we'll simply put the configurations back on a "new" server. [or peer]
> Turn it on and bring it up. [In the peer setup, let it communicate with the
> other peer.]

> Since it won't have a record of any leases [that the dead-peer/old-server
> actually leased] we'll have a bit of a mess.
> But, we'd hope that most machines would already have a lease, and would ask for
> renewal of that lease.
> The server, I think, would generally grant that lease renewal on the same IP.
> [Even though it has no record of it initially.]

> "New" machines just powered up, may/will ask for new addresses, and may "steal"
> a lease from an active client. ...BUT...
> However, if the DHCP server can [and is set to use ping-check] AND the station
> isn't firewalled or otherwise prevented from receiving/responding to the
> ping-check, then the DHCP server will realize there's an active client using
> the address and will avoid leasing that address.

> If the active lease is on a machine that's off and returns to the network
> [before the end of the lease] I'm not sure of the result. I *think* it will
> attempt to confirm the lease when it comes back on, will get a NAK and be
> forced to get a new lease.

> Thus, generally, using best practices, the result of a catastrophic loss of a
> DHCP server shouldn't be too disruptive.
> [Provided it can be replaced fairly quickly before too many machines lose their
> current lease.]
> [ mailto:[hidden email] ]
> The above setup will be a lot cleaner if there's not much/any IP address churn -
> in that, for a particular pool, there's enough addresses to give every machine
> an address simultaneously. If there's a lot of churn it will be substantially
> more messy, but machines will see far less stability in IP address assignment
> [But there wasn't a lot of stability to start with, so we've probably only
> increased the churn rate some.]

> Does that sound about right?
> I'm sure there's use cases I'm not considering because I don't have those
> configurations - but am I missing anything serious?

> ---
> On a side note - is it worth capturing [backing up] the leases file, say at a
> rate of 0.5 times the lease length? [The idea would be to have a reasonably
> current leases file that might be 80%+ right. Or is this likely to cause more
> problems than no leases file at all.]

> Pointers to FAQ/Docs etc gladly accepted!

> TIA
> -Greg.
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Catastrophic failure and recovery

Gregory Sloop
Re: Catastrophic failure and recovery Ok, I do see somewhat similar lease information in each lease file [from each peer.]
[I thought each peer essentially only kept track of it's own leases, with some communication/coordination data.]

Yet the details in the lease records doesn't match exactly.

So, here's an example:

[This lease example is from the peer who has not issued the lease.]
starts 1 2018/06/25 17:13:25;
ends 1 2018/06/25 21:13:25;
tstp 6 2018/06/23 04:26:08;
tsfp 1 2018/06/25 23:13:25;
atsfp 1 2018/06/25 23:13:25;
cltt 3 2018/06/06 00:02:22;
binding state active;
next binding state expired;

[This lease example is from the peer who *has* issued the lease.]
starts 1 2018/06/25 17:13:25;
ends 1 2018/06/25 21:13:25;
tstp 1 2018/06/25 23:13:25;
tsfp 1 2018/06/25 23:13:25;
atsfp 1 2018/06/25 23:13:25;
cltt 1 2018/06/25 17:13:25;
binding state active;
next binding state expired;

So, lets assume the peer that did issue the lease dies and we setup a "new" peer with only the configuration.
The fail-over peer who didn't issue the lease will gather enough data, from simply communicating with the still active peer, that it will know that it is the peer responsible for this lease, and the prior lease data and simply come back up and rebuild the lease file properly. Correct?

[That seems reasonable, given what I see in the leases file - but just wanting to be sure I've not assumed something incorrectly from the response.]

A few follow-up questions:
--Why is: tstp 6 2018/06/23 04:26:08; in one vs tstp 1 2018/06/25 23:13:25; in the other?
[The docs I see say that this "indicates what time the peer has been told the lease expires." But this would seem to indicate that the two peers think the lease expires at different times.]

--I can't find any documentation to describe what the numbers after; starts, ends, tstp, tsfp, atsfp, etc. mean. [1, 6, 3, etc]

Thanks again!

pl> The way you describe is how it would work if you didn't have
pl> failover setup at all.  With failover setup, the "new" server,
pl> when it connects to the existing, will get a list of all the
pl> current leases and such.  It will then enter the "recover" period
pl> where it won't hand any leases out.  "Recover" is the length of
pl> MCLT (from the failover configuration).  Once that period is
pl> passed, both servers will operate as normal.

pl> ----- Original Message -----
>> From: "Gregory Sloop" <
[hidden email]>
>> To: "Users of ISC DHCP" <
[hidden email]>
>> Sent: Monday, June 25, 2018 1:29:59 PM
>> Subject: Catastrophic failure and recovery

>> Catastrophic failure and recovery So, in the case I'm interested in here, I've
>> got a pair of peers [failover].
>> [ISC/We really should pick a different name than failover, because it's
>> essentially load-balancing with redundancy, but I digress :) ]

>> Now while I'm using two peers, I think the question I'm asking about will be the
>> same regardless of peers or a single server...

>> So, lets assume the DHCP server [or a peer] dies. Assume we lost a disk.
>> Assume I've got configs, but no leases file.

>> What's the best recovery method?

>> ---
>> I assume we'll simply put the configurations back on a "new" server. [or peer]
>> Turn it on and bring it up. [In the peer setup, let it communicate with the
>> other peer.]

>> Since it won't have a record of any leases [that the dead-peer/old-server
>> actually leased] we'll have a bit of a mess.
>> But, we'd hope that most machines would already have a lease, and would ask for
>> renewal of that lease.
>> The server, I think, would generally grant that lease renewal on the same IP.
>> [Even though it has no record of it initially.]

>> "New" machines just powered up, may/will ask for new addresses, and may "steal"
>> a lease from an active client. ...BUT...
>> However, if the DHCP server can [and is set to use ping-check] AND the station
>> isn't firewalled or otherwise prevented from receiving/responding to the
>> ping-check, then the DHCP server will realize there's an active client using
>> the address and will avoid leasing that address.

>> If the active lease is on a machine that's off and returns to the network
>> [before the end of the lease] I'm not sure of the result. I *think* it will
>> attempt to confirm the lease when it comes back on, will get a NAK and be
>> forced to get a new lease.

>> Thus, generally, using best practices, the result of a catastrophic loss of a
>> DHCP server shouldn't be too disruptive.
>> [Provided it can be replaced fairly quickly before too many machines lose their
>> current lease.]
>> [
[hidden email] ]
>> The above setup will be a lot cleaner if there's not much/any IP address churn -
>> in that, for a particular pool, there's enough addresses to give every machine
>> an address simultaneously. If there's a lot of churn it will be substantially
>> more messy, but machines will see far less stability in IP address assignment
>> [But there wasn't a lot of stability to start with, so we've probably only
>> increased the churn rate some.]

>> Does that sound about right?
>> I'm sure there's use cases I'm not considering because I don't have those
>> configurations - but am I missing anything serious?

>> ---
>> On a side note - is it worth capturing [backing up] the leases file, say at a
>> rate of 0.5 times the lease length? [The idea would be to have a reasonably
>> current leases file that might be 80%+ right. Or is this likely to cause more
>> problems than no leases file at all.]

>> Pointers to FAQ/Docs etc gladly accepted!

>> TIA
>> -Greg.
>> _______________________________________________
>> dhcp-users mailing list
[hidden email]
>> https://lists.isc.org/mailman/listinfo/dhcp-users

--
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail:
[hidden email]
http://www.sloop.net
---
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Catastrophic failure and recovery

perl-list
starts and ends are the important numbers.  Both peers are aware of when the lease starts and ends.

tsfp atsfp and cltt all have to do with failover, I believe, tho I don't remember what they were for.  I think you will find that they are all offset by the amount of MCLT from your config file.  A search of the mailing list archives would probably let you find out what those are for as I know its been discussed on here some time in the past.

----- Original Message -----
> From: "Greg Sloop" <[hidden email]>
> To: "Users of ISC DHCP" <[hidden email]>
> Sent: Monday, June 25, 2018 6:15:31 PM
> Subject: Re: Catastrophic failure and recovery

> Re: Catastrophic failure and recovery Ok, I do see somewhat similar lease
> information in each lease file [from each peer.]
> [I thought each peer essentially only kept track of it's own leases, with some
> communication/coordination data.]

> Yet the details in the lease records doesn't match exactly.

> So, here's an example:

> [This lease example is from the peer who has not issued the lease.]
> starts 1 2018/06/25 17:13:25;
> ends 1 2018/06/25 21:13:25;
> tstp 6 2018/06/23 04:26:08;
> tsfp 1 2018/06/25 23:13:25;
> atsfp 1 2018/06/25 23:13:25;
> cltt 3 2018/06/06 00:02:22;
> binding state active;
> next binding state expired;

> [This lease example is from the peer who *has* issued the lease.]
> starts 1 2018/06/25 17:13:25;
> ends 1 2018/06/25 21:13:25;
> tstp 1 2018/06/25 23:13:25;
> tsfp 1 2018/06/25 23:13:25;
> atsfp 1 2018/06/25 23:13:25;
> cltt 1 2018/06/25 17:13:25;
> binding state active;
> next binding state expired;

> So, lets assume the peer that did issue the lease dies and we setup a "new" peer
> with only the configuration.
> The fail-over peer who didn't issue the lease will gather enough data, from
> simply communicating with the still active peer, that it will know that it is
> the peer responsible for this lease, and the prior lease data and simply come
> back up and rebuild the lease file properly. Correct?

> [That seems reasonable, given what I see in the leases file - but just wanting
> to be sure I've not assumed something incorrectly from the response.]

> A few follow-up questions:
> --Why is: tstp 6 2018/06/23 04:26:08; in one vs tstp 1 2018/06/25 23:13:25; in
> the other?
> [The docs I see say that this "indicates what time the peer has been told the
> lease expires." But this would seem to indicate that the two peers think the
> lease expires at different times.]

> --I can't find any documentation to describe what the numbers after; starts,
> ends, tstp, tsfp, atsfp, etc. mean. [1, 6, 3, etc]

> Thanks again!
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: Catastrophic failure and recovery

glenn.satchell
man dhcpd.leases

explains the format of the leases file and what all the different fields are.

regards,
-glenn

On Tue, June 26, 2018 6:18 pm, perl-list wrote:

> starts and ends are the important numbers.  Both peers are aware of when
> the lease starts and ends.
>
> tsfp atsfp and cltt all have to do with failover, I believe, tho I don't
> remember what they were for.  I think you will find that they are all
> offset by the amount of MCLT from your config file.  A search of the
> mailing list archives would probably let you find out what those are for
> as I know its been discussed on here some time in the past.
>
> ----- Original Message -----
>> From: "Greg Sloop" <[hidden email]>
>> To: "Users of ISC DHCP" <[hidden email]>
>> Sent: Monday, June 25, 2018 6:15:31 PM
>> Subject: Re: Catastrophic failure and recovery
>
>> Re: Catastrophic failure and recovery Ok, I do see somewhat similar
>> lease
>> information in each lease file [from each peer.]
>> [I thought each peer essentially only kept track of it's own leases,
>> with some
>> communication/coordination data.]
>
>> Yet the details in the lease records doesn't match exactly.
>
>> So, here's an example:
>
>> [This lease example is from the peer who has not issued the lease.]
>> starts 1 2018/06/25 17:13:25;
>> ends 1 2018/06/25 21:13:25;
>> tstp 6 2018/06/23 04:26:08;
>> tsfp 1 2018/06/25 23:13:25;
>> atsfp 1 2018/06/25 23:13:25;
>> cltt 3 2018/06/06 00:02:22;
>> binding state active;
>> next binding state expired;
>
>> [This lease example is from the peer who *has* issued the lease.]
>> starts 1 2018/06/25 17:13:25;
>> ends 1 2018/06/25 21:13:25;
>> tstp 1 2018/06/25 23:13:25;
>> tsfp 1 2018/06/25 23:13:25;
>> atsfp 1 2018/06/25 23:13:25;
>> cltt 1 2018/06/25 17:13:25;
>> binding state active;
>> next binding state expired;
>
>> So, lets assume the peer that did issue the lease dies and we setup a
>> "new" peer
>> with only the configuration.
>> The fail-over peer who didn't issue the lease will gather enough data,
>> from
>> simply communicating with the still active peer, that it will know that
>> it is
>> the peer responsible for this lease, and the prior lease data and simply
>> come
>> back up and rebuild the lease file properly. Correct?
>
>> [That seems reasonable, given what I see in the leases file - but just
>> wanting
>> to be sure I've not assumed something incorrectly from the response.]
>
>> A few follow-up questions:
>> --Why is: tstp 6 2018/06/23 04:26:08; in one vs tstp 1 2018/06/25
>> 23:13:25; in
>> the other?
>> [The docs I see say that this "indicates what time the peer has been
>> told the
>> lease expires." But this would seem to indicate that the two peers think
>> the
>> lease expires at different times.]
>
>> --I can't find any documentation to describe what the numbers after;
>> starts,
>> ends, tstp, tsfp, atsfp, etc. mean. [1, 6, 3, etc]
>
>> Thanks again!
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users
>


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users