failover, partner-down state, MCLT and rewind binding

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

failover, partner-down state, MCLT and rewind binding

Gregory Sloop
failover, partner-down state, MCLT and rewind binding So, I'm looking for a little more understanding. I had an outage last week that didn't work out so well.
I've had sort-of-similar problems in the past with this setup, and I *think* I know some of what happened this time, but wanting confirmation.

After the last similar outage, I knew we needed to put the surviving peer in "partner-down" mode, and this, along with the new "rewind state

Here's the config, in general terms.
---
We're doing failover. [ISC dhcpd 2.4.2 on Ubuntu 14.04]

In this specific case, the "free" pool is quite small in comparison to the number of clients. [And this isn't easily fixable - we're addressing, but there are limitations. Lets, for now, ignore that changes might make the situation more tolerable and focus on what's happening.]

But, since the pool is quite small, in a communication-interrupted state, the surviving peer can easily run out of leases in its part of the split pool. Thus, helping the surviving server do as well as possible in a bad situation is my goal. [Not enough free pool addresses on the surviving peer to handle all the renewals that will come over from the down server.]

Thus, here's what happened:
So the primary and then the secondary went down hard - a few minutes apart.

Fairly quickly, got the secondary back up.
It handled all it's assigned leases fine, as far as I can tell. It then handed out all the remaining pool leases it had. Then it ran out of leases.

I expected it to start recovering the leases from the peer [primary] at this point, and extend [rewind] the leases the primary had already issued.

While I'm not sure how the "rewind" provision worked - we were getting "peer holds all free leases" messages for quite a while.

I didn't think this should be happening - but then I looked at the MCLT time and it was set to 1800 [30m]

One issue to address, I assume:
I assume [now] that I should set MCLT as low as I can, such that, it will also still allow a single server to handle the load. For example; if I set MCLT to 30 seconds; I need to be sure that a single server [or the slowest of the pair] should be able to handle the load of every client renewing every 15s, plus whatever I consider a safety margin. If my servers and network can handle this load, there's no real disadvantage to setting low MCLT times. [I'd perhaps tend to speculate that MCLT times ought to be one-forth as long [or less] as the regular lease time - again if we can handle that load. This would allow the peer, when going into partner down mode to recover leases fairly quickly, in relation to how fast we might expect lease expirations.] Is that reasoning sound?

(The lesson for me here is: I believe I have my MCLT time set too high. At this point, both leases and MCLT are 30 minutes. I can handle a lot more load, so I'm leaning toward a 1-5m MCLT time and perhaps lengthening lease times to help bridge an outage.)

So, my additional questions generally relate to:
1) When can the surviving dhcp fail-over peer recover the unused pool addresses the "down" server had.

I don't see any specific answer in the list or elsewhere, but it seems logical that this would be the MCLT time. But how does it calculate that time? Again, I assume it would be from the time the second server goes into partner down mode + MCLT. Does this time get "reset" each time the surviving server is restarted? [I'm assuming there's no record of this written to disk. And, the server would have to assume stuff might have happened while it was being restarted - so each time it goes into partner down mode (which it will do when it gets restarted), it will have to wait MCLT time again before starting recovery.]

2) Expired/expiring leases: when can the surviving server recover these, or rewind them?

Recover to be used for another client:
This appears to be whatever the original lease was for [Is that STOS?] + MCLT. Once this time has passed, then the surviving peer who is in "partner-down" mode can take that lease and recover it for use.

Rewind: Does rewind work in "partner-down" mode - or only in communications-interrupted? [Not clear to me from docs.]
I assume that the surviving server would issue "rewind" leases for the MCLT time, as many times as needed, until the peer recovers. But I haven't seen any discussion about how the rewind state actually works. I'd be glad to be pointed at a discussion, if one exists.

However, I'd have expected rewind to work better in our case than it did - or at least better than it appeared. So,

I hope that's not too disjointed. I've also looked at the docs in an attempt to understand things - but what I've written above is the best I have from the docs and list discussion. Thanks for your help, in advance.

-Greg
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

Cathy Almond
On 10/11/2015 01:43, Gregory Sloop wrote:
> So, I'm looking for a little more understanding. I had an outage last
> week that didn't work out so well.
> I've had sort-of-similar problems in the past with this setup, and I
> *think* I know some of what happened this time, but wanting confirmation.
>
> After the last similar outage, I knew we needed to put the surviving
> peer in "partner-down" mode, and this, along with the new "rewind state

Hi Greg,

I've just (belatedly, remembering that they had been written, but not
able to find them readily), made the two KB articles below published/public.

Hoping that they help - particularly around tweaking the default
failover settings and why a very small MCLT is not necessarily a good idea.

https://kb.isc.org/article/AA-00268/31/DHCP-Failover-and-MCLT-configuration-implications.html

https://kb.isc.org/article/AA-00327/31/Why-are-the-lease-times-short-and-random-during-communication-interrupted-state.html

Cathy

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

Simon Hobson
In reply to this post by Gregory Sloop
Gregory Sloop <[hidden email]> wrote:

> Fairly quickly, got the secondary back up.
> It handled all it's assigned leases fine, as far as I can tell. It then handed out all the remaining pool leases it had. Then it ran out of leases.
>
> I expected it to start recovering the leases from the peer [primary] at this point, and extend [rewind] the leases the primary had already issued.
>
> While I'm not sure how the "rewind" provision worked - we were getting "peer holds all free leases" messages for quite a while.
>
> I didn't think this should be happening - but then I looked at the MCLT time and it was set to 1800 [30m]

It is not clear, did you set that server into "partner down" mode ?

If not, then AIUI what you saw was normal.

What you need to do in these situations it to manually set the surviving server into partner down mode - then it can have free use of the entire range and will "just work" as though it's a single server. When the peer comes back up, then they should automatically return to normal mode.


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

Gregory Sloop
Re: failover, partner-down state, MCLT and rewind binding


SH> Gregory Sloop <[hidden email]> wrote:

>> Fairly quickly, got the secondary back up.
>> It handled all it's assigned leases fine, as far as I can tell. It then handed out all the remaining pool leases it had. Then it ran out of leases.

>> I expected it to start recovering the leases from the peer [primary] at this point, and extend [rewind] the leases the primary had already issued.

>> While I'm not sure how the "rewind" provision worked - we were getting "peer holds all free leases" messages for quite a while.

>> I didn't think this should be happening - but then I looked at the MCLT time and it was set to 1800 [30m]

SH> It is not clear, did you set that server into "partner down" mode ?

I did set the "still up" server in partner-down mode. I verified it was in partner down mode via the OMAPI tool too.

But given what I read in the docs, about MCLT, I think it was operating normally - at least as far as the free address pool, and reclaiming addresses from the "down" server. [i.e. It still has to wait the MCLT time before reclaiming them. Since my MCLT time was as long as my lease time, leases were expiring and the clients unable to get another address for 30m+ because the partner-down server couldn't reclaim the split pool for 30 minutes after going into partner down mode. (And we didn't get the secondary server up and into partner down mode until after most all of the active leases had actually expired. And yes, it is probably an indication that a longer DHCP lease time would be helpful, in addition to shortening the MCLT.)]

I still need/want more information about how the rewind process works, if anyone has it.

[Is there a way, in the log files, to determine if and/or what clients got a "rewind" lease extension? I don't have a copy of the leases file at the time of the problem, so I can't look there for any data. So, I'm hoping there's evidence of "rewind" activity in the log files, which I can retrospectively review.]

TIA

-Greg


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

Gregory Sloop
In reply to this post by Cathy Almond
Re: failover, partner-down state, MCLT and rewind binding


CA> On 10/11/2015 01:43, Gregory Sloop wrote:
>> So, I'm looking for a little more understanding. I had an outage last
>> week that didn't work out so well.
>> I've had sort-of-similar problems in the past with this setup, and I
>> *think* I know some of what happened this time, but wanting confirmation.

>> After the last similar outage, I knew we needed to put the surviving
>> peer in "partner-down" mode, and this, along with the new "rewind state

CA> Hi Greg,

CA> I've just (belatedly, remembering that they had been written, but not
CA> able to find them readily), made the two KB articles below published/public.

CA> Hoping that they help - particularly around tweaking the default
CA> failover settings and why a very small MCLT is not necessarily a good idea.

CA> https://kb.isc.org/article/AA-00268/31/DHCP-Failover-and-MCLT-configuration-implications.html

CA> https://kb.isc.org/article/AA-00327/31/Why-are-the-lease-times-short-and-random-during-communication-interrupted-state.html

CA> Cathy

Thanks Cathy. But those documents don't add a lot of light to the discussion.

In specific:

1) I understand why the lease extensions are different times and not just the MCLT time. BUT - the lease extensions should *NEVER* be shorter than the MCLT time, right? [They'll be of varying lengths, because there will/may be varying time left on the original leases - from before the fail-over pair went down. But once all the "original" leases have expired, all the remaining leases should be MCLT time, right?]

2) I *think* what I read essentially verifies what I said about MCLT times. If you use really short MCLT times, it's going to put extra load on the environment [network and servers] in even regular mode. [This is because the initial lease, even when running in "communication-normal" mode is for the MCLT time. _After that initial lease_, however, clients will get the regular DHCP lease time. So, I realize that _really_ short MCLT times can adversely impact the performance of your servers both in communications-normal mode, as well as in interrupted mode [as well as other recovery or failure modes]. However, I don't see any indications, _other than performance_, to select longer MCLT times. Do I understand that correctly?

So, is there some reason/benefit, other than performance [load on network, clients and servers] to select longer MCLT times?

And as a corollary,  I think MCLT times, provided your server can handle the load, should be some small fraction of the DHCP lease time. My initial thought - which wasn't encumbered by a lot of deep thought - is around 20-25% of the regular DHCP lease time. That would mean that the server/network should be able to sustain about four-five times the regular load in a failure situation.

Other than answers to the above direct questions - I'm happy for any wider ranging discussion on the thread.

-Greg


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

Gregory Sloop
Re: failover, partner-down state, MCLT and rewind binding Top posting.

So there wasn't a lot of follow-up. Let me bump this and I'll summarize and clarify. [In order of importance to me.]

1) MCLT time. Is there any benefit, other than performance/load on the server, to pick longer MCLT times vs shorter?

1A) It seems to me that in a tight lease situation, having the MCLT time be some fairly small fraction of the DHCP lease time makes sense. Anecdotally, something like 10-20% seems about right. [Again, assuming your DHCP servers can handle the load.] Is there any comment or observation that might illuminate that thinking from anyone?

2) How does the newish lease rewind feature work? Does rewind work only in communications interrupted mode? [vs partner down mode]

3) Is there a way, other than examining the leases file, to see if a lease was "rewound" - say be reviewing the log files?

Thanks again.







CA> On 10/11/2015 01:43, Gregory Sloop wrote:
>> So, I'm looking for a little more understanding. I had an outage last
>> week that didn't work out so well.
>> I've had sort-of-similar problems in the past with this setup, and I
>> *think* I know some of what happened this time, but wanting confirmation.

>> After the last similar outage, I knew we needed to put the surviving
>> peer in "partner-down" mode, and this, along with the new "rewind state

CA> Hi Greg,

CA> I've just (belatedly, remembering that they had been written, but not
CA> able to find them readily), made the two KB articles below published/public.

CA> Hoping that they help - particularly around tweaking the default
CA> failover settings and why a very small MCLT is not necessarily a good idea.

CA> https://kb.isc.org/article/AA-00268/31/DHCP-Failover-and-MCLT-configuration-implications.html

CA> https://kb.isc.org/article/AA-00327/31/Why-are-the-lease-times-short-and-random-during-communication-interrupted-state.html

CA> Cathy

Thanks Cathy. But those documents don't add a lot of light to the discussion.

In specific:

1) I understand why the lease extensions are different times and not just the MCLT time. BUT - the lease extensions should *NEVER* be shorter than the MCLT time, right? [They'll be of varying lengths, because there will/may be varying time left on the original leases - from before the fail-over pair went down. But once all the "original" leases have expired, all the remaining leases should be MCLT time, right?]

2) I *think* what I read essentially verifies what I said about MCLT times. If you use really short MCLT times, it's going to put extra load on the environment [network and servers] in even regular mode. [This is because the initial lease, even when running in "communication-normal" mode is for the MCLT time. _After that initial lease_, however, clients will get the regular DHCP lease time. So, I realize that _really_ short MCLT times can adversely impact the performance of your servers both in communications-normal mode, as well as in interrupted mode [as well as other recovery or failure modes]. However, I don't see any indications, _other than performance_, to select longer MCLT times. Do I understand that correctly?

So, is there some reason/benefit, other than performance [load on network, clients and servers] to select longer MCLT times?

And as a corollary,  I think MCLT times, provided your server can handle the load, should be some small fraction of the DHCP lease time. My initial thought - which wasn't encumbered by a lot of deep thought - is around 20-25% of the regular DHCP lease time. That would mean that the server/network should be able to sustain about four-five times the regular load in a failure situation.

Other than answers to the above direct questions - I'm happy for any wider ranging discussion on the thread.

-Greg


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

glenn.satchell
Yes, a small ratio of MCLT to lease time is right. I use 30 min for MCLT
and 24 hours for lease time.

I have seen discussion about really short lease times (< 1 min) doing
crazy things with some clients, ie the client has not settled and the
lease expires, so I wouldn't make your MCLT too short. You need to balance
this with how many clients you have and expect them all to try to renew
during the MCLT period. So really work out how many renews per second one
of your servers can cope with, multiply that by the number of clients and
it will give you the minimum MCLT that your server can handle.

On the other hand MCLT only affects new leases, and partner-down
situations. Normal operations will handle out default length leases.

Don't know about the last two points.

regards,
-glenn

On Tue, November 17, 2015 4:52 am, Gregory Sloop wrote:

> Top posting.
>
> So there wasn't a lot of follow-up. Let me bump this and I'll summarize
> and clarify. [In order of importance to me.]
>
> 1) MCLT time. Is there any benefit, other than performance/load on the
> server, to pick longer MCLT times vs shorter?
>
> 1A) It seems to me that in a tight lease situation, having the MCLT time
> be some fairly small fraction of the DHCP lease time makes sense.
> Anecdotally, something like 10-20% seems about right. [Again, assuming
> your DHCP servers can handle the load.] Is there any comment or
> observation that might illuminate that thinking from anyone?
>
> 2) How does the newish lease rewind feature work? Does rewind work only in
> communications interrupted mode? [vs partner down mode]
>
> 3) Is there a way, other than examining the leases file, to see if a lease
> was "rewound" - say be reviewing the log files?
>
> Thanks again.
>
>
>
>
>
>
> CA> On 10/11/2015 01:43, Gregory Sloop wrote:
>>> So, I'm looking for a little more understanding. I had an outage last
>>> week that didn't work out so well.
>>> I've had sort-of-similar problems in the past with this setup, and I
>>> *think* I know some of what happened this time, but wanting
>>> confirmation.
>
>>> After the last similar outage, I knew we needed to put the surviving
>>> peer in "partner-down" mode, and this, along with the new "rewind state
>
> CA> Hi Greg,
>
> CA> I've just (belatedly, remembering that they had been written, but not
> CA> able to find them readily), made the two KB articles below
> published/public.
>
> CA> Hoping that they help - particularly around tweaking the default
> CA> failover settings and why a very small MCLT is not necessarily a good
> idea.
>
> CA>
> https://kb.isc.org/article/AA-00268/31/DHCP-Failover-and-MCLT-configuration-implications.html
>
> CA>
> https://kb.isc.org/article/AA-00327/31/Why-are-the-lease-times-short-and-random-during-communication-interrupted-state.html
>
> CA> Cathy
>
> Thanks Cathy. But those documents don't add a lot of light to the
> discussion.
>
> In specific:
>
> 1) I understand why the lease extensions are different times and not just
> the MCLT time. BUT - the lease extensions should *NEVER* be shorter than
> the MCLT time, right? [They'll be of varying lengths, because there
> will/may be varying time left on the original leases - from before the
> fail-over pair went down. But once all the "original" leases have expired,
> all the remaining leases should be MCLT time, right?]
>
> 2) I *think* what I read essentially verifies what I said about MCLT
> times. If you use really short MCLT times, it's going to put extra load on
> the environment [network and servers] in even regular mode. [This is
> because the initial lease, even when running in "communication-normal"
> mode is for the MCLT time. _After that initial lease_, however, clients
> will get the regular DHCP lease time. So, I realize that _really_ short
> MCLT times can adversely impact the performance of your servers both in
> communications-normal mode, as well as in interrupted mode [as well as
> other recovery or failure modes]. However, I don't see any indications,
> _other than performance_, to select longer MCLT times. Do I understand
> that correctly?
>
> So, is there some reason/benefit, other than performance [load on network,
> clients and servers] to select longer MCLT times?
>
> And as a corollary,  I think MCLT times, provided your server can handle
> the load, should be some small fraction of the DHCP lease time. My initial
> thought - which wasn't encumbered by a lot of deep thought - is around
> 20-25% of the regular DHCP lease time. That would mean that the
> server/network should be able to sustain about four-five times the regular
> load in a failure situation.
>
> Other than answers to the above direct questions - I'm happy for any wider
> ranging discussion on the thread.
>
> -Greg
> _______________________________________________
> dhcp-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/dhcp-users


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: failover, partner-down state, MCLT and rewind binding

Simon Hobson
Glenn Satchell <[hidden email]> wrote:

> So really work out how many renews per second one of your servers can cope with, multiply that by the number of clients and it will give you the minimum MCLT that your server can handle.

I think you meant divide there - easy mistake to make when manually rearranging a formula in your head.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users