DHCP pair messed up, second one only running cant get primary up.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

DHCP pair messed up, second one only running cant get primary up.

Rob Morin

Two ISC dhcpd servers running as a failover pair running Version 4.3.3-P1 compiled, running on Ubuntu 14.04 64 bit

The Servers are 500 gigs of RAID 1 space with 8 gigs of RAM with quad core Intel(R) Xeon(R) CPU E31225 @ 3.10GHz

The dhcpd.leases file sits in /ramdisk which is a 4 Gb RAM disk to make sure that the file can be written to very quickly, this method gives us virtually a 0.0 WA time when viewing with the top command.

The LEASE_HASH size was increased to 1800017 and enabled debugging with  REPORT_HASH_PERFORMANCE 1 in the dhcpd.h file

We are using 6,657 /24 subnets in our pools file

We give out millions IPs each day

Please see below for config files.

 

This service has been running fine for the last 346 days, last Saturday, for an unknown reason dhcp-1 server had issues, so we turned if off, and dhcp-2 server took over dhcp-1’s part just fine.

So using omapi I told dhcp-2 that its partner was down in order to keep dhp-2 working on its own, so the last log file entry in syslog on dhcp-2 was; 

 

In recent days dhcp-2 is having some difficulty, after comparing a tcpdump to dhcp logs we see DISCOVER requests coming in, but no offers going back out, but this is sporadic, but enough to make users call in.

The last time we saw this issue happen is when the LEASH_HASH size was too low, we are not sure if this is the same issue , maybe because we are only on one server?

Also the dhcpd.leases files grow too big for the /ramdisk, so we are each 10 mins catting /dev/null into /ramdisk/dhcpd.lease! file to save space.

 

So currently only dhcp-2 is running.

 

I want to try tonight to simply stop both dhcpd services on both servers, delete the leases file and “touch” new ones then reboot primary first , then secondary a few mins later.

 

What  do you think?

 

Here are my configs. & Thanks… J

 

DHCP-1 dhcpd.conf file

-----------

authoritative;

log-facility local7;

db-time-format local;

 

option domain-name "dev"; # TODO

 

# DNS internal

option domain-name-servers xxx.xx.xx.210, xxx.xx.xx.220;

 

default-lease-time 1200; # 20 minutes to match the default Tim Hortons' session duration

max-lease-time 3600; # 1h

 

# Include EITHER the primary configuration

include "/usr/local/etc/dhcp/dhcpd_primary.conf";

# OR the secondary configuration

#include "/etc/dhcp/dhcpd_secondary.conf";

 

# No service for the local networks

subnet xxx.xx.0.0 netmask 255.255.255.0 { }

subnet xxx.xx.128.0 netmask 255.255.255.0 { }

subnet xxx.xx.129.0 netmask 255.255.255.0 { }

 

# All IP ranges for TDL stores

# This file should be automatically generated using the command:

#       ./make_ranges.pl < ranges > dhcpd_pools.conf

include "/usr/local/etc/dhcp/dhcpd_pools.conf";

 

# Non-standard IP ranges (i.e. big stores)

include "/usr/local/etc/dhcp/dhcpd_special_pools.conf";

 

 

pid-file-name "/run/dhcpd.pid";

 

ddns-update-style none;

 

omapi-port 7911;

omapi-key omapi_key;

 

key omapi_key {

     algorithm hmac-md5;

     secret xxxxxxxxxxxxxxxxy==;

}

 

DHCP-1 dhcpd_primary.conf

## PRIMARY

failover peer "dhcp-failover" {

  primary; # declare this to be the primary server

  address xxx.xx.xx.9;

  port 647;

  peer address xxx.xx.xx.11;

  peer port 647;

  max-response-delay 30;

  max-unacked-updates 10;

  load balance max seconds 3;

  mclt 1800;

  split 128;

}

 

 

DHCP-2 dhcp-2.conf

----

authoritative;

log-facility local7;

db-time-format local;

 

option domain-name "tdl"; # TODO

 

# DV DNS internal

option domain-name-servers XXX.XX.XX.210, XXX.xx.xx.220;

 

default-lease-time 1200; # 20 minutes to match the default Tim Hortons' session duration

max-lease-time 3600; # 1h

 

### The below commneted as we are to be independant server - Rob Jan 28th 2016

# Include EITHER the primary configuration

#include "/etc/dhcp/dhcpd_primary.conf";

# OR the secondary configuration

include "/usr/local/etc/dhcpd_secondary.conf";

 

# No service for the local networks

subnet xxx.xx.0.0 netmask 255.255.255.0 { }

subnet xxx.xx.128.0 netmask 255.255.255.0 { }

subnet xxx.xx.129.0 netmask 255.255.255.0 { }

 

# All IP ranges for TDL stores

# This file should be automatically generated using the command:

#       ./make_ranges.pl < ranges > dhcpd_pools.conf

include "/usr/local/etc/dhcpd_pools.conf";

 

# Non-standard IP ranges (i.e. big stores)

include "/etc/dhcp/dhcpd_special_pools.conf";

 

 

pid-file-name "/run/dhcp-server/dhcpd.pid";

 

ddns-update-style none;

 

omapi-port 7911;

omapi-key omapi_key;

 

key omapi_key {

     algorithm hmac-md5;

     secret xxxxxxxxxxxxxxx==;

}

 

DHCP-2 dhcpd_secondary.conf

---

## SECONDARY

failover peer "dhcp-failover" {

secondary;

 address XXX.xx.128.11;

port 647;

peer address xxx.xx.128.9;

peer port 647;

max-response-delay 30;

max-unacked-updates 10;

load balance max seconds 3;

}

 

Dhcp pools file  snip it… over 6000 subnets

--

subnet 10.32.0.0 netmask 255.255.255.0 {

  option routers 10.32.0.1;

  pool {

        failover peer "dhcp-failover";

        range 10.32.0.5 10.32.0.254;

  }

}

 

……  too long to list J

 

subnet 10.57.255.0 netmask 255.255.255.0 {

  option routers 10.57.255.1;

  pool {

        failover peer "dhcp-failover";

        range 10.57.255.5 10.57.255.254;

  }

}

 

 

 

 

Rob Morin

Gestionnaire des systèmes | Senior Systems Administrator

Tel: 514 385-4448 #174                        

DATAVALET.COM

5275, chemin Queen-Mary, Montréal (Québec) H3W 1Y3 Canada

CE COURRIEL AINSI QUE CES DOCUMENTS JOINTS peuvent contenir des renseignements confidentiels et privilégiés. Si vous n’êtes pas le destinataire désigné, veuillez nous en informer immédiatement et effacer toute copie. Merci.

THIS EMAIL AND THE DOCUMENTS ATTACHED may contain privileged or confidential information. If the reader of this message is not the intended recipient, please notify the sender immediately and delete the original message. Thank you.

 


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Simon Hobson
Rob Morin <[hidden email]> wrote:

> Also the dhcpd.leases files grow too big for the /ramdisk, so we are each 10 mins catting /dev/null into /ramdisk/dhcpd.lease! file to save space.

I can't help with the other problems, but pray you don't have to stop the DHCP server at any time before it's re-written the compacted leases file ! Losing the leases file is "bad" in a big way.

I can't help with the specific problem, but I would suggest that if you lengthen the lease time (by a considerable amount) it will dramatically reduce the rate of growth of the leases file. With a lease length of 20 minutes, you'll have a renewal every 10 minutes (roughly) - so that's 6 lease updates to the leases file per hour !

For example, if you were to increase the lease time to (say) 4 hours, then your leases file would contain one record per lease (in practical terms, every address in your pools) plus one update for roughly 1/2 the active clients.

So your lease file size will change from total of IP ranges + 6x number of active clients, to total of IP ranges plus 1/2 the active clients.


Is there a reason for having such short leases ? It's quite short, longer leases bring much stability and much more leeway in dealing with DHCPO service issues !

Also, for consideration, you can have more than 2 servers in failover - but only 2 per pool. So it's possible to have (say) 3 servers sharing the load as A+B, B+C, and C+A. More complexity, but more scope for server failure without losing DHCP service - and more load sharing. Of course, you can also just split pools across an even number of servers as A+B, C+D, etc.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Rob Morin
In reply to this post by Rob Morin
Sorry I had a typo in my email we cat /dev/null into dhcp. leases~ file not the active file 


Sent from Samsung Mobile


-------- Original message --------
From: Simon Hobson
Date:01-13-2017 3:30 PM (GMT-05:00)
To: Users of ISC DHCP
Subject: Re: DHCP pair messed up, second one only running cant get primary up.

Rob Morin <[hidden email]> wrote:

> Also the dhcpd.leases files grow too big for the /ramdisk, so we are each 10 mins catting /dev/null into /ramdisk/dhcpd.lease! file to save space.

I can't help with the other problems, but pray you don't have to stop the DHCP server at any time before it's re-written the compacted leases file ! Losing the leases file is "bad" in a big way.

I can't help with the specific problem, but I would suggest that if you lengthen the lease time (by a considerable amount) it will dramatically reduce the rate of growth of the leases file. With a lease length of 20 minutes, you'll have a renewal every 10 minutes (roughly) - so that's 6 lease updates to the leases file per hour !

For example, if you were to increase the lease time to (say) 4 hours, then your leases file would contain one record per lease (in practical terms, every address in your pools) plus one update for roughly 1/2 the active clients.

So your lease file size will change from total of IP ranges + 6x number of active clients, to total of IP ranges plus 1/2 the active clients.


Is there a reason for having such short leases ? It's quite short, longer leases bring much stability and much more leeway in dealing with DHCPO service issues !

Also, for consideration, you can have more than 2 servers in failover - but only 2 per pool. So it's possible to have (say) 3 servers sharing the load as A+B, B+C, and C+A. More complexity, but more scope for server failure without losing DHCP service - and more load sharing. Of course, you can also just split pools across an even number of servers as A+B, C+D, etc.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

RE: DHCP pair messed up, second one only running cant get primary up.

Rob Morin
In reply to this post by Simon Hobson
Our lease time is governed by our client, which is huge. That cannot be changed. :)

Rob Morin
Gestionnaire des systèmes | Senior System administrator
 
T 514 385-4448 #174                 DATAVALET.COM
 
5275, chemin Queen-Mary, Montréal (Québec) H3W 1Y3 Canada
 
CE COURRIEL AINSI QUE CES DOCUMENTS JOINTS peuvent contenir des renseignements confidentiels et privilégiés. Si vous n'êtes pas le destinataire désigné, veuillez nous en informer immédiatement et effacer toute copie. Merci.
THIS EMAIL AND THE DOCUMENTS ATTACHED may contain privileged or confidential information. If the reader of this message is not the intended recipient, please notify the sender immediately and delete the original message. Thank you.

-----Original Message-----
From: dhcp-users [mailto:[hidden email]] On Behalf Of Simon Hobson
Sent: Friday, January 13, 2017 3:31 PM
To: Users of ISC DHCP <[hidden email]>
Subject: Re: DHCP pair messed up, second one only running cant get primary up.

Rob Morin <[hidden email]> wrote:

> Also the dhcpd.leases files grow too big for the /ramdisk, so we are each 10 mins catting /dev/null into /ramdisk/dhcpd.lease! file to save space.

I can't help with the other problems, but pray you don't have to stop the DHCP server at any time before it's re-written the compacted leases file ! Losing the leases file is "bad" in a big way.

I can't help with the specific problem, but I would suggest that if you lengthen the lease time (by a considerable amount) it will dramatically reduce the rate of growth of the leases file. With a lease length of 20 minutes, you'll have a renewal every 10 minutes (roughly) - so that's 6 lease updates to the leases file per hour !

For example, if you were to increase the lease time to (say) 4 hours, then your leases file would contain one record per lease (in practical terms, every address in your pools) plus one update for roughly 1/2 the active clients.

So your lease file size will change from total of IP ranges + 6x number of active clients, to total of IP ranges plus 1/2 the active clients.


Is there a reason for having such short leases ? It's quite short, longer leases bring much stability and much more leeway in dealing with DHCPO service issues !

Also, for consideration, you can have more than 2 servers in failover - but only 2 per pool. So it's possible to have (say) 3 servers sharing the load as A+B, B+C, and C+A. More complexity, but more scope for server failure without losing DHCP service - and more load sharing. Of course, you can also just split pools across an even number of servers as A+B, C+D, etc.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Simon Hobson
In reply to this post by Rob Morin

On 14 Jan 2017, at 02:39, Rob Morin <[hidden email]> wrote:

> Sorry I had a typo in my email we cat /dev/null into dhcp. leases~ file not the active file

You could just delete it !
The server will recreate it when it does it's cleanup. The process it uses is :
 Write out all lease info to new file
 Rename leases to leases~
 Rename new file to leases

Once this is done, there is no need to the leases~ file.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Timothe Litt
In reply to this post by Rob Morin

> Message: 1
> Date: Sat, 14 Jan 2017 12:25:48 +0000
> From: Simon Hobson <[hidden email]>
> To: Users of ISC DHCP <[hidden email]>
> Subject: Re: DHCP pair messed up, second one only running cant get
> primary up.
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=us-ascii
>
>
> On 14 Jan 2017, at 02:39, Rob Morin <[hidden email]> wrote:
>
>> Sorry I had a typo in my email we cat /dev/null into dhcp. leases~ file not the active file
> You could just delete it !
> The server will recreate it when it does it's cleanup. The process it uses is :
>  Write out all lease info to new file
>  Rename leases to leases~
>  Rename new file to leases
>
> Once this is done, there is no need to the leases~ file.
>
>
Strictly speaking, this is true.  However, in the event of disk
corruption or an inopportune crash, leases~ can be useful for recovery.
Recognizing that space on a ramdisk is valuable, you might wish to mv
leases~ to a magnetic disk rather than deleting it.  This has the
additional benefit of putting the backup on a different spindle,
avoiding a single point of failure.

Disk failures are a lot less frequent than they once were, but they do
still happen...and when they do, the more state is available (even if
somewhat stale), the less painful the recovery.



Timothe Litt
ACM Distinguished Engineer
--------------------------
This communication may not represent the ACM or my employer's views,
if any, on the matters discussed.


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Simon Hobson
Timothe Litt <[hidden email]> wrote:

>> Once this is done, there is no need to the leases~ file.

> Strictly speaking, this is true.  However, in the event of disk
> corruption or an inopportune crash, leases~ can be useful for recovery.
> Recognizing that space on a ramdisk is valuable, you might wish to mv
> leases~ to a magnetic disk rather than deleting it.  This has the
> additional benefit of putting the backup on a different spindle,
> avoiding a single point of failure.

Good point - especially as they are using a RAM disk which is not exactly non-volaile.

_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Drew Derbyshire
As an aside ...

The thought of putting recovery state/log files (i.e. the DHCP leases
file and its backup) on a RAM disk to make top look pretty leaves me
dismayed. That leaves zero local protection against a system crash (or
misguided deliberate reboot).

And yes, the server has been running 346 days -- Doesn't matter.
Services can be five nines reliable, but hardware won't be.

An SSD, or if you prefer multiple SSD units configured as a raid, will
give you the same order of performance without the premortem fodder.

The machine should be audited for other critical files which are written
to volatile storage, and moved to the SSD or other storage as well.

-ahd-
_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Bob Harold

On Mon, Jan 16, 2017 at 12:14 PM, Drew Derbyshire <[hidden email]> wrote:
As an aside ...

The thought of putting recovery state/log files (i.e. the DHCP leases file and its backup) on a RAM disk to make top look pretty leaves me dismayed. That leaves zero local protection against a system crash (or misguided deliberate reboot).

And yes, the server has been running 346 days -- Doesn't matter. Services can be five nines reliable, but hardware won't be.

An SSD, or if you prefer multiple SSD units configured as a raid, will give you the same order of performance without the premortem fodder.

The machine should be audited for other critical files which are written to volatile storage, and moved to the SSD or other storage as well.

-ahd-


I agree that putting state in ram is a concern, but with a failover pair, there should be a duplicate copy on the other server of all but the very latest changes.  There is a risk when one server is rebooting, so having a copy of the backup lease file on real disk would help.
I think SSD (particularly write speed) is still orders of magnitude slower than ram.

That said, I would not ever want to have both servers rebooted at the same time, or even restart DHCP at the same time.  Restart one, allow it to sync data and get to normal-normal, then restart the other.

-- 
Bob Harold



_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

RE: DHCP pair messed up, second one only running cant get primary up.

Rob Morin

Hey all, so my re-pairing went fine that I did early Saturday morning.

 

What I did was;

 

Stop dhcpd on both servers

Deleted the dhcpd.leases file which sits on a 4 gig ramdisk on both servers

Then I rebooted the primary server, I did a reboot rather than start dhcpd, as I had kernel updates to do, so I figure kill 2 birds with one stone!

Before dhcpd starts, in the init.d script, I do a “touch /ramdisk/dhcpd.leases “

Then dhcpd starts up fresh, all worked fine on primary, it started giving leases out after about 4 minutes after it started up.

 

Then after the MCLT time, I did the same thing to the secondary, and it came up flawlessly and both started to give out leases in a load balancing way just fine.

 

BTW I do copy /ramdisk/dhcpd.leases file to disk every 5 mins to be safe via cron.

 

Other than this reboot it ran for almost a year with zero issues.

 

Thanks for all the comments.

Have a great day!

 

 

Rob Morin

Gestionnaire des systèmes | Senior Systems Administrator

Tel: 514 385-4448 #174                        

DATAVALET.COM

5275, chemin Queen-Mary, Montréal (Québec) H3W 1Y3 Canada

CE COURRIEL AINSI QUE CES DOCUMENTS JOINTS peuvent contenir des renseignements confidentiels et privilégiés. Si vous n’êtes pas le destinataire désigné, veuillez nous en informer immédiatement et effacer toute copie. Merci.

THIS EMAIL AND THE DOCUMENTS ATTACHED may contain privileged or confidential information. If the reader of this message is not the intended recipient, please notify the sender immediately and delete the original message. Thank you.

 

From: dhcp-users [mailto:[hidden email]] On Behalf Of Bob Harold
Sent: January 16, 2017 2:49 PM
To: Users of ISC DHCP <[hidden email]>
Subject: Re: DHCP pair messed up, second one only running cant get primary up.

 

 

On Mon, Jan 16, 2017 at 12:14 PM, Drew Derbyshire <[hidden email]> wrote:

As an aside ...

The thought of putting recovery state/log files (i.e. the DHCP leases file and its backup) on a RAM disk to make top look pretty leaves me dismayed. That leaves zero local protection against a system crash (or misguided deliberate reboot).

And yes, the server has been running 346 days -- Doesn't matter. Services can be five nines reliable, but hardware won't be.

An SSD, or if you prefer multiple SSD units configured as a raid, will give you the same order of performance without the premortem fodder.

The machine should be audited for other critical files which are written to volatile storage, and moved to the SSD or other storage as well.

-ahd-

 

 

I agree that putting state in ram is a concern, but with a failover pair, there should be a duplicate copy on the other server of all but the very latest changes.  There is a risk when one server is rebooting, so having a copy of the backup lease file on real disk would help.

I think SSD (particularly write speed) is still orders of magnitude slower than ram.

 

That said, I would not ever want to have both servers rebooted at the same time, or even restart DHCP at the same time.  Restart one, allow it to sync data and get to normal-normal, then restart the other.

 

-- 

Bob Harold

 

 


_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users
Reply | Threaded
Open this post in threaded view
|

Re: DHCP pair messed up, second one only running cant get primary up.

Drew Derbyshire
In reply to this post by Bob Harold
On 1/16/17 11:48 AM, Bob Harold wrote:

On Mon, Jan 16, 2017 at 12:14 PM, Drew Derbyshire <[hidden email]> wrote:
As an aside ...

The thought of putting recovery state/log files (i.e. the DHCP leases file and its backup) on a RAM disk to make top look pretty leaves me dismayed. That leaves zero local protection against a system crash (or misguided deliberate reboot).

And yes, the server has been running 346 days -- Doesn't matter. Services can be five nines reliable, but hardware won't be.

An SSD, or if you prefer multiple SSD units configured as a raid, will give you the same order of performance without the premortem fodder.

The machine should be audited for other critical files which are written to volatile storage, and moved to the SSD or other storage as well.

-ahd-


I agree that putting state in ram is a concern, but with a failover pair, there should be a duplicate copy on the other server of all but the very latest changes.  There is a risk when one server is rebooting, so having a copy of the backup lease file on real disk would help.
I think SSD (particularly write speed) is still orders of magnitude slower than ram.

That said, I would not ever want to have both servers rebooted at the same time, or even restart DHCP at the same time.  Restart one, allow it to sync data and get to normal-normal, then restart the other.

I would suggest the performance difference between RAM and SSD is meaningless to the client.  But the whole environment (like how much traffic the one DHCP server is handling) has a subtle strangeness about it which makes me nervous.

But Not my servers. Not my pager going off.

Good to hear both servers are back online.  We'll leave it at that.



_______________________________________________
dhcp-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/dhcp-users