Computers and the networks they us to communicate with each other are complex things, and from time to time, things stop working, or don’t work as well as they used to, and you have to figure out why. Troubleshooting is a bit of an art, and in this post, I will go over various troubleshooting stories, and how to try and avoid rebuilding things from scratch when you just forgot to check a checkbox on some configuration screen.
The most important part of troubleshooting is knowing there is a problem in the first place, and having as much information as possible to help figure out what is wrong. You need a system that is checking your entire network and can alert you when something goes wrong. We have installed Nagios, which is an open source network watchdog program. This checks all our backbone radios and routers every 5 minutes. Individual member endpoints are checked every 15 minutes. So, if a radio goes down, Nagios will alert us to this fact when it happens.
When you are alerted to a problem, the next useful piece of information is to be able to see what may have been happening in the past. And for this we have another system installed called Cacti, which is another open source data logging program. This system checks every router, radio and member router every 5 minutes, and logs how long it takes to contact (ping time), how long the system has been up (uptime), if it’s a radio, it records the signal and noise values as well as raw bitrate. It also records how much data has been transfered recently (bandwidth). All of these metrics are very useful in troubleshooting any problems. And without these two systems, we would really be working in the dark when something went wrong.
Start with the basics
When we see a radio go down, the first thing we first try and confirm if there is power to the radio. Lack of power is way more common than many people realize.
We have battery backups on all our backbone radios, so sometimes the power loss happened several hours before. One case is the infamous sheep. When we were first building things out and installing one of our relay points in the middle of Tom’s field, we had a temporary extension cord running out into the field to power our relay point. In the middle of the night we got an alert that “tillman-pb-5-a” was down. In the morning we went out to the field, and found that the extension cord was unplugged, and apparently a sheep had scratched up against it during the day and unplugged it, and so then about 8 hours later, the battery died, and the radio went offline.
Another time, we got an alert that “shipstad-ar” was down. This is a member wifi router, and I knew that this was probably not on a battery backup. The rest of the equipment at the Shipstad’s was still up, but my gut said it was now on battery backup. In calling to the Shipstad’s house, I found out that they were doing something in the garage and had popped a breaker turning on a heater or something. This had turned off the wifi router, and this had also stopped sending POE power to the rest of the relay point equipment at that location. It is this type of event where we are working on being able to monitor grid power at our relay points, so we know if there is a power problem and we are on battery backup.
Sometimes the power problems are not lack of power, but not enough (meaning not enough amps).
Early on in building out our network, I put 3 radios up in one of my trees, two nanostations, and one rocket. To make life easy, I ran one wire up the tree. Nanostations have 2 network plugs and allow you to daisy chain another device. Down on the ground, I tested running one POE cable into a nanostation, then on the secondary port, running another cable to the 2nd nanostation, then from the secondary port on that radio, into the rocket. Everything turned on and lit up and I was able to login to each radio on the one wire.
But, when it all went up in the tree, and we started running traffic over the radios, everything started rebooting. Well, turns out you can only daisy chain once, and when the radio started sending traffic it started pulling more amps than was available, and so things rebooted. So, the lesson on that was to run one wire for each radio up the tree.
But, there was another location where we only needed to have 2 radios in the tree and this worked well, even under load. We needed to add a 3rd radio in that location, which meant we needed to install a touchswitch, instead of running the 2 radios directly from the Tycon Power charge controller directly. So, we switched the 2 radios to one port on the toughswitch, and the 3rd radio to another toughswitch port. Then we started to get random alerts that the 2 radios where going down. Turns out you can’t run a nanostation chained to another radio off a single toughswitch port. So, we ran a dedicated wire to each radio and each had their own port on the toughswitch.
After you have made sure there is power to a location, you next should check that there is not a problem with the physical wire that carries the power to the radio.
We had one time when a radio went offline, and what happened was a small branch came down, and must have hit the ethernet cable, and the cable was not completely “clicked” into the network port on the back of the radio, and so the cable popped out.
There were several times when even though the radio had power lights and was connecting upstream, downstream on the network packets were not flowing. This was usually a problem with the crimping of the cat5 end. I had one of these recently, and even though I have done hundreds of these end crimping connections, every once in a while I’m not paying attention, it’s a little dark, or i’m taking to someone, and one wire goes in the wrong place. Then things either don’t work, or they half work.
One time we had a faulty POE power brick, where the little wires in the brick that are sprung to connect tightly to the cable end, where stuck, and only completed the connection if you pushed the cable end into the POE brick really hard. This was the intermittent power problem.
Then we just had another case where a member had a faulty powerstrip, that if you touched it wrong, it would turn off the power.
After confirming all the power related issues, we then get to the programming of the radios….which will be another post because this one has gotten really long.