Horror Hack-A-Thon - Horror Stories

In the days before Indeni, IT professionals were left with learning about issues after they already happened. As a result, almost every person in the industry seems to have at least one horror story where they lived through a networking nightmare!


Share your most technically challenging Horror Stories here before October 31th to be part of the Horror Hack-A-thon.


Let the Hacking begin!

A few years ago I showed up, filled with excitement, at the offices of our first ever user. Our software was just an alpha back then, and I came in with an HP Proliant server pre-installed with it. My contact at the customer walked with me to the data center and pointed to the rack we were going to set up the server in.

10 minutes later, the server was racked and hooked up to power. Now, we only needed to connect it to the network. We ran a network cable from the server to the top-of-rack switch, but alas - the switch didn't have any available ports! My contact looked at the switch, looked at the cables hanging down from it, and picked one cable that he thought wasn't connected to anything.


We disconnected that cable from the TOR switch and connected the Indeni server.


Within 30 seconds, three guys came running into the data center. Their faces were pale-white. Apparently we took down the out-of-band monitoring and management network, by disconnecting it from the main firewall. Oops!

A couple years back, I worked at a branch office that was relying on a single edge Checkpoint 1100 and a couple Brocade distribution switches that went EOL 6 years prior. Let's call the IT guy "Bob". Bob was actually amazing at keeping every office alive despite limited budget and being a one-man team.


Bob tried to update the Brocade switches with new VLANs and subnets for the engineering and sales floor. He forgot to save the changes. It just so happened that one of these switches was sitting under my desk. The next day, I accidentally tripped on the power switch and it caused the switch to cycle through power and all of Bob's configs were lost. Keep in mind this also handled VOIP traffic and it was the end of quarter for the sales team. The next minute, I heard a sales rep punch a wall and a couple people refer Bob in all sorts of ways. The sales director called Bob, who by the way was attending his dad's funeral.


Bob never let me touch the office gear thereafter.

I've been the problem more often than being the person fixing the problem. This story goes back to 2001, when I had a high school diploma, a laptop and an ethernet cable.


So there I was, sitting in the back room of the retail store I was managing for one of the four major cell phone companies (there were probably 5 or 6 at the time...). Since it was 8pm or so, I was bored out of my mind, and someone decided to enable a web filter on the company computers, I just plugged my personal laptop into an open port on the wall.


Around 8:30pm I get a call at the store:


"This is (name) from (company) IT in (corporate HQ city). Is this Dustin Stuflick?"

"Yes."

"Do you have a personal computer plugged into the corporate network?"

Panic ensues. "Yes."

"Do you remember the ILOVEYOU virus that shutdown our email service and other systems for a few weeks?"

"Yes."

"Your computer is broadcasting it to the network."

"Okay. I'll unplug it."

"Thanks."


IT suddenly seemed interesting.

An employee of mine was adding a VLAN to a LAG that carried traffic for ~750 customers. They (like many have done before) forgot the "add" part of "switchport trunk allowed vlan add” command. Unfortunately, due to poor network design, management was also on this LAG. It was over an hour before we had someone in the datacenter with a console cable able to add management back to the LAG so we could restore full service.

Years ago, I was working with a large retail client that had a sustained network outage at a majority of their locations. Even though it was an ISP outage, someone at a particular location took it upon themselves to try and fix it, and in doing so, got the cables switched around and some unplugged - so it also became a LAN issue for them.


Since I had no remote access, I was walking the District Manager over the phone through the usual troubleshooting steps and he noticed a cable was unplugged that he swore was plugged into the switch at one point, so I had him plug it back in. It didn't bring up the network and he couldn't trace it because it was a tangled-maze of cables, so I let it ride because he was sure it should have been plugged in that spot.


Once we figured out the WAN/LAN cables were in the wrong ports everything came back up, though network speeds were much slower than usual. But as long as it was up they were happy and wanted to deal with speed issues later. I converted the spooled (faux-transactions) files and pushed them through so they got paid on those. They were also able to take live transactions at the registers, so they were very happy.


About two days later management calls me in for an emergency meeting; turns out all the transactions that had been spooling for several days and all transactions afterwards were actually charged anywhere between 5-20 times within seconds of each other; even though we only showed a single authorization file on the server. BIG time problem!


We had to take credit cards down and look at everything again, but the program, system, network settings looked fine. The network was still running slow, which was unusual because all the other stores were fine and they shared the same ISP managed at the corp level. So I went back to square one. After more phone time, followed by an on-site tech dispatch that came out of the support budget, it turns out that the cable that "looked like it should have been plugged back into the switch", was actually already plugged into the switch but not connected to anything else; so we looped the cable creating a network flap.


It just so happened that the payment software on the server would retouch the processor end-point at each up/down state, creating new and unique transaction-ID's. Because the trans ID's were unique, the processor/gateway didn't catch it and fold the duplicates into one transaction; so their system processed the duplicate transactions as unique account withdrawals. And keeping in mind that transition amounts were anywhere from 50.00 - 500.00+ USD, people's credit cards were being maxed out, banking accounts were overdrawn or locked because of suspected fraud. One family was on vacation and had no access to funds. A total nightmare.


Lessons learned:


1. I never use my bank debit card at payment terminal.

2. It pays to take the time to label cables and make sure you have them zipped-tied and well organized - and a sign that says DO NOT touch *cough*ahem*dustin*cough*.

3. It would have been great to have something like indeni to tell me exactly what was going on in the switch.

Network horror story, here goes nothing. It was a dark and moonless night, there was a howl of a wolf in the distance and a cold biting wind… oh, wait wrong type of horror story. Ok, for real, here goes; a large nation-wide telco, had recently upgraded the software of the route servers for their national IP backbone. After the upgrade, all their route severs where running the same software version. A few days later their NOC received an alarm that a route server was not responding, then an alarm for a second route server not responding and a third one. Within an hour all their route servers were down!!! Of course, that crippled their national IP backbone. The culprit, a memory leak in the software that the route servers were upgraded to. All the route server ran out of free memory and crashed with an hour’s time frame. If that is not horror, I don’t know what is. I bet they wished that they were using Indeni. Lessons learned: 1) Use Indeni 2) For really really missions critical services and applications, use two different vendors. In case you hit a bug in one vendor, the other vendor would still be functioning.

A guy at IBM brought his kids to work.


He had an assignment in the server room and the kinds joined him. After a while they were accompanied by multiple very stressed technicians because for some reason a lot of servers had gone down.Turns out his kids had a field day in the data center with all the buttons while the guy was doing his thing. Not very popular. :)

Where do I begin? :)


We had a certificate expired on our F5 boxes and "took down" a website for several hours until staff figured out what device had the expired certificate. The users received an expired certificate warning and wouldn't use the site (as they shouldn't right?). Admins updated the web servers and updated the F5 boxes hours after the first report and involved over half a dozen people to resolve the issue while the F5 admins were out of office.


DNS resolution was failing on a remote Palo Alto Networks firewall. During an Indeni demo this showed up as an alert. Pretending as though we didn't have a proactive notification system in place, I ignored the alert using it as an example for justifying the product purchase. I eventually resolved 4 tickets I spent way more time on than I should have. The acquired company's old DNS server was taken offline without knowledge to the networking team and was the primary DNS server for the PA. This caused problems with the NTP servers it was pointed at so NTP was out of sync, FQDN based firewall policies didn't work right, and Dynamic firewall updates were working sporadically at best.

Back in the day, I built a suite of automated server deployment solutions based on PXE leveraging iLO. The remote server deployments were needed because we were expanding our SaaS business (aka ASP back then) at a rapid pace…and in a global capacity. These scripts also included profile specific hardening of each server depending on its purpose: application server, database, dns, etc.. I had just built out the Beijing site. It was time for the next set of updates based on new required spec’s. Oh, did I mention I was 6 ½ months pregnant with twins at the time? Well, I did the remote update and headed to a prenatal check-up. To quickly end the story, to my surprise, I was placed on immediate and strict bed-rest in the hospital. I made one last call to work and was told that Beijing data center was no more and wtf did I do?! Well, crap, I must have selected the choice that said, “format and deploy” versus, “deploy” (something like that). That was a very very bad day.


Hence…begins my avid mantra of document document document (and with screenshots)! That saved what was left of my ass that day.

I was at a company, and pretty new on Check Point firewalls. I was working with a college who was much more senior. Let’s call him Brian.

Brian found an old license still attached to our production firewalls.

“Remove this license” he said to me.

I was a bit hesitant. It was in the middle of the day and nobody was notified of the change, but in the end I did as he said.

Just as I removed it, the person sitting in front of us goes “What happened to the connection to our production site?! I lost RDP connectivity!”

“Put it back!” Brian whispered to me and so I did.

Across from us we heard “Ah now its back, must have been something weird!”

After Brian had pondered this for a moment he says to me; “It couldn’t have been that, remove the license again”

“Never!” I said, and so he went ahead and did it himself.

Just as he does it, we hear from across “Now I lost connectivity again!”

Brian quickly puts the license back again and all was good. The monitoring in this company was not very good, so the connection loss was never investigated.

Had a new engineer working on the company firewalls, only been there about 2 months. He was adding new lines to an ACL that affected all inbound and outbound traffic for the company. He made a mistake entering some lines and decided to remove them but entered the command to clear the ACL and not just the lines. Needless to say but he erase the entire ACL which stopped all in and out traffic for the company as well as vendor connections. He had to reboot the firewalls in the middle of the day to get the config working again.

I think it was about 3 more months before the IT manager would allow the guy into any network gear after that.

Working for a large network provider that also had a unified communications platform with many phone endpoints. I was doing a new build and readying configuration on an ASR platform. A new employee mistakingly put in a route export from the VRF I was building into the unified communications VRF. Later that evening, when I went to turn services up for a new customer, I started advertising default route into the customers environment. Because the other employee had put in a route export from the new VRF to the unified communications VRF, all traffic destined for the Internet was sent to the new firewall solution that was solely meant for a single customer, which took down phones across the US.

This is great.


Deron's story hits a chord that we hear too often!


Luckily, I don't have a horror story...but I found a Haiku that I want to share with you guys!


"reading to unwind

footsteps. someone is coming

switch to ticket queue"

- Fehnor

Many years ago when I was working in a service desk we started getting calls that some people could not access the internet with their browser. They could however ping and telnet.

It didnt affect everybody and they had in common that they had recently installed windows updates.

We started troubleshooting but didn't find any issue.
Eventually I found that what caused it was the fact that each .NET installation added to the UserAgent of the browser, and that the team responsible for the firewall had set a limit on how long the UserAgent could be. Uninstalling the latest .NET version was a temporary solution while the permanent was to get the team to fix the firewall configuration. It is still unclear why you would want to limit this to X nr of chars.

This one is for the "Horror Hack-A-Thon"...


It was a dark and cold early morning in October (yes, this really happened in October) before the sun had broken over the horizon when the ring, ring, ring of my Blackberry (yes children, there was life before the release of the iPhone) woke me from a deep sleep. It was the director of our NOC. “Hey, I need you to hop on a bridge, we have an outage that we need all hands on deck for”, he stated with a defeated tone in his voice. “Sure, which site is having an outage,” I enquired. “All of them,” he replied. You could cut the silence with a knife.


Upon getting on the bridge, I learned that all our national backbone core routers were thrashing. All the processors of the core routers were running at 100%, any command would take minutes to respond. All BGP routes were constantly flapping. What could be going on ?!? There was too much churn to determine anything. I suggested, “Let’s start by looking at any network changes that were planned for last night.” After reviewing the changes from that evening, only one stood out, a BGP change in Niceville (name changed to protect the innocent). One of the boxes that had the configuration change would not respond to ssh or console access at all. Finally a good clue! However, what to do about it? After a couple minutes the answer dawned on me. “We will have to isolate the whole city. Let’s get on the far end routers and shut down the OC48s to Niceville,” I advised. Finally, after 10 to 15 minutes the core routers settled down and returned to normal operations.


We were also able to access the console on the routers under question via dial-up. Upon reviewing the configuration we noticed that a BGP policy was missing. This missing policy allowed all OSPF routes, including the loopback addresses used for BGP next-hops, to be redistributed into BGP. A very bad thing. It turned out that the tech applying the changes accidentally deleted the BGP policy, even though it was not part of the change.


The definition of a “Network Horror Story”, is when it makes the national news!!!


Many years ago, When I was living in the hostel of my university, there were two the most popular network games - Starcraft and Quake 3. Most of guys liked quake but me and a few more Starcraft. We had a small 10base-t coaxial network and the problem was that when they were playing quake we couldn’t play our favorite game ))

After several awful, full of pain days, it seemed there were way out I’ve got an idea.

I don’t remember what tool I used, but when was the time of playing quake I started to send ARP spoofing with wrong MAC addresses of their Hosts... Can you realize what was happening? I think you can’t. They were screaming and swearing )) But couldn’t understand what was happening))

But at the same time, behind the next door, couple of fans of Starcraft were able to play their favorite game ))

I've seen my television set it aitomatically turned volume up without anyone having a remote control to adjust the volume. And I told my mom that the volume just went up high itself. She replied, "Liar."

Late one night just as the moon began to set behind the distant hills a strange sound was heard in the distance. Could this be the return of the spooky specter that had terrified the village many years ago? I decided to jump on an all too familiar bridge and investigate. That strange sound turned out to be yet another application owner yelling because the NAT table on a CheckPoint firewall had filled up and toppled the firewall causing a massive outage. How could this be? We’re monitoring ever MIB known to mankind on these devices and there are no signs of any distress before these devices continually topple. For months this has been happening and we have no viable solution, until…...Indeni (more appropriately Yoni – I like to imagine he was running around in that Ghost Busters outfit while on the call with me). Turns out there indeed was no MIB that accurately monitored the actual NAT table and Indeni’s sophisticated means of enriching the data was all we needed to capture this spooky specter. The villagers rejoiced and everyone lived happily ever after…..

The only horror I experienced related to being an IT professional was taking Cisco Networking Certification (CCNA) courses at Santa Barbara City College during high school.


I didn't make it.



1 Like