this post was submitted on 05 Jul 2024

149 points (98.1% liked)

networking

2811 readers

1 users here now

Community for discussing enterprise networks and the ensuing chaos that comes after inheriting or building one.

founded 1 year ago

MODERATORS

manifex@sh.itjust.works

149

Company brought to its knees by a cable (lemmy.sdf.org)

submitted 4 months ago* (last edited 4 months ago) by ExtremeDullard@lemmy.sdf.org to c/networking@sh.itjust.works

35 comments fedilink hide all child comments

Yesterday around noon, the internet at my company started acting up. No matter, slowdowns happen and there's roadwork going on outside: maybe they hit the fiber or something. So we waited.

Then our Samba servers started getting flaky. And the database too. Uh oh... That's different.

We started investigating. Some machines were dropping ICMP packets like crazy, then recovered, then other machines started to become unpingable too. I fired up Wireshark and discovered an absolute flood of IGMP packets on all the trunks, mostly broadcast from Windows machine. It was so bad two Linux machines on the same switch couldn't ping each other reliably if the switch was connected to the intranet.

So we suspected a DDOS attack initiated from within the intranet by an outside attacker. We cut off the internet, but the storm of packets kept on coming. Physically disconnecting machines from the intranet one by one didn't do a thing either.

Eventually, we started disconnecting each trunk one by one from the main router until we disconnected one and all the activity lights immediately stopped on all the ports. We reconnected it and the crazy traffic resumed.

So we went to that trunk's subrouter and did the same thing. When we found the cable that stopped all the traffic, we followed it and finally found one lonely $10 ethernet switch with... a cable with both ends plugged into the switch. We disconnected the cable and everything instantly returned to normal.

One measly cable brought the entire company to a standstill for hours! Because half of the software we have to use are cloud crap or need to call their particular motherships to activate their licenses, many people couldn't work anymore for no good technical reason at all while we investigated the networking issue.

Anyway, I thought switches had protections against that sort of loopback connection, and routers prevented circular routes. But there's theory and there's reality. Crazy!

top 35 comments

sorted by: hot top controversial new old

[–] Gobo@lemmy.world 54 points 4 months ago

Yea. This is what spanning tree and bpduguard is for. Don't disable them on your edge.

[–] mlg@lemmy.world 28 points 4 months ago (2 children)

Lol imagine the poor dude in his office who was just bored and thought "what if I plug this cable back into the hub, probably won't do anything"

[–] ExtremeDullard@lemmy.sdf.org 37 points 4 months ago* (last edited 4 months ago) (2 children)

Actually this happened in the lab. I know exactly who did this because he told me: we were discussing what had happened and he said "Oh yeah, Daniel and I needed to connect this Windows machine to the intranet quick because we had something urgent to do, and we connected all the ends of the nest of ethernet cables at random until the machine connected. And then we left everything as it was." But bad luck for us, their machine was connected, but so was that fatal cable on both ends. It just happened that their machine kept working well enough for them to finish what they were doing without noticing the problems rightaway.

And in case you wonder, there's no penalty in our company for owning up to honest mistakes, so that's why he readily admitted to it. Only people who never do anything never do anything wrong.

[–] Randelung@lemmy.world 16 points 4 months ago

That's a healthy attitude! The blame game is useless in most cases.

[–] GreyEyedGhost@lemmy.ca 4 points 4 months ago

I do hope you taught him the many better ways of doing this. I absolutely agree with making an environment where mistakes are easily owned up to (I made a mistake that ended up costing my employer over $10k in the last year), but if it isn't coupled with turning those into learning experiences (here's why you don't do that, here's why this is a better solution) then you just have a lot of mistakes happening over and over again.

[–] Socsa@sh.itjust.works 6 points 4 months ago

In my experience it's either someone doing it on purpose, or someone accidentally pulling the wrong cable out of a rats nest.

[–] oleorun@real.lemmy.fan 25 points 4 months ago

This got me too once. I was in the server room replacing old 110 punch panels/blocks with 8P8C connections. I lost track of cable connections, a mistake I have learned from, and I looped a patch cable into the same switch. Within moments the entire network went down.

Forty-five minutes later and we figured out the loop.

Another lesson learned: HP Procurve switches did not have Spanning Tree enabled by default.

Anyway, mistakes happen, especially in IT. It's all part of the learning experience. My boss was the coolest, chillest guy in the world so I learned and moved on.

[–] ramble81@lemm.ee 16 points 4 months ago (1 children)

I really hope you meant “switch” when saying “hub”. I haven’t seen a hub used in decades. Also your switch should have some level of STP protection enabled to prevent that. Even if someone had a hub with a routing loop, STP would have disabled the ports.

[–] dan@upvote.au 14 points 4 months ago (1 children)

Basic unmanaged switches often don't have any sort of protection, and on some fancier managed switches it's disabled by default (no idea why)

[–] Jajcus@sh.itjust.works 12 points 4 months ago* (last edited 4 months ago) (1 children)

no idea why

Because it makes initial connection much slower. Dumb switch - you insert a cable and it works. STP-enabled switch: you insert a cable and it takes a while until the port is enabled (unless you do extra configuration, appropriate for your network topology). This is annoying and for inexperienced users it could seem like the switch 'does not work'. It is easier to sell a switch without such a feature enabled by default.

[–] NaibofTabr@infosec.pub 2 points 4 months ago

the tyranny of the default strikes again

[–] AstridWipenaugh@lemmy.world 15 points 4 months ago

I was diagnosing a network bottleneck at a customer site that didn't make any sense. Literally everything had gigabit connections except one block of cubicles, but all the devices were connected to the same subnet router for that part of the building. Started tracing wires like you did and found that someone didn't have a long enough cable when building the office and installed a 10 megabit linksys switch in the drop ceiling to connect two short cables. Rather than fix the cable, the customer just went to Best buy and bought a gigabit Linksys switch to replace it... A multi-million dollar operation is being held together by a $10 switch...

[–] ipha@lemm.ee 14 points 4 months ago (2 children)

But there’s theory and there’s reality.

Mood. I can't count the times I've found issues that shouldn't be possible, but are clearly happening.

[–] oleorun@real.lemmy.fan 13 points 4 months ago (1 children)

We used to use Malwarebytes Corporate Edition at work.

One afternoon all of our web servers stopped responding to traffic on port 443. I could RDC into the servers, and I could ping them, but most traffic wasn't being passed properly.

Despite not having made any changes, I did everything I could think of to get them to work. I tried moving them to different switches, different static IPs, Wireshark showed packets flowing, but no web traffic.

I left the office. It was around 8 PM and I had been banging my head on my desk trying to figure out what the hell was going on.

I came back around 10 PM, mind clear and stomach topped off. I worked a few more minutes, then heard the Outlook ding.

Mass email from Malwarebytes CEO. Bad update. Blocked all class B IP addresses by mistake (guess which class we used). Mea culpa. So sorry. New update fixes things.

I immediately uninstalled MWB CE and boom. Services restored.

The next week we got our licenses refunded by our VAR and we never used that product again.

[–] possiblylinux127@lemmy.zip 0 points 4 months ago

Uninstalling antivirus should be step one

[–] NaibofTabr@infosec.pub 3 points 4 months ago

"In theory, theory and practice are the same. But in practice..."

[–] Orbituary@lemmy.world 10 points 4 months ago

Just reading the title of the post I knew what happened. I read through the whole thing because your story was good and I was in suspense to figure out if it was a router or voip phone that was the culprit.

Had this happen at work about a decade ago.

[–] dan@upvote.au 9 points 4 months ago (2 children)

By "hub", do you mean switch? I haven't seen a hub in a very long time. I don't think I've ever seen a 1Gbps one.

[–] ExtremeDullard@lemmy.sdf.org 3 points 4 months ago

Yeah I keep calling them hubs incorrectly...

[–] possiblylinux127@lemmy.zip 1 points 4 months ago* (last edited 4 months ago) (1 children)

There is such a thing as a small 1Gbps hub that are designed to just handle a small network. They scare me as they are cheap on Amazon and could theoretically bring a network to its knees if a random user finds a port that isn't authenticated.

[–] Nougat@fedia.io 4 points 4 months ago

For the passers-by, in very simple terms:

A switch maintains a list of the IPs and MAC addresses of devices attached to it (ARP [Address Resolution Protocol] table). When a packet comes into the switch for a specific destination IP, the switch looks up on the ARP table where that destination IP can be found, and only sends the packet out on the port the destination device (or next hop towards that device) is connected to.

A hub doesn't do any of that. Every packet that comes into the hub gets sent out of every port on the hub, to every device connected to the hub. It's on the connected devices' to discard packets that aren't addressed to them. On anything but a very small and relatively slow network, this would create an unnecessarily large amount of traffic, not to mention the security issue around sending packets to devices they're not addressed to.

[–] Socsa@sh.itjust.works 7 points 4 months ago

Yup, the good old "loopback FU."

Routers do have some protections which can mitigate this, but the entire problem is broadcast flooding which can't really be dealt with at later 2, or even at layer 3 within the same segment. Most places will have no broadcast forwarding between segments, but even if you detect unusual broadcast activity and ban that class of traffic, you break other things. A lot of times it is ARP floods, so it doesn't happen when the network is static and converged until someone plugs a new laptop in, and then everyone assumes it's that laptop.

[–] dukatos@lemm.ee 6 points 4 months ago

Managed switches are not expensive and have death loop protection.

[–] deadbeef@lemmy.nz 6 points 4 months ago

Most hubs didn't protect you from anything in particular.

Most of them would forward everything to every port, some really insane ones would strip out the spanning tree that could have prevented a loop.

It's been a long time since I did anything that goes as far into a network as the desktop, but 15+ years ago we had a customer ring up with the same sort of complaint. After we followed the breadcrumbs on site we found a little 8 port hub ( that we hadn't supplied ) plugged into two wall ports that went to two different Cisco edge switches in the server room, two cisco phones also with their passthrough ports both patched into same switch and then two desktop PC's.

Amazing.

[–] pastermil@sh.itjust.works 6 points 4 months ago (2 children)

Does that kind of loop really mess with things? ELI5 please!

Also, what do you mean a lonely switch? Does it have that loop and a port connected to another switch in the network?

[–] stoy@lemmy.zip 13 points 4 months ago

IT tech here, yes, yes it can.

Network infrastructure is both increadibly smart while also being dumb in other ways.

To do an ELI5 answer:

Imagine you have a container of pearls that you need to sort, red, green and blue pearls all need to be dropped into a red, green or blue hole.

The container is being refilled, but slow enough that it only gets a new pearl once you have sorted the previous.

The holes are connected to pipes going to separate buckets.

Everything is fine, but then some adds a new hole that is muticolored and tells you that all pearls should go there.

You tell your friends that you have a faster way to deal with the perls and to send you their pearls.

The new hole also has a pipe, but that is connected to the container that recieves pearls, so every time you drop a pearl into the new hole, it appears in the container again.

So now you have a situation where you not only get your normal ammount of pearls, but everyone else's pearls and you also get every pearl you send back again.

You are smart and quickly realize that something is wrong and call for your teacher for help, networking gear don't have that capabillity to understand that it is wrong, it just looks at each pearl and not the big picture.

If we go back to the real world, we have developed tools to deal with this situation, we have protocols line spanning tree which can have switches speak with eachother and figure out if there is a physical loop before sending traffic through it.

There are other tools as well, but they all need to be configured and to be honest, it is easily forgotten or made a low priority since it happens rarely.

It is something that is often implemented after a big outage.

[–] Socsa@sh.itjust.works 1 points 4 months ago (1 children)

Certain types of broadcast traffic always get re-broadcast from of every port on a switch. So if you directly connect two ports, and you get some broadcast coming into the switch, that broadcast will loop forever across that loopback, and then get propagated repeatedly until it hits a broadcast boundary. It's surprisingly difficult to prevent even with managed switches unless you are willing to hand manage every port and significantly restrict the kind of network services which can flow through it.

Some devices can detect these loops and break them, but that can have other unintended impacts if your network is designed (some would argue poorly) around using dumb switches to multiply limited Ethernet drops at the edge.

[–] possiblylinux127@lemmy.zip 1 points 4 months ago

You can Mac lock the port

[–] Crackhappy@lemmy.world 5 points 4 months ago

Good troubleshooting discipline.

[–] possiblylinux127@lemmy.zip 5 points 4 months ago

If you are using a hub then that's expected as they tend to be one of the main sources of floods on a network.

If you have managed switches make sure you turn on loop protection and alerting. Ideally you should immediately know when something like that happens.

Also bonus if you setup vlans with different subnets. From there practice least privilege and block all forward traffic by default.

[–] Randelung@lemmy.world 4 points 4 months ago (1 children)

Our Unifi network collapsed and I have no clue why. One theory was the automatic WiFi bridges that might have acted as loops.

[–] Brickhead92@lemmy.world 1 points 4 months ago* (last edited 4 months ago)

Yeah I've had a wireless uplink between two Unifif AP's on the same switch, the only non Unifi switch, come up by itself and caused a loop. Unlikely that switch only had a 1G uplink to the next switch, all the rest were 10G links, so it mostly only affected devices on that switch.

Edit: thought I'd just say that since then, I always disable wireless uplink on all AP's, and the global system setting, unless it's actually used, and only on the APs that need it.

[–] stringere@sh.itjust.works 3 points 4 months ago

I managed to accomplish this at my first IT job, but I used broadcast with Symantec Ghost on a 10 port 100k/1mb hub to bring our office down without knowing any better! They bought me a 10/100 switch to push laptop images with after that incident.

[–] mindbleach@sh.itjust.works 3 points 4 months ago

Accidental ring architecture.

It is surprising the switch doesn't occasionally check for zero-ping echo between plugs.

[–] Nooodel@lemmy.world 3 points 4 months ago

Turns out a large excellence cluster technical university can do the same and bring down an entire campus for 2 days. Everything is in one big intranet, has main lines with high throughput routed to a large network node and one backup line from the local internet provider. It killed the main lines and thousands of staff plus some tens of thousands of students were connected through a household class fiber connection. That was fun :)