this post was submitted on 19 Jul 2024
132 points (100.0% liked)

TechTakes

1491 readers
43 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago
MODERATORS
 

The machines, now inaccessible, are arguably more secure than before.

you are viewing a single comment's thread
view the rest of the comments
[–] m@blat.at 8 points 5 months ago (3 children)

@sailor_sega_saturn And given enough time and enough scale even the most improbably weird things will eventually happen. Update file corrupted by a storage controller that flips a couple of bits at random after every 720 hours of uptime but only if it’s 23.682 seconds after the hour? Weirder shit has happened.

[–] YourNetworkIsHaunted@awful.systems 16 points 5 months ago

I once helped one of my company's customers troubleshoot an issue that had seen the same ridiculous edge case error happen three times over the course of a few years. At one point the actual sustaining developer we worked with was able to narrow down a specific bit that was getting flipped somehow, and pitched that cosmic radiation was a plausible solution given how rarely this kind of thing impacted other customers.

It was at this point that we remembered that the customer was either a university with a nuclear physics lab or a hospital with a nuclear medicine program (can't remember now, ironically enough) that the server rack lived adjacent to.

[–] mawhrin@awful.systems 11 points 5 months ago* (last edited 5 months ago)

some twenty four years ago i managed, amongst others, a company's samba and print server (that was at the time when all the company's servers were beige boxes with less memory and disk than the laptop i'm using to type this – and still they served a few hundred employees).

the machine developed a strange custom of hard-resetting itself, which we initially tracked to specific files being sent for printing; the behaviour was fully reproducible.

as it happened, it was a hardware fault somewhere between the mainboard and the integrated SCSI card; installing a separate SCSI card and reconnecting the disks and backup tape device fixed the problem. (i did not have the budget for a new serwer, no.)

establishing the actual cause took me fucking weeks.

[–] yacc143@mastodon.social 4 points 5 months ago (1 children)

@m @sailor_sega_saturn
Builds failing, but only at the new office, and only if you tried to build from scratch.

Funny, the Windows network crew that operated the network and suddenly had to operate NFS over UDP on their network, never really realized that their switches were only capable of half-duplex operation. But announced full-duplex. And these Linux boxes fully used that. And big UDP packages used by NFS under load got corrupted.

[–] yacc143@mastodon.social 4 points 5 months ago (1 children)

@m @sailor_sega_saturn
Took a f%cking nightshift of the CTO (German company, so the CTO had PhD in C.S. and still remembered hacking C++ code) and the resident external IT consultant working on the C++ code getting frustrated with the builds crashing and literally debugging the whole shebang to discover that beside a ton of C++ memory bugs, we also had a network issue.

[–] yacc143@mastodon.social 4 points 5 months ago

@m @sailor_sega_saturn And philosophically, I've been now for a decade in "automatic data entry from 3rd parties", ETL (nice phrasing for industry level web scraping and data clean-up).

Literally, what I've seen (and sometimes, as I've also done website development, one wonders what the f%ck the dear colleague was thinking while (s)he developed THAT. Or I want the drugs they were on, that must have been a great trip.), nothing is unthinkable in IT.