Around the same time, Cloudflare’s chief technology officer Dane Knecht explained that a latent bug was responsible in an apologetic X post.

“In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack,” Knecht wrote, referring to a bug that went undetected in testing and has not caused a failure.

  • FauxLiving@lemmy.world
    link
    fedilink
    English
    arrow-up
    24
    ·
    8 hours ago

    If you want a technical breakdown that isn’t “lol AI bad”:

    https://blog.cloudflare.com/18-november-2025-outage/

    Basically, a permission change cause an automated query to return more data than was planned for. The query resulted in a configuration file with a large amount of duplicate entries which was pushed to production. The size of the file went over the prealloctaed memory limit for a downstream system which died due to an unhandled error state resulting from the large configuration file. This caused a thread panic leading to the 5xx errors.

    It seems that Crowdstrike isn’t alone this year in the ‘A bad config file nearly kills the Internet’ club.

    • AldinTheMage@ttrpg.network
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      2
      ·
      3 hours ago

      So the actual outage comes down to pre-allocating memory, but not actually having error handling to gracefully fail if that limit is or will be exceeded… Bad day for whoever shows up on the git blame for that function

      • hue2hri19@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        4
        ·
        2 hours ago

        This is the wrong take. Git blame only show who wrote the line. What about the people who reviewed the code?

        • floquant@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          2 hours ago

          Plus the guys who are hired to ensure that systems don’t fail even under inexperienced or malicious employees, management who designs and enforces the whole system, etc… “one guy fucked up and needs to be fired” is just a toxic mentality that doesn’t actually address the chain of conditions that led to the situation