• Stallone@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    2
    ·
    1 year ago

    I’m not so sure, there are a lot of businesses and people training their AI models right now and sites like reddit or twitter are very attractive huge collections of user generated content. It’s not the most outrageous assumption that they’ll try to get that data for free by scraping instead of paying for API access.

    • sergih123@eslemmy.es
      link
      fedilink
      English
      arrow-up
      9
      ·
      1 year ago

      I don’t think however, that it is that hard to differentiate an AI scraper between an actual user, since AI scrapers would be scraping huge amounts of data, which the average user doesn’t. Correct me if I’m wrong. wdyt

      • noodle@feddit.uk
        link
        fedilink
        English
        arrow-up
        6
        ·
        edit-2
        1 year ago

        No, you’re correct. Service accounts can consume data way faster than a human user ever could. A smart business always implements rate limits or you could bankrupt them with a simple curl command. They could bankrupt themselves in testing with a simple loop!

        This can be fixed in many ways, not just by putting limitations on credentials but also on source addresses. If a certain address or range of addresses seems to be running multiple service accounts and pulling huge amounts of data, you can deny requests from those IP’s.

        In short, this AI angle smells like BS to save face. Musk effectively fired the SRE team who looked after critical infrastructure. It was their job to ensure service reliability, so it should not be a surprise that Twitter now has issues with service reliability.

        • Billiam@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          1 year ago

          They could bankrupt themselves in testing with a simple loop!

          You mean exactly like what Twitter did this past weekend?

    • Veddit@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      1 year ago

      But also, hasn’t that boat left already for several AI companies? They’ve already trained it up, no need to scrape again, they just use what they got last time for their core training, it’s only the last couple of years/months they’re missing.