• ParadoxSeahorse@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    6 days ago

    I really like the idea of signing the model with a dataset hash. Each legally licensable piece of source material could provide a hash, maybe?

    In terms of outputs, it’s really difficult to judge how transformative a model is without transparency of dataset. We’ve obviously seen prompts regurgitate verbatim known works, it could be even more prevalent than apparent just through obscurity as opposed to transformation. More than meets the eye.

    • altkey@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 days ago

      Each legally licensable piece of source material could provide a hash, maybe?

      We may generate a hash sum for every piece but I don’t see now how it would help. The only application I assume is to know that between stages A and B the database of many works hasn’t been modified. But if we have a hash of a singular piece, we can’t tell by it, if it was included in the dataset or not, persecute cases of it’s misuse etc. For licensing stuff it wouldn’t hurt to obtain it, I guess, but I don’t know how it would be applied to prove something. Alas, I think I do now*.

      In terms of outputs, it’s really difficult to judge how transformative a model is without transparency of dataset.

      True. That’s why I assume everything in the dataset is involved in every creation.

      It is, probably, the level of fight only accessible by the likes of Disney with their endless pockets, but if they do their lawsuit thing frequently enough (correctly assumimg the likeness of Mickey is in every graphical dataset), there’s a hope LLM’s owners and dataset brokers would go more transparent about the data they obtain and use, thus helping everyone.

      One tool I see created is - here’s the asterix * - a standard look-up webpage where you can search a closed commercial dataset (or many of them at once) by hash or by providing a file**. Hash sux ass due to it naturally changing itself whenever the file is slightly modified. But if it’s a known copy-version that circulated the web for a while, it can serve as a unique identifier as that one thing.

      Asterix two** - I imagine if something like that occures, it’d be a captcha-, ad-, js-code-ridden nightmare. If there could be a bill about that whole thing, the look-up site should be included too, with instructions to make an API for that resource and limitations on how awful it can be.