Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · 2 days ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Net_Runner :~$@lemmy.zip · 2 hours ago

I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless

That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running

nialv7@lemmy.world · 8 hours ago

We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is we can’t have nice things.

Shapillon@lemmy.world · 8 hours ago

Big players are the ones behind most AIs though.

katy ✨@piefed.blahaj.zone · 11 hours ago

reminder to donate to codeberg and forgejo :)

prole@lemmy.blahaj.zone · 14 hours ago

Tech bros just actively making the internet worse for everyone.

ShaggySnacks@lemmy.myserv.one · 14 hours ago

Tech bros just actively making ~~the internet~~ society worse for everyone.

FTFY.

mfed1122@discuss.tchncs.de · edit-2 14 hours ago

Okay what about…what about uhhh… Static site builders that render the whole page out as an image map, making it visible for humans but useless for crawlers 🤔🤔🤔

lapping6596@lemmy.world · 14 hours ago

Accessibility gets throw out the window?

mfed1122@discuss.tchncs.de · 14 hours ago

I wasn’t being totally serious, but also, I do think that while accessibility concerns come from a good place, there is some practical limitation that must be accepted when building fringe and counter-cultural things. Like, my hidden rebel base can’t have a wheelchair accessible ramp at the entrance, because then my base isn’t hidden anymore. It sucks that some solutions can’t work for everyone, but if we just throw them out because it won’t work for 5% of people, we end up with nothing. I’d rather have a solution that works for 95% of people than no solution at all. I’m not saying that people who use screen readers are second-class citizens. If crawlers were vision-based then I might suggest matching text to background colors so that only screen readers work to understand the site. Because something that works for 5% of people is also better than no solution at all. We need to tolerate having imperfect first attempts and understand that more sophisticated infrastructure comes later.

But yes my image map idea is pretty much a joke nonetheless

NeilBrü@lemmy.world · 8 hours ago

Computer vision models can read/parse pixel geometry.

Echo Dot@feddit.uk · 14 hours ago

AI is pretty good at OCR now. I think that would just make it worse for humans while making very little difference to the AI.

mfed1122@discuss.tchncs.de · 14 hours ago

The crawlers are likely not AI though, but yes OCR could be done effectively without AI anyways. This idea ultimately boils down to the same hope Anubis had of making the processing costs large enough to not be worth it.

nymnympseudonym@lemmy.world · 11 hours ago

OCR could be done effectively without AI

OCR has been neural nets even before convolutional networks emerged in the 2010s

mfed1122@discuss.tchncs.de · 10 hours ago

Yeah you’re right, I was using AI in the colloquial modern sense. My mistake. It actually drives me nuts when people do that. I should have said “without compute-heavy AI”.

nymnympseudonym@lemmy.world · 8 hours ago

My mistake

hold on I am still somewhat new to Fedi & not fully used to people being polite

interdimensionalmeme@lemmy.ml · 13 hours ago

Do you know how trivial it is to screenshot a website and push it through an OCR ?

This battle is completely unwinnable, just put a full dumb.zip of the public data on the front door and nobody will waste their time with a scrapper.

Is the data public or is it not ? At this point all that you’re doing anyway is entrench the power of openai, google and facebook while starving any possible alternative.

Anubis will never work, no version of anubis will ever be anything more than a temporary speed bump.

mfed1122@discuss.tchncs.de · 12 hours ago

Yeah, I do. I’m just grasping at straws. But you’re right, the only real solution, ironically, is to have non-open sites where you need accounts to view content. I wouldn’t mind seeing some private phpbb forums though.

Baleine@jlai.lu · 14 hours ago

Humans that don’t see:

interdimensionalmeme@lemmy.ml · edit-2 13 hours ago

Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

dwzap@lemmy.world · 9 hours ago

The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.

Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.

interdimensionalmeme@lemmy.ml · edit-2 8 hours ago

That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

They also have an open API that makes scraper entirely unnecessary too.

Here are the relevant quotes from the article you posted

“Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”

“At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”

“Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”

And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !

The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.

0x0@lemmy.zip · 8 hours ago

they won’t have to scrape ?

They don’t have to scrape; especially if robots.txt tells them not to.

it’s public data, it’s out there, do you want it public or not ?

Hey, she was wearing a miniskirt, she wanted it, right?

interdimensionalmeme@lemmy.ml · 8 hours ago

No no no, you don’t get to invoke grape imagery to defend copyright.

I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.

So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.

My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.

0x0@lemmy.zip · 8 hours ago

find a solution the tech platform monopolist are made to relinquish our data

Luigi them.
Can’t use laws against them anyway…

Rose@slrpnk.net · 12 hours ago

The problem isn’t that the data is already public.

The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.

AI crawlers don’t care about robots.txt or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.

interdimensionalmeme@lemmy.ml · 12 hours ago

Yeah but there’s would be scrappers if the robots file just pointed to a dump file.

Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.

Leon@pawb.social · 11 hours ago

Given that they already ignore robots.txt I don’t think we can assume any sort of good manners on their part. These AI crawlers are like locusts, scouring and eating everything in their path,

interdimensionalmeme@lemmy.ml · 11 hours ago

Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data. If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete. This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone

Leon@pawb.social · 3 hours ago

I wish I was still capable of the same belief in the goodness of others.

Mr. Satan@lemmy.zip · 8 hours ago

Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?

Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.

If you have contrary examples I’d love to see some references or sources.

interdimensionalmeme@lemmy.ml · 8 hours ago

The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.

Mr. Satan@lemmy.zip · 5 minutes ago

Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.

Not to mention that robots.txt is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?

To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”

Zink@programming.dev · 8 hours ago

My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.

Spoiler for you bros: It will never be enough.

qaz@lemmy.world · 8 hours ago

I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.

thatonecoder@lemmy.ca · 21 hours ago

I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.

Pro@programming.dev · edit-2 21 hours ago

Like Gemini?

From official Website:

Gemini is a new internet technology supporting an electronic library of interconnected text documents. That’s not a new idea, but it’s not old fashioned either. It’s timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn’t about innovation or disruption, it’s about providing some respite for those who feel the internet has been disrupted enough already. We’re not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader’s privacy, attention and bandwidth.

0x0@lemmy.zip · 8 hours ago

It’s not the most well thought-out, from a technical perspective, but it’s pretty damn cool. Gemini pods are a freakin’ rabbi hole.

cwista@lemmy.world · 15 hours ago

Won’t the bots just adapt and move there too?

vacuumflower@lemmy.sdf.org · 14 hours ago

I’ve personally played with Gemini a few months ago, and now want a new Internet as opposed to a new Web.

Replace IP protocols with something better. With some kind of relative addressing, and delay-tolerant synchronization being preferred to real-time connections between two computers. So that there were no permanent global addresses at all, and no centralized DNS.

With the main “Web” over that being just replicated posts with tags hyperlinked by IDs, with IDs determined by content. Structured, like semantic web, so that a program could easily use such a post as directory of other posts or a source of text or retrieve binary content.

With user identities being a kind of post content, and post authorship being too a kind of post content or maybe tag content, cryptographically signed.

Except that would require to resolve post dependencies and retrieve them too with some depth limit, not just the post one currently opens, because, if it’d be like with bittorrent, half the hyperlinks in found posts would soon become dead, and also user identities would possibly soon become dead, making authorship check impossible.

And posts (suppose even sites of that flatweb) being found by tags, maybe by author tag, maybe by some “channel” tag, maybe by “name” tag, one can imagine plenty of things.

The main thing is to replace “clients connecting to a service” with “persons operating on messages replicated on the network”, with networked computers sharing data like echo or ripples on the water. In what would be the general application layer for such a system.

OK, this is very complex to do and probably stupid.

It’s also not exactly the same level as IP protocols, so this can work over the Internet, just like the Internet worked just fine, for some people, over packet radio and UUCP or FTN email gates and copper landlines. Just for the Internet to be the main layer in terms of which we find services, on the IP protocols, TCP, UDP, ICMP, all that, and various ones and DNS on application layer, - that I consider wrong, it’s too hierarchical. So it’s not a “replacement”.

IndustryStandard@lemmy.world · 13 hours ago

IP is the most robust and best protocol humanity ever invented. No other protocol survived the test of time this well. How would you even go about replacing it with decentralization? Something needs to route the PC to the server

vacuumflower@lemmy.sdf.org · 13 hours ago

Something needs to route the PC to the server

I don’t want client-server model. I want sharing model. Like with Briar.

The only kind of “servers” might be relays, like in NOSTR, or machines running 24/7 like Briar mailbox.

IP. How would I go about replacing it? I don’t know, I think Yggdrasil authors have written something about their routing model, but 1) it’s represented as ipv6, so IP, 2) it’s far over my head, 3) read the previous, I don’t really want to replace it as much as not to make it the main common layer.

nymnympseudonym@lemmy.world · 11 hours ago

client-server model. I want sharing model. Like with Briar

Guess what

Briar itself, and every pure P2P decentralized network where all nodes are identical… are built on Internet Sockets which inherently require one party (“server”) to start listening on a port, and another party (“client”) to start the conversation.

Briar uses TCP/IP, but it uses Tor routing, which is IMO a smart thing to do

vacuumflower@lemmy.sdf.org · 3 hours ago

I’m talking about Briar used over BT.

nymnympseudonym@lemmy.world · 58 minutes ago

Even AF_BLUETOOTH sockets are… sockets, where one machine ("server’) opens to listen, and the other (“client”) initiates the stream

thatonecoder@lemmy.ca · 20 hours ago

Yep! That was exactly the protocol on my mind. One thing, though, is that the Fediverse would need to be ported to Gemini, or at least for a new protocol to be created for Gemini.

Echo Dot@feddit.uk · 14 hours ago

If it becomes popular enough that it’s used by a lot of people then the bots will move over there too.

They are after data, so they will go where it is.

One of the reasons that all of the bots are suddenly interested in this site is that everyone’s moving away from GitHub, suddenly there’s lots of appealing tasty data for them to gobble up.

This is how you get bots, Lana

thatonecoder@lemmy.ca · 11 hours ago

Yes, I know. But, while trying to find a way to bomb the AI datacenters (/s, hopefully it doesn’t come to this), we can stall their attacks.

b000rg@midwest.social · 14 hours ago

It shouldn’t be too hard, and considering private key authentication, you could even use a single sign-in for multiple platforms/accounts, and use the public key as an identifier to link them across platforms. I know there’s already a couple proof-of-concept Gemini forums/BBSs out there already. Maybe they just need a popularity boost?

Monument@lemmy.sdf.org · 17 hours ago

Increasingly, I’m reminded of this: Paul Bunyan vs. the spam bot (or how Paul Bunyan triggered the singularity to win a bet). It’s a medium-length read from the old internet, but fun.

SufferingSteve@feddit.nu · edit-2 2 days ago

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.

NeilBrü@lemmy.world · 7 hours ago

The Simulacrum

Marshezezz@lemmy.blahaj.zone · 2 days ago

Capitalism is grand, innit. Wait, not grand, I meant to say cancer

Serinus@lemmy.world · 1 day ago

I feel like half of the blame capitalism gets is valid, but the other half is just society. I don’t care what kind of system you’re under, you’re going to have to deal with other people.

Oh, and if you try the system where you don’t have to deal with people, that just means other people end up handling you.

nymnympseudonym@lemmy.world · edit-2 11 hours ago

I would give this reddit gold

Instant easy complaints help-i’m-oppressed-by-Capitalism today sound an awful lot like the instant easy complaints help-i’m-oppressed-by-Communism I used to hear from rednecks

Ask someone who starved & died under either system how obviously superior it is, you will find millions on either side

0x0@lemmy.zip · 8 hours ago

What if i’m being repressed instead?

null@lemmy.nullspace.lol · edit-2 9 hours ago

Also consider that Socialism is totally legal under Capitalism. Want to start a co-op? Go for it. Want to legislate and implement socialized healthcare? Many Capitalist countries have.

Under Communism, Capitalism must be illegal and stamped out by force. Want to start a business making shoes and hire someone to work for an agreed upon wage? Illegal.

When the goal involves guaranteeing positive rights, I’m not sure how it can be achieved without coercion. Which is how any socialist policies get implemented under capitalism anyways.

nymnympseudonym@lemmy.world · edit-2 8 hours ago

ofc a lot depends on the precise definitions of terms, state corporatism is not synonymous with generally free markets operating under the regulation of a basically diverse inclusive democracy – but both are referred to as “capitalism”

Marshezezz@lemmy.blahaj.zone · 11 hours ago

Could you clarify on what you mean with “dealing with people”? I’m not really sure the point you’re trying to make with that

Serinus@lemmy.world · 9 hours ago

The complaint that got blamed on capitalism was:

The information age gave way for the misinformation age, where everything is fake.

and if there’s one entity/person most responsible for that, it’s Putin or the GOP. Most of it is political, and very little to do with capitalism itself. Except that capitalism surrounds and is intertwined with everything.

Still, if you get rid of capitalism, it doesn’t get rid of politics. I’d argue that the root of the issue is the GOP trying to hoard power (money and otherwise), and power is going to exist with or without capitalism. Is North Korea capitalist? Do they have issues with disinfo?

This Christian Sharia Law movement doesn’t exist for money.

Amju Wolf@pawb.social · 21 hours ago

In this case it is purely fault of the money incentive though. Noone would spend so much effort and computation power on AI if they didn’t think it could make them money.

The funniest part is though that it’s only theoretical anyway, everyone is only losing on it and they’re most likely never gonna make it back.

kazerniel@lemmy.world · 20 hours ago

It matters a lot though what kind of goal the system incentivises. Imagine if it was people’s happiness and freedom instead of quarterly profits.

nymnympseudonym@lemmy.world · 11 hours ago

Imagine if it was people’s happiness and freedom instead of quarterly profits

Whose happiness and freedom?
How is it to be measured?
Capitalists honestly believe that free trade is the best albeit flawed way to do both of the above

It’s definitely valid to disagree about point #3, but then you need to give a better model for #1 and #2

Marshezezz@lemmy.blahaj.zone · 11 hours ago

That’s the part people never really seem to understand. It makes sense though because we’re subjected to the system from birth and it’s all a lot of people know so they can’t grasp the idea of a world outside of that so it can sometimes be difficult to get through to people on that

null@lemmy.nullspace.lol · 19 hours ago

The neat part is that anything bad that happens under capitalism is capitalism’s fault, but anything good that happens is actually socialism happening in spite of capitalism, somehow.

Marshezezz@lemmy.blahaj.zone · 11 hours ago

Could you give some examples?

null@lemmy.nullspace.lol · 11 hours ago

Socialized healthcare

Marshezezz@lemmy.blahaj.zone · 11 hours ago

And in what way does capitalism socialize healthcare in the United States?

null@lemmy.nullspace.lol · 10 hours ago

Socialized healthcare exists there – at least until the current administration finishes ripping it away

tourist@lemmy.world · 1 day ago

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

It’s ironic that the middlemen showed up anyway and busted all the security of those transfers

You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins

oh btw every government on the planet showed up and dug through our insecure records. hope you weren’t actually buying shroom drugs on the slip rod

also we got hacked, you lost all your bipcoins sorry

At least, that’s my recollection of events. I was getting my illegal narcotics the old fashioned way.

prole@lemmy.blahaj.zone · edit-2 14 hours ago

You want some bipcoin to buy weed drugs on the slip road? Don’t bother figuring out how to set up that wallet shit, come to our nifty token exchange where you can buy and sell all kinds of bipcoins

Maybe I’m slow today, but what is this referencing? Most dark web sites use Monero. Is there some centralized token that people used instead?

Edit: Oh, I guess you’re referring to Mt.Gox? I mean yeah, people were pretty stupid for keeping their bitcoin in exchange wallets (and sending it right to the drug dealers directly from there? Real dumb). That’s always a bad idea. I don’t think they transferred it there instead of something else, they just never took custody of the coins after buying them on the exchange.

WhyJiffie@sh.itjust.works · 7 hours ago

I think it refers to custudial wallets and that it’s hard to obtain the useful coins without a KYC exchange (that also most often works as a custudial wallet).

nymnympseudonym@lemmy.world · 11 hours ago

Monero

Satoshi was right and Crypto absolutely has valid use cases. What if your government doesn’t want you accessing meds you need at prices you can afford? What if your government doesn’t like your sexual orientation, but you want a subscription to a dating site? What if your government throws up unjust export controls or tariffs that suddenly make you and your business impossible?

Crypto’s best killer use case is uncensorable, untraceable money

Bitcoin is neither of those things. There is a reason people buy heroin with Monero. It actually does what crypto is supposed to do, which means it could safeguard your Grindr XTRA subscription.

prole@lemmy.blahaj.zone · 4 hours ago

Yeah Monero is just cool, tech-wise

raspberriesareyummy@lemmy.world · 1 day ago

also we got hacked, you lost all your bipcoins sorry

aaaaaaaaand - it’s gone!

Chee_Koala@lemmy.world · 1 day ago

the old fashioned way.

A whole swath of trained toads using a special made tube network?

0x0@lemmy.zip · 8 hours ago

Swallows actually.

tourist@lemmy.world · 1 day ago

getting into a car with a stranger who said he was 15 minutes away two hours ago

youmaynotknow@lemmy.zip · 9 hours ago

You were there too? 😜

prole@lemmy.blahaj.zone · 14 hours ago

Holy shit do I not miss that lifestyle

Quetzalcutlass@lemmy.world · 1 day ago

Nah, they clearly meant in liquid form.

very_well_lost@lemmy.world · 1 day ago

Liquid narcotics, you say?

muusemuuse@sh.itjust.works · 1 day ago

Sound like it went the same way everything else went. The less money is involved the more trustworthy it is.

vacuumflower@lemmy.sdf.org · 1 day ago

Much drama.

I agree about semantic web, but the issue is with all of the Internet. Both its monopoly as the medium of communication, and its architecture.

And if we go semantic for webpages, allowing the clients to construct representation, then we can go further, to separate data from medium, making messages and identities exist in a global space, as they (sort of, need a better solution) do in Usenet.

About the Internet itself being the problem - that’s because it’s hierarchical, despite appearances, and nobody understands it well. Especially since new systems of this kind are not being built often, to say the least, so the majority of people using the Internet doesn’t even think about it as a system. It takes it for given that this is the only paradigm for the global network. And that it’s application-neutral, which may not be true.

20 years ago, when I was a kid, people would think and imagine all kinds of things about the Internet and about the future and about ways all this can break, and these were normal people, not tech types, and one would think with time we wouldn’t become more certain, as it becomes bigger and bigger.

OK, I’m just having an overvalued idea that the Internet is poisoned. Bad sleep, nasty weather, too much sweets eaten. Maybe that movement of packets on the IP protocol can somehow give someone free computation, with enough machines under their control, by using counters in the network stack as registers, or maybe something else.

GreenShimada@lemmy.world · 1 day ago

Mr. Internet, tear down these walls! (for all these walled gardens)

Return the internet to the wild. Let it run feral like dinosaurs on an island.

Let the grannies and idiots stick themselves in the reservations and asylums run by billionaires.

Let’s all make Neocities pages about our hobbies and dirtiest, innermost thoughts. With gifs all over.

Furbag@lemmy.world · 1 day ago

I’m down with that. Web 1.5? Let’s do it. I’ll get my Geocities page up and then we can rev up that hit counter.

mfed1122@discuss.tchncs.de · 14 hours ago

https://homestarrunner.com/toons/backtoawebsite

“Lemme get that hit counter!”

kameecoding@lemmy.world · 1 day ago

Web3 was about enabling us to securely transfer value between people digitally and without middlemen

I don’t think it ever was that, I think folding ideas has the best explanation of what it was meant to be, it was meant to be a way to grab power, away from those who already have it

https://youtu.be/YQ_xWvX1n9g

hansolo@lemmy.today · 1 day ago

Preach!

zbyte64@awful.systems · 1 day ago

Is there nightshade but for text and code? Maybe my source headers should include a bunch of special characters that then give a prompt injection. And sprinkle some nonsensical code comments before the real code comment.

qaz@lemmy.world · 8 hours ago

There are glitch tokens but I think those only effect it when using it.

KubeRoot@discuss.tchncs.de · 23 hours ago

I think the issue is that text uses comparatively very little information, so you can’t just inject invisible changes by changing the least insignificant bits - you’d need to change the actual phrasing/spelling of your text/code, and that’d be noticable.

Honytawk@feddit.nl · 22 hours ago

Maybe like a bunch of white text at 2pt?

Not visible to the user, but fully readable by crawlers.

ramjambamalam@lemmy.ca · 18 hours ago

If a bot can’t read it, nor can a visually impaired user.

Apytele@sh.itjust.works · 17 hours ago

Well if it’s a prompt injection to fuck with llms you don’t want any users having to read it anyway, vision impaired or no.

ramjambamalam@lemmy.ca · 15 hours ago

You missed my point. A prompt injection to fuck with LLMs would be read by a visually impaired user’s screen reader.

londos@lemmy.world · 1 day ago

Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

0x0@lemmy.zip · edit-2 8 hours ago

Anubis does that (the computation part). You may’ve seen it already.

nymnympseudonym@lemmy.world · 11 hours ago

The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.

But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.

raspberriesareyummy@lemmy.world · 1 day ago

Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O

polle@feddit.org · 1 day ago

The saddest part is, we thought crypto was the biggest waste of energy ever and then the LLMs entered the chat.

1rre@discuss.tchncs.de · 1 day ago

At least LLMs produce something, even if it’s slop, all crypto does is… What does crypto even do again?

prole@lemmy.blahaj.zone · edit-2 14 hours ago

Monero allows you to make untraceable transactions. That can be useful.

The encryption schemes involved (or what I understand of them, at least) are pretty rad imo. That’s why it interests me.

Still, it’s proof of work, which is not great.

1rre@discuss.tchncs.de · 14 hours ago

Sure, Monero is good for privacy-focused applications, but it’s a fraction of the market and the larger coins aren’t particularly any less tracable than virtual temporary payment cards, so Monero (and other privacy-centric coins) get overshadowed by the garbage coins.

Same with AI, where non-LLM models are having a huge impact in medicine, chemistry, space exploration and more, but because tech bros are shouting about the objectively less useful ones, it brings down the reputation of the entire industry.

Honytawk@feddit.nl · 22 hours ago

It gives people with already too much money a way to invest by gambling without actually helping society.

baguettefish@discuss.tchncs.de · 17 hours ago

for the biggest crypto investors it isn’t even really gambling. they use celebrities to hype a memecoin and then rug pull and split the profits harvested from the celebrity’s fans.

xiwi@lemmy.dbzer0.com · 1 day ago

Crypto does drug sales and fraud!

Echo Dot@feddit.uk · 23 hours ago

It also makes it’s fans poorer, which at least is funny, especially since they never learn

SeptugenarianSenate@leminal.space · 1 day ago

Blockchain m8 gg

raspberriesareyummy@lemmy.world · 1 day ago

ouch. I never made that comparison, but that is on point.

kameecoding@lemmy.world · 1 day ago

Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.

andallthat@lemmy.world · edit-2 1 day ago

LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.

Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.

It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.

mobotsar@sh.itjust.works · 7 hours ago

Crawlers aren’t LLMs; they can do arbitrary computations (whatever the target demands to access resources).

patatahooligan@lemmy.world · 19 hours ago

The crawlers for LLM are not themselves LLMs.

kameecoding@lemmy.world · 22 hours ago

They can’t bitcoin mine either, so technical feasibility wasn’t the goal of my reply

londos@lemmy.world · edit-2 1 day ago

You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.

Jolteon@lemmy.zip · 1 day ago

deleted by creator

NeilBrü@lemmy.world · edit-2 21 hours ago

Hey dipshits:

The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.

AlphaFold is not a language model. It is specifically designed to predict the 3D structure of proteins, using a neural network architecture that reasons over a spatial graph of the protein’s amino acids.

Every artificial intelligence is not a deep neural network algorithm.
Every deep neural network algorithm is not a generative adversarial network.
Every generative adversarial network is not a language model.
Every language model is not a large language model.

Fucking fart-sniffing twats.

$ ./end-rant.sh

Honytawk@feddit.nl · 22 hours ago

We are talking about LLMs you dipshit AI phobe

londos@lemmy.world · 1 day ago

I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.

raspberriesareyummy@lemmy.world · 1 day ago

To be fair: it’s a great tool for scamming people (think ransomware) :/

DeathByBigSad@sh.itjust.works · 1 day ago

Great for money laundering.

Echo Dot@feddit.uk · 23 hours ago

Is it? Don’t you risk losing a rather large percentage of the value.

Just by cars or something as they are much better at keeping their value. Also if somebody asks where did you get all this money from you can just point to the car and say, I sold that.

T156@lemmy.world · 1 day ago

Not without making real users also mine bitcoin/avoiding the site because their performance tanked.

zifk@sh.itjust.works · 2 days ago

Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

sudo@programming.dev · edit-2 1 day ago

This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.

If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.

black_flag@lemmy.dbzer0.com · 1 day ago

Yeah but ai companies are losing money so in the long run Anubis seems like it should eventually return to working.

r00ty@kbin.life · 1 day ago

It’s the usual enshittification tactic. Make AI cheap so companies fire tech workers. Keep it cheap long enough that we all have established careers as McDonald’s branch managers, then whack up the prices once they’re locked in.

sudo@programming.dev · edit-2 1 day ago

Costs of solving PoW for Anubis is absolutely not a factor in any AI companies budget. Just the costs of answering one question is millions of times more expensive than running sha256sum for Anubis.

Just in case you’re being glib and mean the businesses will go under regardless of Anubis: most of these are coming from China. China absolutely will keep running these companies at a loss for the sake of strategic development.

black_flag@sh.itjust.works · 1 day ago

Thanks for the info 👍 would not have thought Anubis would be so irrelevant

OpenPassageways@lemmy.zip · 1 day ago

What the alternative?

sudo@programming.dev · 1 day ago

Not much for open source solutions. A simple captcha however would cost scrapers more to crack than Anubis.

But when it comes to “real” bot management solutions: The least invasive solutions will try to match User-Agent and other headers against the TLS fingerprint and block if they don’t match. More invasive solutions will fingerprint your browser and even your GPU, then either block you or issue you a tracking cookie which is often pinned to your IP and user-agent. Both of those solutions require a large base of data to know what real and fake traffic actually looks like. Only large hosting providers like CloudFlare and Akamai have that data and can provide those sorts of solutions.

randomblock1@lemmy.world · 1 day ago

No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.

Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.

I’ve been saying this since day 1 of Anubis but nobody wants to hear it.

T156@lemmy.world · 1 day ago

The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.

Spaz@lemmy.world · edit-2 4 hours ago

Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people would move.

BlameTheAntifa@lemmy.world · 24 hours ago

Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.

dodos@lemmy.world · 1 day ago

There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.

PhilipTheBucket@piefed.social · 2 days ago

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you’ve actively evading Anubis, fuckin’ game on.

TurboWafflz@lemmy.world · 2 days ago

I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed

xthexder@l.sw0.com · 2 days ago

I love catching bots in tarpits, it’s actually quite fun

31ank@ani.social · edit-2 2 days ago

Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post

phx@lemmy.ca · 1 day ago

Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.

I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance

SorteKanin@feddit.dk · 22 hours ago

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s.

Even “high power” AIs would produce bad data. It’s currently well known that feeding AI data to an AI model decreases model quality and if repeated, it just becomes worse and worse. So yea, this is definitely viable.

phx@lemmy.ca · 13 hours ago

Yup. It was more my thought that a low power over could produce sufficient results while requiring less resources. Something that can run on a desktop computer could still produce a database with reams of believable garbage that would take a lot of resources from the attacking AI to sort through, or otherwise corrupt its own harvested cache

sudo@programming.dev · edit-2 2 days ago

The problem is primarily the resource drain on the server and tarpitting tactics usually increase that resource burden by maintaining the open connections.

SorteKanin@feddit.dk · 22 hours ago

The idea is that eventually they would stop scraping you cause the data is bad or huge. But it’s a long term thing, it doesn’t help in the moment.

Monument@lemmy.sdf.org · 17 hours ago

The promise of money — even diminishing returns — is too great. There’s a new scraper spending big on resources every day while websites are under assault.

In the paraphrased words of the finance industry: AI can stay stupid longer than most websites can stay solvent.

The Infinite Nematode@feddit.uk · 2 days ago

Wasn’t this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?

traches@sh.itjust.works · 2 days ago

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP

AmbitiousProcess (they/them)@piefed.social · 2 days ago

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That’s why in Codeberg’s response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from “shady apps.” Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people’s devices to route network traffic they’re unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don’t want.

PhilipTheBucket@piefed.social · 2 days ago

Honestly, man, I get what you’re saying, but also at some point all that stuff just becomes someone else’s problem.

This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn’t really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that “no one will show up at my place with a bat, whatever I do” arrangement is a law of nature. It’s not.

sudo@programming.dev · 1 day ago

Or your TV or IOT devices. Residential proxies are extremely shady businesses.

amelaxxx@piefed.social · 2 days ago

PhilipTheBucket@piefed.social · 2 days ago

Is that really true? I guess I have no reason to doubt it, I just hadn’t heard it before.

sudo@programming.dev · 1 day ago

Here’s one example of a proxy provider offering to pay developers to inject their proxies into their apps. (“100% ethical proxies” because they signed a ToS). Another is BrightData proxies traffic through users of their free HolaVPN.

IOT and smart TVs are also obvious suspects.

NuXCOM_90Percent@lemmy.zip · 2 days ago

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to… rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.

bleistift2@sopuli.xyz · edit-2 2 days ago

https://en.wikipedia.org/wiki/Sarcasm, or maybe https://en.wikipedia.org/wiki/Hyperbole

amelaxxx@piefed.social · 2 days ago

Right

oeuf@slrpnk.net · 1 day ago

Crazy. DDoS attacks are illegal here in the UK.

rdri@lemmy.world · 1 day ago

So, sue the attackers?

BlameTheAntifa@lemmy.world · 24 hours ago

The problem is that hundreds of bad actors doing the same thing independently of one another means it does not qualify as a DDoS attack. Maybe it’s time we start legally restricting bots and crawlers, though.