Fedi Admins of Lemmy, How do you keep your servers up to date without increasing downtime

skymtf@lemmy.blahaj.zone · 2 years ago

Fedi Admins of Lemmy, How do you keep your servers up to date without increasing downtime

hitagi@ani.social · 2 years ago

An LTS distro like Debian and Ubuntu doesn’t update too frequently. I’ve never tried livepatching. I install updates on the weekends and reboot only if necessary. Downtime is usually about a minute and my uptime monitors don’t usually catch it.

intensely_human@lemm.ee · 2 years ago

One way this problem can be solved is by having multiple servers with one being the production server. The other server is usually the failover backup, but when it’s time to upgrade the other server can go down for work. Then when it’s ready you use a load balancer to send all requests to the new server. That switchover happens instantaneously: one millisecond requests are being handled by the old server then the next millisecond they’re going to the new server.

It’s kind of like teleporting a new car into place as a way of avoiding a pit stop in a race. Terrible analogy but there you go.

4am@lemm.ee · 2 years ago

How do you handle the database schema changes during updates? Have two databases and disable replication during the update? How do you sync changes that occur to the backup while/after the main is upgraded?

rglullis@communick.news · 2 years ago

I think OP is asking about system updates. Zero time application deployment is a different thing.

intensely_human@lemm.ee · 2 years ago

You’re swapping in entire new servers, servers with the system updates applied.

intensely_human@lemm.ee · 2 years ago

I should preface by saying I’m reporting what I’ve heard from other engineers. I myself have not worked on high-uptime systems

The only high-uptime system I built was very simple and that simplicity was the strategy. Basically it was built once, never changed, then was up continuously from 2017 to 2022 before it was retired.

I’m not sure how I’d handle a schema change. I’m tempted to say I could have a new db with the migrations applied, then I’d replay the log of all transactions from my production database to the new database, applying a pure function to alter it as necessary to match the new structure.

In this way I’d have a lagging real-time copy of the database ready to swap in.

Then it would be a matter of applying enough resources to get that gap tiny, but any moment of switchover is going to have some unprocessed transactions.

That log of unprocessed transactions, assuming I can’t eliminate it, will need to be played into the new databas before at least some subset of its related data will be valid, meaning a time gap between switchover and availability.

I could maybe get the time gap scoped to only a few records, so that most records would work flawlessly in for users but these particular records would have their own “locked for maintenance” screens.

Or I could just silently update the UI for any data that changes as a result of being generated in the post-migration schema, and make it clear to my users that my system isn’t guaranteed up to date except right after a page refresh. Ie sometimes changes in data can take a few seconds to be reflected at the edges.

This is all me speculating though, I haven’t done it myself.

I think I could probably produce a system that would work well, with the user understanding that “sometimes there’s stale data and you gotta refresh”, to have no down time and a few data mismatches resolvable down to a little UI update delay based on polling or push notifications.

But I don’t know if I could produce a system that would go through a schema update, with no downtime, while also never displaying incorrect data.

Maybe in a high-reliability scenario the best thing would be to put up a banner saying “system migration in progress; data error window is open until this message is gone” and have the page auto refresh.

There could be something much simpler I’m missing though. I’d love to hear from some devops or sysadmins who can speak about it from experience.

Hazzard@lemm.ee · 2 years ago

Not much of an addition, but you’re absolutely right, in most systems that are expected to be highly available, there’s standard maintenance times, an agreement in place, and no critical use of the system is permitted to be scheduled in that regular time period. Any deployments are limited to that window, in case a rollback is necessary, data sync, etc.

All of that is in addition to the type of high availability stuff you’re describing.

Shaolin Shrimp@lemmy.ml · 2 years ago

From my experience database schema changes require for all connections to drop, but they tend to happen a lot less compared to other updates.

oleorun@lemmy.world · 2 years ago

My lemmy.fan instance died. Something broke with federation and I’ve never been able to get it running again, even with a new database and a new subdomain. I gave up for now, at least until better error checking and recovery is implemented.

thisisawayoflife@lemmy.world · 2 years ago

Perform automatic updates and reboot when necessary.

If one is serious about hosting this, it’s best to isolate the services. One container or VM or reach service, with (probably) physical hosts for the DBs.

Schema change is more involved, but backup then update. If you have read only db, it should sync the changes when reconnected.

Realistically, federated data will be re-sent if the recipient doesn’t respond, so a few minutes of downtime is not the end of the world. At least that’s how mastodon works - not sure about Lemmy but I’m presuming it operates in a similar fashion.

skymtf@lemmy.blahaj.zone · 2 years ago

Couldn’t auto updates break things if they arnt tested?

thisisawayoflife@lemmy.world · 2 years ago

They should be tested upstream of you, assuming you aren’t using customized (eg roll your own) versions of any of the ancillary software (php, pgsql, redis, etc). Generally configs are either merged or not adopted, and you can restrict version upgrades to non major releases, if there’s chances of breakages between them (eg moving from pgsql 9 to 10, etc).

Samæ@lemmy.menf.in · 2 years ago

Keep instance small, with all users in the same timezone. Use NixOS, let it update everynight automatically and safely. It’s good enough for a small service, downtime is mostly when people are sleeping.