Update on the recent server failure issue

Between Saturday the 21st of June 2025, and Monday the 23rd of June 2025, we experienced a failure on our plugins server.

As a result, data fetching for applications using the Airtable plugin, SQL plugin, and REST API plugin with CORS enabled in production was disrupted. This also impacted access to the editor for applications relying on these plugins.

The issue has now been resolved, and the server is back to normal, functioning properly.

The issue went undetected by our monitoring tools because the server was going down and restarting at a high frequency, preventing alerts from being triggered.

The team began investigating the issue on Sunday evening and found the root cause and a solution approximately 10 hours later.

This resolution time was unacceptably long, especially given that it affected production, something that had never happened in our seven-year history. We deeply apologize to all users impacted by this incident.

Regarding the root cause of the problem: there hadn’t been any changes on this service on our end since April 2nd 2025. It appears the problem stemmed from a change made by Supabase, which introduced a new type of error that was thrown en masse to our servers, ultimately causing them to fail.

Here is what we are planning to do so that this doesn’t happen again:

  1. We are updating our server monitoring approach to ensure that instabilities like this can be detected and reported immediately.
  2. We put in place a status page so that our users can track the status of servers at any point in time.
  3. This plugin has proven to be a single point of failure for a significant portion of our users. As a result, we’re planning a major revamp of our backend architecture to provide each user with dedicated computing units. This will ensure that one user’s issue won’t impact others.

Feel free to reply here in thread or to reach out to me at raphael@weweb.io if you need to have more information or discuss about it personally with me.

We deeply apologize for this issue. Please rest assured that we are taking all necessary measures to ensure it does not happen again.

19 Likes

We need a status notifications please by subscription via email or slack so it can help us in monitor.

4 Likes

I also had a bunch of issues with the xano auth plugin over the last few days. Is that related or not?

Thanks for the detailed breakdown of what happened and the steps yous are taking to resolve it. I for one am happy with the steps you’re all taking to ensure this doesn’t happen again (the server status pages were a great touch), and will continue to use WeWeb for app development!

2 Likes

what were the issues?

Good idea, I’ll see if we can do it

1 Like

thanks for your trust @Dan_OOT, much appreciated.

2 Likes

It’s excellent that weweb are taking this seriously. One issue that emerged this weekend and i didnt see a solution is the apparent lack of a weweb team handling critical matters on the weekends. I know its not easy and with higher cost to have someone on weekend shift. But we cant just be stranded like that for almost 3 days.

Hey @tomerer2000 thank you for your feedback.

The root cause of this weekend’s delayed response wasn’t primarily a staffing issue, but rather a critical failure in our detection systems. Our monitoring failed to trigger the proper alerts that would have notified our on-call team about the severity of the situation.

As we have rapidly scaled, we have been receiving an increasing volume of support tickets and issue reports. However, without proper correlation through our detection systems, distinguishing critical issues from isolated problems has proven challenging.

This incident did expose gaps in our monitoring and escalation process which we are taking very seriously. We are now implementing the improvements mentioned.

While we can’t change what happened this weekend, we are committed to ensuring our detection and response systems match the reliability you all deserve.

2 Likes

I understand that this issue could have been fixed by better detection systems but I also believe it could have been fixed or at least work could begin if there was an ability to contact a member of staff. From what I hear from the people having this issue they had 0 method of contacting a single person.

So while yes in this case better detection systems would have helped I think not having 1 person available is not good enough. If an error slips through on an update and isn’t recognised by the weekend a better server monitoring system isn’t going to help. Someone needs to be available.

3 Likes

I agree with this. This causes a lot of unnecessary issues.

Users did reach out to us (community, support, e-mail, linkedin, etc.) and we saw the messages over the week-end, we added trackers on the projects with apparent failures to monitor them and figure out where the problem came from. We understood where it came from on Sunday night, tested and the fix went live on Monday morning. We could have been faster if the monitoring system would have pointed out the error straight away, what took time is:

  1. Realizing there was a problem affecting many users (not an isolated issue).
  2. Having to add trackers, monitor the error and figure out where it came from (it was not obvious since nothing changed on our side).

Nevertheless you are right that it could have been resolved faster, but to be clear, for this kind of error, that will only be possible through a better monitoring system.

2 Likes