Every time someone says something like this I have to explain CDC and regular old backups. There’s no way in hell Reddit doesn’t keep cold and hot backups of their shit. And while Reddit is unlikely to be doing CDC for soc2 or other compliance reasons, it’s the easiest method to capture data for analytics purposes.
CDC stands for change data capture. It’s generally done with databases by streaming the change log or ref log to a bucket or a service like Kafka where you can fast forward and rewind the log queue to see the state of the DB at any point in time. Even if you edit your comments it’s likely sitting in a Kafka topic or a snowflake bucket outside of the DB or cache used for the presentation layer.
Zero large scale websites operate with a truly single data store. There is always another layer that your user operations don’t impact
Yes, that’s certainly possible, but it’s also out of my control. I have basically three options:
Delete account - we know this doesn’t delete comments
Delete comment - “seems” to delete comments, but we’ve seen comments get restored - so probably using a “deleted” flag
Edit comment with nonsense and when delete - should poison comment if they’re just using the deleted flag
That’s it. There’s no guarantee it works, but it has a much higher chance of working than the other two.
And there’s a good chance they delete old backups. Hosting every edit is expensive, so there’s a decent chance they clean up old data after some months.
In 2019 the total size of the text stored by Reddit was only 50TB. A Petabyte of data in cold storage is only 12k a year so even if they 500x in size since 2019 (very unlikely) it’s a drop in their ARR. given they sell the data for advertising and for AI, they are not deleting it. Reddit also self hosts a lot of their infra (they used to present their architecture at kubecon) so the storage costs would be even lower
Every time someone says something like this I have to explain CDC and regular old backups. There’s no way in hell Reddit doesn’t keep cold and hot backups of their shit. And while Reddit is unlikely to be doing CDC for soc2 or other compliance reasons, it’s the easiest method to capture data for analytics purposes.
CDC stands for change data capture. It’s generally done with databases by streaming the change log or ref log to a bucket or a service like Kafka where you can fast forward and rewind the log queue to see the state of the DB at any point in time. Even if you edit your comments it’s likely sitting in a Kafka topic or a snowflake bucket outside of the DB or cache used for the presentation layer.
Zero large scale websites operate with a truly single data store. There is always another layer that your user operations don’t impact
Yes, that’s certainly possible, but it’s also out of my control. I have basically three options:
That’s it. There’s no guarantee it works, but it has a much higher chance of working than the other two.
And there’s a good chance they delete old backups. Hosting every edit is expensive, so there’s a decent chance they clean up old data after some months.
In 2019 the total size of the text stored by Reddit was only 50TB. A Petabyte of data in cold storage is only 12k a year so even if they 500x in size since 2019 (very unlikely) it’s a drop in their ARR. given they sell the data for advertising and for AI, they are not deleting it. Reddit also self hosts a lot of their infra (they used to present their architecture at kubecon) so the storage costs would be even lower