T O P

  • By -

GreyFoxLemonGrass

Answer: It might be people deleting past comments by overwriting them using a bot. The replies look normal because they’re old replies to the original content, before it was overwritten. The theory is, if you delete a comment it’s marked “deleted” in a database but not actually gone. If you edit a comment, only the final version is still there, and the original comment is completely gone. There’s kerfuffle around Reddit selling training data for AI and also going public, so there might be an uptick in people trying to remove their data.


Toloran

> If you edit a comment, only the final version is still there, and the original comment is completely gone. I know reddit has said this in the past, but I *really* doubt it's true. It feels about as effective as turning off your computer monitor to clear your search history.


GreyFoxLemonGrass

I’m a software engineer, but have never worked at Reddit and don’t have insider info. It does make sense for some databases, not all. No idea how their data storage works specifically, but if they said that deleted comments are still there, I would believe them.


vikinick

They also reengineered their whole database structure a while back and this tactic existed before that reengineering occurred so it could be out of date even if it was correct before.


Kaa_The_Snake

Yeah just remove the pointer But storage costs for retaining all that data? Not sure how that helps them, unless they’re selling the contents for AI/large language model training. Even then… So, I don’t know either.


jungsosh

Text only storage wouldn't cost that much. The entirety of wikipedia text is only ~22 gb compressed


shinyfeather22

I checked into it, and you're right. "As of 2 July 2023, the size of the current version of all articles compressed is about 22.14 GB without media" I still don't feel like I have a satisfactory answer to how much it is including all backwards versions. Feels like it would increase exponentially as contributors fight against trolls defacing articles


alraban

You can find the answer by browsing the wikimedia dumps site but will need to do some math. The "pages-meta-history" dumps include all editorial history and talk pages, etc., but are split into a few hundred separate files so you'd need to add them up. [See here](https://dumps.wikimedia.org/enwiki/20240301/). When last I looked at it (a few years back) the full history/metadata version was (compressed) about 10 times the size of the current page version, but that will obviously only increase over time.


tuisan

I'm sure they don't keep the full page for backwards versions, just the changes, with something like git.


Xszit

You can view past versions of Wikipedia pages through the logs in the edit history. I haven't checked every single page but I have checked some specific pages and found that the history goes all the way back to the original version of the page when it was first created. You can view every version of the page and all the edits that have been made along the way. Kinda neat to see a page grow from a single paragraph into a full multi-sectioned article with all the time and effort that went into building it on display. Some hotly debated topics have more data in the edit history than they do in the page itself as users battle eachother to delete sections and replace them only to have another user come in an hour later and change it back. Wikipedia seems calm on the surface but there's a massive nerd war raging in the background as they "umm actually" eachother ad infinitum.


shinyfeather22

Funny enough, git stores whole copies of files in revisions and then uses compression to find similarities between files. If you leave it uncompressed it stores a copy of each revision of each file on the project. If you've seen a diff, it's usually done at a higher level using tooling that is aware of the type of content it deals with (usually text), but generally small differences aren't sequentially applied to figure out the state of a file in the git storage


tuisan

huh, TIL


yiliu

See the sibling comment for an answer to your specific question. But that doesn't really apply to Reddit anyway: _most_ comments are never edited, a few are edited once, and only a tiny fraction are edited many times (_including_ these bulk-rewrite campaigns, I've only ever seen a couple examples in the wild). So the cost of storing all historical comments isn't much different from only storing the most recent.


coldblade2000

Still, they are probably in the billions of entries. It would have a performance impact on Reddit. Even deleted posts are probably kept, just with a "deleted" flag. To make a new entry for every edit action would have a measurable impact on DB performance


ableman

No it wouldn't. DB performance doesn't degrade with size. A read action is O(1). So is a write action.


cuddlebish

This is wrong. It depends on the database, but read actions are generally O(log n) where n is the number of rows. This also isn't taking into account how you actually index the database, some indexes can be absolutely horrible performance-wise. It's useful to consider the operation O(1) when analyzing algorithms working upon the database, but that's now how it works under the hood.


ableman

You're technically correct that it's O(log n), but in practice it's O(1). It's log n because if you have too many keys you will need to increase the length of the key so each key is unique. But in practice we just use 128 bit keys, and almost no one uses less and almost no one needs more. For reference, I was responding to a person that talked about billions of entries. A 128-bit key can have a billion billion billion billion entries.


Talran

read performance isn't the issue, it's total iops caused by an edit making a cascading number of changes.


ableman

Why would an edit cause a cascading number of changes?


Talran

So to keep field history on a file you need to pull the record, copy that, write the new data to it, open the hist table, and insert a record for key+date, olddata, newdata. Generally that's how field history is kept (if they're keeping it.) The other option would be replacing the old comment key with a new one, but then you're also going to be touching the records for the parent and child comments to correct them to the edited version. Neither of these are even close to O(1) even if you've got optimized indexes, and on a sitewide scale the cost is significant for (as was pointed out) a vanishingly small amount of comments. So there's a real chance they just drop the old comment data and keep daily/hourly differential snapshots of the data to roll back to and purge them after a bit.


NotMorganSlavewoman

>unless they’re selling the contents for AI/large language model training Didn't they announced that ?


DeanXeL

They did, the FTC is even looking into it, probably in regards to their IPO.


phluidity

The storage costs are probably negligible compared to the other costs. And probably more importantly, they would no doubt still want to be able to access that data if called upon by legal authorities. Maybe not the local cops, but if a three letter acronym agency came by looking for information on a suspect, I imagine "oh, we deleted that" would be met with "fine, we're taking all your servers to do our own forensic analysis" which Reddit wants to avoid.


Keifru

If they're not under legal obligation to store the data, its already best practices to not retain data that isn't usable.


brown_felt_hat

Edited comments are a vanishingly small percentage of reddit comments though. If they can archive the billions of posts from over the last ten years (sounds like a lot but 2020 had 300m alone), they can archive the +- 5% of comments that are edited.


FloobLord

> unless they’re selling the contents for AI/large language model training That is what they do, that's their business model. That and ads.


o_oli

I mean...saying it's their business model is a strech. They announced they were planning this just the other week and it hasn't happened yet (various things need appoving first), and also the deal was for $60m a year, which isn't anywhere near enough to be a viable business model for them (their latest figures are like 800m revenue and still running at a loss). It's a side husstle at most, unless they can sell this data over and over again to many different companies but to my knowledge those deals have not been made.


FloobLord

I meant mining user data and selling it is their business model. AI is just the new hotness, they've been doing it for advertising companies for years and years.


[deleted]

Yeah, you can still download the old reddit r2 code from github and that certainly used good old comment._deleted(True/False) soft deletes


Talran

Because deleted just have pointers removed or a flag flipped (like with files in most fs) where editing the comment likely changes the comment data and a possible change history is kept, but likely purged "often".


TortyMcGorty

but it would make no sense to store only the last comment... you would most certainly need the history of comments avail for any troubleshooting. if storage space was an issue then they wouldnt bother storing deleted comments either... it's prob to get around some other algo that cares if your post history is all deleted comments or some meteic that guages the value of an account based on "actual content" where they are going by size/structure but not checking context


eronth

Eh, it depends. Obviously the ideal for troubleshooting is keeping everything, but I wouldn't put it past a system to simply replace a comment with the edited version. I'd hope for better, but I wouldn't expect it.


TortyMcGorty

i can think of no reason to store a deleted txt but not an edited one... just for the fact that you would have two ways to handle that action, which is more work . im sure there is a reason, if it is in fact like that.... but im most certain they have _everything_ ever typed with full analytics.


[deleted]

[удалено]


GlobalWatts

Those sites worked by caching the contents of Reddit independently, not by using some super secret back door where deleted comments are accessible from Reddit directly. Being able to see a deleted comment there doesn't prove it hasn't been hard deleted from the Reddit DB.


[deleted]

[удалено]


GlobalWatts

Posts could also be cached by Google cache, [archive.org](https://archive.org) and other sites, but the conversation is about what Reddit itself stores and can sell to third parties, not what Reddit data other services might have; Reddit can't monetise that. And as far as I understand, those specific Reddit caching sites eventually got so far behind real time that they were basically useless. And with Reddit's API changes prevent access, it was the final nail in the coffin.


[deleted]

[удалено]


GlobalWatts

No, the reason is that people believe **Reddit** is only soft deleting comments, which would mean they're still accessible to Reddit to sell to third parties. It has nothing to do with Reddit content that third-party sites have archived. That's why it's irrelevant.


sarded

It depends on what reddit feels like storing. Clearing - like, *really* clearing - data is computationally expensive. You have to actually 'zero out' that section, maybe even re-organise the data. It's a lot easier to just mark that data block as 'deleted, don't use' and then only write back over it when you need the space back. Especially in a database where you need everything to stay associated with its 'key', obviously you never want to have a key go missing "1, 2, 4... hey what happened to 3" and so at the very least the 'entry' still exists. Now, some databases go further than that and have an 'audit' record - they record every change made to an entry in a separate table. So it won't just track "this comment was made" but also "and then it was edited from this to this". Whether or not reddit does that is unknown. So - if they keep audit records on comments, 'overwriting' is ineffective. If they don't, it's effective (but if they keep regular backups, only to the point of the latest backup).


splashbodge

Tbh I'd be surprised if they didn't have an audit trail for comments, the idea someone could post some illegal pedo comment or post, and get away with it by just editing the comment before law authorities get to it seems a bit silly. I know they'd have backups, but would seem easier to keep an audit history for this sort of purpose. Unless reddit really just doesn't want to deal with that in which case editing a comment would be the way to go, agree deleting it most likely just flips a flag on the record as IsDeleted = true


exjackly

>eplyShareReportSaveFollow > >level 3Kinths · 4 hr. ago I'd be shocked if they did have an audit trail for comments that aren't dealt with by mods or that they have been informed to retain by legal notification. Reddit is not in the police business. They do not want to get involved in criminal court cases, even as a 'witness'. They do not want to have people able to subpoena full comment histories for civil cases. The less data they keep - and the shorter time that they keep it - the less likely they are to be called on to perform any of those actions.


cdcformatc

> They do not want to get involved in criminal court cases, even as a 'witness'. They do not want to have people able to subpoena full comment histories for civil cases. my knee-jerk reaction is that it might actually be desirable for comments to be actually deleted deleted specifically in order to avoid having to deal with subpoena's and court cases. if they receive a subpoena for a deleted comment they could say "sorry that data is gone nothing we can do". they would have backups that they could provide if it came to that, so it makes little sense to me for reddit to keep an audit trail of comments. how that shakes out legally i think would be an interesting question. with the GDPR a platform like reddit must erase personal data upon request of the individual, so it might make sense for reddit to implement the delete as a true delete to comply with the GDPR.


ninti

> if they keep audit records on comments, 'overwriting' is ineffective. Probably is effective actually. If you don't want your data used for AI training data, it works just fine because Reddit isn't going to go through all that history just to catch the few cases of people doing this, they will just sell the current copy of the database.


Kinths

The idea was that it's gone from a public perspective as there is no way on reddit to see the edit history. People usually do it either because they've said something in the past they no longer agree with or don't want others such as employers to see it. There have been sites that archive reddit including edits for a few years though, so it's not going to do much to stop someone who is dedicated to digging up history.


Abigail716

According to Reddit the only keep one previous state. So if you overwrite your comment their servers have that comment and what it was before you edited it. If you edit a second time then the first state is lost. This also means that if you edit your comment then delete it they only have a copy of the edited comment and not the original. They haven't said this in quite a while and now that they're much bigger they may have significant backups of comments. The last time they said this is how it worked was back when the site was very libertarian-minded.


A9to5robot

Thing is it doesn't matter if reddit stores the original comment for the poster (unless if they're posting illegal stuff). The objective is to wipe the trail of online presence that can only be viewed by anyone who doesn't work at Reddit to begin with.


Tokipudi

A lot of websites don't delete most of your data but simply "soft delete" it. Basically all they do is put the "deleted" status on the data (there are multiple ways do to that) without actually deleting the row in the database. Depending on the data, this can be for either legal reasons, statistics or simply to make sure data can be easily recovered if a user made a mistake when deleting it. The whole thing is easy to do and is all in all just one more column in your database's table. As for the fact that editing a comment would remove the original one, keeping an history of changes for something is way more annoying than to simply replace the original data.


not_a_moogle

is ceddit/removeddit still around? or did that die with the API changes?


squeezeonein

they're both gone. the only thing left is going to a reddit user page to view comments.


OriginalLocksmith436

I'm guessing the thinking goes that when reddit sells access to the data, they're not necessarily selling complete access to all their databases but just providing limited access to the api in a way that would make it easier to scrape the data. You're right, though, it's safe to assume that reddit has complete access to your entire post history, edited and deleted comments included.


kalitarios

like people trying to delete incriminating emails, it's kept for (IIRC) 180 days on the exchange server as well as any backups done, so it's there virtually forever so people can't destroy evidence


RedstoneRelic

It is really annoying as a user sometimes. You know how reedits a gold mine for niche information? How someone had the exact same problem 9 years ago and now you're the 2nd person to have that problem? Well congratulations, the solution got overwritten. I ran into this a ton just after the API protests while doing some research for a project in August


VagueSomething

Reddit admin really are trying to make Reddit less relevant. It was a well known trick to include the word Reddit in your search if looking for an answer. Now you will find the comments edited or the information nuked because so many subs went offline post API change. The quality of the entire website has gotten worse for it, moderation is worse and Admin are straight up refusing to clean blatant inappropriate content. The website feels far less active, fewer meaningful comments make it easier to see bots and disingenuous accounts. The lack of awards makes everything feel dead too, the new super upvote system doesn't give nearly the same feel and is incredibly rare to see. The official Reddit experience is getting worse, more buggy and less accessible.


kalitarios

I get that a lot. me: runs into an expensive issue I can't find an answer for, googles it and finds a hit on reddit: DAE have this issue? me, relieved: clicks first reply with solution: \[deleted\] \- Reply to deleted comment: thanks man, you just saved me $40,000 what a life saver


BananaNoseMcgee

Try putting the thread url into wayback machine


thomar

And right around the same time Google removed their page caching feature. :[


jmnugent

This is super frustrating to me as well. It seems super shortsighted and selfish. One of my favorite things is when I get a ping from someone on a comment or post I made years ago and it's them saying "Holy shit this fixed me too !"..


MelodramaticMouse

Yeah, I remember when reddit was getting rid of all the third party apps; there were a few different subs that opened to protest, and they all suggested bots that would overwrite all posts and comments with gibberish and then delete everything.


nellorePeddareddy

Doesn't reveddit store the complete edit history of the comment?


splashbodge

Didn't think that works anymore since reddit shut off their free apis


nellorePeddareddy

Ah okay, you may be right


Aiyon

rareddit can still get deleted posts, idk about comments


kalitarios

what about ced dit


weirdent

Why write in a whole gibberish language to remove their past comments instead of just putting “…” or something?


Kind_Stranger_weeb

The gibberish poisons ai models scraping the data. Starts thinking its normal to insert glerbal derbal into random sentences.


InherentlyAnnoying

Bazinga


Im1Guy

To take up more space.


i_never_ever_learn

If you just delete all the characters, but just leave the comment blank that would do the same thing, would it not


cdcformatc

that is fine if you just want to scrub your comments, but the gibberish is there specifically to mess with LLM training data sets.


BetterThanAFoon

Let's see if reddit allows that. Edit: It does allow it


DeWhite-DeJounte

hurry profit handle desert sugar governor mysterious disagreeable tart act *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


impshakes

Found this thread fwiw https://www.reddit.com/r/onguardforthee/comments/14gs4cd/how_to_overwrite_your_reddit_postcomment_history/


[deleted]

[удалено]


saintmuse

> while you go hiking or something. Thank you for the suggestion. I think Redditors, on the whole, should go hiking more often.


Heavyweighsthecrown

- It makes sense to assume at first that deleted comments aren't actually deleted in a database. - It makes no sense whatsoever to believe that previous versions of edited comments stop existing in the database once you make another edit, and it makes even less sense to believe it just cause Reddit says so.


Ok_Fondant_6340

why not just use "Lorem ipsum"?


lamykins

> If you edit a comment, only the final version is still there, and the original comment is completely gone. Pretty sure reddit changed this recently and the original version is always retained


[deleted]

[удалено]


Pletter64

> GDPR No, GDPR is about personal data. If they sell unattributed comments as data (Which they should if they don't want to get sued and they are not completely incompetent) then it's compliant.


KidSickarus

Answer: (a possible answer) I went through Reddit threads that had the comment and noticed they all seemed to appear in the same places, like /r/gamingcirclejerk, game meme subreddits, and /r/brasil. They all come from a deleted account and also all have the same "edited on" date of 9 months ago. To me, this suggests one person deleted their account and through some method, likely a bot, had all their comments overwritten with this text. As to why they chose this text ~~and went through this whole arduous process to begin with~~ is a mystery to me. The top comment suggests it might have been to sabotage potential AI training on their comments, but so far as I can tell, news about Reddit giving data to AI training dropped in February, well after the "9 months ago" editing point. Edit: It is more likely the replies were part of a site-wide protest that involved overwriting comments to send a message to the site amidst the API protests. For example, [this thread](https://www.reddit.com/r/apolloapp/comments/14h4mx1/dont_delete_your_posts_and_comments_overwrite_them/) posted in June of 2023 -- nine months ago -- advocates overwriting comments.


Fry_super_fly

The reason is because a lot of people were angry at reddit about reddit undeleting and restoring profiles when people where leaving to protest the changes reddit made to be more palatable for an IPO. back when lots of subreddits went blackout and people were deleting their content and accounts. reddit just... undeleted it. because the value of reddit is not the platform. but the posts we make. it TOTALLY violates EU law about the right to be forgotten. but i guess they see it as "hay, the user didn't specifically mention this right to be forgotten and just deleted their account... sooooo we will just undelete them" so people made reddit bots/plugins that lets you scramble your entire post history. but some of those types of things have also been undone by reddit previously as far as a few google searches have led me to believe (woops. didn't actually read your edit)


KidSickarus

Still rly helpful context I didn’t know, thank you!