What amazing work! I am very interested in doing research with Tor and a dataset like this could make my job a heck of a lot easier. I have a legal question though: Are your scrapes text only? Before I work with this dataset, I want to make sure that there's no possibility it contains illegal images (child porn).
They are generally not text only. I feel that images are useful to allow browsing the markets as they were and may be highly valuable in their own right as research material, so I tried to collects images where applicable. (The forums usually did not support any kind of image upload other than avatars, so this is more relevant to the markets than forum scrapes.)
As far as CP goes, there should be essentially zero CP anywhere in the archive. DNM users almost universally loathe CP, and no market has ever dared to permit sales. (You may find this funny: CP is so taboo, on the DNMs like elsewhere, that it's been used in at least one attack - SR2's DoctorClu/Brian Farrell infamously attacked a rival market's forum by posting CP to it.)
DNM users almost universally loathe CP, and no market has ever dared to permit sales.
These users are willing to do so many other illegal things, but the thought of being known as a pedophile or supporting pedophilia in any way, is completely abhorrent to them? Interesting datapoint.
> These users are willing to do so many other illegal things, but the thought of being known as a pedophile or supporting pedophilia in any way, is completely abhorrent to them?
I'm pretty sure the distinction is that voluntary transactions have no victims, and DNM folks care more about morality and ethics than legalities.
It is interesting, isn't it? You might expect there to be a 'general factor of criminality/antisociality/violence' akin to how we find a general factor of intelligence in psychological things, but as far as I can tell, drug use or sales seems to be largely orthogonal to other kinds of crimes - most of the DNM users will never buy credit card dumps and rip off retailers (carding is disliked by a lot of DNM users, although not enough to totally ostracize it like CP), most DNM users will never download CP, most DNM users will never beat or rape someone, etc. I have no issue buying and using illegal drugs, so I'm a criminal, but I'm not the same kind of criminal as, say, the Vallejo kidnappers (which BTW if you haven't read the complaint, it's an amazing read if you're into true-crime stories: http://www1.icsi.berkeley.edu/~nweaver/vallejo.pdf ).
This probably has a lot to do with why the War on Drugs has been such a failure and why legalizing does not seem to unleash crime waves.
It's quite usual for career criminals to have a detailed and complicated worked-out ethical system, where that is a victimless crime but that is reprehensible.
The largest use of DNMs is for the sale of illegal recreational drugs. Regardless of your feelings on the matter, I hope that it's clear that crimes like that fall into a different ethical category than child pornography.
Actually, this is an interesting topic. Poisoning a dataset. CP would work for private security investigators, and to poison against government investigators you could use leaked classified secrets.
Could you work around this by operating on the files on VPS you don't own, streaming a very low-res ('Basilisk'-proof - https://en.wikipedia.org/wiki/BLIT_(short_story) ) remote desktop image.
Possession laws are pretty strict and hard to decode. I wouldn't want to be the test case in court. The idea of "poisoning" a dataset is an interesting theoretical. But in practice, I just want to judge the likelihood that the dataset is poisoned by the presence of images. If it is then there's not much I can do with it.
Nonsense. Gwern doesn't need to do anything for anyone.
It's an interesting issue, and a way investigators may be attacked, but it's their responsibility alone. There exists data. This is that data. The data may bite. Touch the data at your own risk.
Guess what, laws aren't universal! Unless gwern has a complete understanding of your jurisdiction and can somehow guess how you plan to use the data, he cannot know what is legal and wasn't isn't. The burden lies on you.
It's a general black market, not just drugs. For example, one of the sites described on that page is PEDOFUNDING, "A crowdfunding site for child pornography." Now the dump isn't supposed to contain any images, but it's hard to be 100% sure. In any case, whatever risk there might be seems to be clearly implied in the name and description there.
Lower down on the page, he says he did scrape at least one site with such images, although he specifically only took text. Can't verify that this was the case for all scraped sites.