Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm working on text classification. I have a decent classifier that's especially suited to author identification. I can think of a few good uses for it; the first one I'm trying to commercialize is academic anti-cheating.


You are probably aware of it already, but a lot of universities (mine included) use MOSS (Measure Of Software Similarity) to detect plagiarism in CS classes. Link: http://theory.stanford.edu/~aiken/moss/


I am. There are also a large number of services that detect plagiarism in essays, but most (all?) only detect direct copying from published sources and sometimes re-use of an essay previously turned in by another student.

I'm targeting the custom essay - services like http://essaymill.com ("our writing, your success"), as well as students paying other students to write their papers.


Detecting and punishing cheating in those circumstances sounds like a Hard Problem. In particular, when your software says 2 essays were probably written by the same student, but both students deny it, how can they reasonably be punished, since there is no proof?


Questions like "why did you say X?" usually reveal whether a person is actually familiar with what they claim to have written. It's imperfect, certainly, and I would never recommend punishing a student based entirely on an algorithm's result, but I think I can provide a tool to drastically cut down on this sort of academic fraud.

I intend to make it very clear to customers that they should not punish students based only on information provided by my software.


> Questions like "why did you say X?" usually reveal whether a person is actually familiar with what they claim to have written.

That's true.

If an algorithm flags someone as a possible cheat multiple times, it may be worth inspecting that person further.

> I intend to make it very clear to customers that they should not punish students based only on information provided by my software.

Good idea. Customers should however publicise that anti-cheating software is in use.


I ran a site called that crawled Gnutella/Limewire for student papers. That's something you could consider adding to your database and quite easy since the Limewire code and RFC are opensource. You could write your own client or modify Limewire.


That sounds fascinating. How did you come up with the algorithms to use?


I experimented with existing text classification algorithms for an author identification project I was doing for fun. What I'm currently using is somewhere between KNN and SVM, but I'm not done tweaking it yet. I'm also working on boosting results using different feature sets.


you might try looking at the BLEU metric. It's designed to test similarity between a machine-translated text and a human-translated one, but it could be a good starting point for detecting plagiarism too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: