26 April 2006
A group at the Indiana School of Informatics has developed a software to detect whether a document is "human written and authentic or not." The idea was inspired by the successful attempt of MIT students in 2004 to place a computer-generated document at a conference (see here). Their program collated random fragments of computer science speak into a short paper that was accepted at a major conference without revision. (That program is online and you can generate your own paper, though unfortunately it only writes computer science articles).
The new tool lets users paste pieces of text and then assesses whether the content is likely to be authentic or just gibberish. The program tries to identify human-style writing that is characterized by certain repition patterns and apparently does rather well. It is not clear whether this works well for social science type articles. The first paragraphs of a recent health economics article (to remain unnamed) only have a 35.5% chance of being authentic. Hmm...
So is this just a joke or useful programming? The authors say it could be used to differentiate whether a website is authentic or bogus, or to identify different types of texts (articles vs blogs, for example). I wonder what the algorithms behind such technology are, and whether this will lead to an arms race between fakers and detectors? If one of them can recognize a human-written text could this be used by the faking software?
If further tweaked, could this have an application in the social sciences? Maybe we could use the faking software to search existing papers, collate them smartly and use that to identify patterns and get new ideas? Maybe everyone should run their papers through a detector software before submitting it to a journal or presenting at a workshop? And students watch out! No more random collating at 3am to meet the next day deadline!
PS: this blog entry has been classified as "inauthentic with a 26.3% chance of being an authentic text"...
In the last entry I wrote that China is the new exciting trend for researchers interested in development issues. There are now a number of surveys available, and it is getting easier to obtain data. (For a short list, see here.) However there are two key issues that are still pervasive: language difficulties and little sharing of experiences.
While some Chinese surveys are available in English translation, it is still difficult to fully understand their context. China is a very interesting yet peculiar place. It clearly helps to work with someone who speaks (and reads!) the language, though you might still miss some unexpected information -- and there are many things that can be surprising.
More annoying however is the lack of sharing of information and data. This problem has two associated parts. For the existing data, people seem to struggle with similar problems but don't provide their solutions to others. In the case of the China Health and Nutrition Survey for example, numerous papers have been written on different aspects and the key variables are being cleaned over and over. Apart from the time that goes into that, this can lead to different results.
Another lack of sharing is with regards to existing data or ongoing surveys. There are now a lot of people either who either have or are currently collecting data in China. But it is rather difficult even to find out about existing sources. If you're lucky, you've found an article that uses one. If you're not you might find one only once you put in your funding application.
To really start exploring the exciting opportunities that China may have to offer for research, these problems need to get fixed. I can understand that people don't necessarily want to hand over their data, but it seems that there is too little known about existing surveys, even to researchers who have been working on China for longer. And as for the cleaning of existing data and reporting problems, it just seems like a waste not to share. I wonder if there are similar experiences from other countries?