Skip to main content

Need for Detect ChatGPT to protect ChatGPT

Some Background

ChatGPT has attracted a lot of attention because of the ability to generate text which while being grammatically correct most of the time happens to also be very conveniencing  to an extent that unless you are _actually_ aware of the _reality_ you might believe what the machine learning model, which is estimated to have cost 1.2 billion USD to build with your and my contributed English text data, is factually correct.

This has spiked debate on if students will outsource their school homework to ChatGPT when they are asked to do their homework. Lets keep aside the fact that even before ChatGPT attracted our eyeballs there was homework outsourcing, it was to pay for the service of an actual _human_ do their tasks.

Note that using ChatGPT to do homework is interpreted as cheating. Teachers always think that students are out there to find an easy way to accomplish the task assigned to them;  never questioning their inability to frame tasks that would actually require the students to demonstrate their learning.

Consequently, there has been an engaging debate on identifying text generated by ChatGPT to _help_ teachers identify if the student is indeed the author of the text or if s/he used the services of a AI bot used (underlying assumption is if they were unfair in completing their task)!

Detect ChatGPT is an outcome of this, a portal (I was not aware there was a portal for this until now!) that will allow one to differentiate between human generated text and machine generated text. A plagiarism checker if you wish.

First we research and build ChatGPT and then we realize that this can be used unfairly (by students!) so we come up with a brand new research problem to detect ChatGPT generated text & newspaper headlines "Stanford introduces DetectGPT to help educators fight back against ChatGPT".

 

"Solved Research problems lend themselves to new research problems --Anonymous

can not be less true!

 Going Forward

 

 Note that according to openAI, the creators of ChatGPT, ChatGPT has digested almost all the visible English text data that was available on the internet. So at some point of time in the near future the current version of ChatGPT is going to lagging in terms of being informative. 
 
For example say if a company released a new oral vaccine for COVID19 after the ChatGPT was released, then it would not be aware of it because the information corresponding to that would not been digested by the AI model ChatGPT.
 
Sooner or later,  for the above stated reason, at some time in future, the creators of ChatGPT will see a need for their large language model to be upgraded. 
 
See the above picture; "A" is when the ChatGPT was released and say "E" is the point at which, the upgrade is deemed required; the distance between "A" and "E" is the time between two versions of ChatGPT! 
 
The period between A and E, the world is doing what it does best, generate lots of text data. Except that there is an additional source (you guessed right, ChatGPT) generating data. This is shown by the "red" line. It text generation starts with the introduction of ChatGPT (point "A") and it is likely to overtake the "usual modes of" text generation by humans in terms of volume of text generated. 
The hump on the red line indicates the hype - more and more people trying things out using ChatGPT and then putting the generated text out on the internet to shout their findings!
So at some point E, when a ChatGPT upgrade  is envisioned. The upgraded version would digest more of its own generated text (pictorially the area between the two red lines) that the text that has come from usual sources (human; pictorially the area between the two blue lines) of text generation. In some sense it is learning from itself! 
 
This scenario could lead the upgraded ChatGPT into chaos especially if the output of the current ChatGPT is bullshit (as mentioned in  this article). So if more  and more bullshit text generated by ChatGPT is used to train the next version of ChatGPT; the output is going to be more bullshit (what ever that means). Making the upgrade unusable ...

Unless
  1. After the initial hype the text generated by the ChatGPT is reduced (pictorially it is the area between the red dotted lines; denoted by length between points "H" and "I") compared to the usual modes of text generation (by humans; pictorially it is the area between the blue lines; denoted by length between points "G" and "J"). So that the next version of ChatGPT learns from the "new" and "usual" text data than data generated by itself.
  2.  there is a way to identify the text generated by ChatGPT, namely detect ChatGPT! This mechanism can then be used to filter out text generated by ChatGPT, thereby allowing the upgraded version of ChatGPT to only learn from human generated text and not be biased by the text generated by its older version. All this can happen if the detect ChatGPT works to perfection. However, like all explorations, the ability to achieve good accuracy takes time.

While time will tell about #1. There is a definite need put all research might behind exploring methods to detect ChatGPT generated text data. This is probably the only way that ChatGPT will survive.
 
What do you think?

Comments

Shoeb Shaikh said…
It's like someone invented AK47 for better military operations and defence but never realized that it could be used by terrorists to harm innocent people.

Everyday around the world inventors are using AI and creating innovative models/apps without knowing it's consequences. Inventor of ChatGPT would have never thought how students might use it.

Software governance is needed before any AI model/app is made public. It's time that world must put restrictions before it gets uncontrollable and massive.

Popular posts from this blog

Visualizing Speech Processing Challenges!

Often it is difficult to emphasize the difficulty that one faces during speech signal processing. Thanks to the large population use of speech recognition in the form of Alexa, Google Home when most of us are asking for a very limited information ("call my mother", "play the top 50 international hits" or "switch off the lights") which is quite well captured by the speech recognition engine in the form of contextual knowledge (it knows where you are; it knows your calendar, it know you parents phone number, it knows your preference, it knows your facebook likes .... ). Same Same - Different Different:   You speak X = /My voice is my password/ and I speak Y= /My voice is my password/. In speech recognition both our speech samples (X and Y) need to be recognized as "My voice is my password" while in speaker biometric X has to be attributed to you and and Y has to be attributed to me! In this blog post we try to show   visually   what it means to pro

BITS Pilani Goa Campus - Some Useful Information

You have cleared the BIT Aptitude Test and have got admission to BITS Pilani Goa Campus. Congratulation . Well Done. This is how the main building looks! Read on for some useful information, especially since you are traveling for the first time to the campus and more or less you will face the same scenario that we faced! We were asked report on 29-Jul-2018 (Sunday) to take admission on, 30-Jul-2018.  We reached Madgoan (we traveled by train though the airport is pretty close to the BITS campus, primarily to allow us to carry more luggage!)at around 0700 hours (expect a few drizzles to some good rain - so carry an umbrella) on 29-July-2019. As you come out you will be hounded by several taxi drivers, but the best is to take the official pre-paid taxi. It should cost you INR 700 to reach the BITS campus. We had booked a hotel in Vasco (this is one of the closest suburb from BITS campus, a taxi should charge you around 300-350 INR; you will make plenty of trips!) and

Authorship or Acknowledgement? Order of Authors!

 {Personal views} Being in an R&D organization means there are several instances when you have to write (Scientific or Technical Papers) about what you do in peer reviewed conference or journals.Very often, the resulting work is a team effort and as a consequence most papers, written today, have multiple authors.  Few decades ago, as a research scholar, it was just you and your supervisor as the two sole authors of any output that came out of the PhD exploration. This was indeed true, especially if you were writing a paper based on your ongoing research towards a PhD. In the pre-google days, the trend was to email the second author (usually the supervisor) to ask for a copy of the paper so that you could read the research and hopeful build on it because you knew that the supervisor would be more static in terms of geo coordinates than the scholar.   However the concept of multiple authors for a research article is seeping into academic research as well. These days labs write papers