Need for Detect ChatGPT to protect ChatGPT

Some Background

ChatGPT has attracted a lot of attention because of the ability to generate text which while being grammatically correct most of the time happens to also be very conveniencing to an extent that unless you are _actually_ aware of the _reality_ you might believe what the machine learning model, which is estimated to have cost 1.2 billion USD to build with your and my contributed English text data, is factually correct.

This has spiked debate on if students will outsource their school homework to ChatGPT when they are asked to do their homework. Lets keep aside the fact that even before ChatGPT attracted our eyeballs there was homework outsourcing, it was to pay for the service of an actual _human_ do their tasks.

Note that using ChatGPT to do homework is interpreted as cheating. Teachers always think that students are out there to find an easy way to accomplish the task assigned to them; never questioning their inability to frame tasks that would actually require the students to demonstrate their learning.

Consequently, there has been an engaging debate on identifying text generated by ChatGPT to _help_ teachers identify if the student is indeed the author of the text or if s/he used the services of a AI bot used (underlying assumption is if they were unfair in completing their task)!

Detect ChatGPT is an outcome of this, a portal (I was not aware there was a portal for this until now!) that will allow one to differentiate between human generated text and machine generated text. A plagiarism checker if you wish.

First we research and build ChatGPT and then we realize that this can be used unfairly (by students!) so we come up with a brand new research problem to detect ChatGPT generated text & newspaper headlines "Stanford introduces DetectGPT to help educators fight back against ChatGPT".

"Solved Research problems lend themselves to new research problems --Anonymous"

can not be less true!

Going Forward

Note that according to openAI, the creators of ChatGPT, ChatGPT has digested almost all the visible English text data that was available on the internet. So at some point of time in the near future the current version of ChatGPT is going to lagging in terms of being informative.

For example say if a company released a new oral vaccine for COVID19 after the ChatGPT was released, then it would not be aware of it because the information corresponding to that would not been digested by the AI model ChatGPT.

Sooner or later, for the above stated reason, at some time in future, the creators of ChatGPT will see a need for their large language model to be upgraded.

See the above picture; "A" is when the ChatGPT was released and say "E" is the point at which, the upgrade is deemed required; the distance between "A" and "E" is the time between two versions of ChatGPT!

The period between A and E, the world is doing what it does best, generate lots of text data. Except that there is an additional source (you guessed right, ChatGPT) generating data. This is shown by the "red" line. It text generation starts with the introduction of ChatGPT (point "A") and it is likely to overtake the "usual modes of" text generation by humans in terms of volume of text generated.

The hump on the red line indicates the hype - more and more people trying things out using ChatGPT and then putting the generated text out on the internet to shout their findings!

So at some point E, when a ChatGPT upgrade is envisioned. The upgraded version would digest more of its own generated text (pictorially the area between the two red lines) that the text that has come from usual sources (human; pictorially the area between the two blue lines) of text generation. In some sense it is learning from itself!

This scenario could lead the upgraded ChatGPT into chaos especially if the output of the current ChatGPT is bullshit (as mentioned in this article). So if more and more bullshit text generated by ChatGPT is used to train the next version of ChatGPT; the output is going to be more bullshit (what ever that means). Making the upgrade unusable ...

Unless

After the initial hype the text generated by the ChatGPT is reduced (pictorially it is the area between the red dotted lines; denoted by length between points "H" and "I") compared to the usual modes of text generation (by humans; pictorially it is the area between the blue lines; denoted by length between points "G" and "J"). So that the next version of ChatGPT learns from the "new" and "usual" text data than data generated by itself.
there is a way to identify the text generated by ChatGPT, namely detect ChatGPT! This mechanism can then be used to filter out text generated by ChatGPT, thereby allowing the upgraded version of ChatGPT to only learn from human generated text and not be biased by the text generated by its older version. All this can happen if the detect ChatGPT works to perfection. However, like all explorations, the ability to achieve good accuracy takes time.

While time will tell about #1. There is a definite need put all research might behind exploring methods to detect ChatGPT generated text data. This is probably the only way that ChatGPT will survive.

What do you think?

Comments

Shoeb Shaikh said…

It's like someone invented AK47 for better military operations and defence but never realized that it could be used by terrorists to harm innocent people.

Everyday around the world inventors are using AI and creating innovative models/apps without knowing it's consequences. Inventor of ChatGPT would have never thought how students might use it.

Software governance is needed before any AI model/app is made public. It's time that world must put restrictions before it gets uncontrollable and massive.

8:30 PM

Thoughts

Search This Blog