Jul 12 / Punit Bhatia and Jon Gillham

Plagiarism, Copyright & AI

Drag to resize

In today's digital age, where content is king and data flows endlessly, the role of artificial intelligence (AI) in shaping our creative and informational landscape cannot be overstated. But with this technological prowess comes a slew of challenges, particularly on the reals of plagiarism and copyright. In this episode of the FIT4PRIVACY Podcast, Punit Bhatia sits down with Jon Gillham to delve into the intricate web of AI's impact on content creation, the complexities of identifying original material in sea of data, and the reliability of algorithms in determining factual accuracy.

Transcript of the Conversation

Punit 00:00
AI and plagiarism? Well, the AI world comes with its own challenges. And one of the challenges is how do you ensure that the content that is produced are being given to a university to a professor to a company to a professional to an entrepreneur is genuine, factual, not generated by AI. And more importantly, it's true. It's not false. That's an interesting challenge that AI gives us. And we're going to discuss this with none other than John Gillham, who is an entrepreneur himself, and is doing a lot of work in this space to his company originality.ai in helping people identify whether the content being produced is factual, it's not generated by AI, and also the readability of it. So let's go and talk to him.


Hello, and welcome to the fit4privacy podcast with Punit Bhatia. This is the podcast for those who care about their privacy. Here, your host Punit Bhatia has conversations with industry leaders about their perspectives, ideas and opinions relating to privacy, data protection and related matters. Be aware that the views and opinions expressed in this podcast are not legal advice. Let us get started.


Punit  01:29

So here we are. Welcome, Jon. Welcome to Fit4Privacy podcast.


Jon  01:39

Thanks Punit! Yeah, happy to be here and talk about it.

Punit  01:43

It's a pleasure to have you. And let's maybe start with a very simple question, because everyone has his or own her own definition. What do you think? Or how do you define artificial intelligence?


Jon  01:55

Yeah, I think so for my world, which is mostly focused on the world of text is having a machine generate content in a way that is almost in the sight, indecipherable from a human and sharing an intelligent, seemingly intelligent thought.


Punit  02:16

Okay, that's very simple and crisp, intelligent, and crisp. Very nice. When we talk about AI, the issue is it brings in bias, hallucination and many, many other challenges. Of course, everyone talks about bias and hallucination, so let's park them. But apparently, it also creates challenges in content challenges in copyright, and so on. Can you talk about some of the challenges as you see them


Jon  02:43

As another challenge that you just mentioned, there's two main challenges when people are using AI, and publishing that content on the web. Challenge 1 is that if they have hired writers they might be happy to pay them a $100 and article a $1,000 article, whatever their rates might be, they're not super happy to find out that they were copied that content was copied and pasted, or ChatGPT in 5 seconds. So that sort of fairness of value is is one problem. And the 2nd problem is that the mass publishing of AI generated content is has been shown to have Google take manual actions and penalize sites. So sort of unleashing massive volumes of AI generated content is a very quick way to getting a website into a lot of trouble.


Punit  03:37

Indeed, that's a very big challenge. And I think it also links to in universities, the professor's asking students to write assignments and then they ask ChatGPT to do write it, or something similar. And Professor doesn't know if it's well written by student or well written but ChatGPT. And typically, we call it plagiarism, right?


Jon  03:58

 Yeah, AI plagiarism. It's a it's a giant problem in academia, the, the number of people that are plagiarizing, in the calling that sort of copying and pasting someone's existing work has massively dropped, because who would do that when it's so easily caught? And then they could use AI to to create the content themselves? So yeah, it's a it's a huge problem within academia right now, and really challenging one to solve for them.


Punit  04:29

So is it a copyright issue? Is it a plagiarism issue? Or is it a privacy issue? What is it in your view?


Jon  04:35

 Yeah, I think for sure it's a plagiarism issue. The case of academia as an academic, disciplinary issue and academic dishonesty issue. So definitely plagiarism. Copyright? I think that's going to be to be determined. I think most open AI and most large language model are trying to make it pretty clear that they're happy to have whoever created the content, receive the they own the output. The US PTO and they the copyright, sort of governing bodies are not open to passing. So there can been some recent studies or some recent cases where they have been reluctant to copyright AI generated work. And so I think that's still gonna play out. I think that's not settled. And the question is gonna become really nuanced in the very near future, where, what is AI vs what is human and when we have sort of a cyborg writing experience, does that still become, is that copyrightable work or not? So I think copyright like plagiarism, simple copyright, nuanced privacy, it shouldn't be. So as long as the LLM's don't produce information that is potentially revealing private information. I don't see a privacy risk with the use of generative AI for text creation.


Punit  05:56

Okay, so talking about plagiarism. I think there are 2 contexts 1 is the students, but that's a different ballgame. But also, there's this context, you or I, or any other entrepreneur, hire somebody to write a copy, write a text? maybe an article, maybe an email, maybe your social media text, and somebody generates it from the artificial intelligence and gives us and we don't know how accurate or how factual it is. And then we end up doing the work to how do we address that?

Jon  06:31

 Yes, so that was exactly the use case that we were struggling with. When we started, the current company originality was we were we would had before that ran a content marketing business, where we would hire writers hire entrepreneurs that would want content to rank on their site. And we struggled to be able to say we had a policy around, not writers not using AI. But we didn't have any good control mechanism in place to ensure that happened. And so that's where we launched originality, which was an AI detection tool to help ensure that website owners, entrepreneurs that were hiring writers, were aware of the content that they were getting, you know, we're not against AI content, we think there's a great use case within within marketing for it. But we think the entrepreneur, the person purchasing the content should be the one that's making the decision on when it gets used and when it doesn't get used.


Punit  07:23

But the challenge also is, any checks that you make on the system in terms of plagiarism, or AI generated content, or whatever, those have what you say, call it false outcomes or false positives, isn't that?


Jon  07:40

Yeah, so so the way the way we evaluate a classifier, which is so as actors are sort of have their own AI that are called classifiers. And the way that they work is you the way that you evaluate their how effective they are, is you feed them known AI content, known human content, and then identify the number of times that it correctly identified a content as AI and human content as human one challenge with the use of these detectors is that they are not no classifier will ever be, you know, 100.00000% accurate, that's just not going to happen. It's similar to the weather, the prediction, like it says is going to rain, sometimes, like most of the time, it will rain, if it says it's going to rain doesn't doesn't always. And that's similar, so at least Crossfire, so although they can be very accurate, you know, 99%, accurate on GPT for content, 2 ½ 3% false positive rate. So false positives do happen. And that's when it identifies a human written content, as AI incorrectly identifies it as AI generated. And that happens in, in our case with sort of web content in that sort of 2 ½  to 3 ½ % of the time. And that's unfortunate, it's painful, we hate them. That's like the one the one thing we would change, if we could change, we have a magic wand and improve our tech, that's the number we would drive down.


Punit  09:04

So if I get it right, there is a challenge to identify whether the content is original or generated by AI?


Jon  09:12



Punit  09:13

You can use tools, you can use some engines, which help you validate or verify if it is original or not, but then that's not the be all end all solution. Because there is a false positive detection, save 5% or 10%. On which happens, so the only way is to do it through many channels and be certain and also use your own human intelligence.


Jon  09:37

Yeah, so we see sort of the sub 5% kind of what we've been seeing in terms of falls rate, but you're exactly right, that is the challenge you're facing. And then a tool like originality or other classifiers are used to help identify take, if you identify content is likely AI generated. There are a few different things that you can do to help know with certainty. There is a we have a free chrome extension that allows you to visualize the creation process. And so basically you just watch, almost like watch the writer, right. So it's all the metadata that exists inside of a Google document, if the tool extracts that and then recreates the creation process. So you get to see this sort of trend over time of the writer writing the document, vs just seeing it copied and pasted quickly, couple changes, and in it, identified it as high AI. And all you see in the creation process was a copy and paste. That's pretty confident in that situation that that's the tool correctly identified as AI.


Punit  10:36

Know. But then there's also another challenge with content generated by AI. I mean, of course, you want to know if it's generated by AI? And okay, if you have 90 95% certainty that it is not or it is. The other challenge is, is it factual? Is it really based on facts? Because, you know, the incident, which happened in one of the US courts, the lawyer asked it, something, it got some text, it submitted the response. And then the law being referred, never existed. So there was no factual evidence of the content being relevant. And is that something also that can be identified through these algorithms?


Jon  11:19

Yeah, so it is definitely a challenge. And what's unique to the challenge, copy editors, when they got a piece of content, they're sort of internal skepticism would be higher when the content was pure, was poorly written, what's happening now is ChatGPT is and all these LLMs are so phenomenal. Intelligently, and convincingly be guessing and making factual errors. So that is definitely a big challenge. We have a we're developing a tool, it's, it's still in beta. It's called a fact checking aid. And so it basically uses retrieval of existing knowledge in the world, plus an LLM to evaluate each facts statement that is made in an article to evaluate whether or not that statement is true or false. Now this is still an L, is it still AI evaluating AI. And so it's it's improves the efficacy of fact checking, but doesn't produce a high enough level that we say, you know, you still need to have a human in the loop. But it's a fact checking aid to make that process a lot faster, where you get all the sort of relevant Google that Google links and an opinion based on an LLM, whether or not the statement is factually correct or not.



Punit  12:37

Okay. So essentially, what we are saying is that these are tools that assist a human in being more sure, but the human intelligence needs to be applied. In addition, let's put it like that collective intelligence of the AI tool, plus the human, the in play, that determines what is the reality, but to up to a certain extent, the tools can help.


Jon  13:03

Correct, yes, it's the intelligence of the LLM plus a collective sort of knowledge of the top ranking articles on Google for that given fact. And then that's all fed into the, to the human to have them provide their sort of evaluation on by checking the sources reviewing the LMS opinion, to determine whether or not that statement is factually accurate or not.

Punit  13:29

And then while this is all on the content being produced, you also mentioned early on that this challenge of multitude of data or content being published on websites, as website as website content, as blogs as articles, which is polluting, let's say the web and creating a challenge to say what is real data, what is the real content? And what is fake content? Or not. So real content, is that something real challenge you see also?


Jon  13:59

Yes, I think it's, it's a all go pretty high level with sort of the answer, but then try and answer it specifically. It's a question for society right now on where do we want AI content? Are we okay with reading AI content in different locations? Reading AI generated product reviews, I think most people are all on the same page that No, I don't want to read an AI generated review, I want to read a human generated review. We helped a journalist review an Amazon book that was about forging an Amazon Kindle book that was by by an author that was published on on Amazon about foraging, and they we helped identify the book was generated. And in the book, it mentioned, to check whether or not a mushroom was poisonous by tasting it, you know, obviously terrible advice really could lead to death. And so it's a question on a bit of a societal question on is this a problem or not? I think there are, you know, my sort of theory on that it says that it is a problem, I don't want to read reviews that are generated by an AI, I want to read reviews that are written by a human that experienced that product and are providing their honest feedback. We've looked at sort of the rate at which AI for lack of a better word is sort of polluting different pockets of the internet. On Google search results, we've seen it go from sort of 6% of results were AI generated. And now it's it's sort of exponentially increased since ChatGPT is launched, and we're up to 12 13% of Google search results are AI generated? And then that becomes a question on if all we're seeing in Google search results is AI generated content that if that day comes, would be still go to Google, or would just go to the AI that created the content and has more context about our own situation? And ask the question there. So I think that's the, you know, a lot and a long, long winded answer, but I think it's, uh, societally It depends. Where we want to take it, I think there's some pockets, we definitely don't want it. But what we're seeing is it is it is currently goo getting everywhere in the world right now.


Punit  16:06

So if I understand correctly, you're saying about 15% of the content on web is AI generated? So which is about 1 in 6 articles, 1 in 6 emails, 1 in 6 websites is being generated? And we don't know if that for real, if it's fact checked or not. And that's where I think your company originality.ai comes in? And tell us what do you do in that context?


Jon  16:33

Yeah, so that's correct. And where, so originality is a tool for copy editors. So anybody that is getting a piece of content from a writer, and then looking to publish that content on the web, so somebody that's an entrepreneur could be acting as a copy editor, when they hire a writer, get a piece of content, and then publish it on the web, we help make sure that content meets the requirements that you're after, so that you can hit publish with integrity. To make sure it's not AI generated makers not plagiarize, make sure it's fact checked, make sure the readability level is where you want it to be. So it's a it's a tool to help copy editors, work with writers and make sure that they can hit hit publish with integrity.


Punit  17:14

That's very nice thing. And you are using what you shared with us the plagiarism checker, the fact checker, the readability checker, and also, the AI checker all aspects is that?


Jon  17:25

That's correct. Yeah. And continue to build out build up more features that help editors.


Punit  17:30

Oh, that's wonderful thing that you're doing. How did you come about this idea of originality.ai?


Jon  17:36

Yeah, so it was it really stemmed from when we were running our content marketing business. And we were working with writers working with editors working with clients, and we just didn't, there were no really good controls in place to be able to help show that content meets these requirements of not AI generated not plagiarize readability level. And so we set out to build that that sort of tool set for editors.


Punit  18:02

Okay, that's an good service you're doing, because this is one of the real challenges? And what would be your message to anyone who is a professional and maybe say an entrepreneur, or a small or medium sized company, or even a large size company and producing content? How can they say stay on the top of their game when they're producing content? And of course, they want to leverage AI to an extent but in the right way.


Jon  18:28

That's it exactly, exactly what you said in the right way, and know that when they're doing it, there's great use cases for AI. For content to help help websites, I think companies should be making that decision, sort of in a heads up way, knowing that knowing that they're making the decision, and themselves and not just sort of head in the sand, hoping that they're not getting a ton of AI content, and then finding out later that all the writers were just using generative AI and that resulted in their website taking taking a punishment. So that's, that's sort of a key. I think, the key thing for people is to understand whether or not AI is being used on their site and make that decision for themselves and not outsource that decision to writers.


Punit  19:15

Makes sense. Makes sense. And if I may ask, do you use AI for your own personal day to day life in some way? Or the other? Are there any use cases you want to share?


Jon  19:25

Yes, I use it for writing, which is kind of funny considering what we just talked about. But you know, I'm an engineer. I wish I could communicate in spreadsheets all the time. And so yeah, I use I've trained to have a custom GPT that's been trained on my own writing style. And then I'll dump my own thoughts and say, write it like me and then I'll I'll use that quite a bit and then they use it a lot for quick data analysis. I use it I find those are probably my two most common use cases as quick quick data analysis about some topic to try and help explore something and understand it more thoroughly. And then for writing.


Punit  20:02

Makes sense that makes sense. And I think it does ease up things, it helps you give a quick start, because sometimes you're struggling or thinking, what do I write? How do I write and then it gives you a jumpstart. And of course, if you use human intelligence on top and write it in your own style, own tone of voice language, then it's a good tool. But if you copy paste, then of course, it's not the ideal approach.


Jon  20:25



Punit  20:26

So with that, John, I think it was wonderful and insightful to know about all these challenges that AI is coming up with, and how also you and many others are solving these challenges that that are emerging in the new world. So I would say in that sense, thank you so much for sharing your insights. And it was a privilege to have you.


Jon  20:45

Yeah, thanks. I think it's an important and important topic that I think society is gonna be wrestling with for many years to come. So yeah, it's a lot lots of excitement. Also some some concerns. So happy to be happy to be involved in that conversation.


Punit  21:01

It's a start, as they say.


FIT4Privacy  21:04

Thanks for listening. If you liked the show, feel free to share it with a friend and write a review. If you have already done so, thank you so much. And if you did not like the show, don't bother and forget about it. Take care and stay safe. FIT4privacy helps you to create a culture of privacy and manage risks by creating, defining and implementing a privacy strategy that includes delivering scenario based training for your staff. We also help those who are looking to get certified in CIPPE CIPM, and CIPT through on demand courses that help you prepare and practice for certification exam. Want to know more? Visit www.fit4privacy.com. That's www.fit4privacy.com. If you have questions or suggestions, drop an email at hello(at)fit4privacy.com.


AI content generation presents both opportunities and challenges for content creators. While AI can be a helpful tool for brainstorming, writing assistance, and data analysis, it's important to ensure the content is original, factual, and aligns with your brand voice. Jon Gillum's company, Originality.ai, provides a valuable suite of tools to help businesses address these challenges and ensure their content is published with integrity. As AI continues to evolve, it's crucial to stay informed and leverage these tools to stay ahead of the curve.


Jon Gillham, is an early generative AI content adopter for SEO purposes at scale, Jon understood the wave that was coming which ChatGPT and GPT-4 have fully unleashed. He also recognized the need for a modern plagiarism-checking solution that delivered advanced features such as scan history, detection scores, shareable results, and team access, all of which are now integral parts of his tool. Jon's work has garnered attention from renowned publications like The New York Times, The Guardian, and Axios. His expertise in AI content detection is shaping the narrative of the digital era.  

Punit Bhatia is one of the leading privacy experts who works independently and has worked with professionals in over 30 countries. Punit works with business and privacy leaders to create an organization culture with high AI & privacy awareness and compliance as a business priority by creating and implementing a AI & privacy strategy and policy.

Punit is the author of books “Be Ready for GDPR” which was rated as the best GDPR Book, “AI & Privacy – How to Find Balance”, “Intro To GDPR”, and “Be an Effective DPO”. Punit is a global speaker who has spoken at over 50 global events. Punit is the creator and host of the FIT4PRIVACY Podcast. This podcast has been featured amongst top GDPR and privacy podcasts.

As a person, Punit is an avid thinker and believes in thinking, believing, and acting in line with one’s value to have joy in life. He has developed the philosophy named ‘ABC for joy of life’ which passionately shares. Punit is based out of Belgium, the heart of Europe.

For more information, please click here.


Listen to the top ranked EU GDPR based privacy podcast...

Stay connected with the views of leading data privacy professionals and business leaders in today's world on a broad range of topics like setting global privacy programs for private sector companies, role of Data Protection Officer (DPO), EU Representative role, Data Protection Impact Assessments (DPIA), Records of Processing Activity (ROPA), security of personal information, data security, personal security, privacy and security overlaps, prevention of personal data breaches, reporting a data breach, securing data transfers, privacy shield invalidation, new Standard Contractual Clauses (SCCs), guidelines from European Commission and other bodies like European Data Protection Board (EDPB), implementing regulations and laws (like EU General Data Protection Regulation or GDPR, California's Consumer Privacy Act or CCPA, Canada's Personal Information Protection and Electronic Documents Act or PIPEDA, China's Personal Information Protection Law or PIPL, India's Personal Data Protection Bill or PDPB), different types of solutions, even new laws and legal framework(s) to comply with a privacy law and much more.
Created with