Protecting originality: Detecting plagiarism and AI-generated content using Machine Learning
Traditionally, plagiarism was mainly done by literally copying content from other sources without adding any references. However, the irruption of Chat GPT, Bard, and other Large Language models enables it to automatically generate texts on any subject. In today’s digital landscape, the prevalence of plagiarism and AI-generated content (AIGC) poses a serious threat to education institutions as there are no reliable tools to validate the authenticity of work  and recent research revealed that content generated from the GPT-4 model was more difficult to detect as AI-generated compared to GPT 3.5 . On the other hand, research conducted by the University of California, Berkey revealed that 40-70 % of students admitted to cheating . From a preliminary state-of-the-art review conducted for AI-generated content detection and plagiarism detection, the following research gaps were identified: Existing plagiarism detection algorithms can tackle literal plagiarism such as clone plagiarism, and synonym plagiarism. However, most of them need help to tackle intelligent plagiarism techniques such as paraphrasing, and idea plagiarism . Existing AIGC detection algorithms mostly fail to identify the paraphrased version of the content generated from generative language models as AI-generated. There are almost no implementations tackling both use cases: plagiarism detection and AIGC detection. Existing research is mainly focused on the English language and there is very little research in multilingual text. This thesis endeavors to address several research questions. Firstly, the research assesses the efficiency of existing algorithms for plagiarism detection and AI-generated content detection. Secondly, the study seeks to find out the best technical and non-technical approaches to tackle the problem of AI-generated plagiarism. Thirdly, the project intends to find out which machine learning techniques, including architecture, feature engineering, and deep learning algorithms would yield the optimal results for detecting originality in textual documents across multilingual environments and effectively address plagiarism and AI-generated content detection. Additionally, the project will help to develop the most effective policies, guidance, good practices, and procedures in academia to avoid, detect and mitigate such plagiarism. This research project aims to help public and private institutions solve a massive problem which is validating that all researchers and students are submitting original work which is highly important for maintaining the academic integrity of educational institutions and research organizations. With the help of generative AI and several paraphrasing tools, people have been able to violate the intellectual property of several authors and researchers. By developing effective algorithms for detecting plagiarism and AI-generated content, this research project will help protect the intellectual property rights of authors and researchers. Thus, we propose the development of machine-learning solutions to detect both traditional and AI-generated plagiarism, as well as other techniques and guidance, to ensure the authenticity of academic conduct. The project will primarily focus on the English language and consider other languages in its second phase.