การแก้ไขปัญหาความไม่สมดุลของข้อมูลสำหรับการจำแนกข่าว ด้วยขั้นตอนวิธีนาอีฟเบย์

ณัฐนันท์ จันโจ้ก; สิทธิพร ตันติบริรักษ์

Please use this identifier to cite or link to this item: https://ir.swu.ac.th/jspui/handle/123456789/15753

Title:	การแก้ไขปัญหาความไม่สมดุลของข้อมูลสำหรับการจำแนกข่าว ด้วยขั้นตอนวิธีนาอีฟเบย์
Other Titles:	Solving The Data Imbalance Problem For News Classification By Naïve Bayes Algorithm
Advisor :	กำพล วรดิษฐ์
Authors:	ณัฐนันท์ จันโจ้ก สิทธิพร ตันติบริรักษ์
Keywords:	การจำแนกข่าว นาอีฟเบย์ ไพทอน Data Imbalances News Classification Naïve Bayes Algorithm Imbalanced Datasets Python
Issue Date:	2562
Publisher:	ภาควิชาวิศวกรรมไฟฟ้า มหาวิทยาลัยศรีนครินทรวิโรฒ
Abstract:	โครงงานวิศวกรรมนี้ศึกษาและทดลองเกี่ยวกับวิธีแก้ไขความไม่สมดุลของจำนวนข้อมูลข่าวโดยข้อมูลมีลักษณะเป็นบทความภาษาอังกฤษ สำหรับการจำแนกประเภทข่าวด้วยขั้นตอนวิธีนาอีฟเบย์ โดยอาศัยเงื่อนไขตามลักษณะของข้อมูลในแต่ละประเภท ซึ่งจำนวนของข้อมูลมีผลกับการจำแนกข้อมูล ในการจำแนกข้อมูลที่ไม่สมดุลจะเกิดความผิดพลาดขึ้นได้ เนื่องจากข้อมูลแต่ละประเภทมีจำนวนข้อมูลที่ไม่เท่ากันหรือต่างกันอย่างมาก จนทำให้การจำแนกนั้นไม่สามารถแบ่งแยกข้อมูลออกได้อย่างถูกต้อง ความไม่สมดุลของจำนวนข้อมูลดังกล่าวเกิดกับข้อมูลข่าว ซึ่งข่าวแต่ละประเภทที่เกิดขึ้นในแต่ละวันมีจำนวนไม่เท่ากัน โครงงานวิศวกรรมนี้จึงได้ศึกษาและทดลองเกี่ยวกับวิธีแก้ไขความไม่สมดุลของจำนวนข้อมูลข่าวทั้งหมด 3 วิธี ทั้งวิธีที่เป็นการเพิ่มจำนวน วิธีที่เป็นการลดจำนวน และวิธีที่เป็นการผสม แล้วนำข้อมูลมาวิเคราะห์เงื่อนไข ตามลักษณะของข้อมูลเพื่อสร้างโมเดลขั้นตอนวิธีนาอีฟเบย์ และทดสอบโมเดล ผลลัพธ์ความน่าจะเป็นของการทำนายถูกสร้างในรูปแบบของเมตริกแห่งความสับสน และทำการประเมินความถูกต้องของโมเดล แบ่งออกเป็น 4 ค่า คือ ความแม่นยำ ความถูกต้อง ความเที่ยงตรง และ ค่าวัดประสิทธิภาพ ผลของการทดลองสรุปได้ว่าการใช้วิธีการสุ่มเพิ่มแบบสุ่มได้ผลดีที่สุดเมื่อนำข้อมูลมาจำแนกประเภทของข่าวด้วยขั้นตอนวิธีนาอีฟเบย์ เนื่องจากการเพิ่มจำนวนข้อมูลเดิม ทำให้จำนวนลักษณะสำคัญของข้อมูลแต่ละประเภทมีจำนวนเพิ่มขึ้นตามและทำให้การจำแนกมีความถูกต้องสูง และวิธีที่เป็นการลดข้อมูลทำให้ลักษณะสำคัญของข้อมูลลดลงส่งผลให้ความถูกต้องของโมเดลต่ำกว่า โมเดลที่สร้างจากข้อมูลที่ยังไม่ถูกแก้ไขความไม่สมดุลของข้อมูล และในวิธีการเชื่อมโยงโทเมคไม่สามารถแก้ไขความไม่สมดุลของข้อมูลได้ This engineering project studies and experiments about solving the imbalance in the amount of news data in an English article by using Naive Bayes algorithm. The Naive Bayes algorithm is based on the conditions of data attributes in each category. The amount of data affects the classification. Classifying the imbalance data may cause an error due to the significant difference in quantity of each type of data which makes it impossible to classify the data. The imbalance of data occurs because the amount of news in each category is not equally published daily. This engineering project includes three main techniques of solving the data imbalance. The technique consists of increasing, reducing and hybrid technique. Finally, the result of the experiment with Confusion Matrix via Accuracy Precision Recall and F1-Score suggested that using the random over-sampling technique is the best solution, due to its increase of the amount of data by using the data attribute. On the other hand, the data reduction technique causes the missing of some important attributes of the data. The decreased data’s accuracy is significantly lower than the original data. Tomek’s links technique cannot be used to solve the data imbalance.
URI:	https://ir.swu.ac.th/jspui/handle/123456789/15753
Appears in Collections:	EleEng-Senior Projects

Files in This Item:

File	Description	Size	Format
Eng_Nattanan_J.pdf Restricted Access		2.42 MB	PDF	View/Open Request a copy

Show full item record