Pendekatan Teknik Pengecaman Entiti Nama Bagi Capaian Berita Jenayah Bahasa Melayu (Named Entity Recognition Approach for Malay Crime News Retrieval)
Abstract
Pengekstrakan maklumat merupakan satu proses bagi mendapatkan konsep penting dalam mewakili kandungan teks dari dokumen yang tidak berstruktur. Pada masa kini, terdapat banyak dokumen yang tidak berstruktur seperti teks berita, artikel blog, forum, tweet serta mikro blog dari rangkaian sosial. Dokumen-dokumen ini amat sukar untuk difahami oleh komputer. Oleh itu, kajian berkaitan pengekstrakan maklumat menjadi sangat penting bagi mengatasi permasalah ini. Salah satu teknik pengekstrakan yang banyak digunakan ialah pengecaman entiti nama. Kajian ini dijalankan bagi mengimplementasikan teknik pengecaman entiti nama dari sumber dokumen berita jenayah bahasa Melayu. Objektif utama kajian ini adalah untuk membangunkan sistem prototaip model pengekstrakan maklumat berita jenayah dalam bahasa Melayu dengan menggunakan teknik pengecaman entiti nama melalui pendekatan berasaskan peraturan. Kajian ini dilakukan dengan mewujudkan korpus berita jenayah dalam bahasa Melayu yang diperolehi dari sumber arkib berita BERNAMA. Korpus ini kemudiannya diteliti secara manual oleh pakar bahasa bagi mengecam entiti nama seperti individu, organisasi, lokasi, tarikh, masa, kewangan, peratusan, jenayah dan senjata. Dalam masa yang sama, sistem prototaip dibangunkan serta diuji dengan korpus yang sama dan hasil dari pengujian ini dibandingkan dengan keputusan pakar. Secara keseluruhannya, ujian sistem prototaip ini menunjukkan hasil yang baik dengan nilai dapatan bagi recall sebanyak 78.67%, manakala bagi precision ialah sebanyak 71.11% dan F-measure sebanyak 74.7%. Hasil dari kajian ini diharap dapat menyumbang kepada pengetahuan mengenai keberkesanan teknik pengecaman entiti nama bagi berita jenayah bahasa Melayu dan seterusnya dapat membantu para penyelidik, polis, peguam serta pihak berkuasa yang terlibat dalam bidang jenayah menyelesaikan jenayah dengan lebih cepat dan berkesan.
Kata kunci: pengekstrakan maklumat; pengecaman entiti nama; Bahasa Melayu, berita jenayah, pendekatan berasaskan peraturan.
Abstract
Information extraction is a process of obtaining an important concept in representing the textual content of unstructured documents. At present, there are a lot of unstructured documents such as news, articles, blogs, forums, tweets and micro-blogs of social networks. These documents are very difficult to be understood by the computer. Therefore, studies on the extraction of information is very important to overcome this problem. One extraction technique that is widely used is the entity name recognition. This research aims to implement the entity name recognition techniques of crime news source document in Malay language. The main objective of this study is to develop a prototype system model information extraction crime news in the Malay language using name entity recognition through a rule-based approach. This assessment is done by creating a corpus of crime news in the Malay language which is derived from the archival source; BERNAMA news. The corpus is then examined manually by linguists to identify individual entities such as name, organization, location, date, time, financial, percentage, crime and weapons. At the same time, a prototype system was developed and tested with the same corpus and the results of these tests were compared with the results of an expert. Overall, these tests showed good results with the findings for the recall at 78.67%, while precision is at 71.11% and for F-measure at 74.7%. The results of this study are expected to contribute knowledge regarding the effectiveness of the entity's name recognition techniques for crime news Malay language. This could further assist investigators, police, lawyers and authorities involved in the field of crime in solving crimes more quickly and effectively.
Keywords: information extraction; named entity recognition; malay language, crime news, rule-based approach
Full Text:
PDFReferences
Alfred, R., Leong, L. C., On, C. K. & Anthony, P. (2014) . Malay Named Entity Recognition Based on Rule-Based Approach. International Journal of Machine Learning and Computing. Vol. 4(3), 300-306. doi:10.7763/IJMLC.2014.V4.428
Alfred, R., Leong, L., On, C. & Anthony, P. (2013). A Rule-Based Named-Entity Recognition for Malay Articles. Advanced Data Mining. 288-299.
Alfred, R., Mujat, A. & Obit, J. (2013). A ruled-based part of speech (RPOS) tagger for malay text articles. Asian Conference on Intelligent Information and Database Systems, LNCS. 7803, 50-59.
Alruily, M., Ayesh, A. & Zedan, H. (2009). Crime Type Document Classification from Arabic Corpus. 2009 Second International Conference on Developments in eSystems Engineering, 153–159. doi:10.1109/DeSE.2009.50
Amin M.B., Rahim M.K. & Ayu G.M.S. (2014). A Trend Analysis of Violent Crimes in Malaysia. Health and the Environment Journal. Vol. 5(2), 41-56.
Asharef, M. M. A. A. (2012). Rule Base Arabic Named Entity Recognition For Crime Documents. Unpublished Master Thesis, Universiti Kebangsaan Malaysia, Bangi, Malaysia
Bikel, D. M., Miller, S., Schwartz, R. & Weischedel, R. (1997). Nymble: A high- performance learning name-finder. Proceedings of the Fifth Conference on Applied Natural Language Processing, 194–201.
Budi, I.,Bressan, S.,Wahyudi, G.,Hasibuan, Z.A. & Nazief, B.A.A. (2005). Named Entity Recognition for the Indonesian language: Combining contextual, morphological and part-of-speech features into a knowledge engineering approach. Proceedings of the 8th international conference on
Discovery Science. LNCS 3735 LNAI : 57-69.
Chau, M., Xu, J. & Chen, H. (2002). Extracting Meaningful Entities from Police Narrative Reports. Proceedings of the 2002 annual national conference on Digital government research, 1–5.
Croft, W. B., Metzler, D. & Strohman, T. (2010). Search Engines Information Retrieval in Practice. Pearson.
Cunningham, H. (2005). Information extraction, automatic. Encyclopedia of Language and Linguistics.
Darwich, M. (2014). Probabilistic Reference To Suspect Or Victim In Nationality Extraction From Unstructured Crime News Documents. Unpublished Master Thesis, Universiti Kebangsaan Malaysia, Bangi,
Malaysia.
Douthat, A. (1998). The Message Understanding Conference Scoring Software User’s Manual. 7th Message Understanding Conference (MUC-7).
Eikvil, L. (1999). Information Extraction from the World Wide Web: A Survey. Norwegian Computer Center, Report no. 945, July 1999.
Esmaail, N. F. M. (2012). Arabic Named Entity Recognition Using Neural Network. Unpublished Master Thesis, Universiti Kebangsaan Malaysia, Bangi, Malaysia.
Faizah Md Latif. (2015). Ke Arah Pengurangan Indeks Jenayah Jalanan di Pusat Bandar Kuala Lumpur. GEOGRAFIA Malaysian Journal of Society and Space. Vol. 11(4), 97-107.
Fong, Y., Ranaivo-Malançon, B. & Wee, A. (2011). NERSIL-the Named-Entity Recognition System for Iban Language. 25th Pacific Asia Conference on Language, Information and Computation, pages 549–558.
Fukuda, K., Tsunoda, T., Tamura, A. & Takagi, T. (1998). Toward Information Extraction: Identifying protein names from biological papers. Pacific Symposium on Biocomputing PSB’98, 707–718.
Grishman, R. & Sundheim, B. (1996). Message Understanding Conference-6: A Brief History. Proceedings of the 16th International Conference on Computational Linguistics, 466–471.
Hadi, S. binti A. (2011). Pendekatan Pengecaman Nama Entiti Bagi Capaian Berita Berbahasa Inggeris di Malaysia. Unpublished Master Thesis, Universiti Kebangsaan Malaysia, Bangi, Malaysia
Humphreys, K., Demetriou, G. & Gaizauskas, R. (2000). Two Applications of Information Extraction to Biological Science Journal Articles. Enzyme Interactions and Protein Structures. Pacific Symposium on Biocomputing, 5, 502–513.
Ishak, S. & Bani, Y. (2017). Determinants of Crime in Malaysia: Evidence from Developed States. Int. Journal of Economics and Management. Vol. 11, 607-622.
Kamus Dewan Edisi Ketiga. (2002). Dewan Bahasa dan Pustaka.
Ku, C. H., Iriberri, A. & Leroy, G. (2008). Crime Information Extraction from Police and Witness Narrative Reports. 2008 IEEE Conference on Technologies for Homeland Security, 193–198.
doi:10.1109/THS.2008.4534448
Liddy, E. D. (2001). Natural Language Processing. Encyclopedia of Library and Information Science, hlm.2nd Ed. New York, New York, USA: Marcel Decker, Inc.
Masnizah Mohd, Wan Fariza Paizi@Fauzi, Amri Jasin. (2018). Teknik Pengukuhan Perangkak Tumpuan melalui Modul Pengesan Bahasa bagi Capaian Web Bahasa Melayu. GEMA Online® Journal of Language
Studies. Vol. 18(3).
Mohammed, N. F. & Omar, N. (2012). Arabic Named Entity Recognition Using Artificial Neural Network. Journal of Computer Science. Vol. 8(8), 1285-1293.
Nik Safiah Karim, Onn, F. M., Musa, H. H. & Mahmood, A. H. (2013). Tatabahasa Dewan Edisi Ketiga. Dewan Bahasa dan Pustaka.
Proux, D., Rechenmann, F., Julliard, L., Pillet, V. & Jacq, B. (1998). Detecting Gene Symbols and Names in Biological Texts : A First Step toward Pertinent Information Extraction. Genome Informatics. Vol. 9, 72-80.
Rindflesch, T. C., Rajan, J. V. & Hunter, L. (2000). Extracting Molecular Binding Relationships from Biomedical Text. ANLC ’00 Proceedings of the sixth conference on Applied natural language processing, 188–195
Sari, Y., Hassan, M. F. & Zamin, N. (2010). Rule-based pattern extractor
and named entity recognition: A hybrid approach. 2010 International Symposium on Information Technology, 563–568. doi:10.1109/ITSIM.2010.5561392
Sekine, S., Grishman, R. & Shinnou, H. (1998). A Decision Tree Method for Finding and Classifying Names in Japanese Texts. Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada.
Shahrul Azman Mohd Noah, Nazlena Mohamad Ali, Mohd Sabri Hasan. (2018). Penentuan Fitur bagi Pengekstrakan Tajuk Berita Akhbar Bahasa Melayu. GEMA Online® Journal of Language Studies. Vol. 18(2).
Sommerville, I. (2011). Software Engineering. 9th Edition. Pearson.
Tang, C. (2009). The linkages among inflation, unemployment and crime rates in Malaysia. International Journal of Economics and Management. Vol. 3(1), 50-61.
DOI: http://dx.doi.org/10.17576/gema-2018-1804-14
Refbacks
- There are currently no refbacks.
eISSN : 2550-2131
ISSN : 1675-8021