Teknik Pengukuhan Perangkak Tumpuan melalui Modul Pengesan Bahasa bagi Capaian Web Bahasa Melayu (Focused Crawler Enhancement Technique with Language Detection Module for Malay Web Retrieval)

Masnizah Mohd, Wan Fariza Paizi@Fauzi, Amri Jasin

Abstract


Perangkak ialah antara komponen utama dalam seni bina sistem capaian maklumat atau enjin gelintar. Ia berfungsi mengumpul laman web yang relevan bertujuan untuk diuruskan melalui pengindeksan maklumat pautan dan kandungan. Perangkak tumpuan adalah aplikasi perangkak yang direka khas untuk memilih dan mengumpul laman web yang mempunyai kaitan tentang domain atau pertanyaan khusus di Internet. Perangkak yang baik mampu memberikan keputusan maklumat yang tepat, pantas, luas dan relevan kepada pengguna semasa proses pencarian maklumat menggunakan enjin gelintar. Kesukaran malah ketidakupayaan mengesan pautan serta kandungan berbahasa Melayu merupakan antara isu utama. Kesannya ialah ada di antara kandungan laman web Bahasa Melayu tidak dapat diindeks seterusnya diproses untuk capaian maklumat. Malah kekurangan perangkak yang khusus bagi carian laman web Bahasa Melayu sebagai bahasa carian utama menjadi pendorong utama penyelidikan ini. Maka objektif utama kajian ini adalah untuk mengenalpasti strategi merangkak yang baik untuk perangkak tertumpu memilih pautan yang relevan dan berkualiti berdasarkan pertanyaan Bahasa Melayu. Perangkak tumpuan yang digunakan dalam penyelidikan ini telah melalui pengubahsuaian hasil daripada gabungan beberapa teknik pengukuhan merangkak. Hasil pengujian yang berulang menunjukkan kehadiran modul pengukuhan perangkak tumpuan telah memberi keputusan yang baik iaitu berupaya mengesan laman web bahasa Melayu yang tepat. Penyelidikan ini juga menjadi titik tolak kepada perkembangan pencarian maklumat berdasarkan pertanyaan Bahasa Melayu di Internet, di samping dapat memartabatkan Bahasa Melayu di dunia siber.

 

Kata Kunci: perangkak; capaian maklumat; Bahasa Melayu; enjin gelintar; web

 

ABSTRACT

 

Crawler is one of the major components in the architecture of information retrieval systems or search engines. The function is to gather relevant websites aimed to be managed through indexing of links and content. A focused crawler application is designed to select and collect web pages that are relevant to domains or specific topics in the Internet. A good crawler can  provide accurate, extensive and relevant information to the user during the process of information seeking using search engines. The inability to detect links and content of Malay language is one of the main issues. Therefore, some of the content of the Malay website cannot be indexed and processed for information retrieval. The lack of research in focused crawler especially for Malay website has motivated this research. The main objective of this study is to identify good crawling strategies for focused crawler in detecting relevant and quality links for Malay website. The focused crawler employed in this research has undergone some modifications resulting from a combination of some crawling strengthening techniques. Findings indicate that the presence of a focused crawler enhancement module provides good results because it can detect Malay language webs accurately. This research is also a turning point for the development of information retrieval for Malay websites as well as enhancing the prominence of Malay language in cyberspace.

 

Keywords: crawler; information retrieval; Malay language; search engine; web


Keywords


Crawler; Capaian Maklumat; Bahasa Melayu; Enjin Gelintar; Web

Full Text:

PDF

References


Acharjya, D.P. & Mitra, Anirban. (2017). Bio-Inspired Computing for Information Retrieval Applications. United States: GI Global.

Almpanidis, G, Kotropoulos, C & Pitas, I. (2007). Combining Text and Link Analysis for Focused Crawling: An Application for Vertical Search Engines. Information System. Vol. 32(6), 886-908.

Angkawattanawit, N & Rungsawang, A. (2002). Learnable Crawling: An Efficient Approach to Topic-Specific Web Resource Discovery. Proceedings of the 2nd International Symposium on Communications and Information Technology (ISCIT). Bangkok, Thailand. March.

Arifah Che Alhadi, Shahrul Azman Noah & Lailatul Qadri Zakaria. (2012). Pendekatan Ontologi dalam Capaian dan Perwakilan Semantik Dokumen Web. Jurnal Teknologi. Vol. 46(1), 103-120.

Badia, A, Muezzinoglu, T. & Nas-raoui, O. (2006). Focused Crawling: Experiences in a Real World Project. Proceedings of the 15 International Conference on World Wide Web. Edinburgh, Scotland, May. 1043-1044.

Bhatia, M. P. S. & Gupta, D. (2008). Discussion on Web Crawlers of Search Engine. Proceedings of 2nd National Conference on Challenges & Opportunities in Information Technology (COIT-2008). RIMT-IET, India. March.

Botha, G. & Barnards, E. (2005). Two Approaches to Gathering Text Corpora from the World Wide Web. Proceedings of the 16th Annual Symposium of the Pattern Recognition Association of South Africa.

Stellenbosch, South Africa, December. 194-199.

Cavnar, W. B. & Trenkle, M. (1994). N-Gram Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, US, September. 161-175.

Chakrabarti, S, Punera, K. & Subramanyam, M. (2002). Accelerated Focused Crawling Through Online Relevance Feedback. Proceedings of the 11th international conference on World Wide Web. Honolulu, Hawaii, US. May.

Cho, J, Garcia-Molina, H. & Page, L. (1998). Efficient Crawling Through URL Ordering. Proceedings of the 7th International World-Wide Web Conference. Brisbane, Australia, April. 161-172.

De Bra, P, Houben, G.-J, Kornatzky, Y. & R. Post. (1994). Information Retrieval in Distributed Hypertexts. Proceedings of RIAO”94, Intelligent Multimedia, Information Retrieval Systems and Management. New York, US. October.

Ehrig, M. & Maedche, A. (2003). Ontology-Focused Crawling of Web Documents. Proceedings of the 2003 ACM Symposium on Applied Computing (SAC). Melbourne, FL, USA. March.

Fatimah Ahmad. (1995). A Malay Language Document Retrieval System: An Experimental Approach and Analysis. Ph.D thesis, Universiti Kebangsaan Malaysia, Bangi, Malaysia.

Fatma Howedi & Masnizah Mohd. (2014). Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data.Computer Engineering and Intelligent Systems. Vol. 5(4), 48-56.

Hamed Zakeri Rad, Sabrina Tiun & Saidah Saad. (2018). Lexical Scoring System of Lexical Chain for Quranic Document Retrieval. GEMA Online® Journal of Language Studies. Vol. 18(2), 59-79.

Hamood Alshalabi, Sabrina Tiun, Nazlia Omar & Mohammed Albared. (2013). Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization. Procedia Technology. Vol. 11, 748-754.

Hersovici, M , Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalheim, M. & Ur, S. (1998). The Shark-Search Algorithm: An Application Tailored Web Site Mapping. Computer Networks and ISDN Systems. Vol. 30(1-7), 317-326.

Katharine Jarmul & Richard Lawson. (2017). Python Web Scraping. Birmingham: Packt Publishing Ltd.

Kumar, V., Grama, A., Gupta, A. & Karypis, G. (2008). Introduction to Parallel Computing: Design and Analysis of Algorithms. Michigan: Benjamin/Cummings Pub. Co.

Masomeh, A., Yari, A., & Mohammad, J. V. (2010). Language Specific Crawling based on Web Pages Features. In Proceedings of the 2010 International Conference on Multimedia Computing and Information Technology (MCIT). Sharjah, United Arab Emirates. March.

Menczer, F., Pant, G., Srinivasan, P. & Ruiz, M. (2001). Evaluating Topic -Driven Web Crawlers. In Proceedings of the 24th Annual International ACM/SIGIR Conference. Melbourne, Australia. August.

Nazlia Omar, Masnizah Mohd & Yusman Jamat. (2013). Kebarangkalian Secara Automatik Dalam Alkhwarizmi Bayes untuk Aplikasi Agen Bualan. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 2(1), 27-37.

Neel, S. & Jeonghee, Y. (2001). Mining the Web for Relations. Computer Networks Vol. 33(1-6), 699-711.

Page, L., Brin, S., Motwani, R. & Winograd, T. (1998). The Pagerank Citation Ranking: Bringing Order to The Web. Proceedings of the 7th International World Wide Web Conference. Brisbane, Australia, April.

Pant, G. & Menczer, F. (2003) Topical crawling for business intelligence. Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries. Bath, UK. September.

Pant, G, Srinivasan, P. & Menczer, F. (2004). Crawling the Web. In Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Trondheim, Norway. August.

Pant, G. (2003). Deriving Link-Context from HTML Tag Tree. 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, California, US. June.

RaviKumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A. & Upfal, E. (2000). Stochastic Models for the Web Graph. Proceedings of 41st Annual Symposium on Foundations of Computer Science. Redondo Beach, CA, US. November.

Savoy, J. (1993). Stemming of French Words on Grammatical Categories. Journal of American Society for Information Science. Vol. 44(1), 1-9.

Shahrul Azman Mohd Noah, Nazlena Mohamad Ali & Mohd Sabri Hasan. (2018). Penentuan Fitur bagi Pengekstrakan Tajuk Berita Akhbar Bahasa Melayu. GEMA Online® Journal of Language Studies. Vol. 18(2), 154-167.

Shan-Bin Chan & Hayato Yamana. (2010). The Method of Improving the Specific Language Focused Crawler. Proceedings of the 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing. Beijing, China. August.

Shanjian Li & Katsuhiko Momoi. (2001). A Composite Approach To Language/Encoding Detection. Proceedings of the 19th International Unicode Conference. San Jose, California, US. September.

Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri & Mohammad Ghodsi. (2003). A Fast Community Based Algorithm For Generating Web Crawler Seeds Set. Proceedings of the Fourth International Conference on Web Information Systems and Technologies. Madeira, Portugal. May.

Somboonviwat, K., Kitsuregawa, M. & Tamura, T. (2015). Simulation Study of Language Specific Web Crawling. ICDE 21st International Conference on Data Engineering (ICDE’05). Tokyo, Japan. April.

Somboonviwat, K., Tamura, T. & Kitsuregawa, M. (2016). Finding Thai Web Pages in Foreign Web Spaces. International Conference on Data Engineering (ICDE) Workshop. Helsinki, Finland. May.

Tamura, T., Somboonviwat, K. & Masaru, K. (2016). A Method for Language Specific Web Crawling and Its Evaluation. IEICE Transactions on Iinformation and Systems Vol. 38(2), 10-20.




DOI: http://dx.doi.org/10.17576/gema-2018-1803-10

Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2131

ISSN : 1675-8021