The Development of Malaysian Corpus of Financial English (MaCFE)

Roslan Sadjirin; Roslina Abdul Aziz; Noli Maishara Nordin; Mohd Rozaidi Ismail; Norzie Diana Baharum

doi:10.17576/gema-2018-1803-05

The Development of Malaysian Corpus of Financial English (MaCFE)

Roslan Sadjirin, Roslina Abdul Aziz, Noli Maishara Nordin, Mohd Rozaidi Ismail, Norzie Diana Baharum

Abstract

This paper presents the processes involved in the design and development of the Malaysian Corpus of Financial English (MaCFE); a specialized corpus containing a wide range of online/internet documents (i.e. communiqué) from various financial institutions in Malaysia. It describes in detail the processes involved in the collection and selection of data and preprocessing of raw data, which includes data digitizing, cleansing and tagging. This paper also introduces the user interface for MaCFE with its built-in linguistic analysis features. MaCFE was designed and developed with the intention of providing corpus linguistic researchers with the avenue to explore the field and for ESP/EAP practitioners in Malaysia, as the resources for the development of local-based ESP/EAP curriculum and teaching and learning materials. It would also serve as a learning avenue for future financial professionals in their training. MaCFE corpus has approximately 4.3 million words from 1472 electronic documents retrieved from banks and financial institutions’ official websites. At present, users can make queries to the MaCFE database using its built-in concordancer. In the future, its language-data-processing facilities will be expanded to include tools for keyword, wordlist and word collocations queries.

Keywords

corpus linguistics; specialized corpus; financial English; ESP; EAP

Full Text:

PDF

References

Ain Nadzimah Abdullah, Rosli Talif (2002). The Sociolinguistics of Banking: Language Use in Enhancing Capacities and Opportunities. Pertanika Journal of Social Sciences & Humanities. Vol. 10(2), 109-116.

Aksan, Y., & Aksan, M. (2009). Building a National Corpus of Turkish: Design and Implementation. Working Papers in Corpus-Based Linguistics and Language Education. Vol. 3, 299–310.

Arshad Abdul Samad (2002). The English of Malaysian School Students (EMAS) Corpus. Retrieved April 20, 2017 from http://works.bepress.com/arshad_abdsamad/2/

Arshad Abdul Samad, Hawanum Hussein (2010). Teaching Grammar and What Student Errors in the Use of the English Auxiliary 'be' Can Tell Us. The English Language Teacher. Vol. 39, 164-178.

Arshad Abdul Samad (2004). Beyond Concordance Lines: Using Concordances to Investigating Language Development. Internet Journal of e-Language Learning & Teaching. Vol. 1(1), 43-51.

Atkins, S., Clear, J. & Ostler, N. (1991). Corpus Design Criteria. Retrieved January 11, 2017 from http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf

Bank Negara Malaysia (2017). Islamic Banking and Takaful. Retrieved March 2, 2017 from http://www.bnm.gov.my/index.php?ch=fs_mfs_banks&act=55

Bennett, G. R. (2010). Using Corpora in the Language Learning Classroom: Corpus Linguistics for Teachers. Ann Harbour: University of Michigan Press. https://doi.org/10.3998/mpub.3715

Botley, S. & Doreen Dillah (2007). Investigating Spelling Errors in a Malaysian Learner Corpus. Malaysian Journal of ELT Research. Vol. 3, 74–93.

Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data cleaning: Overview and emerging challenges. 2016 International Conference on Management of Data Conference Proceedings, 26 June-01 July, California,

USA (2016). https://doi.org/10.1145/2882903.2912574

Darina Lokeman Lok, Juliana Ali, Norin Norain Zulkifli Anthony (2013). A Corpus Based Study on the Use of Preposition of Time ‘on’ and ‘at’ in Argumentative Essays of Form 4 and Form 5 Malaysian Students. English Language Teaching. Vol. 6(9), 128-135.

Fox, C. (1989). A Stop List for General Text. ACM SIGIR Forum. Vol. 24(1–2), 19–21. https://doi.org/10.1145/378881.378888

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J.. & Smith, N. A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. 49th Annual Meeting of the Association for Computational Linguistics: Shortpapers Conference Proceedings. https://doi.org/10.1.1.206.3224

Granger, S. (1998). Learner English on Computer. London and New York: Addison Wesley Longman.

Granger, S. (2002). A Bird's-eye View of Learner Corpus Research. In S. Granger, J. Hung & S. Petch-Tyson (Eds.), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching (pp. 3-33). Amsterdam: John Benjamins.

Hajar Abdul Rahim (2014). Corpora in Language Research in Malaysia. Kajian Malaysia. Vol. 32(1), 1–16.

Imran Ho-Abdullah, Zaharani Ahmad, Rusdi Abdul Ghani, Nor Hashimah Jalaluddin, Idris Aman (2004). A practical grammar of Malay – A corpus-based approach to the description of Malay. First COLLA Regional Workshop Conference Proceedings, 28–29 June, Putrajaya, Malaysia (2004).

James, C. (1998). Errors in Language Learning and Use. Exploring Error Analysis. Haslow, Essex: Addison-Wesley Longman.

Janaki Manokaran, Chithra Ramalingam, Karen Adriana (2013). A Corpus-based Study on the Use of Past Tense Auxiliary ‘be’ in Argumentative Essays of Malaysian ESL Learners. English Language Teaching. Vol. 6(10), 111-119.

Jayakaran Mukundan, Rezvani Kalajahi, S. A. (2013). Malaysian Corpus of Student Argumentative Writing. Australia: Australian International Academic Centre.

Johns, T. (1991). Should You be Persuaded: Two Samples of Data-driven Learning Materials. ELR Journal. Vol. 4, 1–16.

Kennedy, G. (1998). An Introduction to Corpus Linguistics. London and New York: Longman.

Knowles, G., Zuraidah Mohd Don, Jariah Mohd Jan, Rajeswary Sargunam, Janet Yong, Sathia Devi, Asha Doshi, Su'ad Awab (2006). The Malaysian Corpus of Learner English: A Bridge from Linguistics to ELT. In H. Azirah & H. Norizah (Eds.), Varieties of English in Southeast Asia and Beyond (pp. 257-268). Kuala Lumpur: University of Malaya Press.

Leech, G. N. (1997). Introducing Corpus Annotation. In R.G. Garside, G. Leech, A.M. McEnery (Eds). Corpus Annotation: Linguistic Information from Computer Text Corpora (pp. 1-18). London and New York: Longman.

Marcus, M. P., Santorini, B. & Marcinkiewicz, M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. Vol. 19(2), 313–330.

https://doi.org/10.1162/coli.2010.36.1.36100

McEnery, A. & Hardie, A. (2011). Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.

Mohamed Ismail Abdul Kader, Neda Begi, Reza Vaseghi (2013). A Corpus-based Study of Malaysian ESL Learners’ Use of Modals in Argumentative Compositions. English Language Teaching. Vol. 6(9), 146-157.

Nesselhauf, N. (2005). Corpus Linguistics: A Practical Introduction. https://www.scribd.com/document/215218285/Corpus-Linguistics-Practical-Introduction-pdf

Rafidah Kamarudin (2013). A Corpus-based Study on the Use of Phrasal Verbs by Malaysian Learners of English: The Case of Particle UP. In S. Ishikawa (Ed.). Learner Corpus Studies in Asia and The World Vol. 1 (pp. 255-270). Japan: Kobe University.

Roslina Abdul Aziz, Zuraidah Mohd Don (2013). The BE Verb Omission Among Advanced L1-Malay ESL Learners: What Corpus-based Study can Reveal. In S. Ishikawa (Ed.). Learner Corpus Studies in Asia and the

World Vol. 1 (pp. 121-138). Japan: Kobe University.

Roslina Abdul Aziz, Zuraidah Mohd Don (2014). The Overgeneration of BE+verb Construction in the Writing of L1-Malay ESL Learners in Malaysia. Research in Corpus Linguistics., Vol. 2, 35-44.

Roslina Abdul Aziz, Noli Nordin, Mohd Rozaidi Ismail, Norzie Diana Baharum, Roslan Sadjirin (2015). Building the Malaysian Corpus of Financial English (MaCFE). 2nd International Conference on Language, Education, Humanities and Innovation Conference Proceedings, 29-30 December, Kuala Lumpur, Malaysia.

Santorini, B. (1990). Part-of-speech Tagging Guidelines for the Penn Treebank Project (3rd Revision). University of Pennsylvania 3rd Revision 2nd Printing, 53(MS-CIS-90-47), 33. https://doi.org/10.1017/CBO9781107415324.004

Shazila Abdullah, Noorzan Mohd Noor (2013). Contrastive Analysis of the Use of Lexical Verbs and Verb-noun Collocation in Two Learner Corpora: WECMEL vs. LOCNESS. In S. Ishikawa (Ed.). Learner Corpus Studies in Asia and the World Vol. 1 (pp. 139-160). Japan: Kobe University. Retrieved from http://www.lib.kobe-u.ac.jp/handle_kernel/81006680.

Shterev, Y. (2013). Demo: Using RapidMiner for Text Mining RapidMiner Possibility for Text Mining, Digital Presentation and Preservation of Cultural and Scientific Heritage. Vol. 3, 354-356.

Sinclair, J. (2004). Corpus and Text — Basic Principles. In M. Wynne (Ed.) Developing Linguistic Corpora: A Guide to Good Practice (pp. 1-16). Oxbow Books: Oxford. http://ahds.ac.uk/linguistic-corpora/.

Siti Aeisha Joharry, Hajar Abdul Rahim (2014). Corpus Research in Malaysia: A Bibliographic Analysis. Kajian Malaysia. Vol. 32(1), 17–43.

Su'ad Awab (1999). Multi-word Units in a corpus of Memoranda of Understanding. Modal multi-word units. Unpublished Ph.D Thesis, Lancaster University, UK.

Su'ad Awab (2003). Identifying an Unexplored Genre: Memoranda of Understanding. In Zubaidah Ibrahim et al. (Eds.) Language, Linguistics and the Real World: Language Practices in the Workplace (pp. 199-220).

Kuala Lumpur: UM Publication.

Toutanova, K., Klein, D. & Manning, C. D. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Vol. 1 Conference Proceedings, May 27-

June 01, Edmonton, Canada (2003).

https://doi.org/10.3115/1073445.1073478

Toutanova, K. & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics Conference Proceedings, 7-8 October, Hong Kong (2000). https://doi.org/10.3115/1117794.1117802

Verma, T. & Gaur, D. (2014). Tokenization and Filtering Process in RapidMiner. International Journal of Applied Information Systems. Vol. 7(2), 16–18.

Warren, M. (2010). Online Corpora for Specific Purposes. ICAME Journal. Vol. 34, 169–188. Retrieved from http://icame.uib.no/ij34/warren.pdf.

Yoon, H. (2011). Concordancing in L2 Writing Class. An Overview of Research and Issues. Journal of English for Academic Purposes. Vol. 10, 130-139.

Zarifi, A., & Jayakaran Mukundan (2014). Creativity and Unnaturalness in the Use of Phrasal Verbs in ESL Learner Language. The Southeast Asian Journal of English Language Studies. Vol. 20(3), 51-62.

Zimmermann, T., & Weißgerber, P. (2004). Preprocessing CVS Data for Fine-grained Analysis. 26th International Conference on Software Engineering Conference Proceedings, 25 May, Edinburgh, UK (2004). https://doi.org/10.1049/ic:20040466

Zuraidah Mohd Don (2010). Processing Natural Malay Texts: A Data-driven Approach. TRAMES. Vol. 14(64/59), 90–103.

Zuraidah Mohd Don, Sridevi Sriniwass (2017). Conjunctive Adjuncts in Malaysian Undergraduate ESL Essays: Frequency and Manner of Use. Moderna Språk. Vol. 1, 99-117.

DOI: http://dx.doi.org/10.17576/gema-2018-1803-05