Deep Learning Applied on Arabic language for punctuation marks prediction

Abdelkarim  Aboutaib; Imad  Zeroual; Ahmad  EL Allaoui

doi:10.56294/sctconf2023472

Authors

Abdelkarim Aboutaib L-STI, T-IDMS, FST Errachidia, Moulay Ismail University of Meknes, Morocco. Author https://orcid.org/0009-0002-8760-1775
Imad Zeroual L-STI, T-IDMS, FST Errachidia, Moulay Ismail University of Meknes, Morocco. Author https://orcid.org/0000-0002-4454-6369
Ahmad EL Allaoui L-STI, T-IDMS, FST Errachidia, Moulay Ismail University of Meknes, Morocco. Author https://orcid.org/0000-0002-8897-3565

DOI:

https://doi.org/10.56294/sctconf2023472

Keywords:

Deep Learning, Bi-LSTM, NLP, Attention

Abstract

In the absence of explicit punctuation, the Arabic language's semantic and contextual nature poses a unique challenge, necessitating the reintroduction of punctuation marks for elucidating sentence structure and meaning. We investigate the impact of sentence length on punctuation prediction in the context of Arabic language processing. Leveraging Deep Neural Networks (DNNs), specifically Bi-Directional Long Short-Term Memory (Bi-LSTM) models. Our study goes beyond restoration, aiming to accurately predict punctuation marks in unprocessed text. The investigation focuses on five primary punctuation marks (.?,: and !), contributing to a more comprehensive understanding of predicting diverse punctuation marks in Arabic texts and we have achieved 85 % in accuracy . This research not only advances our understanding of Arabic language processing but also serves as a broader exploration of the relationship between sentence length and punctuation prediction.

References

1. Y. Wang, J. Deng, A. Sun, and X. Meng, “Perplexity from PLM Is Unreliable for Evaluating Text Quality.” arXiv, Mar. 15, 2023. Accessed: Dec. 26, 2023. [Online]. Available: http://arxiv.org/abs/2210.05892

2. M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, Art. no. 11, 1997, doi: 10.1109/78.650093.

3. Ł. Augustyniak et al., “Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.05985

4. M. Bajec, M. Janković, S. Žitnik, and I. L. Bajec, “Punctuation Restoration System for Slovene Language,” in Research Challenges in Information Science, F. Dalpiaz, J. Zdravkovic, and P. Loucopoulos, Eds., Cham: Springer International Publishing, 2020, pp. 509–514.

5. International Association for Pattern Recognition, Zhongguo ke xue yuan, and Chinese Association of Automation, 2018 24th International Conference on Pattern Recognition (ICPR).

6. T. B. D. Lima et al., “Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts,” in Intelligent Systems: 11th Brazilian Conference, BRACIS 2022, Campinas, Brazil, November 28 – December 1, 2022, Proceedings, Part II, Berlin, Heidelberg: Springer-Verlag, 2022, pp. 616–630. doi: 10.1007/978-3-031-21689-3_43.

7. A. Aboutaib, A. El allaoui, I. Zeroual, and E. W. Dadi, “Punctuation Prediction for the Arabic Language,” in Proceedings of the 6th International Conference on Networking, Intelligent Systems & Security, in NISS ’23. New York, NY, USA: Association for Computing Machinery, 2023. doi: 10.1145/3607720.3607734.

8. X. Li and E. Lin, “A 43 Language Multilingual Punctuation Prediction Neural Network Model.” [Online]. Available: https://github.com/pytorch/pytorch

9. M. Á. Tündik and G. Szaszák, “Joint Word- and Character-level Embedding CNN-RNN Models for Punctuation Restoration,” in 2018 9th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), 2018, pp. 000135–000140. doi: 10.1109/CogInfoCom.2018.8639876.

10. P. Baranyi, A. Csapo, and G. Sallai, Cognitive infocommunications (coginfocom). Springer, 2015.

11. A. Gravano, M. Jansche, and M. Bacchiani, “Restoring punctuation and capitalization in transcribed speech,” 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4741–4744, 2009.

12. R. Pan, J. A. García-Díaz, and R. Valencia-García, “Evaluation of Transformer-Based Models for Punctuation and Capitalization Restoration in Spanish and Portuguese,” in Natural Language Processing and Information Systems: 28th International Conference on Applications of Natural Language to Information Systems, NLDB 2023, Derby, UK, June 21–23, 2023, Proceedings, Berlin, Heidelberg: Springer-Verlag, 2023, pp. 243–256. doi: 10.1007/978-3-031-35320-8_17.

13. R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, “Exploring the Limits of Language Modeling.” 2016. [Online]. Available: http://arxiv.org/abs/1602.02410

14. W. Lu and H. T. Ng, “Better punctuation prediction with dynamic conditional random fields,” in Proceedings of the 2010 conference on empirical methods in natural language processing, 2010, pp. 177–186.

15. X. Che, C. Wang, H. Yang, and C. Meinel, “Punctuation Prediction for Unsegmented Transcript Based on Word Vector.” [Online]. Available: http://nlp.stanford.edu/projects/glove/

16. F. Wang, W. Chen, Z. Yang, and B. Xu, “Self-Attention Based Network for Punctuation Restoration,” in 2018 24th International Conference on Pattern Recognition (ICPR), Beijing: IEEE, Aug. 2018, pp. 2803–2808. doi: 10.1109/ICPR.2018.8545470.

17. O. Tilk and T. Alumäe, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” presented at the Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech and Communication Association, 2016, pp. 3047–3051. doi: 10.21437/Interspeech.2016-1517.

18. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

19. O. Tilk and T. Alumäe, “Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration,” in Interspeech 2016, ISCA, Sep. 2016, pp. 3047–3051. doi: 10.21437/Interspeech.2016-1517.

20. R. Al–Shalabi, G. Kanaan, T. Kanan, and M. ElBes, “A Review Study for Arabic Machine Learning and Deep Learning Methods,” in 2022 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS), 2022, pp. 225–232. doi: 10.1109/ICETSIS55481.2022.9888948.

21. M. K. Siddhu and S. N. Yaakob, “Deep learning applied to arabic and latin scripts: A review,” International Journal of Scientific and Technology Research, vol. 8, no. 11, pp. 1510–1521, 2019.

22. R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (GRU) neural networks,” in 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), IEEE, 2017, pp. 1597–1600.

23. C. C. Juin, R. X. J. Wei, L. F. D’Haro, and R. E. Banchs, “Punctuation prediction using a bidirectional recurrent neural network with part-of-speech tagging,” in TENCON 2017-2017 IEEE Region 10 Conference, IEEE, 2017, pp. 1806–1811.

24. O. Tilk and T. Alumäe, “LSTM for Punctuation Restoration in Speech Transcripts,” 2015. [Online]. Available: http://bark.phon.ioc.ee/tsab

25. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014.

26. Abdelkarim Aboutaib, “Punctuations corpus for Arabic”,” Mendeley Data, vol. V1, 2023, doi: 10.17632/jnz483dypx.1.

27. Farhaoui, Y. and All, Big Data Mining and Analytics, 2022, 5(4), pp. I IIDOI: 10.26599/BDMA.2022.9020004

28. Farhaoui, Y.and All, Big Data Mining and Analytics, 2023, 6(3), pp. I–II, DOI: 10.26599/BDMA.2022.9020045

Deep Learning Applied on Arabic language for punctuation marks prediction

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

scopus

citescore

sjr