Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-26T22:50:06.880Z Has data issue: false hasContentIssue false

Authorship attribution using author profiling classifiers

Published online by Cambridge University Press:  19 January 2022

Caio Deutsch
Affiliation:
School of Arts, Sciences and Humanities, University of São Paulo, Av. Arlindo Bettio 1000, São Paulo, Brazil
Ivandré Paraboni*
Affiliation:
School of Arts, Sciences and Humanities, University of São Paulo, Av. Arlindo Bettio 1000, São Paulo, Brazil
*
*Corresponding author. E-mail: ivandre@usp.br

Abstract

Authorship attribution – the computational task of identifying the author of a given text document within a set of possible candidates – has been attracting interest in Natural Language Processing research for many years. At the same time, significant advances have also been observed in the related field of author profiling, that is, the computational task of learning author demographics from text such as gender, age and others. The close relation between the two topics – both of which focused on gaining knowledge about the individual who wrote a piece of text – suggests that research in these fields may benefit from each other. To illustrate this, this work addresses the issue of author identification with the aid of author profiling methods, adding demographics predictions to an authorship attribution architecture that may be particularly suitable to extensions of this kind, namely, a stack of classifiers devoted to different aspects of the input text (words, characters and text distortion patterns.) The enriched model is evaluated across a range of text domains, languages and author profiling estimators, and its results are shown to compare favourably to those obtained by a standard authorship attribution method that does not have access to author demographics predictions.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bagnall, D. (2016). Authorship clustering using multi-headed recurrent neural networks. In Cappellato, L., Ferro, N., Macdonald, C. and Balog, K. (eds), CEUR Workshop Proceedings, vol. 1609, Evora, Portugal. CEUR-WS.org, pp. 791–804.Google Scholar
Baker, C.F., Fillmore, C.J. and Lowe, J.B. (1998). The Berkeley FrameNet project. In COLING-1998, Montréal, Quebec, Canada. Association for Computational Linguistics, pp. 86–90.Google Scholar
Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H. and Nissim, M. (2017). N-GrAM: new groningen author-profiling model. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin.Google Scholar
Casavantes, M., López, R. and González, L.C. (2019). UACh at MEX-A3T 2019: preliminary results on detecting aggressive tweets by adding author information via an unsupervised strategy. In IberLEF@ SEPLN, Bilbao, Spain. CEUR-WS.org, pp. 537–543.Google Scholar
Chen, X., Hao, P., Chandramouli, R. and Subbalakshmi, K.P. (2011). Authorship similarity detection from email messages. In Machine Learning and Data Mining in Pattern Recognition - 7th International Conference, MLDM, New York, NY, USA. Berlin, Heidelberg: Springer, pp. 375–386.CrossRefGoogle Scholar
Custódio, J.E. and Paraboni, I. (2019). An ensemble approach to cross-domain authorship attribution. In International Conference of the Cross-Language Evaluation Forum for European Languages CLEF 2019, Lecture Notes in Computer Science, vol. 11696, Lugano, Switzerland. Springer, pp. 201–212.CrossRefGoogle Scholar
Custódio, J.E. and Paraboni, I. (2021). Stacked authorship attribution of digital texts. Expert Systems with Applications 176, 114866.CrossRefGoogle Scholar
dos Santos, W.R. and Paraboni, I. (2019). Moral stance recognition and polarity classification from Twitter and elicited text. In Recents Advances in Natural Language Processing (RANLP-2019), Varna, Bulgaria. INCOMA Ltd., pp. 1069–1075.CrossRefGoogle Scholar
dos Santos, W.R., Ramos, R.M.S. and Paraboni, I. (2019). Computational personality recognition from facebook text: psycholinguistic features, words and facets. New Review of Hypermedia and Multimedia 25(4), 268–287.CrossRefGoogle Scholar
Garrido-Espinosa, M.G., Rosales-Pérez, A. and López-Monroy, A.P. (2020). GRU with author profiling information to detect aggressiveness. In Notebook Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain.Google Scholar
Granados, A., Cebrián, M., Camacho, D. and de Borja Rodrguez, F. (2011). Reducing the loss of information through annealing text distortion. IEEE Transactions on Knowledge and Data Engineering 23(7), 10901102.CrossRefGoogle Scholar
Hinh, R., Shin, S. and Taylor, J. (2016). Using frame semantics in authorship attribution. In IEEE International Conference on Systems, Man, and Cybernetics, SMC-2016, Budapest, Hungary, pp. 4093–4098.CrossRefGoogle Scholar
Hsieh, F.C., Dias, R.F.S. and Paraboni, I. (2018). Author profiling from facebook corpora. In 11th International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. ELRA, pp. 2566–2570.Google Scholar
Isbister, T., Kaati, L. and Cohen, K. (2017). Gender classification with data independent features in multiple languages. In European Intelligence and Security Informatics Conference (EISIC-2017), Athens, Greece. IEEE Computer Society, pp. 54–60.CrossRefGoogle Scholar
Jafariakinabad, F. and Hua, K.A. (2019). Style-aware neural model with application in authorship attribution. In 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 325–328.CrossRefGoogle Scholar
Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain. Association for Computational Linguistics, pp. 427431.CrossRefGoogle Scholar
Juola, P. and Stamatatos, E. (2013). Overview of the author identification task at PAN 2013. In Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013. Google Scholar
Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M. and Stein, B. (2019). Overview of the cross-domain authorship attribution task at PAN 2019. In Cappellato L., Ferro N., Losada D. and Müller H. (eds), CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org.Google Scholar
Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B. and Potthast, M. (2018). Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In Cappellato L., Ferro N., Nie J.-Y. and Soulier L. (eds), Working Notes Papers of the CLEF 2018 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org.Google Scholar
Kim, S.M., Xu, Q., Qu, L., Wan, S. and Paris, C. (2017). Demographic inference on Twitter using recursive neural networks. In Proceedings of ACL-2017, Vancouver, Canada, pp. 471–477.CrossRefGoogle Scholar
Koppel, M. and Seidman, S. (2018). Detecting pseudepigraphic texts using novel similarity measures. Digital Scholarship in the Humanities 33(1), 7281.CrossRefGoogle Scholar
Markov, I., Stamatatos, E. and Sidorov, G. (2017). Improving cross-topic authorship attribution: the role of pre-processing. In 18th International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, pp. 289–302.Google Scholar
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153157.CrossRefGoogle ScholarPubMed
Misra, K., Devarapalli, H., Ringenberg, T.R. and Rayz, J.T. (2019). Authorship analysis of online predatory conversations using character level convolution neural networks. In IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 623628.CrossRefGoogle Scholar
Nguyen, D.-P., Trieschnigg, R.B., Dogruoz, A.S., Gravel, R., Theune, M., Meder, T. and de Jong, F.M. (2014). Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In Proceedings of COLING-2014. Association for Computational Linguistics, pp. 1950–1961.Google Scholar
Patchala, J. and Bhatnagar, R. (2018). Authorship attribution by consensus among multiple features. In 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2766–2777.Google Scholar
Pavan, M.C., dos Santos, V.G., Lan, A.G.J., ao Trevisan Martins, J., dos Santos, W.R., Deutsch, C., da Costa, P.B., Hsieh, F.C. and Paraboni, I. (2020). Morality classification in natural language text. IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2020.3034050 CrossRefGoogle Scholar
Peng, J., Choo, K.-K.R. and Ashman, H. (2016). Astroturfing detection in social media: using binary n-gram analysis for authorship attribution. In 2016 IEEE Trustcom/BigDataSE/ISPA, pp. 121128.CrossRefGoogle Scholar
Pennebaker, J.W., Francis, M.E. and Booth, R.J. (2001). Inquiry and Word Count: LIWC. Mahwah, NJ: Lawrence Erlbaum.Google Scholar
Pizarro, J. (2019). Using N-grams to detect Bots on Twitter. In Cappellato L., Ferro N., Losada D. and Müller H. (eds), CLEF 2019 Labs and Workshops, Notebook Papers, Lugano, Switzerland. CEUR-WS.org.Google Scholar
Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P. and Stein, B. (2017). Overview of PAN 17: author identification, author profiling, and author obfuscation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science, vol. 10456. Springer, pp. 275290.CrossRefGoogle Scholar
Ramos, R.M.S., Neto, G.B.S., Silva, B.B.C., Monteiro, D.S., Paraboni, I. and Dias, R.F.S. (2018). Building a corpus for personality-dependent natural language understanding and generation. In 11th International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. ELRA, pp. 1138–1145.Google Scholar
Rangel, F. and Rosso, P. (2019). Overview of the 7th author profiling task at PAN 2019: bots and gender profiling. In Cappellato L., Ferro N., Losada D. and Müller H. (eds), CLEF 2019 Labs and Workshops, Notebook Papers, Lugano, Switzerland. CEUR-WS.org.Google Scholar
Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M. and Stein, B. (2018). Overview of the 6th author profiling task at PAN 2018: multimodal gender identification in Twitter. In Cappellato L., Ferro N., Nie, J.-Y. and Soulier L. (eds), Working Notes Papers of the CLEF 2018 Evaluation Labs, CEUR Workshop Proceedings, Avignon, France. CLEF and CEUR-WS.org.Google Scholar
Rangel, F., Rosso, P., Potthast, M. and Stein, B. (2017). Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin. CEUR-WS.org.Google Scholar
Rangel, F., Rosso, P., Zaghouani, W. and Charfi, A. (2020). Fine-grained analysis of language varieties and demographics. Natural Language Engineering 26(6), 641661.CrossRefGoogle Scholar
Reddy, P.B., Reddy, T.R., Chand, M.G. and Venkannababu, A. (2018). A new approach for authorship attribution. In Advances in Intelligent Systems and Computing, vol. 701, pp. 1–9.CrossRefGoogle Scholar
Reddy, T.R., Vardhan, B.V. and Reddy, P.V. (2017). N-Gram approach for gender prediction. In Advance Computing Conference (IACC), Hyderabad, India, pp. 860–865.CrossRefGoogle Scholar
Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R.B. and Stamatatos, E. (2017). Authorship attribution for social media forensics. IEEE Transactions on Information Forensics and Security 12(1), 533.CrossRefGoogle Scholar
Sánchez-Junquera, J., nor Pineda, L.V., y Gómez, M.M., Rosso, P. and Stamatatos, E. (2020). Masking domain-specific information for cross-domain deception detection. Pattern Recognition Letters 135, 122–130.CrossRefGoogle Scholar
Sari, Y. and Stevenson, M. (2016). Exploring word embeddings and character N-grams for author clustering notebook for PAN at CLEF 2016. In CEUR Workshop Proceedings, Evora, Portugal. CEUR-WS.org.Google Scholar
Sari, Y., Stevenson, M. and Vlachos, A. (2018). Topic or style? exploring the most useful features for authorship attribution. In 27th International Conference on Computational Linguistics COLING-2018, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 343–353.Google Scholar
Schler, J., Koppel, M., Argamon, S. and Pennebaker, J. (2006). Effects of age and gender on blogging. In AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Menlo Park, California, USA. AAAI Press, pp. 199–205.Google Scholar
Schwartz, R., Tsur, O., Rappoport, A. and Koppel, M. (2013). Authorship attribution of micro-messages. In Empirical Methods in Natural Language Processing, Seattle, Washington, USA. Association for Computational Linguistics, pp. 1880–1891.Google Scholar
Sharon Belvisi, N.M., Muhammad, N. and Alonso-Fernandez, F. (2020). Forensic authorship analysis of microblogging texts using n-grams and stylometric features. In 8th International Workshop on Biometrics and Forensics (IWBF), Porto, Portugal. IEEE, pp. 1–6.Google Scholar
Shrestha, P., Sierra, S., Gonzalez, F., Rosso, P., Montes-Y-Gomez, M. and Solorio, T. (2017). Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Valencia, Spain. Association for Computational Linguistics (ACL), pp. 669–674.CrossRefGoogle Scholar
Silva, B.B.C. and Paraboni, I. (2018). Personality recognition from Facebook text. In 13th International Conference on the Computational Processing of Portuguese (PROPOR-2018), LNCS, vol. 11122, Canela. Springer-Verlag, pp. 107–114.CrossRefGoogle Scholar
Stamatatos, E. (2017). Authorship attribution using text distortion. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-2017), Valencia, Spain. Association for Computational Linguistics.CrossRefGoogle Scholar
Stevenson, M., Vlachos, A. and Sari, Y. (2017). Continuous n-gram representations for authorship attribution. In 15th Conference of the European Chapter of the Association for Computational Linguistics EACL-2017, Valencia, Spain, pp. 267–273.Google Scholar
Sundararajan, K. and Woodard, D.L. (2018). What constitutes style in authorship attribution? In 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 2814–2822.Google Scholar
Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T. and Ohkuma, T. (2018). Text and image synergy with feature cross technique for gender identification. In Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF 2018), vol. 2125, Avignon, France. CEUR-WS.org.Google Scholar
Vartapetiance, A. and Gillam, L. (2012). Quite simple approaches for authorship attribution, intrinsic plagiarism detection and sexual predator identification. In CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy. CEUR-WS.org.Google Scholar
Verhoeven, B., Daelemans, W. and Plank, B. (2016). TwiSty: a multilingual Twitter Stylometry corpus for gender and personality profiling. In 10th International Conference on Language Resources and Evaluation (LREC-2016), Portoroz, Slovenia. ELRA, pp. 1632–1637.Google Scholar
Wolpert, D.H. (1992). Stacked generalization. Neural Networks 5(2), 241259.CrossRefGoogle Scholar