Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-25T22:06:16.008Z Has data issue: false hasContentIssue false

TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

Published online by Cambridge University Press:  10 July 2015

MENA B. HABIB
Affiliation:
Database Chair, University of Twente, Enschede, the Netherlands e-mail: m.b.habib@ewi.utwente.nl, m.vankeulen@ewi.utwente.nl
MAURICE VAN KEULEN
Affiliation:
Database Chair, University of Twente, Enschede, the Netherlands e-mail: m.b.habib@ewi.utwente.nl, m.vankeulen@ewi.utwente.nl

Abstract

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

The authors would like to thank Zhemin Zhu for sharing his CRF model (Zhu et al.2013) and assisting us in applying it. This work is supported by the Dutch national research program COMMIT.

References

Abeel, T., Van de Peer, Y., and Saeys, Y. 2009. Java-ml: a machine learning library. Journal of Machine Learning Research 10 : 931–4.Google Scholar
Basave, A. E. C., Varga, A., Rowe, M., Stankovic, M., and Dadzie, A.-S. 2013. Making sense of microposts (#msm2013) concept extraction challenge. In Making Sense of Microposts (#MSM2013) Concept Extraction Challenge, Rio de Janeiro, Brazil, pp. 115.Google Scholar
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M., Maynard, D., and Aswani, N. 2013. Twitie: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, Hissar, Bulgaria, pp. 8390.Google Scholar
Bunescu, R. C., and Pasca, M. 2006. Using encyclopedic knowledge for named entity disambiguation. In EACL, Trento, Italy, pp. 916.Google Scholar
Cano Basave, A. E., Rizzo, G., Varga, A., Rowe, M., Stankovic, M., and Dadzie, A.-S. 2014. Making sense of microposts (#microposts2014) named entity extraction & linking challenge. In Proceedings of the 4th Workshop on Making Sense of Microposts (#Microposts2014), Seoul, South Korea, pp. 5460.Google Scholar
Castillo, C., Mendoza, M., and Poblete, B. 2011. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India. ACM, pp. 675–84.CrossRefGoogle Scholar
Chang, C.-C. and Lin, C.-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3–27): 127. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.CrossRefGoogle Scholar
Christoforaki, M., Erunse, I., and Yu, C. 2011. Searching social updates for topic-centric entities. In Proceedings of the 1st International Workshop on Searching and Integrating New Web Data Sources – Very Large Data Search (VLDS), Seattle, WA, USA, pp. 34–9.Google Scholar
Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 708–16.Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. 2002. GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), Philadelphia, Pennsylvania, USA, pp. 168–75.Google Scholar
Dann, S. 2010. Twitter content classification. First Monday 15 (12), http://firstmonday.org/ojs/index.php/fm/article/viewArticle/2745/2681.Google Scholar
Davis, A., Veloso, A., da Silva, A. S., Meira, W. Jr, and Laender, A. H. F. 2012. Named entity disambiguation in streaming data. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers – Volume 1, ACL ’12, Jeju Island, Korea, pp. 815–24.Google Scholar
Delgado, A. D., Mart’ınez, R., Pérez Garc’ıa-Plaza, A., and Fresno, V. 2012. Unsupervised Real-Time company name disambiguation in twitter. In Workshop on Real-Time Analysis and Mining of Social Streams (RAMSS), Palo Alto, California, USA, pp. 25–8.Google Scholar
Derczynski, L. and Bontcheva, K. 2013. Mining social media with linked open data, entity recognition, and event extraction. In Proceedings of the 3rd Workshop on Data Extraction and Object Search (DEOS 2013), Oxford, UK.Google Scholar
Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297302.CrossRefGoogle Scholar
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, University of Michigan, USA, pp. 363–70.Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. 2011. Part-of-speech tagging for twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short papers – Volume 2, HLT ’11, Portland, Oregon, USA, pp. 42–7.Google Scholar
Gupta, P., Goel, A., Lin, J., Sharma, A., Wang, D., and Zadeh, R. 2013. Wtf: the who to follow service at twitter. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, Rio de Janeiro, Brazil, pp. 505–14.CrossRefGoogle Scholar
Habib, M. B. and van Keulen, M. 2012a. Improving toponym disambiguation by iteratively enhancing certainty of extraction. In Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012, Barcelona, Spain. SciTePress, pp. 399410.Google Scholar
Habib, M. B. and van Keulen, M. 2012b. Unsupervised improvement of named entity extraction in short informal context using disambiguation clues. In Proc. of the Workshop on Semantic Web and Information Extraction (SWAIE 2012), Galway, Ireland, pp. 110.Google Scholar
Habib, M. B. and van Keulen, M. 2013. A hybrid approach for robust multilingual toponym extraction and disambiguation. In IIS, Warsaw, Poland, pp. 115.Google Scholar
Hoffart, J., Yosef, M. A., Bordino, I., Frstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., and Weikum, G. 2011. Robust disambiguation of named entities in text. In Proceedings of EMNLP 2011, Edinburgh, Scotland, UK, pp. 782–92.Google Scholar
Howard, P. and Hussain, M. 2013. Democracy’s Fourth Wave?: Digital Media and the Arab Spring, Oxford Studies in Digital Politics. USA: OUP.CrossRefGoogle Scholar
Jung, J. J. 2012. Online named entity recognition method for microtexts in social networking services: a case study of twitter. Expert Systems with Applications 39 (9): 8066–70.CrossRefGoogle Scholar
Kulkarni, S., Singh, A., Ramakrishnan, G., and Chakrabarti, S. 2009. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, Paris, France, pp. 457–66.CrossRefGoogle Scholar
Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., and Lee, B.-S. 2012. Twiner: named entity recognition in targeted twitter stream. In SIGIR, Portland, Oregon, USA, pp. 721–30.Google Scholar
Li, L., Yu, Z., Zou, J., Su, L., Xian, Y., and Mao, C. 2009. Research on the method of entity homepage recognition. Journal of Computational Information Systems (JCIS) 5 (4): 1617–24.Google Scholar
Lin, T., Mausam, , and Etzioni, O.,2012. Entity linking at web scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), Montreal, Canada, pp. 84–8.Google Scholar
Locke, B., and Martin, J. 2009. Named entity recognition: adapting to microblogging. Senior Thesis, University of Colorado.Google Scholar
MacKay, D. J., and Peto, L. C. B. 1994. A hierarchical dirichlet language model. Natural Language Engineering 1 : 119.Google Scholar
Marsh, E., and Perzanowski, D. 1998. Muc-7 evaluation of ie technology: overview of results. In Proceedings of the 7th Message Understanding Conference (MUC-7).Google Scholar
McCallum, A., and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of CoNLL 2003, Edmonton, Canada, pp. 188–91.Google Scholar
Mendes, P. N., Jakob, M., García-Silva, A., and Bizer, C. 2011. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11, New York, NY, USA. ACM, pp. 18.Google Scholar
Ritter, A., Clark, S., Mausam, , and Etzioni, O. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of EMNLP 2011, Edinburgh, Scotland, UK, pp. 1524–34.Google Scholar
Rizzo, G. and Troncy, R. 2011. Nerd: Evaluating named entity recognition tools in the web of data. In ISWC’11, Workshop on Web Scale Knowledge Extraction (WEKEX’11), Bonn, Germany.Google Scholar
Spina, D., Amigó, E., and Gonzalo, J. 2011. Filter keywords and majority class strategies for company name disambiguation in twitter. In Proceedings of the 2nd International Conference on Multilingual and Multimodal Information Access Evaluation, CLEF’11, Amsterdam, The Netherlands, pp. 5061.CrossRefGoogle Scholar
Srinivasan, H., Chen, J., and Srihari, R. 2009. Cross document person name disambiguation using entity profiles. In Proceedings of the Text Analysis Conference (TAC) Workshop, Gaithersburg, Maryland, USA.Google Scholar
Steiner, T., Verborgh, R., Gabarró Vallés, J., and Van de Walle, R. 2013. Adding meaning to social network microposts via multiple named entity disambiguation apis and tracking their data provenance. International Journal of Computer Information Systems and Industrial Management 5 : 6978.Google Scholar
Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In Proc. of the 16th International Conference on World Wide Web, WWW ’07, Banff, Alberta, Canada, pp. 697706.CrossRefGoogle Scholar
Sullivan, S. J., Schneiders, A. G., Cheang, C.-W., Kitto, E., Lee, H., Redhead, J., Ward, S., Ahmed, O. H., and McCrory, P. R. 2012. what’s happening? A content analysis of concussion-related traffic on twitter. British Journal of Sports Medicine 46 (4): 258–63.CrossRefGoogle ScholarPubMed
Sutton, C. and McCallum, A. 2005. Piecewise training of undirected models. In Proceedings of UAI, Edinburgh, Scotland, UK, pp. 568–75.Google Scholar
Verma, M., Divya, , and Sofat, S. 2014. Article: Techniques to detect spammers in twitter- a survey. International Journal of Computer Applications 85 (10): 2732.CrossRefGoogle Scholar
Wang, C., Chakrabarti, K., Cheng, T., and Chaudhuri, S. 2012. Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, Lyon, France, pp. 719–28.CrossRefGoogle Scholar
Wang, K., Thrasher, C., Viegas, E., Li, X., and Hsu, B.-J. P. 2010. An overview of microsoft web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010, Los Angeles, California, USA, pp. 45–8.Google Scholar
Westerveld, T., Kraaij, W., and Hiemstra, D. 2002. Retrieving web pp. using content, links, urls and anchors. In Proceedings of the 10th Text REtrieval Conference, TREC 2001, vol. SP 500, Gaithersburg, Maryland, USA, pp. 663–72.Google Scholar
Winkels, M. 2013. The global social network landscape a country-by-country guide to social network usage. http://www.optimediaintelligence.es/noticias_archivos/719_20130715123913.pdf.Google Scholar
Wu, T.-F., Lin, C.-J., and Weng, R. C. 2004. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5 : 9751005.Google Scholar
Yerva, S. R., Miklós, Z., and Aberer, K. 2012. Entity-based classification of twitter messages. IJCSA, 9 (1): 88115.Google Scholar
Yosef, M., Hoffart, J., Bordino, I., Spaniol, M., and Weikum, G. 2011. Aida: An online tool for accurate disambiguation of named entities in text and tables. Proc. of the VLDB Endowment 4 (12): 1450–53.CrossRefGoogle Scholar
Zhai, C. and Lafferty, J. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, New Orleans, Louisiana, USA, pp. 334–42.Google Scholar
Zhu, Z., Hiemstra, D., Apers, P. M. G., and Wombacher, A. 2012. Separate training for conditional random fields using co-occurrence rate factorization. Technical Report TR-CTIT-12-29, Centre for Telematics and Information Technology, University of Twente, Enschede.Google Scholar
Zhu, Z., Hiemstra, D., Apers, P. M. G., and Wombacher, A. 2013. Closed form maximum likelihood estimator of conditional random fields. Technical Report TR-CTIT-13-03, Centre for Telematics and Information Technology, University of Twente, Enschede.Google Scholar