Skip to main content Accessibility help
×
Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-25T23:33:13.981Z Has data issue: false hasContentIssue false

20 - AI, Human–Robot Interaction, and Natural Language Processing

from Part V - Advances in Multimodal and Technological Context-Based Research

Published online by Cambridge University Press:  30 November 2023

Jesús Romero-Trillo
Affiliation:
Universidad Autónoma de Madrid
Get access

Summary

An AI-driven (or AI-assisted) speech or dialogue system, from an engineering perspective, can be decomposed into a pipeline with a subset of the following three distinct processing activities: (1) Speech processing  that turns sampled acoustic sound waves into enriched phonetic information through automatic speech recognition (ASR), and vice versa via text-to-speech (TTS); (2) Natural Language Processing (NLP), which operates at both syntactic and semantic levels to get at the meanings of words as well as of the enriched phonetic information; (3) Dialogue processing which ties both together so that the system can function within the specified latency and semantic constraints. This perspective allows for at least three levels of context. The lowest level is phonetic, where the fundamental components of speech are built from a time-sequence string of acoustic symbols (analyzed in ASR or generated in TTS). The next higher level of context is word- or character-level, normally postulated as sequence-to-sequence modeling. The highest level of context typically used today keeps track of a conversation or topic. An even higher level of context, generally missing today, but which will be essential in future, is that of our beliefs, desires, and intentions.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akiwowo, S., Vidgen, B., Prabhakaran, V., and Waseem, Z., eds. (2020). Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics. https://aclanthology.org/volumes/2020.alw-1/.Google Scholar
Baker, J. M., Deng, L., Glass, J., Khudanpur, S., Lee, C.-H., Morgan, N., and O’Shaughnessy, D. (2009). Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine, 26(3), 7580.CrossRefGoogle Scholar
Bender, E. M, and Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 51855198). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.463.Google Scholar
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc-Candlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. CoRR, abs/2005.14165.Google Scholar
Bunt, H. (2011). The semantics of dialogue acts. In Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011). SIGSEM. https://aclanthology.org/W11-0100.Google Scholar
Cai, W., Chen, J., and Li, M. (2018). Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. arXiv preprint arXiv:1804.05160.Google Scholar
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. (2020). Extracting training data from large language models. arXiv preprint arXiv:2012.07805.Google Scholar
etinoğlu, Ç, Schulz, Ö., S., and Vu, N. T. (2016). Challenges of computational processing of code-switching. arXiv preprint arXiv:1610.02213.Google Scholar
Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., and Smith, N. A. (2021). All That’s “human” is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol. I: Long Papers) (pp. 7282–7296). Association for Computational Linguistics. https://aclanthology.org/2021.acl-long.565/.Google Scholar
Crystal, D. (2003). The Cambridge Encyclopedia of English Language. Cambridge: Cambridge University Press.Google Scholar
Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961.Google Scholar
Ferrer, L., Lei, Y, .McLaren, M., and Scheffer, N. (2015). Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1), 105116.CrossRefGoogle Scholar
Frankel, J., and King, S. (2001). ASR-articulatory speech recognition. In Proceedings of Eurospeech-2001 (pp. 599602). Aalborg: International Speech Communication Association.Google Scholar
GPT-3. (2020 )(Sep). A robot wrote this entire article. Are you scared yet, human? – GPT-3. The Guardian, September 8, www.theguardian.com/commentisfree/2020/sep/08/robotwrote-this-article-gpt–3.Google Scholar
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2016). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 22222232.CrossRefGoogle ScholarPubMed
Hu, R., and Singh, A. (2021). Transformer is all you need: Multi-modal multitask learning with a unified transformer. arXiv e-prints, arXiv– 2102.Google Scholar
Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I. V., and Dai, L.-R. (2014). Deep bottleneck features for spoken language identification. PLoS ONE, 9(7), e100795.CrossRefGoogle ScholarPubMed
Jin, M., Song, Y., McLoughlin, I., and Dai, L.-R. (2018). LID-senones and their statistics for language identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 171183.CrossRefGoogle Scholar
Jobs, S. (2005). Steve Jobs connect the dots [video file]. www.youtube.com/watch?v=5BSbOc5VYY8. Accessed June 22, 2023.Google Scholar
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., and Levy, O. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 6477.CrossRefGoogle Scholar
Juang, B.-H., and Rabiner, L. R. (2005). Automatic speech recognition: A brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, 1, 67.Google Scholar
Kohler, M. A., and Kennedy, M. (2002). Language identification using shifted delta cepstra. In Proceedings of the 45th IEEE International Midwest Symposium on Circuits and Systems (pp. III–69). doi:10.1109/MWSCAS.2002.1186972.CrossRefGoogle Scholar
Lin, J., Yang, A., Bai, J., Zhou, C., Jiang, L., Jia, X., Wang, A., Zhang, J., Li, Y., Lin, W., Zhou, J., and Yang, H. (2021). M6-10 T: A Sharing-Delinking paradigm for efficient multi-trillion parameter pretraining. https://arxiv.org/abs/2110.03888.Google Scholar
McLoughlin, I. V. (2016). Speech and Audio Processing: a MATLAB-Based Approach. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
McLoughlin, I. V. (2018). Computer Systems: An Embedded Approach. Singapore: McGraw Hill Education.Google Scholar
McLoughlin, I. V., and Sharifzadeh, H. R. (2008). Speech recognition for smart homes. In Vasile-Florian Păiș (ed.), Speech Recognition: Technologies and Applications (pp. 477–494). Rijeka: IntechOpen. https://doi.org/10.52305/BKWM8996.CrossRefGoogle Scholar
McTear, M. F. 2002. Spoken dialogue technology: Enabling the conversational user interface. ACM Computing Surveys (CSUR), 34(1), 90169.CrossRefGoogle Scholar
Miao, X., McLoughlin, I., Wang, W., and Zhang, P. (2021). D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Networks, 139, 201211.CrossRefGoogle Scholar
Mori, M., MacDorman, K. F., and Kageki, N. (2012). The Uncanny Valley [From the Field]. IEEE Robotics Automation Magazine, 19(2), 98100.CrossRefGoogle Scholar
Richardson, J., and Arthur, M. B. (2013). Just three stories: The career lessons behind Steve Jobs’ Stanford University Commencement Address. Journal of Business & Management, 19(1), 4557.Google Scholar
Skjuve, M., Følstad, A., Fostervold, K. I., and Brandtzaeg, P. B. (2021). My Chatbot companion: A study of human–chatbot relationships. International Journal of Human-Computer Studies, 149, 102601.CrossRefGoogle Scholar
Slonim, N., Bilu, Y., Alzate, C., Bar-Haim, R., Bogin, B., Bonin, F., Choshen, L., Cohen-Karlik, E., Dankin, L., Edelstein, L., et al. (2021). An autonomous debating system. Nature, 591(7850), 379384.CrossRefGoogle ScholarPubMed
Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Thomason, J., Sinapov, J., Svetlik, M., Stone, P., and Mooney, R. J. (2016). Learning multi-modal grounded linguistic semantics by playing “I Spy.” In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 3477–3483), July 2016. Palo Alto: AAAI Press / International Joint Conferences on Artificial Intelligence.Google Scholar
Truong, M., Fast, N. J., and Kim, J. (2020). It’s not what you say, it’s how you say it: Conversational flow as a predictor of networking success. Organizational Behavior and Human Decision Processes, 158, 110.CrossRefGoogle Scholar
Turing, A. M. (2009). Computing machinery and intelligence. In Epstein, R., Roberts, G., and Beber, G. (eds.), Parsing the Turing Test (pp. 2365). Dordrecht: Springer. https://doi.org/10.1007/978-1-4020-6710-5_3.CrossRefGoogle Scholar
Voskarides, N., Meij, E., Reinanda, R., Khaitan, A., Osborne, M., Stefanoni, G., Kambadur, P., and de Rijke, M. (2018). Weakly-supervised contextualization of knowledge graph facts. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 765774). New York: Association for Computing Machinery.Google Scholar
Weizenbaum, J. 1966. ELIZA: A computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1), 3645.CrossRefGoogle Scholar
Wolf, M. J., Miller, K. W., and Grodzinsky, F. S. (2017). Why we should have seen that coming: Comments on Microsoft’s Tay “Experiment,” and wider implications. The ORBIT Journal, 1(2), 112.CrossRefGoogle Scholar
Zerubavel, E. (2006). The Elephant in the Room: Silence and Denial in Everyday Life. Oxford: Oxford University Press.Google Scholar
Zhou, L., Gao, J., Li, D., and Shum, H.-Y. (2020). The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1), 5393.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×