Artificial intelligence in audiobooks: applications and perspectives

Suellen Souza Gonçalves; Patrícia Nascimento Silva

doi:10.24215/18539912e282

HTML (Português (Portugal)) PDF (Português (Portugal)) EPUB (Português (Portugal))

Published: Apr 1, 2026

DOI: https://doi.org/10.24215/18539912e282

Keywords:

Audiobooks Artificial Intelligence Information Science Literature Review

Suellen Souza Gonçalves

Universidade Federal de Minas Gerais, Brasil / Instituto Federal do Norte de Minas Gerais, Brasil

https://orcid.org/0000-0002-9330-2440

Patrícia Nascimento Silva

Universidade Federal de Minas Gerais

https://orcid.org/0000-0002-2405-8536

Abstract

The use of Artificial intelligence (AI) techniques in the context of audiobooks has expanded the possibilities for accessibility, personalization and immersion, covering aspects from voice recognition and synthesis to interactive multimodal experiences and personalized recommendations, in addition to enhancing content retrieval and expanding access to information. This study aimed to identify studies on the use of AI in audiobooks in the academic literature. To this end, a literature review was conducted in the Scopus, Web of Science, ACM Digital Library, IEEE Xplore and Scielo databases, between May and August 2025, resulting in the selection and analysis of 35 articles. The results reveal that the studies focus on four categories: (i) speech recognition; (ii) voice synthesis; and personalization; (iii) voice-based experiences; and (iv) generative AI and LLMs. It was observed that technical studies focused on Automatic Speech Recognition and Voice Synthesis predominate, while voice-based experiences and LLM applications are still emerging, indicating future trends. Audiobooks are also frequently used as technical corpora for model development, with few studies focused on directly improving the user experience, in addition to a scarcity of research in the field of Information Science. It can be concluded that, despite recent advances, there are gaps related to the lack of user-centered studies, the predominant use of audiobooks as a technical corpus as well as few ethical and social aspects. This overview provides theoretical and practical support for future research in the area.

Downloads

Download data is not yet available.

Issue

Vol. 15 No. 2 (2026)

Section

Artículos de temática libre

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This magazine is available in open access under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (http://creativecommons.org/licenses/by-nc-sa/4.0/).

Author Biographies

Suellen Souza Gonçalves, Universidade Federal de Minas Gerais, Brasil / Instituto Federal do Norte de Minas Gerais, Brasil

Doutoranda em Gestão e Organização do Conhecimento pela Escola de Ciência da Informação da Universidade Federal de Minas Gerais (UFMG). Mestre em Gestão Organização do Conhecimento (PPGGOC) pela Escola de Ciência da Informação da Universidade Federal de Minas Gerais - UFMG (2022-2024). Graduada em Biblioteconomia pela Universidade Federal do Pará - UFPA (2009- 2013). Especialista em Gestão e Governança em Tecnologia da Informação - SENAC (2014 -2015 ), MBA Executivo em Gestão Estratégica de Inovação Tecnológica E Propriedade Intelectual - Marcas E Patentes - Faculdade Unyleya (2021 -2022). Atualmente é Bibliotecária do Instituto Federal do Norte de Minas Gerais - Campus Teófilo Otoni.

Patrícia Nascimento Silva, Universidade Federal de Minas Gerais

Professora Adjunta na Escola de Ciência da Informação (ECI) da Universidade Federal de Minas Gerais (UFMG) - Área: Organização, Tratamento da Informação e Tecnologia. Professora e Pesquisadora no Programa de Pós-Graduação em Gestão & Organização do Conhecimento (PPGGOC) ECI/UFMG. Bolsista de Produtividade do CNPq. Doutora em Gestão e Organização do Conhecimento pelo PPGGOC ECI UFMG, Mestre e Bacharel em Sistemas de Informação. Atuou como Analista de Sistemas por 15 anos na área de Engenharia de Software. Experiência e interesse de pesquisa na área de Ciência da Informação e Sistemas de Informação, na linha de Gestão e Tecnologia, com destaque para: Recuperação de Informação, Representação e Organização da Informação e do Conhecimento, Inteligência Artificial, Interoperabilidade, Microsserviços e Application Programming Interface, Dados abertos e Open Government Data, Métodos ágeis e Governança de dados.

References

Aldeneh, Z., Perez, M. & Mower Provost, E. (2021). Learning paralinguistic features from audiobooks through style voice conversion. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.377/ DOI: https://doi.org/10.18653/v1/2021.naacl-main.377

Bardin, L. (2011). Análise de conteúdo. Edições 70.

Barr, A. & Feigenbaum, E. A. (1981). The handbook of artificial intelligence. William Kaufmann Inc.

Bertulfo, L. C., Razon, J. M., San Juan, J. L., Sambrano, J. C. & Medina, R. P. (2017). Gabay Tinig: A 3D interactive audiobook with voice recognition for visually-impaired and blind preschool students using mobile technologies. Proceedings of the 3rd International Conference on Communication and Information Processing (ICCIP ’17). https://dl.acm.org/doi/10.1145/3162957.3162979 DOI: https://doi.org/10.1145/3162957.3162979

Biber, D., Conrad, S. & Reppen, R. (1998). Corpus linguistics: investigating language structure and use. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511804489

Borko, H. (1968). Information science: What is it? American documentation, 19(1). https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090190103 DOI: https://doi.org/10.1002/asi.5090190103

Brewer, R. N. & Piper, A. M. (2017). XPress: Rethinking design for aging and accessibility through an IVR blogging system. Proceedings of the ACM on Human-Computer Interaction, 1(CSCW). https://dl.acm.org/doi/10.1145/3139354 DOI: https://doi.org/10.1145/3139354

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... Amodei, D. (2020). Language models are few-shot learners. IAdvances in neural information processing systems (NeurIPS 2020). https://arxiv.org/abs/2005.14165

Chalamandaris, A., Raptis, S., Karabetsos, S. & Tsiakoulis, P. (2014). Using audio books for training a text-to-speech system. En N. Calzolari et al. (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA). https://aclanthology.org/L14-1645/ DOI: https://doi.org/10.63317/3izdxbcmh47r

Chen, L., Braunschweiler, N. & Gales, M. J. F. (2015). Speaker and expression factorization for audiobook data: Expressiveness and transplantation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4). https://ieeexplore.ieee.org/document/6995936 DOI: https://doi.org/10.1109/TASLP.2014.2385478

Chen, X. et al. (2023). StyleSpeech: self-supervised style enhancing with VQ-VAE-based pre-training for expressive audiobook speech synthesis. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).https://doi.org/10.48550/arXiv.2312.12181 DOI: https://doi.org/10.1109/ICASSP48485.2024.10446352

Choi, D. H. (2019). LYRA: an interactive and interactive storyteller. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). IEEE. https://ieeexplore.ieee.org/document/8919562 DOI: https://doi.org/10.1109/CSE/EUC.2019.00037

Cotton, K., de Vries, K. & Tatar, K. (2024). Singing for the missing: bringing the body back to AI voice and speech technologies. Proceedings of the 9th International Conference on Movement and Computing. ACM. https://dl.acm.org/doi/10.1145/3658852.3659065 DOI: https://doi.org/10.1145/3658852.3659065

Cruz, J. R. D., Villanueva, C. M., Abarro, J. A. & Villanueva, J. L. (2020). Talkie: an assistive web-based educational application using audio files and speech technology for the visually impaired. Proceedings of the 2020 The 6th International Conference on Frontiers of Educational Technologies. ACM. https://dl.acm.org/doi/10.1145/3404709.3404748 DOI: https://doi.org/10.1145/3404709.3404748

Desai, S., Lundy, M. & Chin, J. (2023). “A painless way to learn”: designing an interactive storytelling voice user interface to engage older adults in informal health information learning. CUI ’23: Proceedings of the 5th International Conference on Conversational User Interfaces (No. 5). ACM. https://doi.org/10.1145/3571884.3597141 DOI: https://doi.org/10.1145/3571884.3597141

Gebreegziabher, N. H. & Nürnberger, A. (2020). A light-weight convolutional neural network based speech recognition for spoken content retrieval task. IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada. IEEE. https://doi.org/10.1109/SMC42975.2020.9282956 DOI: https://doi.org/10.1109/SMC42975.2020.9282956

Gibadullin, R. F., Perukhin, M. Y. & Llin, A. V. (2021). Speech recognition and machine translation using neural networks. International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia. IEEE. https://doi.org/10.1109/ICIEAM51226.2021.9446474 DOI: https://doi.org/10.1109/ICIEAM51226.2021.9446474

Godambe, T., Singh, R., Sitaram, S. & Choudhury, M. (2016). Developing a unit selection voice given audio without corresponding text. EURASIP Journal on audio, speech, and music processing, (6). https://doi.org/10.1186/s13636-016-0084-y DOI: https://doi.org/10.1186/s13636-016-0084-y

Gonçalves, S. S. & Silva, P. N. (2025). Requisitos funcionais para recuperação de informação em audiolivros: uma análise nas plataformas. Informação & informação, 30(1), 354-372. https://doi.org/10.5433/1981-8920.2025v30n1p354 DOI: https://doi.org/10.5433/1981-8920.2025v30n1p354

Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning. MIT Press. https://www.deeplearningbook.org/

Have, I. & Pedersen, B. (2019). The audiobook circuit in digital publishing: voicing the silent revolution. New media & society, 22(3), 409-428. https://doi.org/10.1177/1461444819863407 DOI: https://doi.org/10.1177/1461444819863407

Jennes, I., Blanckaert, E. & Van den Broeck, W. (2023). Immersion or disruption: Readers’ evaluation of and requirements for (3D-) audio as a tool to support immersion in digital reading practices. IMX ’23: Proceedings of the 2023 ACM International Conference on Interactive Media Experiences. ACM. https://dl.acm.org/doi/10.1145/3573381.3596151 DOI: https://doi.org/10.1145/3573381.3596151

Jin, Z. et al. (2024). SpeechCraft: a fine-grained expressive speech dataset with natural language description. Proceedings of the 32nd ACM International Conference on Multimedia. ACM. https://dl.acm.org/doi/10.1145/3664647.3681674 DOI: https://doi.org/10.1145/3664647.3681674

Jurafsky, D. & Martin, J. H. (2023). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall.

Kryvenchuk, Y. & Duda, O. (2024). Audio book creation system using artificial intelligence. IEEE 19th International Conference on Computer Science and Information Technologies (CSIT), Lviv, Ukraine. IEEE. https://doi.org/10.1109/CSIT65290.2024.10982688 DOI: https://doi.org/10.1109/CSIT65290.2024.10982688

Laban, P., Dusek, O., Sharma, R. & Rieser, V. (2022). NewsPod: automatic and interactive news podcasts. 27th International Conference on Intelligent User Interfaces. ACM. https://dl.acm.org/doi/10.1145/3490099.3511147 DOI: https://doi.org/10.1145/3490099.3511147

Lourenço, C. de A. (2005). Modelagem de dados como ferramenta de análise de padrões de metadados em bibliotecas digitais: O padrão de metadados brasileiro para teses e dissertações segundo o modelo entidade-relacionamento (Tese de doutorado). Escola de Ciência da Informação, Universidade Federal de Minas Gerais, Belo Horizonte. https://repositorio.ufmg.br/handle/1843/EARM-6ZGNZC

Mamiya, Y., et al. (2013). Lightly supervised GMM VAD to use audiobook for speech synthesiser. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://ieeexplore.ieee.org/document/6639220 DOI: https://doi.org/10.1109/ICASSP.2013.6639220

McEnery, T. & Hardie, A. (2012). Corpus linguistics. Cambridge University Press. DOI: https://doi.org/10.1093/oxfordhb/9780199276349.013.0024

Nadai, M. de, Silva, T., Faria, R. & Ribeiro, L. (2024). Personalized audiobook recommendations at Spotify through graph neural networks. WWW ’24: Companion Proceedings of the ACM Web Conference. ACM. https://dl.acm.org/doi/10.1145/3589335.3648339 DOI: https://doi.org/10.1145/3589335.3648339

Oliveira, D. T. De & Nascimento Silva, P. (2024). Representação e recuperação de dados governamentais abertos: uma revisão de literatura. RDBCI: Revista digital de biblioteconomia e ciência da informação, (22), e024029. https://doi.org/10.20396/rdbci.v22i00.8675828 DOI: https://doi.org/10.20396/rdbci.v22i00.8675828

Oumard, C., Kreimeier, J. & Götzelmann, T. (2022). Implementation and evaluation of a voice user interface with offline speech processing for people who are blind or visually impaired. PETRA ’22: Proceedings of the 15th International Conference on Pervasive Technologies Related to Assistive Environments. ACM. https://dl.acm.org/doi/10.1145/3529190.3529197 DOI: https://doi.org/10.1145/3529190.3529197

Page, M. J. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. The BMG, 372(71). https://doi.org/10.1136/bmj.n71 DOI: https://doi.org/10.1136/bmj.n71

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia. IEEE. https://doi.org/10.1109/ICASSP.2015.7178964 DOI: https://doi.org/10.1109/ICASSP.2015.7178964

Park, T. H. & Tsuruoka, T. (2019). Generative bookscapes: towards immersive and interactive book reading. International Computer Music Conference, New York City Electroacoustic Music Festival. Fulcrum. https://www.fulcrum.org/epubs/9880vt18d?locale=en#page=3

Pathak, A., Sharma, V., Singh, R. & Choudhury, M. (2024). Emotion-aware text to speech: bridging sentiment analysis and voice synthesis. 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India. IEEE. https://doi.org/10.1109/INOCON60754.2024.10512224 DOI: https://doi.org/10.1109/INOCON60754.2024.10512224

Penha, G., Santos, L., Almeida, F. & Hauff, C. (2025). Contextualizing Spotify’s audiobook list recommendations with descriptive shelves. En C. Hauff et al. (Eds.), Advances in information retrieval (Lecture notes in computer science, 15576). Springer. https://doi.org/10.1007/978-3-031-88720-8_26 DOI: https://doi.org/10.1007/978-3-031-88720-8_26

Pinheiro, M. & Oliveira, H. (2022). Inteligência artificial: estudos e usos na ciência da informação no Brasil. Revista ibero-americana de ciência da informação, 15(3), 950-968. https://doi.org/10.26512/rici.v15.n3.2022.42767 DOI: https://doi.org/10.26512/rici.v15.n3.2022.42767

Porcheron, M., Fischer, J. E., Reeves, S. & Sharples, S. (2018). Voice interfaces in everyday life. CHI ’18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Paper 640). ACM. https://dl.acm.org/doi/10.1145/3173574.3174214 DOI: https://doi.org/10.1145/3173574.3174214

Rubery, M. (2016). The untold story of the talking book. Harvard University Press. DOI: https://doi.org/10.4159/9780674974555

Russell, S. & Norvig, P. (2016). Inteligência artificial: uma abordagem moderna. GEN LTC.

Saracevic, T. (1996). Ciência da informação: origem, evolução e relações. Perspectivas em ciência da informação, 1(1). http://hdl.handle.net/20.500.11959/brapci/37415

Serralheiro, A., Ferreira, C. & Costa, P. (2002). Word alignment in digital talking books using WFSTs. En M. Agosti & C. Thanos (Eds.), Research and advanced technology for digital libraries. Springer. https://link.springer.com/chapter/10.1007/3-540-45747-X_37 DOI: https://doi.org/10.1007/3-540-45747-X_37

Schittine, D. (2022). Audiolivros: Desafios de produção, voz do narrador e público-leitor. Scripta, 26(56), 256-269. https://doi.org/10.5752/P.2358-3428.2022v26n56p256-269 DOI: https://doi.org/10.5752/P.2358-3428.2022v26n56p256-269

Silva, M. B. da & Neves, D. A. de B. (2013). A aplicação da teoria facetada em banco de dados, através da modelagem conceitual. En M. E. B. C. de Albuquerque, L. S. M. A. da Silva, & R. C. C. de Araújo (Eds.), Representação da informação: um universo multifacetado. Editora da UFPB. https://doi.org/10.22477/vii.widat.206 DOI: https://doi.org/10.22477/vii.widat.206

Sodhi, S. S., Singh, A., Ghosh, S. & Shrivastava, M. (2021). Mondegreen: A post-processing solution to speech recognition error correction for voice search queries. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. ACM. https://dl.acm.org/doi/10.1145/3447548.3467156 DOI: https://doi.org/10.1145/3447548.3467156

Souza Gonçalves, S. & Nascimento Silva, P. (2026). Dados da pesquisa: inteligência artificial em audiolivros: aplicações e perspectivas. Mendeley data, V2, https://doi.org/10.17632/vtb3ygt62k.3

Sri, K. S., Mounika, C. & Yamini, K. (2022). Audiobooks that converts text, image, PDF-audio & speech-text: For physically challenged & improving fluency. International Conference on Inventive Computation Technologies (ICICT), Nepal. IEEE. https://doi.org/10.1109/ICICT54344.2022.9850872 DOI: https://doi.org/10.1109/ICICT54344.2022.9850872

Subramanian, V., Patel, A., Rao, P. & Kumar, S. (2024). Voice modulation in audiobook narration. 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI). IEEE. https://ieeexplore.ieee.org/document/10851662 DOI: https://doi.org/10.1109/ISCMI63661.2024.10851662

Székely, É., O’Connor, N. & Gobl, C. (2012). Synthesizing expressive speech from amateur audiobook recordings. 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE. https://ieeexplore.ieee.org/document/6424239 DOI: https://doi.org/10.1109/SLT.2012.6424239

Tóth, L., Grósz, T., Gosztolya, G. & Hoffmann, I. (2010). Speech recognition experiments with audiobooks. Acta cybernetica, 19(4), 669-682. https://cyber.bibl.u-szeged.hu/index.php/actcybern/article/view/3792

Vít, J. & Matoušek, J. (2016). Unit-selection speech synthesis adjustments for audiobook-based voices. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech, and dialogue. Springer. https://link.springer.com/chapter/10.1007/978-3-319-45510-5_38 DOI: https://doi.org/10.1007/978-3-319-45510-5_38

Xiao, Y., Li, H., Zhou, K., Zhang, J. & Liu, Y. (2024). Contrastive context-speech pretraining for expressive text-to-speech synthesis. Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australia. ACM. https://dl.acm.org/doi/10.1145/3664647.3681348 DOI: https://doi.org/10.1145/3664647.3681348

Yahagi, Y., Tanaka, M., Saito, T. & Nakamura, S. (2025). PaperWave: listening to research papers as conversational podcasts scripted by LLM. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM. https://dl.acm.org/doi/10.1145/3706599.3706664 DOI: https://doi.org/10.1145/3706599.3706664

Yang, L., Krause, M., Seipp, K. & Ricci, F. (2018). Understanding user interactions with podcast recommendations delivered via voice. Proceedings of the 12th ACM Conference on Recommender Systems. ACM. https://dl.acm.org/doi/10.1145/3240323.3240389 DOI: https://doi.org/10.1145/3240323.3240389

Zhang, Z., Wu, Y., Li, X. & Chen, S. (2024). SpeechLM: enhanced speech pre-training with unpaired textual data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32. IEEE. https://ieeexplore.ieee.org/document/10476749 DOI: https://doi.org/10.1109/TASLP.2024.3379877

Zhao, Z. & McEwen, R. (2022). Let’s read a book together: a long-term study on the usage of pre-school children with their home companion robot. 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Sapporo, Japan. IEEE. https://doi.org/10.1109/HRI53351.2022.9889672 DOI: https://doi.org/10.1109/HRI53351.2022.9889672

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Suellen Souza Gonçalves, Universidade Federal de Minas Gerais, Brasil / Instituto Federal do Norte de Minas Gerais, Brasil

Patrícia Nascimento Silva, Universidade Federal de Minas Gerais

References

Most read articles by the same author(s)