Artificial intelligence in audiobooks: applications and perspectives
Main Article Content
Abstract
The use of Artificial intelligence (AI) techniques in the context of audiobooks has expanded the possibilities for accessibility, personalization and immersion, covering aspects from voice recognition and synthesis to interactive multimodal experiences and personalized recommendations, in addition to enhancing content retrieval and expanding access to information. This study aimed to identify studies on the use of AI in audiobooks in the academic literature. To this end, a literature review was conducted in the Scopus, Web of Science, ACM Digital Library, IEEE Xplore and Scielo databases, between May and August 2025, resulting in the selection and analysis of 35 articles. The results reveal that the studies focus on four categories: (i) speech recognition; (ii) voice synthesis; and personalization; (iii) voice-based experiences; and (iv) generative AI and LLMs. It was observed that technical studies focused on Automatic Speech Recognition and Voice Synthesis predominate, while voice-based experiences and LLM applications are still emerging, indicating future trends. Audiobooks are also frequently used as technical corpora for model development, with few studies focused on directly improving the user experience, in addition to a scarcity of research in the field of Information Science. It can be concluded that, despite recent advances, there are gaps related to the lack of user-centered studies, the predominant use of audiobooks as a technical corpus as well as few ethical and social aspects. This overview provides theoretical and practical support for future research in the area.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
References
Aldeneh, Z., Perez, M. & Mower Provost, E. (2021). Learning paralinguistic features from audiobooks through style voice conversion. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.377/ DOI: https://doi.org/10.18653/v1/2021.naacl-main.377
Bardin, L. (2011). Análise de conteúdo. Edições 70.
Barr, A. & Feigenbaum, E. A. (1981). The handbook of artificial intelligence. William Kaufmann Inc.
Bertulfo, L. C., Razon, J. M., San Juan, J. L., Sambrano, J. C. & Medina, R. P. (2017). Gabay Tinig: A 3D interactive audiobook with voice recognition for visually-impaired and blind preschool students using mobile technologies. Proceedings of the 3rd International Conference on Communication and Information Processing (ICCIP ’17). https://dl.acm.org/doi/10.1145/3162957.3162979 DOI: https://doi.org/10.1145/3162957.3162979
Biber, D., Conrad, S. & Reppen, R. (1998). Corpus linguistics: investigating language structure and use. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511804489
Borko, H. (1968). Information science: What is it? American documentation, 19(1). https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090190103 DOI: https://doi.org/10.1002/asi.5090190103
Brewer, R. N. & Piper, A. M. (2017). XPress: Rethinking design for aging and accessibility through an IVR blogging system. Proceedings of the ACM on Human-Computer Interaction, 1(CSCW). https://dl.acm.org/doi/10.1145/3139354 DOI: https://doi.org/10.1145/3139354
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... Amodei, D. (2020). Language models are few-shot learners. IAdvances in neural information processing systems (NeurIPS 2020). https://arxiv.org/abs/2005.14165
Chalamandaris, A., Raptis, S., Karabetsos, S. & Tsiakoulis, P. (2014). Using audio books for training a text-to-speech system. En N. Calzolari et al. (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA). https://aclanthology.org/L14-1645/ DOI: https://doi.org/10.63317/3izdxbcmh47r
Chen, L., Braunschweiler, N. & Gales, M. J. F. (2015). Speaker and expression factorization for audiobook data: Expressiveness and transplantation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4). https://ieeexplore.ieee.org/document/6995936 DOI: https://doi.org/10.1109/TASLP.2014.2385478
Chen, X. et al. (2023). StyleSpeech: self-supervised style enhancing with VQ-VAE-based pre-training for expressive audiobook speech synthesis. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).https://doi.org/10.48550/arXiv.2312.12181 DOI: https://doi.org/10.1109/ICASSP48485.2024.10446352
Choi, D. H. (2019). LYRA: an interactive and interactive storyteller. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). IEEE. https://ieeexplore.ieee.org/document/8919562 DOI: https://doi.org/10.1109/CSE/EUC.2019.00037
Cotton, K., de Vries, K. & Tatar, K. (2024). Singing for the missing: bringing the body back to AI voice and speech technologies. Proceedings of the 9th International Conference on Movement and Computing. ACM. https://dl.acm.org/doi/10.1145/3658852.3659065 DOI: https://doi.org/10.1145/3658852.3659065
Cruz, J. R. D., Villanueva, C. M., Abarro, J. A. & Villanueva, J. L. (2020). Talkie: an assistive web-based educational application using audio files and speech technology for the visually impaired. Proceedings of the 2020 The 6th International Conference on Frontiers of Educational Technologies. ACM. https://dl.acm.org/doi/10.1145/3404709.3404748 DOI: https://doi.org/10.1145/3404709.3404748
Desai, S., Lundy, M. & Chin, J. (2023). “A painless way to learn”: designing an interactive storytelling voice user interface to engage older adults in informal health information learning. CUI ’23: Proceedings of the 5th International Conference on Conversational User Interfaces (No. 5). ACM. https://doi.org/10.1145/3571884.3597141 DOI: https://doi.org/10.1145/3571884.3597141
Gebreegziabher, N. H. & Nürnberger, A. (2020). A light-weight convolutional neural network based speech recognition for spoken content retrieval task. IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada. IEEE. https://doi.org/10.1109/SMC42975.2020.9282956 DOI: https://doi.org/10.1109/SMC42975.2020.9282956
Gibadullin, R. F., Perukhin, M. Y. & Llin, A. V. (2021). Speech recognition and machine translation using neural networks. International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia. IEEE. https://doi.org/10.1109/ICIEAM51226.2021.9446474 DOI: https://doi.org/10.1109/ICIEAM51226.2021.9446474
Godambe, T., Singh, R., Sitaram, S. & Choudhury, M. (2016). Developing a unit selection voice given audio without corresponding text. EURASIP Journal on audio, speech, and music processing, (6). https://doi.org/10.1186/s13636-016-0084-y DOI: https://doi.org/10.1186/s13636-016-0084-y
Gonçalves, S. S. & Silva, P. N. (2025). Requisitos funcionais para recuperação de informação em audiolivros: uma análise nas plataformas. Informação & informação, 30(1), 354-372. https://doi.org/10.5433/1981-8920.2025v30n1p354 DOI: https://doi.org/10.5433/1981-8920.2025v30n1p354
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning. MIT Press. https://www.deeplearningbook.org/
Have, I. & Pedersen, B. (2019). The audiobook circuit in digital publishing: voicing the silent revolution. New media & society, 22(3), 409-428. https://doi.org/10.1177/1461444819863407 DOI: https://doi.org/10.1177/1461444819863407
Jennes, I., Blanckaert, E. & Van den Broeck, W. (2023). Immersion or disruption: Readers’ evaluation of and requirements for (3D-) audio as a tool to support immersion in digital reading practices. IMX ’23: Proceedings of the 2023 ACM International Conference on Interactive Media Experiences. ACM. https://dl.acm.org/doi/10.1145/3573381.3596151 DOI: https://doi.org/10.1145/3573381.3596151
Jin, Z. et al. (2024). SpeechCraft: a fine-grained expressive speech dataset with natural language description. Proceedings of the 32nd ACM International Conference on Multimedia. ACM. https://dl.acm.org/doi/10.1145/3664647.3681674 DOI: https://doi.org/10.1145/3664647.3681674
Jurafsky, D. & Martin, J. H. (2023). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall.
Kryvenchuk, Y. & Duda, O. (2024). Audio book creation system using artificial intelligence. IEEE 19th International Conference on Computer Science and Information Technologies (CSIT), Lviv, Ukraine. IEEE. https://doi.org/10.1109/CSIT65290.2024.10982688 DOI: https://doi.org/10.1109/CSIT65290.2024.10982688
Laban, P., Dusek, O., Sharma, R. & Rieser, V. (2022). NewsPod: automatic and interactive news podcasts. 27th International Conference on Intelligent User Interfaces. ACM. https://dl.acm.org/doi/10.1145/3490099.3511147 DOI: https://doi.org/10.1145/3490099.3511147
Lourenço, C. de A. (2005). Modelagem de dados como ferramenta de análise de padrões de metadados em bibliotecas digitais: O padrão de metadados brasileiro para teses e dissertações segundo o modelo entidade-relacionamento (Tese de doutorado). Escola de Ciência da Informação, Universidade Federal de Minas Gerais, Belo Horizonte. https://repositorio.ufmg.br/handle/1843/EARM-6ZGNZC
Mamiya, Y., et al. (2013). Lightly supervised GMM VAD to use audiobook for speech synthesiser. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://ieeexplore.ieee.org/document/6639220 DOI: https://doi.org/10.1109/ICASSP.2013.6639220
McEnery, T. & Hardie, A. (2012). Corpus linguistics. Cambridge University Press. DOI: https://doi.org/10.1093/oxfordhb/9780199276349.013.0024
Nadai, M. de, Silva, T., Faria, R. & Ribeiro, L. (2024). Personalized audiobook recommendations at Spotify through graph neural networks. WWW ’24: Companion Proceedings of the ACM Web Conference. ACM. https://dl.acm.org/doi/10.1145/3589335.3648339 DOI: https://doi.org/10.1145/3589335.3648339
Oliveira, D. T. De & Nascimento Silva, P. (2024). Representação e recuperação de dados governamentais abertos: uma revisão de literatura. RDBCI: Revista digital de biblioteconomia e ciência da informação, (22), e024029. https://doi.org/10.20396/rdbci.v22i00.8675828 DOI: https://doi.org/10.20396/rdbci.v22i00.8675828
Oumard, C., Kreimeier, J. & Götzelmann, T. (2022). Implementation and evaluation of a voice user interface with offline speech processing for people who are blind or visually impaired. PETRA ’22: Proceedings of the 15th International Conference on Pervasive Technologies Related to Assistive Environments. ACM. https://dl.acm.org/doi/10.1145/3529190.3529197 DOI: https://doi.org/10.1145/3529190.3529197
Page, M. J. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. The BMG, 372(71). https://doi.org/10.1136/bmj.n71 DOI: https://doi.org/10.1136/bmj.n71
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia. IEEE. https://doi.org/10.1109/ICASSP.2015.7178964 DOI: https://doi.org/10.1109/ICASSP.2015.7178964
Park, T. H. & Tsuruoka, T. (2019). Generative bookscapes: towards immersive and interactive book reading. International Computer Music Conference, New York City Electroacoustic Music Festival. Fulcrum. https://www.fulcrum.org/epubs/9880vt18d?locale=en#page=3
Pathak, A., Sharma, V., Singh, R. & Choudhury, M. (2024). Emotion-aware text to speech: bridging sentiment analysis and voice synthesis. 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India. IEEE. https://doi.org/10.1109/INOCON60754.2024.10512224 DOI: https://doi.org/10.1109/INOCON60754.2024.10512224
Penha, G., Santos, L., Almeida, F. & Hauff, C. (2025). Contextualizing Spotify’s audiobook list recommendations with descriptive shelves. En C. Hauff et al. (Eds.), Advances in information retrieval (Lecture notes in computer science, 15576). Springer. https://doi.org/10.1007/978-3-031-88720-8_26 DOI: https://doi.org/10.1007/978-3-031-88720-8_26
Pinheiro, M. & Oliveira, H. (2022). Inteligência artificial: estudos e usos na ciência da informação no Brasil. Revista ibero-americana de ciência da informação, 15(3), 950-968. https://doi.org/10.26512/rici.v15.n3.2022.42767 DOI: https://doi.org/10.26512/rici.v15.n3.2022.42767
Porcheron, M., Fischer, J. E., Reeves, S. & Sharples, S. (2018). Voice interfaces in everyday life. CHI ’18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Paper 640). ACM. https://dl.acm.org/doi/10.1145/3173574.3174214 DOI: https://doi.org/10.1145/3173574.3174214
Rubery, M. (2016). The untold story of the talking book. Harvard University Press. DOI: https://doi.org/10.4159/9780674974555
Russell, S. & Norvig, P. (2016). Inteligência artificial: uma abordagem moderna. GEN LTC.
Saracevic, T. (1996). Ciência da informação: origem, evolução e relações. Perspectivas em ciência da informação, 1(1). http://hdl.handle.net/20.500.11959/brapci/37415
Serralheiro, A., Ferreira, C. & Costa, P. (2002). Word alignment in digital talking books using WFSTs. En M. Agosti & C. Thanos (Eds.), Research and advanced technology for digital libraries. Springer. https://link.springer.com/chapter/10.1007/3-540-45747-X_37 DOI: https://doi.org/10.1007/3-540-45747-X_37
Schittine, D. (2022). Audiolivros: Desafios de produção, voz do narrador e público-leitor. Scripta, 26(56), 256-269. https://doi.org/10.5752/P.2358-3428.2022v26n56p256-269 DOI: https://doi.org/10.5752/P.2358-3428.2022v26n56p256-269
Silva, M. B. da & Neves, D. A. de B. (2013). A aplicação da teoria facetada em banco de dados, através da modelagem conceitual. En M. E. B. C. de Albuquerque, L. S. M. A. da Silva, & R. C. C. de Araújo (Eds.), Representação da informação: um universo multifacetado. Editora da UFPB. https://doi.org/10.22477/vii.widat.206 DOI: https://doi.org/10.22477/vii.widat.206
Sodhi, S. S., Singh, A., Ghosh, S. & Shrivastava, M. (2021). Mondegreen: A post-processing solution to speech recognition error correction for voice search queries. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. ACM. https://dl.acm.org/doi/10.1145/3447548.3467156 DOI: https://doi.org/10.1145/3447548.3467156
Souza Gonçalves, S. & Nascimento Silva, P. (2026). Dados da pesquisa: inteligência artificial em audiolivros: aplicações e perspectivas. Mendeley data, V2, https://doi.org/10.17632/vtb3ygt62k.3
Sri, K. S., Mounika, C. & Yamini, K. (2022). Audiobooks that converts text, image, PDF-audio & speech-text: For physically challenged & improving fluency. International Conference on Inventive Computation Technologies (ICICT), Nepal. IEEE. https://doi.org/10.1109/ICICT54344.2022.9850872 DOI: https://doi.org/10.1109/ICICT54344.2022.9850872
Subramanian, V., Patel, A., Rao, P. & Kumar, S. (2024). Voice modulation in audiobook narration. 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI). IEEE. https://ieeexplore.ieee.org/document/10851662 DOI: https://doi.org/10.1109/ISCMI63661.2024.10851662
Székely, É., O’Connor, N. & Gobl, C. (2012). Synthesizing expressive speech from amateur audiobook recordings. 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE. https://ieeexplore.ieee.org/document/6424239 DOI: https://doi.org/10.1109/SLT.2012.6424239
Tóth, L., Grósz, T., Gosztolya, G. & Hoffmann, I. (2010). Speech recognition experiments with audiobooks. Acta cybernetica, 19(4), 669-682. https://cyber.bibl.u-szeged.hu/index.php/actcybern/article/view/3792
Vít, J. & Matoušek, J. (2016). Unit-selection speech synthesis adjustments for audiobook-based voices. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech, and dialogue. Springer. https://link.springer.com/chapter/10.1007/978-3-319-45510-5_38 DOI: https://doi.org/10.1007/978-3-319-45510-5_38
Xiao, Y., Li, H., Zhou, K., Zhang, J. & Liu, Y. (2024). Contrastive context-speech pretraining for expressive text-to-speech synthesis. Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australia. ACM. https://dl.acm.org/doi/10.1145/3664647.3681348 DOI: https://doi.org/10.1145/3664647.3681348
Yahagi, Y., Tanaka, M., Saito, T. & Nakamura, S. (2025). PaperWave: listening to research papers as conversational podcasts scripted by LLM. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM. https://dl.acm.org/doi/10.1145/3706599.3706664 DOI: https://doi.org/10.1145/3706599.3706664
Yang, L., Krause, M., Seipp, K. & Ricci, F. (2018). Understanding user interactions with podcast recommendations delivered via voice. Proceedings of the 12th ACM Conference on Recommender Systems. ACM. https://dl.acm.org/doi/10.1145/3240323.3240389 DOI: https://doi.org/10.1145/3240323.3240389
Zhang, Z., Wu, Y., Li, X. & Chen, S. (2024). SpeechLM: enhanced speech pre-training with unpaired textual data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32. IEEE. https://ieeexplore.ieee.org/document/10476749 DOI: https://doi.org/10.1109/TASLP.2024.3379877
Zhao, Z. & McEwen, R. (2022). Let’s read a book together: a long-term study on the usage of pre-school children with their home companion robot. 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Sapporo, Japan. IEEE. https://doi.org/10.1109/HRI53351.2022.9889672 DOI: https://doi.org/10.1109/HRI53351.2022.9889672