Inteligência artificial em audiolivros: aplicações e perspectivas
##plugins.themes.bootstrap3.article.main##
Resumo
O uso de técnicas de Inteligência artificial (IA) no contexto dos audiolivros tem ampliado as possibilidades de acessibilidade, personalização e imersão, permitindo desde o reconhecimento e a síntese de voz até experiências multimodais interativas e recomendações personalizadas, além de potencializar a recuperação de conteúdo e ampliar o acesso à informação. Este estudo teve como objetivo identificar, na literatura acadêmica, estudos sobre o uso da IA em audiolivros. Para tanto, foi realizada uma revisão de literatura nas bases Scopus, Web of Science, ACM Digital Library, IEEE Xplore e Scielo, entre maio e agosto de 2025, resultando na seleção e análise de 35 artigos. Os resultados revelam que os trabalhos concentram-se em quatro categorias: (i) reconhecimento de fala; (ii) síntese de voz e personalização; (iii) experiências baseadas em voz; e (iv) IA generativa e LLMs. Observou-se que predominam estudos técnicos voltados para o Reconhecimento Automático de Fala e Síntese de Voz, enquanto experiências baseadas em voz e aplicações de LLMs ainda aparecem de forma emergente, indicando tendências futuras. Os audiolivros também são frequentemente utilizados como corpus técnico para o desenvolvimento de modelos, com poucos estudos voltados à melhoria direta da experiência de uso, além de uma escassez de pesquisas na área da Ciência da Informação. Conclui-se que, apesar dos avanços recentes, há lacunas relativas à falta de estudos centrados no usuário e ao uso predominante dos audiolivros como corpus técnico, assim como poucos aspectos éticos e sociais. Este panorama oferece subsídios teóricos e práticos para pesquisas futuras na área.
Downloads
##plugins.themes.bootstrap3.article.details##

Este trabalho encontra-se publicado com a Licença Internacional Creative Commons Atribuição-NãoComercial-CompartilhaIgual 4.0.
Referências
Aldeneh, Z., Perez, M. & Mower Provost, E. (2021). Learning paralinguistic features from audiobooks through style voice conversion. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.377/ DOI: https://doi.org/10.18653/v1/2021.naacl-main.377
Bardin, L. (2011). Análise de conteúdo. Edições 70.
Barr, A. & Feigenbaum, E. A. (1981). The handbook of artificial intelligence. William Kaufmann Inc.
Bertulfo, L. C., Razon, J. M., San Juan, J. L., Sambrano, J. C. & Medina, R. P. (2017). Gabay Tinig: A 3D interactive audiobook with voice recognition for visually-impaired and blind preschool students using mobile technologies. Proceedings of the 3rd International Conference on Communication and Information Processing (ICCIP ’17). https://dl.acm.org/doi/10.1145/3162957.3162979 DOI: https://doi.org/10.1145/3162957.3162979
Biber, D., Conrad, S. & Reppen, R. (1998). Corpus linguistics: investigating language structure and use. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511804489
Borko, H. (1968). Information science: What is it? American documentation, 19(1). https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090190103 DOI: https://doi.org/10.1002/asi.5090190103
Brewer, R. N. & Piper, A. M. (2017). XPress: Rethinking design for aging and accessibility through an IVR blogging system. Proceedings of the ACM on Human-Computer Interaction, 1(CSCW). https://dl.acm.org/doi/10.1145/3139354 DOI: https://doi.org/10.1145/3139354
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... Amodei, D. (2020). Language models are few-shot learners. IAdvances in neural information processing systems (NeurIPS 2020). https://arxiv.org/abs/2005.14165
Chalamandaris, A., Raptis, S., Karabetsos, S. & Tsiakoulis, P. (2014). Using audio books for training a text-to-speech system. En N. Calzolari et al. (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA). https://aclanthology.org/L14-1645/ DOI: https://doi.org/10.63317/3izdxbcmh47r
Chen, L., Braunschweiler, N. & Gales, M. J. F. (2015). Speaker and expression factorization for audiobook data: Expressiveness and transplantation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4). https://ieeexplore.ieee.org/document/6995936 DOI: https://doi.org/10.1109/TASLP.2014.2385478
Chen, X. et al. (2023). StyleSpeech: self-supervised style enhancing with VQ-VAE-based pre-training for expressive audiobook speech synthesis. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).https://doi.org/10.48550/arXiv.2312.12181 DOI: https://doi.org/10.1109/ICASSP48485.2024.10446352
Choi, D. H. (2019). LYRA: an interactive and interactive storyteller. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). IEEE. https://ieeexplore.ieee.org/document/8919562 DOI: https://doi.org/10.1109/CSE/EUC.2019.00037
Cotton, K., de Vries, K. & Tatar, K. (2024). Singing for the missing: bringing the body back to AI voice and speech technologies. Proceedings of the 9th International Conference on Movement and Computing. ACM. https://dl.acm.org/doi/10.1145/3658852.3659065 DOI: https://doi.org/10.1145/3658852.3659065
Cruz, J. R. D., Villanueva, C. M., Abarro, J. A. & Villanueva, J. L. (2020). Talkie: an assistive web-based educational application using audio files and speech technology for the visually impaired. Proceedings of the 2020 The 6th International Conference on Frontiers of Educational Technologies. ACM. https://dl.acm.org/doi/10.1145/3404709.3404748 DOI: https://doi.org/10.1145/3404709.3404748
Desai, S., Lundy, M. & Chin, J. (2023). “A painless way to learn”: designing an interactive storytelling voice user interface to engage older adults in informal health information learning. CUI ’23: Proceedings of the 5th International Conference on Conversational User Interfaces (No. 5). ACM. https://doi.org/10.1145/3571884.3597141 DOI: https://doi.org/10.1145/3571884.3597141
Gebreegziabher, N. H. & Nürnberger, A. (2020). A light-weight convolutional neural network based speech recognition for spoken content retrieval task. IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada. IEEE. https://doi.org/10.1109/SMC42975.2020.9282956 DOI: https://doi.org/10.1109/SMC42975.2020.9282956
Gibadullin, R. F., Perukhin, M. Y. & Llin, A. V. (2021). Speech recognition and machine translation using neural networks. International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia. IEEE. https://doi.org/10.1109/ICIEAM51226.2021.9446474 DOI: https://doi.org/10.1109/ICIEAM51226.2021.9446474
Godambe, T., Singh, R., Sitaram, S. & Choudhury, M. (2016). Developing a unit selection voice given audio without corresponding text. EURASIP Journal on audio, speech, and music processing, (6). https://doi.org/10.1186/s13636-016-0084-y DOI: https://doi.org/10.1186/s13636-016-0084-y
Gonçalves, S. S. & Silva, P. N. (2025). Requisitos funcionais para recuperação de informação em audiolivros: uma análise nas plataformas. Informação & informação, 30(1), 354-372. https://doi.org/10.5433/1981-8920.2025v30n1p354 DOI: https://doi.org/10.5433/1981-8920.2025v30n1p354
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning. MIT Press. https://www.deeplearningbook.org/
Have, I. & Pedersen, B. (2019). The audiobook circuit in digital publishing: voicing the silent revolution. New media & society, 22(3), 409-428. https://doi.org/10.1177/1461444819863407 DOI: https://doi.org/10.1177/1461444819863407
Jennes, I., Blanckaert, E. & Van den Broeck, W. (2023). Immersion or disruption: Readers’ evaluation of and requirements for (3D-) audio as a tool to support immersion in digital reading practices. IMX ’23: Proceedings of the 2023 ACM International Conference on Interactive Media Experiences. ACM. https://dl.acm.org/doi/10.1145/3573381.3596151 DOI: https://doi.org/10.1145/3573381.3596151
Jin, Z. et al. (2024). SpeechCraft: a fine-grained expressive speech dataset with natural language description. Proceedings of the 32nd ACM International Conference on Multimedia. ACM. https://dl.acm.org/doi/10.1145/3664647.3681674 DOI: https://doi.org/10.1145/3664647.3681674
Jurafsky, D. & Martin, J. H. (2023). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall.
Kryvenchuk, Y. & Duda, O. (2024). Audio book creation system using artificial intelligence. IEEE 19th International Conference on Computer Science and Information Technologies (CSIT), Lviv, Ukraine. IEEE. https://doi.org/10.1109/CSIT65290.2024.10982688 DOI: https://doi.org/10.1109/CSIT65290.2024.10982688
Laban, P., Dusek, O., Sharma, R. & Rieser, V. (2022). NewsPod: automatic and interactive news podcasts. 27th International Conference on Intelligent User Interfaces. ACM. https://dl.acm.org/doi/10.1145/3490099.3511147 DOI: https://doi.org/10.1145/3490099.3511147
Lourenço, C. de A. (2005). Modelagem de dados como ferramenta de análise de padrões de metadados em bibliotecas digitais: O padrão de metadados brasileiro para teses e dissertações segundo o modelo entidade-relacionamento (Tese de doutorado). Escola de Ciência da Informação, Universidade Federal de Minas Gerais, Belo Horizonte. https://repositorio.ufmg.br/handle/1843/EARM-6ZGNZC
Mamiya, Y., et al. (2013). Lightly supervised GMM VAD to use audiobook for speech synthesiser. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://ieeexplore.ieee.org/document/6639220 DOI: https://doi.org/10.1109/ICASSP.2013.6639220
McEnery, T. & Hardie, A. (2012). Corpus linguistics. Cambridge University Press. DOI: https://doi.org/10.1093/oxfordhb/9780199276349.013.0024
Nadai, M. de, Silva, T., Faria, R. & Ribeiro, L. (2024). Personalized audiobook recommendations at Spotify through graph neural networks. WWW ’24: Companion Proceedings of the ACM Web Conference. ACM. https://dl.acm.org/doi/10.1145/3589335.3648339 DOI: https://doi.org/10.1145/3589335.3648339
Oliveira, D. T. De & Nascimento Silva, P. (2024). Representação e recuperação de dados governamentais abertos: uma revisão de literatura. RDBCI: Revista digital de biblioteconomia e ciência da informação, (22), e024029. https://doi.org/10.20396/rdbci.v22i00.8675828 DOI: https://doi.org/10.20396/rdbci.v22i00.8675828
Oumard, C., Kreimeier, J. & Götzelmann, T. (2022). Implementation and evaluation of a voice user interface with offline speech processing for people who are blind or visually impaired. PETRA ’22: Proceedings of the 15th International Conference on Pervasive Technologies Related to Assistive Environments. ACM. https://dl.acm.org/doi/10.1145/3529190.3529197 DOI: https://doi.org/10.1145/3529190.3529197
Page, M. J. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. The BMG, 372(71). https://doi.org/10.1136/bmj.n71 DOI: https://doi.org/10.1136/bmj.n71
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia. IEEE. https://doi.org/10.1109/ICASSP.2015.7178964 DOI: https://doi.org/10.1109/ICASSP.2015.7178964
Park, T. H. & Tsuruoka, T. (2019). Generative bookscapes: towards immersive and interactive book reading. International Computer Music Conference, New York City Electroacoustic Music Festival. Fulcrum. https://www.fulcrum.org/epubs/9880vt18d?locale=en#page=3
Pathak, A., Sharma, V., Singh, R. & Choudhury, M. (2024). Emotion-aware text to speech: bridging sentiment analysis and voice synthesis. 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India. IEEE. https://doi.org/10.1109/INOCON60754.2024.10512224 DOI: https://doi.org/10.1109/INOCON60754.2024.10512224
Penha, G., Santos, L., Almeida, F. & Hauff, C. (2025). Contextualizing Spotify’s audiobook list recommendations with descriptive shelves. En C. Hauff et al. (Eds.), Advances in information retrieval (Lecture notes in computer science, 15576). Springer. https://doi.org/10.1007/978-3-031-88720-8_26 DOI: https://doi.org/10.1007/978-3-031-88720-8_26
Pinheiro, M. & Oliveira, H. (2022). Inteligência artificial: estudos e usos na ciência da informação no Brasil. Revista ibero-americana de ciência da informação, 15(3), 950-968. https://doi.org/10.26512/rici.v15.n3.2022.42767 DOI: https://doi.org/10.26512/rici.v15.n3.2022.42767
Porcheron, M., Fischer, J. E., Reeves, S. & Sharples, S. (2018). Voice interfaces in everyday life. CHI ’18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Paper 640). ACM. https://dl.acm.org/doi/10.1145/3173574.3174214 DOI: https://doi.org/10.1145/3173574.3174214
Rubery, M. (2016). The untold story of the talking book. Harvard University Press. DOI: https://doi.org/10.4159/9780674974555
Russell, S. & Norvig, P. (2016). Inteligência artificial: uma abordagem moderna. GEN LTC.
Saracevic, T. (1996). Ciência da informação: origem, evolução e relações. Perspectivas em ciência da informação, 1(1). http://hdl.handle.net/20.500.11959/brapci/37415
Serralheiro, A., Ferreira, C. & Costa, P. (2002). Word alignment in digital talking books using WFSTs. En M. Agosti & C. Thanos (Eds.), Research and advanced technology for digital libraries. Springer. https://link.springer.com/chapter/10.1007/3-540-45747-X_37 DOI: https://doi.org/10.1007/3-540-45747-X_37
Schittine, D. (2022). Audiolivros: Desafios de produção, voz do narrador e público-leitor. Scripta, 26(56), 256-269. https://doi.org/10.5752/P.2358-3428.2022v26n56p256-269 DOI: https://doi.org/10.5752/P.2358-3428.2022v26n56p256-269
Silva, M. B. da & Neves, D. A. de B. (2013). A aplicação da teoria facetada em banco de dados, através da modelagem conceitual. En M. E. B. C. de Albuquerque, L. S. M. A. da Silva, & R. C. C. de Araújo (Eds.), Representação da informação: um universo multifacetado. Editora da UFPB. https://doi.org/10.22477/vii.widat.206 DOI: https://doi.org/10.22477/vii.widat.206
Sodhi, S. S., Singh, A., Ghosh, S. & Shrivastava, M. (2021). Mondegreen: A post-processing solution to speech recognition error correction for voice search queries. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. ACM. https://dl.acm.org/doi/10.1145/3447548.3467156 DOI: https://doi.org/10.1145/3447548.3467156
Souza Gonçalves, S. & Nascimento Silva, P. (2026). Dados da pesquisa: inteligência artificial em audiolivros: aplicações e perspectivas. Mendeley data, V2, https://doi.org/10.17632/vtb3ygt62k.3
Sri, K. S., Mounika, C. & Yamini, K. (2022). Audiobooks that converts text, image, PDF-audio & speech-text: For physically challenged & improving fluency. International Conference on Inventive Computation Technologies (ICICT), Nepal. IEEE. https://doi.org/10.1109/ICICT54344.2022.9850872 DOI: https://doi.org/10.1109/ICICT54344.2022.9850872
Subramanian, V., Patel, A., Rao, P. & Kumar, S. (2024). Voice modulation in audiobook narration. 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI). IEEE. https://ieeexplore.ieee.org/document/10851662 DOI: https://doi.org/10.1109/ISCMI63661.2024.10851662
Székely, É., O’Connor, N. & Gobl, C. (2012). Synthesizing expressive speech from amateur audiobook recordings. 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE. https://ieeexplore.ieee.org/document/6424239 DOI: https://doi.org/10.1109/SLT.2012.6424239
Tóth, L., Grósz, T., Gosztolya, G. & Hoffmann, I. (2010). Speech recognition experiments with audiobooks. Acta cybernetica, 19(4), 669-682. https://cyber.bibl.u-szeged.hu/index.php/actcybern/article/view/3792
Vít, J. & Matoušek, J. (2016). Unit-selection speech synthesis adjustments for audiobook-based voices. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech, and dialogue. Springer. https://link.springer.com/chapter/10.1007/978-3-319-45510-5_38 DOI: https://doi.org/10.1007/978-3-319-45510-5_38
Xiao, Y., Li, H., Zhou, K., Zhang, J. & Liu, Y. (2024). Contrastive context-speech pretraining for expressive text-to-speech synthesis. Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australia. ACM. https://dl.acm.org/doi/10.1145/3664647.3681348 DOI: https://doi.org/10.1145/3664647.3681348
Yahagi, Y., Tanaka, M., Saito, T. & Nakamura, S. (2025). PaperWave: listening to research papers as conversational podcasts scripted by LLM. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM. https://dl.acm.org/doi/10.1145/3706599.3706664 DOI: https://doi.org/10.1145/3706599.3706664
Yang, L., Krause, M., Seipp, K. & Ricci, F. (2018). Understanding user interactions with podcast recommendations delivered via voice. Proceedings of the 12th ACM Conference on Recommender Systems. ACM. https://dl.acm.org/doi/10.1145/3240323.3240389 DOI: https://doi.org/10.1145/3240323.3240389
Zhang, Z., Wu, Y., Li, X. & Chen, S. (2024). SpeechLM: enhanced speech pre-training with unpaired textual data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32. IEEE. https://ieeexplore.ieee.org/document/10476749 DOI: https://doi.org/10.1109/TASLP.2024.3379877
Zhao, Z. & McEwen, R. (2022). Let’s read a book together: a long-term study on the usage of pre-school children with their home companion robot. 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Sapporo, Japan. IEEE. https://doi.org/10.1109/HRI53351.2022.9889672 DOI: https://doi.org/10.1109/HRI53351.2022.9889672