本文刊發(fā)于《現(xiàn)代電影技術(shù)》2024年第9期
專家點(diǎn)評(píng)
近年來,人工智能生成內(nèi)容(AIGC)技術(shù)迅猛發(fā)展,其主流模型框架以深度神經(jīng)網(wǎng)絡(luò)為基礎(chǔ),由早期的GAN、VAE向Transformer、Diffusion與DiT(Diffusion Transformer)發(fā)展演進(jìn)。其中,大語言模型(LLM)文本生成技術(shù)日漸成熟,引領(lǐng)推動(dòng)圖像與聲音生成技術(shù)的發(fā)展,并通過不斷增強(qiáng)可控性以滿足日益增長(zhǎng)的個(gè)性化創(chuàng)作需求。音樂作為電影不可或缺的表達(dá)元素,順應(yīng)AIGC技術(shù)的發(fā)展與應(yīng)用,AI音樂生成正逐步成為電影配樂創(chuàng)作的革新力量,迄今已分化出符號(hào)生成與音頻生成兩種技術(shù)路線,但現(xiàn)有方法對(duì)音樂流派等控制條件關(guān)注不足,一定程度上影響了音樂生成質(zhì)量和多樣性的提升。《基于多粒度注意力Transformer的電影音樂生成研究》一文以編碼后的流派信息作為條件輸入從零生成符號(hào)音樂,結(jié)合音樂重復(fù)周期的結(jié)構(gòu)特點(diǎn),采用多粒度注意力機(jī)制Transformer架構(gòu)捕獲音樂結(jié)構(gòu)和上下文信息,并引入流派分類判別器,輸出流派分類概率用于識(shí)別判斷,為音樂生成提供風(fēng)格控制。本方法在流派控制效果、音樂質(zhì)量結(jié)構(gòu)等方面較同類方法有較大提升,但在實(shí)用性上仍有改進(jìn)空間,有待進(jìn)一步研究探索。
——王萃
正高級(jí)工程師
中國(guó)電影科學(xué)技術(shù)研究所(中央宣傳部電影技術(shù)質(zhì)量檢測(cè)所)高新技術(shù)研究處副處長(zhǎng)
作 者 簡(jiǎn) 介
熊曉鈺
上海大學(xué)上海電影學(xué)院2021級(jí)碩士研究生,主要研究方向:深度學(xué)習(xí)、電影音樂生成。
上海大學(xué)上海電影學(xué)院、上海電影特效工程技術(shù)中心副教授、博士生導(dǎo)師,主要研究方向:電影高新技術(shù)、人工智能。
謝志峰
黃登云
上海大學(xué)上海電影學(xué)院2023級(jí)碩士研究生,主要研究方向:深度學(xué)習(xí)、電影音樂生成。
上海大學(xué)上海電影學(xué)院副教授、碩士生導(dǎo)師,主要研究方向:人工智能、計(jì)算機(jī)應(yīng)用。
朱永華
摘要
電影音樂自動(dòng)生成是當(dāng)前人工智能領(lǐng)域的研究熱點(diǎn)之一,不少深度學(xué)習(xí)音樂生成算法可實(shí)現(xiàn)動(dòng)聽的電影配樂生成,但這些算法在生成過程中往往忽略了流派等風(fēng)格控制。針對(duì)這一情況,本文提出了一種基于多粒度注意力Transformer的電影音樂生成方法,可根據(jù)目標(biāo)流派從零生成音樂。本方法在引入多粒度注意力Transformer建模音樂結(jié)構(gòu)的基礎(chǔ)上,引入了對(duì)抗學(xué)習(xí)機(jī)制,通過具有流派分類損失和生成對(duì)抗損失的流派輔助分類判別器,加強(qiáng)模型對(duì)流派信息的控制。在所構(gòu)建的包含流派信息的符號(hào)音樂數(shù)據(jù)集上進(jìn)行的主客觀實(shí)驗(yàn)表明,本文方法在生成音樂質(zhì)量和流派控制方面均優(yōu)于以往方法,有助于基于目標(biāo)流派自動(dòng)生成電影配樂。
關(guān)鍵詞
音樂生成;流派控制;生成式對(duì)抗網(wǎng)絡(luò);Transformer;電影音樂
1引言
2相關(guān)研究
2.1 基于深度學(xué)習(xí)的符號(hào)音樂生成
2.2 可控音樂生成
3本文方法
3.1 整體網(wǎng)絡(luò)框架
3.2 數(shù)據(jù)表示
3.3 多粒度注意力Transformer
3.4 流派輔助分類判別器
4實(shí)驗(yàn)結(jié)果及分析
4.1 數(shù)據(jù)集
4.2 實(shí)驗(yàn)設(shè)置
4.3 客觀評(píng)價(jià)
4.4 主觀評(píng)價(jià)
5結(jié)語
參考文獻(xiàn)
(向下滑動(dòng)閱讀)
[1] 陳吉尚, 哈里旦木·阿布都克里木, 梁蘊(yùn)澤, 等. 深度學(xué)習(xí)在符號(hào)音樂生成中的應(yīng)用研究綜述[J]. 計(jì)算機(jī)工程與應(yīng)用, 2023, 59(09): 27?45.
[2] Roberts A, Engel J, Raffel C, et al. A hierarchical latent vector model for learning long?term structure in music[C]//International conference on machine learning. PMLR, 2018: 4364?4373.
[3] Brunner G, Konrad A, Wang Y, et al. MIDI?VAE: Modeling dynamics and instrumentation of music with applications to style transfer[C]//ISMIR 2019: 343?351.
[4] Mogren O. C?RNN?GAN: Continuous recurrent neural networks with adversarial training[C]//Conference and Workshop on Neural Information Processing Systems, 2016: 1?6.
[5] Yang L C, Chou S Y, Yang Y H. MidiNet: A convolutional generative adversarial network for symbolic?domain music generation[C]//Proceedings of International Society for Music Information Retrieval Conference, 2017: 324?331.
[6] Dong H W, Hsiao W Y, Yang L C, et al. MuseGAN: Multi?track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2018: 34?41.
[7] Huang C Z A, Vaswani A, Uszkoreit J, et al. Music Transformer: Generating music with long?term structure[C]// International Conference on Learning Representations,2018: 123?131.
[8] Huang Y S, Yang Y H. Pop Music Transformer: Beat?based modeling and generation of expressive pop piano compositions[C]//Proceedings of the 28th ACM international conference on multimedia, 2020: 1180?1188.
[9] Dai Z, Yang Z, Yang Y, et al. Transformer?XL: Attentive language models beyond a fixed?length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978?2988.
[10] Hsiao W Y, Liu J Y, Yeh Y C, et al. Compound Word Transformer: Learning to compose full?song music over dynamic directed hypergraphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 178?186.
[11] Katharopoulos A, Vyas A, Pappas N, et al. Transformers are rnns: Fast autoregressive transformers with linear attention[C]//International Conference on Machine Learning, Proceedings of Machine Learning Research, 2020: 5156?5165.
[12] Zhang N. Learning adversarial transformer for symbolic music generation[J]. IEEE transactions on neural networks and learning systems, 2020, 34(4): 1754?1763.
[13] Muhamed A, Li L, Shi X, et al. Symbolic music generation with transformer?gans [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2021:408?417.
[14] Wang L, Zhao Z, Liu H, et al. A review of intelligent music generation systems[J]. Neural Computing and Applications, 2024: 1?21.
[15] Mao H H, Shin T, Cottrell G. DeepJ: Style?specific music generation [C]//2018 IEEE 12th International Conference on Semantic Computing (ICSC). IEEE, 2018: 377?382.
[16] Johnson D D. Generating polyphonic music using tied parallel networks[C]//International conference on evolutionary and biologically inspired music and art. Cham: Springer International Publishing, 2017: 128?143.
[17] Wang Z, Wang D, Zhang Y, et al. Learning interpretable representation for controllable polyphonic music generation[C]//Proceedings of the 21st International Society for Music Information Retrieval Conference, 2020: 662–669.
[18] Choi K, Hawthorne C, Simon I, et al. Encoding musical style with transformer autoencoders[C]//International conference on machine learning. PMLR, 2020: 1899?1908.
[19] Di S, Jiang Z, Liu S, et al. Video background music generation with controllable music transformer[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 2037?2045.
[20] Zhuo L, Wang Z, Wang B, et al. Video background music generation: Dataset, method and evaluation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 15637?15647.
[21] Hung H T, Ching J, Doh S, et al. EMOPIA: A multi?modal pop piano dataset for emotion recognition and emotion?based music generation[C]//Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021: 318?325.
[22] Kang J, Poria S, Herremans D. Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model[J]. Expert Systems with Applications, 2024, 249: 123640.
[23] Ding Z, Liu X, Zhong G, et al. Steelygan: Semantic unsupervised symbolic music genre transfer[C]//Chinese Conference on Pattern Recognition and Computer Vision, Cham: Springer International Publishing, 2022: 305?317.
[24] Wu S L, Yang Y H. MuseMorphose: Full?song and fine?grained piano music style transfer with one transformer VAE[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1953?1967.
[25] Huang H, Wang Y, Li L, et al. Music style transfer with diffusion model[C]// Proceedings of International Computer Music Conference, 2023: 39?46.
[26] Wang W, Li X, Jin C, et al. CPS: full?song and style?conditioned music generation with linear transformer[C]//2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2022: 1?6.
[27] Sarmento P, Kumar A, Chen Y H, et al. GTR?CTRL: Instrument and genre conditioning for guitar?focused music generation with transformers[C]//International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar), Cham: Springer Nature Switzerland, 2023: 260?275.
[28] Yu B, Lu P, Wang R, et al. Museformer: Transformer with fine?and coarse?grained attention for music generation[J]. Advances in Neural Information Processing Systems, 2022, 35: 1376?1388.
[29] Odena A, Olah C, Shlens J. Conditional image synthesis with auxiliary classifier gans[C]//International conference on machine learning. PMLR, 2017: 2642?2651.
[30] Zeng M, Tan X, Wang R, et al. Musicbert: Symbolic music understanding with large?scale pre?training[C]//Proceedings of Findings of the Association for Computational Linguistics, 2021: 791–800.
主管單位:國(guó)家電影局
主辦單位:電影技術(shù)質(zhì)量檢測(cè)所
標(biāo)準(zhǔn)國(guó)際刊號(hào):ISSN 1673-3215
國(guó)內(nèi)統(tǒng)一刊號(hào):CN 11-5336/TB
投稿系統(tǒng):ampt.crifst.ac.cn
官方網(wǎng)站:www.crifst.ac.cn
期刊發(fā)行:010-63245081
特別聲明:以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶上傳并發(fā)布,本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。
Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.