網(wǎng)易首頁 > 網(wǎng)易號(hào) > 正文申請(qǐng)入駐

《現(xiàn)代電影技術(shù)》｜熊曉鈺等：基于多粒度注意力Transformer的電影音樂生成研究

2024-10-18 09:29:05　來源: 電影技術(shù)微刊

北京舉報(bào)

分享至

本文刊發(fā)于《現(xiàn)代電影技術(shù)》2024年第9期

專家點(diǎn)評(píng)

近年來，人工智能生成內(nèi)容（AIGC）技術(shù)迅猛發(fā)展，其主流模型框架以深度神經(jīng)網(wǎng)絡(luò)為基礎(chǔ)，由早期的GAN、VAE向Transformer、Diffusion與DiT（Diffusion Transformer）發(fā)展演進(jìn)。其中，大語言模型（LLM）文本生成技術(shù)日漸成熟，引領(lǐng)推動(dòng)圖像與聲音生成技術(shù)的發(fā)展，并通過不斷增強(qiáng)可控性以滿足日益增長(zhǎng)的個(gè)性化創(chuàng)作需求。音樂作為電影不可或缺的表達(dá)元素，順應(yīng)AIGC技術(shù)的發(fā)展與應(yīng)用，AI音樂生成正逐步成為電影配樂創(chuàng)作的革新力量，迄今已分化出符號(hào)生成與音頻生成兩種技術(shù)路線，但現(xiàn)有方法對(duì)音樂流派等控制條件關(guān)注不足，一定程度上影響了音樂生成質(zhì)量和多樣性的提升。《基于多粒度注意力Transformer的電影音樂生成研究》一文以編碼后的流派信息作為條件輸入從零生成符號(hào)音樂，結(jié)合音樂重復(fù)周期的結(jié)構(gòu)特點(diǎn)，采用多粒度注意力機(jī)制Transformer架構(gòu)捕獲音樂結(jié)構(gòu)和上下文信息，并引入流派分類判別器，輸出流派分類概率用于識(shí)別判斷，為音樂生成提供風(fēng)格控制。本方法在流派控制效果、音樂質(zhì)量結(jié)構(gòu)等方面較同類方法有較大提升，但在實(shí)用性上仍有改進(jìn)空間，有待進(jìn)一步研究探索。

——王萃

正高級(jí)工程師

中國(guó)電影科學(xué)技術(shù)研究所（中央宣傳部電影技術(shù)質(zhì)量檢測(cè)所）高新技術(shù)研究處副處長(zhǎng)

作者簡(jiǎn) 介

熊曉鈺

上海大學(xué)上海電影學(xué)院2021級(jí)碩士研究生，主要研究方向：深度學(xué)習(xí)、電影音樂生成。

上海大學(xué)上海電影學(xué)院、上海電影特效工程技術(shù)中心副教授、博士生導(dǎo)師，主要研究方向：電影高新技術(shù)、人工智能。

謝志峰

黃登云

上海大學(xué)上海電影學(xué)院2023級(jí)碩士研究生，主要研究方向：深度學(xué)習(xí)、電影音樂生成。

上海大學(xué)上海電影學(xué)院副教授、碩士生導(dǎo)師，主要研究方向：人工智能、計(jì)算機(jī)應(yīng)用。

朱永華

摘要

電影音樂自動(dòng)生成是當(dāng)前人工智能領(lǐng)域的研究熱點(diǎn)之一，不少深度學(xué)習(xí)音樂生成算法可實(shí)現(xiàn)動(dòng)聽的電影配樂生成，但這些算法在生成過程中往往忽略了流派等風(fēng)格控制。針對(duì)這一情況，本文提出了一種基于多粒度注意力Transformer的電影音樂生成方法，可根據(jù)目標(biāo)流派從零生成音樂。本方法在引入多粒度注意力Transformer建模音樂結(jié)構(gòu)的基礎(chǔ)上，引入了對(duì)抗學(xué)習(xí)機(jī)制，通過具有流派分類損失和生成對(duì)抗損失的流派輔助分類判別器，加強(qiáng)模型對(duì)流派信息的控制。在所構(gòu)建的包含流派信息的符號(hào)音樂數(shù)據(jù)集上進(jìn)行的主客觀實(shí)驗(yàn)表明，本文方法在生成音樂質(zhì)量和流派控制方面均優(yōu)于以往方法，有助于基于目標(biāo)流派自動(dòng)生成電影配樂。

關(guān)鍵詞

音樂生成；流派控制；生成式對(duì)抗網(wǎng)絡(luò)；Transformer；電影音樂

1引言

2相關(guān)研究

2.1 基于深度學(xué)習(xí)的符號(hào)音樂生成

2.2 可控音樂生成

3本文方法

3.1 整體網(wǎng)絡(luò)框架

3.2 數(shù)據(jù)表示

3.3 多粒度注意力Transformer

3.4 流派輔助分類判別器

4實(shí)驗(yàn)結(jié)果及分析

4.1 數(shù)據(jù)集

4.2 實(shí)驗(yàn)設(shè)置

4.3 客觀評(píng)價(jià)

4.4 主觀評(píng)價(jià)

5結(jié)語

參考文獻(xiàn)

（向下滑動(dòng)閱讀）

[1] 陳吉尚, 哈里旦木·阿布都克里木, 梁蘊(yùn)澤, 等. 深度學(xué)習(xí)在符號(hào)音樂生成中的應(yīng)用研究綜述[J]. 計(jì)算機(jī)工程與應(yīng)用, 2023, 59(09): 27?45.

[2] Roberts A, Engel J, Raffel C, et al. A hierarchical latent vector model for learning long?term structure in music[C]//International conference on machine learning. PMLR, 2018: 4364?4373.

[3] Brunner G, Konrad A, Wang Y, et al. MIDI?VAE: Modeling dynamics and instrumentation of music with applications to style transfer[C]//ISMIR 2019: 343?351.

[4] Mogren O. C?RNN?GAN: Continuous recurrent neural networks with adversarial training[C]//Conference and Workshop on Neural Information Processing Systems, 2016: 1?6.

[5] Yang L C, Chou S Y, Yang Y H. MidiNet: A convolutional generative adversarial network for symbolic?domain music generation[C]//Proceedings of International Society for Music Information Retrieval Conference, 2017: 324?331.

[6] Dong H W, Hsiao W Y, Yang L C, et al. MuseGAN: Multi?track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2018: 34?41.

[7] Huang C Z A, Vaswani A, Uszkoreit J, et al. Music Transformer: Generating music with long?term structure[C]// International Conference on Learning Representations,2018: 123?131.

[8] Huang Y S, Yang Y H. Pop Music Transformer: Beat?based modeling and generation of expressive pop piano compositions[C]//Proceedings of the 28th ACM international conference on multimedia, 2020: 1180?1188.

[9] Dai Z, Yang Z, Yang Y, et al. Transformer?XL: Attentive language models beyond a fixed?length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978?2988.

[10] Hsiao W Y, Liu J Y, Yeh Y C, et al. Compound Word Transformer: Learning to compose full?song music over dynamic directed hypergraphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 178?186.

[11] Katharopoulos A, Vyas A, Pappas N, et al. Transformers are rnns: Fast autoregressive transformers with linear attention[C]//International Conference on Machine Learning, Proceedings of Machine Learning Research, 2020: 5156?5165.

[12] Zhang N. Learning adversarial transformer for symbolic music generation[J]. IEEE transactions on neural networks and learning systems, 2020, 34(4): 1754?1763.

[13] Muhamed A, Li L, Shi X, et al. Symbolic music generation with transformer?gans [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2021:408?417.

[14] Wang L, Zhao Z, Liu H, et al. A review of intelligent music generation systems[J]. Neural Computing and Applications, 2024: 1?21.

[15] Mao H H, Shin T, Cottrell G. DeepJ: Style?specific music generation [C]//2018 IEEE 12th International Conference on Semantic Computing (ICSC). IEEE, 2018: 377?382.

[16] Johnson D D. Generating polyphonic music using tied parallel networks[C]//International conference on evolutionary and biologically inspired music and art. Cham: Springer International Publishing, 2017: 128?143.

[17] Wang Z, Wang D, Zhang Y, et al. Learning interpretable representation for controllable polyphonic music generation[C]//Proceedings of the 21st International Society for Music Information Retrieval Conference, 2020: 662–669.

[18] Choi K, Hawthorne C, Simon I, et al. Encoding musical style with transformer autoencoders[C]//International conference on machine learning. PMLR, 2020: 1899?1908.

[19] Di S, Jiang Z, Liu S, et al. Video background music generation with controllable music transformer[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 2037?2045.

[20] Zhuo L, Wang Z, Wang B, et al. Video background music generation: Dataset, method and evaluation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 15637?15647.

[21] Hung H T, Ching J, Doh S, et al. EMOPIA: A multi?modal pop piano dataset for emotion recognition and emotion?based music generation[C]//Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021: 318?325.

[22] Kang J, Poria S, Herremans D. Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model[J]. Expert Systems with Applications, 2024, 249: 123640.

[23] Ding Z, Liu X, Zhong G, et al. Steelygan: Semantic unsupervised symbolic music genre transfer[C]//Chinese Conference on Pattern Recognition and Computer Vision, Cham: Springer International Publishing, 2022: 305?317.

[24] Wu S L, Yang Y H. MuseMorphose: Full?song and fine?grained piano music style transfer with one transformer VAE[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1953?1967.

[25] Huang H, Wang Y, Li L, et al. Music style transfer with diffusion model[C]// Proceedings of International Computer Music Conference, 2023: 39?46.

[26] Wang W, Li X, Jin C, et al. CPS: full?song and style?conditioned music generation with linear transformer[C]//2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2022: 1?6.

[27] Sarmento P, Kumar A, Chen Y H, et al. GTR?CTRL: Instrument and genre conditioning for guitar?focused music generation with transformers[C]//International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar), Cham: Springer Nature Switzerland, 2023: 260?275.

[28] Yu B, Lu P, Wang R, et al. Museformer: Transformer with fine?and coarse?grained attention for music generation[J]. Advances in Neural Information Processing Systems, 2022, 35: 1376?1388.

[29] Odena A, Olah C, Shlens J. Conditional image synthesis with auxiliary classifier gans[C]//International conference on machine learning. PMLR, 2017: 2642?2651.

[30] Zeng M, Tan X, Wang R, et al. Musicbert: Symbolic music understanding with large?scale pre?training[C]//Proceedings of Findings of the Association for Computational Linguistics, 2021: 791–800.

主管單位：國(guó)家電影局

主辦單位：電影技術(shù)質(zhì)量檢測(cè)所

標(biāo)準(zhǔn)國(guó)際刊號(hào)：ISSN 1673-3215

國(guó)內(nèi)統(tǒng)一刊號(hào)：CN 11-5336/TB

投稿系統(tǒng)：ampt.crifst.ac.cn

官方網(wǎng)站：www.crifst.ac.cn

期刊發(fā)行：010-63245081

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶上傳并發(fā)布，本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.

手機(jī) / 數(shù)碼

房產(chǎn) / 家居

《現(xiàn)代電影技術(shù)》｜熊曉鈺等：基于多粒度注意力Transformer的電影音樂生成研究