Academic Journal of Computing & Information Science, 2025, 8(10); doi: 10.25236/AJCIS.2025.081005.
Junnuo Wang
New York University, 70 Washington Sq South, New York, NY, USA
Chord-conditioned melody generation remains limited by coarse harmonic encodings that collapse extended, altered, and modal chords into a few root–quality labels, preventing models from learning nuanced tension and resolution behavior. This paper introduces a theory-structured framework that enriches harmonic conditioning and guides decoding in a modular Transformer architecture. First, a Theory-Structured Harmonic Embedding decomposes each chord into additive Root, Quality, Extension, and Tension components, yielding interpretable sub-embeddings without incurring a combinatorial chord vocabulary. Second, a Harmony-Aware Soft Constrained Decoding scheme adjusts pitch logits at inference time using music-theoretic priors on chord-tone preference, tension validity, non-chord-tone resolution, and scale adherence, controlled by a single constraint-strength parameter. Experiments on the Enhanced Wikifonia Leadsheet Dataset compare a CMT-style baseline, an EC2-VAE model, and three ablation variants. The full model significantly improves Chord Tone Ratio, Tension Correctness, and Non-Chord-Tone Resolution, while maintaining corpus-level pitch and rhythm statistics as measured by MGEval KLD and overlap area. These results demonstrate that explicit harmonic structure and theory-aware decoding jointly yield melodies that are both stylistically faithful and more music-theoretically aligned.
Symbolic Music, Melody Generation, Chord Conditioning
Junnuo Wang. Theory Structured Harmonic Embeddings for Chord Conditioned Melody Generation. Academic Journal of Computing & Information Science (2025), Vol. 8, Issue 10: 30-36. https://doi.org/10.25236/AJCIS.2025.081005.
[1] Y.-J. Shih, S.-L. Wu, F. Zalkow, M. Müller, and Y.-H. Yang, “Theme Transformer: Symbolic music generation with theme-conditioned Transformer,” IEEE Trans. Multimedia, vol. 25, pp. 3495–3508, 2023, doi: 10.1109/TMM.2022.3161851.
[2] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[3] S. Ji, X. Yang, and J. Luo, “A survey on deep learning for symbolic music generation: Representations, algorithms, evaluations, and challenges,” ACM Comput. Surv., vol. 56, no. 1, pp. 1–39, Jan. 2024, doi: 10.1145/3597493.
[4] K. Choi, J. Park, W. Heo, S. Jeon, and J. Park, “Chord conditioned melody generation with Transformer based decoders,” IEEE Access, vol. 9, pp. 42071–42080, 2021, doi: 10.1109/ACCESS.2021.3065831.
[5] B. Genchel, A. Pati, and A. Lerch, “Explicitly conditioned melody generation: A case study with interdependent RNNs,” arXiv preprint arXiv:1907.05208, Jul. 2019, doi: 10.48550/arXiv.1907.05208.
[6] Y.-C. Yeh et al., “Automatic melody harmonization with triad chords: A comparative study,” J. New Music Res., vol. 50, no. 1, pp. 1–15, 2021, doi: 10.1080/09298215.2021.1873392.
[7] S. E. Ni-Hahn, “Machine learning and music theory: Models for hierarchical music generation and analysis,” Ph.D. dissertation, Dept. Electr. Comput. Eng., Duke Univ., Durham, NC, USA, 2025.
[8] S.-L. Wu and Y.-H. Yang, “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” arXiv preprint arXiv:2008.01307, Aug. 2020, doi: 10.48550/arXiv.2008.01307.
[9] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” arXiv preprint arXiv:1703.10847, Jul. 2017, doi: 10.48550/arXiv.1703.10847.
[10] M. Kaliakatsos-Papakostas, D. Makris, K. Soiledis, K.-T. Tsamis, V. Katsouros, and E. Cambouropoulos, “HarmonyTok: Comparing methods for harmony tokenization for machine learning,” Information, vol. 16, no. 9, p. 759, Sep. 2025, doi: 10.3390/info16090759.
[11] Y.-S. Huang and Y.-H. Yang, “Pop music Transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, Oct. 2020, pp. 1180–1188, doi: 10.1145/3394171.3413671.
[12] H.-W. Dong, K. Chen, S. Dubnov, J. McAuley, and T. Berg-Kirkpatrick, “Multitrack music Transformer,” in ICASSP 2023 – IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Jun. 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10094628.
[13] L. Min, J. Jiang, G. Xia, and J. Zhao, “Polyffusion: A diffusion model for polyphonic score generation with internal and external controls,” arXiv preprint arXiv:2307.10304, Jul. 2023, doi: 10.48550/arXiv.2307.10304.
[14] S. Li and Y. Sung, “MelodyDiffusion: Chord-conditioned melody generation using a Transformer-based diffusion model,” Mathematics, vol. 11, no. 8, p. 1915, 2023, doi: 10.3390/math11081915.
[15] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2022, doi: 10.48550/arXiv.2111.00396.
[16] A. Agostinelli et al., “MusicLM: Generating music from text,” arXiv preprint arXiv:2301.11325, Jan. 2023, doi: 10.48550/arXiv.2301.11325.
[17] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, Apr. 2020, doi: 10.48550/arXiv.2005.00341.
[18] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, and G. Xia, “Deep music analogy via latent representation disentanglement,” arXiv preprint arXiv:1906.03626, Oct. 2019, doi: 10.48550/arXiv.1906.03626.
[19] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long-term structure in music,” in Proc. 35th Int. Conf. Machine Learning, Jul. 2018, pp. 4364–4373.
[20] G. Hadjeres, F. Pachet, and F. Nielsen, “DeepBach: A steerable model for Bach chorales generation,” in Proc. 34th Int. Conf. Machine Learning, Jul. 2017, pp. 1362–1371.
[21] S. Lattner, M. Grachten, and G. Widmer, “Imposing higher-level structure in polyphonic music generation using convolutional restricted Boltzmann machines and constraints,” J. Creat. Music Syst., vol. 2, pp. 1–31, Nov. 2020, doi: 10.3316/informit.668426282761995.
[22] A. C. Salem, M. Shokri, and J. Devaney, “Chord-conditioned melody and bass generation,” arXiv preprint arXiv:2511.08755, Nov. 2025, doi: 10.48550/arXiv.2511.08755.
[23] L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Comput. Appl., vol. 32, no. 9, pp. 4385–4416, 2020, doi: 10.1007/s00521-018-3849-7.
[24] M. S. Cuthbert and C. Ariza, “music21: A toolkit for computer-aided musicology and symbolic music data,” in Proc. 11th Int. Soc. Music Inf. Retrieval Conf. (ISMIR), Utrecht, The Netherlands, Aug. 2010, pp. 637–642, doi: 10.5281/zenodo.1416114.
[25] C. Harte, M. Sandler, S. Abdallah, and E. Gómez, “Symbolic representation of musical chords: A proposed syntax for text annotations,” in Proc. 6th Int. Conf. Music Inf. Retrieval (ISMIR), London, U.K., Sep. 2005, pp. 66–71.
[26] C.-Z. A. Huang et al., “Music Transformer,” arXiv preprint arXiv:1809.04281, Dec. 2018, doi: 10.48550/arXiv.1809.04281.
[27] F. Simonetta, F. Carnovalini, N. Orio, and A. Rodà, “Symbolic music similarity through a graph-based representation,” in Proc. Audio Mostly 2018 on Sound in Immersion and Emotion (AM ’18), New York, NY, USA, 2018, pp. 1–7, doi: 10.1145/3243274.3243301.
[28] D. Temperley, The Cognition of Basic Musical Structures. Cambridge, MA, USA: MIT Press, 2001.
[29] D. L. Brodbeck, “Review of Harmony and Voice Leading, by Edward Aldwell and Carl Schachter,” Perspect. New Music, vol. 21, no. 1–2, pp. 425–430, 1982/83.
[30] M. Levine, The Jazz Theory Book. Petaluma, CA, USA: Sher Music, 1995.