-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matcha compared to Vits #97
Comments
Hi! That is a cool experiment. Did you fine-tune the vocoder too? Why I am asking this is because: VITS has a built-in vocoder as it is an end-to-end TTS system. On the other hand, Matcha is an acoustic model where we learn to generate text-to-(log-mel-spectrogram). Currently, we have been using off-the-shelf neural vocoders namely, HiFiGAN, without fine-tuning them for matcha's Once log-mel-spectrogram output. I think to fix this, you will have to fine-tune the vocoder. One way to do this would be to extract alignments. Then, use this extracted alignments to generate instead of the duration predictor's outputs, save the log-mel-spectrogram and then finetune, the vocoder. One easier experiment might be to try switching the vocoder. You can switch from HiFiGAN to BigVGan off the shelf they use the same SFT parameters so, you don't need to retrain Matcha with different SFT settings. Hope this helps, let me know if you have more questions :) One side note is Matcha also has a temperature parameter: the more the temperature the more variance will be to the generated output it is also used only during the inference/generation so you can easily play with it. However, I still feel this is a vocoder artefact as end-to-end models have a waveform generation objective to optimise, while acoustic models do not. |
ok, thanks! I will try to fine-tune the vocoder. |
I replicated the results of VITS and Matcha-TTS on a single speaker Chinese dataset and found that the timbre similarity of Matcha-TTS is lower than that of VITS, especially in the high-frequency details of the spectrum. Below are the spectrograms of VITS and Matcha-TTS. Is there any way to improve the timbre similarity of Matcha-TTS?
The text was updated successfully, but these errors were encountered: