Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcha compared to Vits #97

Open
yygg678 opened this issue Sep 22, 2024 · 2 comments
Open

Matcha compared to Vits #97

yygg678 opened this issue Sep 22, 2024 · 2 comments

Comments

@yygg678
Copy link

yygg678 commented Sep 22, 2024

I replicated the results of VITS and Matcha-TTS on a single speaker Chinese dataset and found that the timbre similarity of Matcha-TTS is lower than that of VITS, especially in the high-frequency details of the spectrum. Below are the spectrograms of VITS and Matcha-TTS. Is there any way to improve the timbre similarity of Matcha-TTS?
v
m

@shivammehta25
Copy link
Owner

Hi! That is a cool experiment.

Did you fine-tune the vocoder too? Why I am asking this is because: VITS has a built-in vocoder as it is an end-to-end TTS system. On the other hand, Matcha is an acoustic model where we learn to generate text-to-(log-mel-spectrogram). Currently, we have been using off-the-shelf neural vocoders namely, HiFiGAN, without fine-tuning them for matcha's Once log-mel-spectrogram output.

I think to fix this, you will have to fine-tune the vocoder. One way to do this would be to extract alignments. Then, use this extracted alignments to generate instead of the duration predictor's outputs, save the log-mel-spectrogram and then finetune, the vocoder.

One easier experiment might be to try switching the vocoder. You can switch from HiFiGAN to BigVGan off the shelf they use the same SFT parameters so, you don't need to retrain Matcha with different SFT settings.

Hope this helps, let me know if you have more questions :)

One side note is Matcha also has a temperature parameter: the more the temperature the more variance will be to the generated output it is also used only during the inference/generation so you can easily play with it. However, I still feel this is a vocoder artefact as end-to-end models have a waveform generation objective to optimise, while acoustic models do not.

@yygg678
Copy link
Author

yygg678 commented Sep 29, 2024

ok, thanks! I will try to fine-tune the vocoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants