o *СiV<у@sмddlZddlmZmZmZddlZddlmmZ ddlmZm Z gdвZGddДdejГZ GddДdejГZGd d Дd ejГZGddДdejГZGd dДdejГZdS)щN)┌List┌Optional┌Tuple)┌nn┌Tensor)┌ResBlock┌ MelResNet┌ Stretch2d┌UpsampleNetwork┌WaveRNNcs>eZdZdZddeddfЗfddД Zdedefd d ДZЗZS)rafResNet block based on *Efficient Neural Audio Synthesis* :cite:`kalchbrenner2018efficient`. Args: n_freq: the number of bins in a spectrogram. (Default: ``128``) Examples >>> resblock = ResBlock() >>> input = torch.rand(10, 128, 512) # a random spectrogram >>> output = resblock(input) # shape: (10, 128, 512) щА┌n_freq┌returnNcsRtГабtаtj||dddНtа|бtjddНtj||dddНtа|бб|_dS)NщFй┌in_channels┌out_channels┌kernel_size┌biasTйZinplace)┌super┌__init__r┌ Sequential┌Conv1d┌BatchNorm1d┌ReLU┌resblock_model)┌selfr й┌ __class__й·h/var/www/html/eduruby.in/lip-sync/lip-sync-env/lib/python3.10/site-packages/torchaudio/models/wavernn.pyrs √zResBlock.__init__┌specgramcCs|а|б|S)zщPass the input through the ResBlock layer. Args: specgram (Tensor): the input sequence to the ResBlock layer (n_batch, n_freq, n_time). Return: Tensor shape: (n_batch, n_freq, n_time) )rйrr"r r r!┌forward(s zResBlock.forward)rй ┌__name__┌ __module__┌__qualname__┌__doc__┌intrrr$┌ __classcell__r r rr!rsrc sPeZdZdZ ddedededed ed dfЗfdd Д Zded efddДZЗZS)raПMelResNet layer uses a stack of ResBlocks on spectrogram. Args: n_res_block: the number of ResBlock in stack. (Default: ``10``) n_freq: the number of bins in a spectrogram. (Default: ``128``) n_hidden: the number of hidden dimensions of resblock. (Default: ``128``) n_output: the number of output dimensions of melresnet. (Default: ``128``) kernel_size: the number of kernel size in the first Conv1d layer. (Default: ``5``) Examples >>> melresnet = MelResNet() >>> input = torch.rand(10, 128, 512) # a random spectrogram >>> output = melresnet(input) # shape: (10, 128, 508) щ rщ┌n_res_blockr ┌n_hidden┌n_outputrrNcshtГабЗfddДt|ГDГ}tjtj|И|ddНtаИбtjddНg|вtjИ|ddНСRО|_dS) Ncsg|]}tИГСqSr )r)┌.0┌_йr/r r!┌ Isz&MelResNet.__init__..FrTrr)rrr) rr┌rangerrrrr┌melresnet_model)rr.r r/r0rZ ResBlocksrr3r!rDs ¤№√zMelResNet.__init__r"cCs |а|бS)z Pass the input through the MelResNet layer. Args: specgram (Tensor): the input sequence to the MelResNet layer (n_batch, n_freq, n_time). Return: Tensor shape: (n_batch, n_output, n_time - kernel_size + 1) )r6r#r r r!r$Ss zMelResNet.forwardйr,rrrr-r%r r rr!r4s" ■rcs@eZdZdZdededdfЗfddДZdedefd d ДZЗZS)r aСUpscale the frequency and time dimensions of a spectrogram. Args: time_scale: the scale factor in time dimension freq_scale: the scale factor in frequency dimension Examples >>> stretch2d = Stretch2d(time_scale=10, freq_scale=5) >>> input = torch.rand(10, 100, 512) # a random spectrogram >>> output = stretch2d(input) # shape: (10, 500, 5120) ┌ time_scale┌ freq_scalerNcstГаб||_||_dSйN)rrr9r8)rr8r9rr r!rms zStretch2d.__init__r"cCs|а|jdба|jdбS)z■Pass the input through the Stretch2d layer. Args: specgram (Tensor): the input sequence to the Stretch2d layer (..., n_freq, n_time). Return: Tensor shape: (..., n_freq * freq_scale, n_time * time_scale) щ■ щ )Zrepeat_interleaver9r8r#r r r!r$ss zStretch2d.forwardr%r r rr!r _s r csheZdZdZ ddeedededed ed eddfЗfd dД ZdedeeeffddДZ ЗZ S)r aёUpscale the dimensions of a spectrogram. Args: upsample_scales: the list of upsample scales. n_res_block: the number of ResBlock in stack. (Default: ``10``) n_freq: the number of bins in a spectrogram. (Default: ``128``) n_hidden: the number of hidden dimensions of resblock. (Default: ``128``) n_output: the number of output dimensions of melresnet. (Default: ``128``) kernel_size: the number of kernel size in the first Conv1d layer. (Default: ``5``) Examples >>> upsamplenetwork = UpsampleNetwork(upsample_scales=[4, 4, 16]) >>> input = torch.rand(10, 128, 10) # a random spectrogram >>> output = upsamplenetwork(input) # shape: (10, 128, 1536), (10, 128, 1536) r,rr-┌upsample_scalesr.r r/r0rrNc s╘tГабd}|D]}||9}q ||_|dd||_t|||||Г|_t|dГ|_g} |D]2} t| dГ}tj ddd| ddfd| fddН}t jjа|j d| ddб| а|б| а|бq/tj| О|_dS)NrщrF)rrr┌paddingrчЁ?)rr┌total_scale┌indentr┌resnetr ┌resnet_stretchrZConv2d┌torch┌initZ constant_┌weight┌appendr┌upsample_layers) rr=r.r r/r0rrA┌upsample_scaleZ up_layers┌scaleZstretch┌convrr r!rСs$ zUpsampleNetwork.__init__r"cCsf|а|баdб}|а|б}|аdб}|аdб}|а|б}|аdбddЕddЕ|j|jЕf}||fS)a┐Pass the input through the UpsampleNetwork layer. Args: specgram (Tensor): the input sequence to the UpsampleNetwork layer (n_batch, n_freq, n_time) Return: Tensor shape: (n_batch, n_freq, (n_time - kernel_size + 1) * total_scale), (n_batch, n_output, (n_time - kernel_size + 1) * total_scale) where total_scale is the product of all elements in upsample_scales. rN)rC┌ unsqueezerD┌squeezerIrB)rr"Z resnet_outputZupsampling_outputr r r!r$░s &zUpsampleNetwork.forwardr7)r&r'r(r)rr*rrrr$r+r r rr!r Аs.∙■¤№√·∙°"r csиeZdZdZ ddeededed ed ededed edededdfЗfddД ZdededefddДZe j jddedeede eeeffddДГZЗZS)raWWaveRNN model from *Efficient Neural Audio Synthesis* :cite:`wavernn` based on the implementation from `fatchord/WaveRNN `_. The original implementation was introduced in *Efficient Neural Audio Synthesis* :cite:`kalchbrenner2018efficient`. The input channels of waveform and spectrogram have to be 1. The product of `upsample_scales` must equal `hop_length`. See Also: * `Training example `__ * :class:`torchaudio.pipelines.Tacotron2TTSBundle`: TTS pipeline with pretrained model. Args: upsample_scales: the list of upsample scales. n_classes: the number of output classes. hop_length: the number of samples between the starts of consecutive frames. n_res_block: the number of ResBlock in stack. (Default: ``10``) n_rnn: the dimension of RNN layer. (Default: ``512``) n_fc: the dimension of fully connected layer. (Default: ``512``) kernel_size: the number of kernel size in the first Conv1d layer. (Default: ``5``) n_freq: the number of bins in a spectrogram. (Default: ``128``) n_hidden: the number of hidden dimensions of resblock. (Default: ``128``) n_output: the number of output dimensions of melresnet. (Default: ``128``) Example >>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200) >>> waveform, sample_rate = torchaudio.load(file) >>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) >>> specgram = MelSpectrogram(sample_rate)(waveform) # shape: (n_batch, n_channel, n_freq, n_time) >>> output = wavernn(waveform, specgram) >>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes) r,щr-rr=┌ n_classes┌ hop_lengthr.┌n_rnn┌n_fcrr r/r0rNc s:tГаб||_|dr|dn|d|_||_| d|_||_||_tt а |jбГ|_d}|D]}||9}q0||jkrFtd|Ыd|ЫЭГВt |||| | |Г|_tа||jd|б|_tj||ddН|_tj||j|ddН|_tjddН|_tjddН|_tа||j|б|_tа||j|б|_tа||jб|_dS) Nr>rщz/Expected: total_scale == hop_length, but found z != T)Zbatch_firstr)rrr┌_padrR┌n_auxrQrPr*┌math┌log2┌n_bits┌ ValueErrorr ┌upsamplerZLinear┌fcZGRU┌rnn1┌rnn2r┌relu1┌relu2┌fc1┌fc2┌fc3) rr=rPrQr.rRrSrr r/r0rArJrr r!rшs, zWaveRNN.__init__┌waveformr"cs|аdбdkrtdГВ|аdбdkrtdГВ|аdб|аdб}}|аdб}tjd|Иj|j|jdН}tjd|Иj|j|jdН}Иа|б\}}|а ddб}|а ddб}ЗfddДt d ГDГ}|d d Еd d Е|d|dЕf}|d d Еd d Е|d|dЕf} |d d Еd d Е|d|dЕf} |d d Еd d Е|d|dЕf}tj|аd б||gd dН}Иа |б}|} Иа||б\}}|| }|} tj|| gd dН}Иа||б\}}|| }tj|| gd dН}Иа|б}Иа|б}tj||gd dН}Иа|б}Иа|б}Иа|б}|аdбS)aPass the input through the WaveRNN model. Args: waveform: the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length) specgram: the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time) Return: Tensor: shape (n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes) rz*Require the input channel of waveform is 1z*Require the input channel of specgram is 1r)┌dtype┌devicer>csg|]}Иj|СqSr йrVйr1┌iйrr r!r4.sz#WaveRNN.forward..r-NщrTr<й┌dim)┌sizerZrNrE┌zerosrRrerfr[Z transposer5┌catrMr\r]r^rar_rbr`rc)rrdr"Z batch_size┌h1┌h2┌auxZaux_idxZa1Za2┌a3Za4┌x┌resr2r rjr!r$sB """" zWaveRNN.forward┌lengthscs|j}|j}tjjа|ИjИjfб}Иа|б\}Й|dur#|Иjj}g}|а б\}}}tj d|Иjf||dН} tj d|Иjf||dН} tj |df||dН}ЗЗfddДtdГDГ}t|ГD]ЬЙ|ddЕddЕИf} ЗfddД|DГ\}}}}tj || |gddН}Иа|б}Иа|аdб| б\}} || d }tj ||gddН}Иа|аdб| б\}} || d }tj ||gddН}tаИа|бб}tj ||gddН}tаИа|бб}Иа|б}tj|ddН}tа|dбаб}d |d Иjdd}|а|бq^tа|баdd d б|fS)a╛Inference method of WaveRNN. This function currently only supports multinomial sampling, which assumes the network is trained on cross entropy loss. Args: specgram (Tensor): Batch of spectrograms. Shape: `(n_batch, n_freq, n_time)`. lengths (Tensor or None, optional): Indicates the valid length of each audio in the batch. Shape: `(batch, )`. When the ``specgram`` contains spectrograms with different durations, by providing ``lengths`` argument, the model will compute the corresponding valid output lengths. If ``None``, it is assumed that all the audio in ``waveforms`` have valid length. Default: ``None``. Returns: (Tensor, Optional[Tensor]): Tensor The inferred waveform of size `(n_batch, 1, n_time)`. 1 stands for a single channel. Tensor or None If ``lengths`` argument was provided, a Tensor of shape `(batch, )` is returned. It indicates the valid length in time axis of the output Tensor. Nr)rfrecs6g|]}ИddЕИj|Иj|dЕddЕfСqS)Nrrgrh)rsrr r!r4xs6z!WaveRNN.infer..rTcs"g|] }|ddЕddЕИfСqSr:r )r1┌a)rir r!r4~s"rlrr>r@)rfrerEr┌ functional┌padrUr[rArnrorRr5rpr\r]rMr^┌FZrelurarbrcZsoftmaxZmultinomial┌floatrYrH┌stackZpermute)rr"rwrfre┌outputZb_sizer2Zseq_lenrqrrruZ aux_splitZm_tZa1_tZa2_tZa3_tZa4_t┌inpZlogitsZ posteriorr )rsrirr!┌inferKs@ z WaveRNN.infer)r,rOrOr-rrrr:)r&r'r(r)rr*rrr$rEZjitZexportrrrАr+r r rr!r╟sF%ї■¤№√·∙° ў ЎїЇ*92r)rW┌typingrrrrEZtorch.nn.functionalrryr{r┌__all__┌Modulerrr r rr r r r!┌s #+!G