DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection

Mahum, Rabbia; Irtaza, Aun; Javed, Ali; Mahmoud, Haitham A.; Hassan, Haseeb

doi:10.1186/s13636-024-00335-9

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Layer-wise details of our proposed MobileNet

From: DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection

Type	Activations	Learnable	Stride/channel	Total learnable
Image input	96 × 64 × 1	–	–	0
Convolution 2D (Conv)	48 × 32 × 32	Weights: 3 × 3 × 1 × 32 Bias: 1 × 1 × 32	32 3 × 3 × 1 convolutions Stride: [2 2] Padding: same	320
Instance normalization	48 × 32 × 32	Offset: 1 × 1 × 32 Scale: 1 × 1 × 32	32 Channels	64
ReLU	48 × 32 × 32	–	–	0
Grouped convolution depthwise (GConv DW)	48 × 32 × 32	Weights: 3 × 3 × 1 × 1 × 32 Bias: 1 × 1 × 1 × 32	32 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	320
Instance normalization	48 × 32 × 32	Offset: 1 × 1 × 32 Scale: 1 × 1 × 32	–	64
ReLU	48 × 32 × 32	–	–	0
Conv	48 × 32 × 64	Weights: 1 × 1 × 32 × 64 Bias: 1 × 1 × 64	64 1 × 1 × 32 convolutions Stride: [1 1] Padding: same	2112 128 0
GConv DW	24 × 16 × 64	Weights: 3 × 3 × 1 × 1 × 64 Bias: 1 × 1 × 1 × 64	64 groups of 1 33 × 1 Convolutions Stride: [2 2] Padding: same	640 128 0
Conv	24 × 16 × 128	Weights: 1 × 1 × 64 × 128 Bias: 1 × 1 × 128	128 1 × 1 × 64 Convolutions Stride: [1 1] Padding: same	8320 256 0
GConv DW	24 × 16 × 128	Weights: 3 × 3 × 1 × 1 × 128 Bias: 1 × 1 × 1 × 128	128 groups of 1 3 × 3 × 1 Convolutions Stride: [1 1] Padding: same	1280 256 0
Conv	24 × 16 × 128	Weights: 1 × 1 × 128 × 128 Bias: 1 × 1 × 128	128 1 × 1 × 128 Convolutions Stride: [1 1] Padding: same	16,512 256 0
GConv DW	12 × 8 × 128	Weights: 3 × 3 × 1 × 1 × 128 Bias: 1 × 1 × 1 × 128	128 groups of 1 3 × 3 × 1 Convolutions Stride: [2 2] Padding: same	1280 256 0
Conv	12 × 8 × 256	Weights: 1 × 1 × 128 × 256 Bias: 1 × 1 × 256	256 1 × 1 × 128 Convolutions Stride: [1 1] Padding: same	33,024 512 0
GConv DW	12 × 8 × 256	Weights: 3 × 3 × 1 × 1 × 256 Bias: 1 × 1 × 1 × 256	256 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	2560 512 0
Conv	12 × 8 × 256	Weights: 1 × 1 × 256 × 256 Bias: 1 × 1 × 256	256 1 × 1 × 256 Convolutions Stride: [1 1] Padding: same	65,972 512 0
GConv DW	6 × 4 × 256	Weights: 3 × 3 × 1 × 1 × 256 Bias: 1 × 1 × 1 × 256	256 groups of 1 3 × 3 × 1 Convolutions Stride: [2 2] Padding: same	2560 512 0
Conv	6 × 4 × 512	Weights: 1 × 1 × 256 × 512 Bias: 1 × 1 × 512	512 1 × 1 × 256 Convolutions Stride: [1 1] Padding: same	131,584 1024 0
GConv DW	6 × 4 × 512	Weights: 3 × 3 × 1 × 1 × 512 Bias: 1 × 1 × 1 × 512	512 Groups Of 1 3 × 3 × 1 Convolutions Stride: [1 1] Padding: same	5120 1024 0
Conv	6 × 4 × 512	Weights: 1 × 1 × 512 × 512 Bias: 1 × 1 × 512	512 1 × 1 × 512 convolutions Stride: [1 1] Padding: same	262,656 1024 0
GConv DW	6 × 4 × 512	Weights: 3 × 3 × 1 × 1 × 512 Bias: 1 × 1 × 1 × 512	512 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	5120 1024 0
Conv	6 × 4 × 512	Weights: 1 × 1 × 512 × 512 Bias: 1 × 1 × 512	512 1 × 1 × 512 convolutions Stride: [1 1] Padding: same	262,656 1024 0
GConv DW	6 × 4 × 512	Weights: 3 × 3 × 1 × 1 × 512 Bias: 1 × 1 × 1 × 512	512 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	5120 1024 0
Conv	6 × 4 × 512	Weights: 1 × 1 × 512 × 512 Bias: 1 × 1 × 512	512 1 × 1 × 512 convolutions Stride: [1 1] Padding: same	262,656 1024 0
GConv DW	6 × 4 × 512	Weights: 3 × 3 × 1 × 1 × 512 Bias: 1 × 1 × 1 × 512	512 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	5120 1024 0
Conv	6 × 4 × 512	Weights: 1 × 1 × 512 × 512 Bias: 1 × 1 × 512	512 1 × 1 × 512 convolutions Stride: [1 1] Padding: same	262,656 1024 0
GConv DW	6 × 4 × 512	Weights: 3 × 3 × 1 × 1 × 512 Bias: 1 × 1 × 1 × 512	512 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	5120 1024 0
Conv	6 × 4 × 512	Weights: 1 × 1 × 512 × 512 Bias: 1 × 1 × 512	512 1 × 1 × 512 convolutions Stride: [1 1] Padding: same	262,656 1024 0
GConv DW	3 × 2 × 512	Weights: 3 × 3 × 1 × 1 × 512 Bias: 1 × 1 × 1 × 512	512 groups of 1 3 × 3 × 1 convolutions Stride: [2 2] Padding: same	5120 1024 0
Conv	3 × 2 × 1024	Weights: 1 × 1 × 512 × × 1024 Bias: 1 × 1 × 1024	1024 1 × 1 × 512 convolutions Stride: [1 1] Padding: same	525,312 2048 0
GConv DW	3 × 2 × 1024	Weights: 3 × 3 × 1 × 1 × 1024 Bias: 1 × 1 × 1 × 1024	1024 groups of 1 3 × 3 × 1 convolutions Stride: [1 1] Padding: same	10,240 2048 0
Conv	3 × 2 × 1024	Weights: 1 × 1 × 1024 × 1024 Bias: 1 × 1 × 1024	1024 1 × 1 × 1024 convolutions Stride: [1 1] Padding: same	1,049,600 2048 0
Conv	3 × 2 × 1024	Weights: 1 × 1 × 1024 × 1024 Bias: 1 × 1 × 1024	1024 1 × 1 × 1024 convolutions Stride: [1 1] Padding: same	1,049,600 2048 0
Avg. Pooling	1 × 1 × 1024	–	–	0
FC Layer	1 × 1 × 2	Weights: 2 × 1024 Bias: 2 × 1	–	2040
Softmax	1 × 1 × 2	–	Binary classifier

Back to article page