Attentive Statistics Pooling for Deep Speaker Embedding

Koji Okabe; Takafumi Koshinaka; Koichi Shinoda

doi:10.21437/Interspeech.2018-993

Publication Information

Title

Japanese:
English:	Attentive Statistics Pooling for Deep Speaker Embedding

Author

Japanese:	岡部浩司, 越仲孝文, 篠田浩一.
English:	Koji Okabe, Takafumi Koshinaka, Koichi Shinoda.

Language

English

Journal/Book name

Japanese:
English:	Proc. Interspeech 2018

Volume, Number, Page

pp. 2252-2256

Published date

Sept. 4, 2018

Publisher

Japanese:
English:	ISCA

Conference name

Japanese:
English:	Interspeech 2018

Conference site

Japanese:	ハイデラバード
English:	Hyderabad

File

Official URL

https://www.isca-speech.org/archive/Interspeech_2018/pdfs/0993.pdf

DOI

https://doi.org/10.21437/Interspeech.2018-993

Abstract

This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.

Home

Search

Support

About T2R2

Related Links

Publication Information