Fepstrum Features: Design and Application to Conversational Speech Recognition

In this paper, we present the Fepstrum features
– a principled approach to estimate the modulation spectrum
of the speech signals using the Hilbert envelopes in a nonparametric
way. The importance of the modulation spectrum
as a feature in the automatic speech recognition (ASR) has
long been established by several researchers in the past twothree
decades. However, traditionally, in the speech recognition
literature the modulation spectrum features have been extracted
as the DCT/DFT of the log Mel filter’s energies over 10 15
frames. These Mel-filter energies are in-turn computed through
short term spectrum (with 20 30ms long primary window).
We show, that this approach leads to a crude approximation of
the modulation spectrum in the Mel-filter bands. Further, we
show that the log of a particular Mel-Filter’s Hilbert envelope
(obtained over a primary analysis window of 100ms) leads
to a principled amplitude modulation (AM) signal estimate in
that band. Lower DCT coefficients (in the range 0 25Hz)
of the AM signal leads to the fepstrum features. To assess
the effectiveness of the fepstrum features, we have performed
conversational telephony speech (CTS) recognition experiments
on the Switchboard (SWB) corpus using a recently developed
LVCSR library (IBM IrlTK). Our experiments indicate that
the fepstrum features in simple concatenation with the shortterm
spectral envelope features (MFCC) provide upto 2.5%
absolute improvement in phoneme recognition accuracy and upto
2.5% 3.5% absolute word recognition accuracy improvement
on a 1.5Hr SWB test set with a 2, 300 words vocabulary. We
also provide the details of our IrlTK LVCSR acoustic modeling

By: Vivek Tyagi

Published in: RI11009 in 2011


