我正在使用 azure-speech 识别来自Speech_recognition_samples.cpp的音频流,来自 RecognitionResult类我只能获取 Text 和 m_duration,但是如何获取语音结果的开始时间和结束时间?我知道e.Result->Offset()
可以返回偏移量,但我仍然对此感到困惑,我的代码是
void recognizeSpeech() {
std::shared_ptr<SpeechConfig> config = SpeechConfig::FromSubscription("****", "****");
config->RequestWordLevelTimestamps();
auto pushStream = AudioInputStream::CreatePushStream();
std::cout << "created push\n" << std::endl;
auto audioInput = AudioConfig::FromStreamInput(pushStream);
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);
promise<void> recognitionEnd;
recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
{
cout << "Recognizing:" << e.Result->Text << std::endl
<< " Offset=" << e.Result->Offset() << std::endl
<< " Duration=" << e.Result->Duration() << std::endl;
});
recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
{
if (e.Result->Reason == ResultReason::RecognizedSpeech)
{
cout << "RECOGNIZED: Text=" << e.Result->Text << std::endl
<< " Offset=" << e.Result->Offset() << std::endl
<< " Duration=" << e.Result->Duration() << std::endl;
}
else if (e.Result->Reason == ResultReason::NoMatch)
{
cout << "NOMATCH: Speech could not be recognized." << std::endl;
}
});
recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
{
switch (e.Reason)
{
case CancellationReason::EndOfStream:
cout << "CANCELED: Reach the end of the file." << std::endl;
break;
case CancellationReason::Error:
cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << std::endl;
cout << "CANCELED: ErrorDetails=" << e.ErrorDetails << std::endl;
recognitionEnd.set_value();
break;
default:
cout << "CANCELED: received unknown reason." << std::endl;
}
});
recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
{
cout << "Session stopped.";
recognitionEnd.set_value(); // Notify to stop recognition.
});
WavFileReader reader(FILE_NAME);
vector<uint8_t> buffer(1000);
recognizer->StartContinuousRecognitionAsync().wait();
int readSamples = 0;
while((readSamples = reader.Read(buffer.data(), (uint32_t)buffer.size())) != 0)
{
pushStream->Write(buffer.data(), readSamples);
}
pushStream->Close();
recognitionEnd.get_future().get();
recognizer->StopContinuousRecognitionAsync().get();
}
结果是
Recognizing:my
Offset=6800000
Duration=2700000
Recognizing:my voice is
Offset=6800000
Duration=8500000
Recognizing:my voice is my
Offset=6800000
Duration=9800000
Recognizing:my voice is my passport
Offset=6800000
Duration=14400000
Recognizing:my voice is my passport verify me
Offset=6800000
Duration=26100000
RECOGNIZED: Text=My voice is my passport, verify me.
Offset=6800000
Duration=28100000
CANCELED: Reach the end of the file.
为什么每次结果的偏移量总是6800000?我认为应该是不断增加的,比如:“my”的开始偏移量为0,“my”的结束偏移量为100000,“my voice is”的开始偏移量为0,“my”的结束偏移量voice is" 200000。那么我可以得到句子中“my voice is”的开始时间和结束时间。但是现在我怎样才能得到每个结果的句子中的开始时间和结束时间呢?