Speech to Text

Convert speech to text automatically with the highest accuracy.

Overview
Marsview Automatic Speech Recognition (ASR) technology accurately converts speech into text in live or batch mode. API can be deployed in the cloud or on-premise. Get superior accuracy, speaker separation, punctuation, casing, word-level time markers, and more.
Marsview ASR Features
Features
Feature Description
Speech-to-text
accurately converts speech into text in live or batch mode
Automatic Punctuation
accurately adds punctuation to the transcribed text
Custom Vocabulary
boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases.
Speaker Separation
automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.
PII Detection
the transcription that has Personally Identifiable Information (PII), such as phone numbers and social security numbers can be redacted. All redacted text will be replaced with special characters (#,*)
Sentence Topics
The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion. (For an detailed description of the different types of topics that are available checkout the Topics section)
Model Types available 
Model 
Usecases
Parameter
Default
Best for all types of data and accents (English only)
Default
Custom language or accent
Tailored for your data. You can expect a large improvement compared to the default model on your data.
Contact us for more info: [email protected]
Input Type Supported: Video, Audio 
post
Compute Metadata
https://api.marsview.ai/v1/conversation/compute
This method is used to upload an audio or video file on which metadata has to be computed.
Settings object can be used to enable/disable metadata from different models. Check the overview section for getting a list of models that are available
Request
Response
Request
Headers
appSecret
required
string
<sample-app-secret>
appId
required
string
<sample-app-Id>
Content-Type
optional
string
application/json
Body Parameters
automatic_punctuation
optional
boolean
Enable or disable Automatic punctuation on the generated transcript (Defaults to True)
custom_vocabulary
optional
array
List of strings for custom vocabulary that need to be boosted during transcription. 
pii_detection
optional
boolean
PII information will be redacted in the transctipt when set to true
speech_to_text.enable
required
boolean
Speech to text will be computed when speech_to_text key is set to true under the settings object
Response
200: OK
A Transaction ID is returned in the JSON body once the processing job is launched successfully.
This Transaction ID can be used to check the status of the job or fetch the results of the job once the metadata is computed
{
    "status":true,
    "transaction_id":32dcef1a-5724-4df8-a4a5-fb43c047716b 
}
400: Bad Request
This usually happens when the settings for computing the metadata are not configured correctly. Check the request object and also the dependencies required to compute certain metadata objects. ( For Example: Speech to Text has to be enabled for Action Items to be enabled)
{
    "status":false,
    "error":{
        "code":"MCST07",
        "message":"DependencyError: action_items depends on speech_to_text"
    }
}
Example API Call
Request
CURL
CURL
curl --request POST 'https://api.marsview.ai/v1/conversation/compute' \
--header 'appSecret: 32dcef1a-5724-4df8-a4a5-fb43c047716b' \
--header 'appId: 1ZrKT0tTv7rVWX-qNAKLc' \
--header 'Content-Type: application/json' \
--data-raw '{
                "settings":{
                     "speech_to_text":{
                          "enable":true,
                          "pii_detection":false,
                          "custom_vocabulary":["Marsview" , "OmeoDataPlatform"],
                          "sentence_topics":true
                     },
                     "speaker_separation":{
                          "enable":true
                     }
                     
                 }
          }'
Response
Given below is a sample response JSON when the Status code is 200.
{
    "status":true,
    "transaction_id":32dcef1a-5724-4df8-a4a5-fb43c047716b,
    "message": " Compute job for file-id: 32dcef1a-5724-4df8-a4a5-fb43c047716b launched successfully"
}
post
Request Metadata
https://api.marsview.ai/v1/conversation/fetch
This method is used to fetch soecific Metadata for a particular file_id. It can also be used for long polling to track the progress of compute under the status object.
Request
Response
Request
Headers
Content-Type
optional
string
application/json
appId
optional
string
<sample-app-id>
appSecret
optional
string
<sample-app-secret>
Body Parameters
fileID
optional
string
fileId of the audio/video file
data.speech_to_text
optional
boolean
Returns transcript data for file_id once computed
Response
200: OK
The output consists of two objects. The data object returns the requested metadata if it is computed. The status object shows the current state of the requested metadata. Status for each metadata field can take values  "Queued"/"Processing"/"Completed".
Shown below is a case where STT Job is in "Queued" state and "Completed" state. 
QUEUED STATE
COMPLETED STATE
QUEUED STATE
{
    "status":{
        "speech_to_text":"Queued"
    }
    "data":{
        "speech_to_text":{}
    }
}
COMPLETED STATE
{
    "status":{
        "speech_to_text":"Queued"
    }
    "data":{
        "speech_to_text":{
            "sentences":[
                ...
                {
			            "sentence" : "Be sure to check out the support document at marsview.ai",
			            "start_time" : "172200.0",
			            "end_time" : "175100.0",
			            "speakers" : [
				                "2"
			            ],
			            "topics": [
											{
															"topic":"support document",
															"type":"AI Generated"
											}
							    ]
			          },
                {
			            "sentence" : "Sure, Thats what i was looking for, Thank You!",
			            "start_time" : "175100.0",
			            "end_time" : "177300.0",
			            "speakers" : [
				                "1"
			            ],
							    "topics":[]
			          },
			          ...
            ]
        
        }
    }
}
Speech To Text Response Object Fields
Fields
Description
sentences
Contains a list of sentence objects in the order of occurence in the input Audio/Video file
sentence
Contains the Marsview STT generated transcript for tht particular chunk.
start_time
Starting time of the chunk in millseconds.
end_time
Ending time of the chunk in milliseconds.
speakers
Speaker Number for that particular chunk (Refer to Speaker Separationsection of the documentation for more information)
​

Conversation API (Bundle) - Previous

File Handling

Next - Conversation API (Bundle)

Speaker Separation

Last updated 3 months ago

Features	Feature Description
Speech-to-text	accurately converts speech into text in live or batch mode
Automatic Punctuation	accurately adds punctuation to the transcribed text
Custom Vocabulary	boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases.
Speaker Separation	automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.
PII Detection	the transcription that has Personally Identifiable Information (PII), such as phone numbers and social security numbers can be redacted. All redacted text will be replaced with special characters (#,*)
Sentence Topics	The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion. (For an detailed description of the different types of topics that are available checkout the Topics section)

Model	Usecases	Parameter
Default	Best for all types of data and accents (English only)	Default
Custom language or accent	Tailored for your data. You can expect a large improvement compared to the default model on your data.	Contact us for more info: [email protected]

Fields	Description
sentences	Contains a list of `sentence objects` in the order of occurence in the input Audio/Video file
sentence	Contains the Marsview STT generated transcript for tht particular chunk.
start_time	Starting time of the chunk in millseconds.
end_time	Ending time of the chunk in milliseconds.
speakers	Speaker Number for that particular chunk (Refer to `Speaker Separation`section of the documentation for more information)

Speech to Text

Overview

Marsview ASR Features

Model Types available

postCompute Metadata

Example API Call

Request

Response

postRequest Metadata

Speech To Text Response Object Fields

post
Compute Metadata

post
Request Metadata