Google Wake Words in US English

Home » Case Study » Google Wake Words in US English

Project Overview:

Objective

Our company successfully built a comprehensive dataset of audio clips featuring the “Hey Google” or “OK Google” wake words in US English. This dataset, crucial for improving wake word detection and voice assistant technologies, showcases our expertise in gathering and annotating high-quality data for machine learning models.

Scope

We gathered a varied collection of audio recordings from diverse US English speakers, featuring various accents and contexts. Our team meticulously annotated these recordings with precise wake word markers, demonstrating our capability in handling complex data annotation projects.

Sources

Voice Assistant Users: Collaborate with Google Assistant users who consent to contribute audio clips of them saying “Hey Google” or “OK Google” in different contexts.
Voice Actors: Hire professional voice actors to create synthetic wake word recordings for added diversity and control.
Public Domain Recordings: Extract publicly available audio recordings with instances of the “Hey Google” or “OK Google” wake words in US English.

Data Collection Metrics

Total Audio Clips Collected and Annotated: 60,000 clips
User Contributions: 36,000
Voice Actor Recordings: 12,000
Public Domain Extracts: 12,000

Annotation Process

Stages

Wake Word Annotation: We accurately identified and marked the “Hey Google” or “OK Google” wake words in each audio clip.
Speaker Demographics: Our team collected and annotated demographic metadata, including age, accent, and gender, for each speaker.
Recording Conditions: We documented and annotated various recording conditions like background noise and acoustic environments.

Annotation Metrics

Audio Clips with Wake Word Annotations: 60,000
Speaker Demographic Metadata: 60,000
Recording Condition Metadata: 60,000

Quality Assurance

Stages

Annotation Verification: We employed automated tools and human reviewers to ensure the accuracy of wake word annotations.
User Consent: We maintained strict privacy standards, ensuring all user-contributed audio clips had explicit consent for use.
Privacy Compliance: We adhered to privacy regulations, including data retention policies and opt-out options for contributors.

QA Metrics

Annotation Validation Cases: 6,000 (10% of total)
Privacy Audits: 36,000 (for user-contributed data)

Conclusion

The Google Wake Words Dataset in US English is a testament to our expertise in data collection and annotation. It serves as an invaluable resource for advancements in voice recognition and natural language processing technologies.