Use R to Transcribe Zoom Audio files for use with zoomGroupStats

The zoomGroupStats functions that I’ve been building over the past few months have, to date, relied heavily on the transcription that is created automatically when a meeting is recorded to the Zoom Cloud. This is an excellent option if your account has access to Cloud Recording; however, it can be an obstacle if you want meeting leaders to record their own meetings (locally) and send you the file. In a recent project, for example, I had many meeting leaders who accidentally recorded their meetings locally, which left me without a transcript of the meeting.

This week I’ve started building a set of functions to take in an audio file from a Zoom meeting (or could also take in the video file, but that is unnecessary) and output the same transcript object that the processZoomTranscript function in zoomGroupStats produces. These functions rely on AWS Transcribe and S3. There are currently just two functions — one that launches a transcription job (since these are done asynchronously) and the second that parses the output of the transcription job.

Note that these functions currently use the default segmenting algorithm in AWS transcribe. From reviewing several transcriptions, it’s not very good (in my opinion). If your work requires utterance-level analysis (e.g., average utterance length), I would consider defining your own segmentation approach. The functions will output a simple text file transcript, so you could use that to do a custom segmentation.

############################################################
# transcribeZoomAudio Function
############################################################

# Zoom Audio File Processing, Function to launch transcription jobs
# This function starts an audio transcription job only == it does not output anything of use. However,
# it is useful for batch uploading audio files and starting transcription jobs for them.

# This can be done with a local file (uploads to a specified s3 bucket) or with a file that already
# exists in an s3 bucket

# example call:             transcribeZoomAudio(fileLocation="local", bucketName="my-transcription-bucket", filePath="mylocalfile.m4a", jobName="mylocalfile.m4a", languageCode="en-US")

# INPUT ARGUMENTS:
# fileLocation:             either "local" or "s3" - if local, then this function will upload the file to the specified bucket
# bucketName:               name of an existing s3 bucket that you are using for storing audio files to transcribe and finished transcriptions
# filePath:                 the path to the local file or to the s3 file (depending on whether it is "local" or "s3")
# jobName:                  the name of the transcription job for aws -- I set this to the same as the filename (without path) for convenience
# numSpeakers:              this helps AWS identify the speakers in the clip - specify how many speakers you expect
# languageCode:             the code for the language (e.g., en-US)

# OUTPUT:
# None

transcribeZoomAudio = function(fileLocation, bucketName, filePath, jobName, numSpeakers, languageCode) {
    require(paws)

    # First, if the file location is local, then upload it into the
    # designated s3 bucket
    if(fileLocation == "local") {
        localFilePath = filePath
        svc = s3()
        upload_file = file(localFilePath, "rb")
        upload_file_in = readBin(upload_file, "raw", n = file.size(localFilePath))
        svc$put_object(Body = upload_file_in, Bucket = bucketName, Key = jobName)
        filePath = paste("s3://", bucketName, "/",jobName, sep="")
        close(upload_file)
    }

    svc = transcribeservice()  
    svc$start_transcription_job(TranscriptionJobName = jobName, LanguageCode = languageCode, Media = list(MediaFileUri = filePath), OutputBucketName = bucketName, Settings = list(ShowSpeakerLabels=TRUE, MaxSpeakerLabels=numSpeakers))
}


############################################################
# processZoomAudio Function
############################################################

# Zoom Audio File Processing, process finished transcriptions
# This function parses the JSON transcription completed by AWS transcribe.
# The output is the same as the processZoomTranscript function.

# example call:             audio.out = processZoomAudio(bucketName = "my-transcription-bucket", jobName = "mylocalfile.m4a", localDir = "path-to-local-directory-for-output", speakerNames = c("Tom Smith", "Jamal Jones", "Jamika Jensen"), recordingStartDateTime = "2020-06-20 17:00:00", writeTranscript=TRUE)

# INPUT ARGUMENTS:
# bucketName:               name of the s3 bucket where the finished transcript is stored
# jobName:                  name of the transcription job (see above - i usually set this to the filename of the audio)
# localDir:                 a local directory where you can save the aws json file and also a plain text file of the transcribed text
# speakerNames:             a vector with the Zoom user names of the speakers, in the order in which they appear in the audio clip.
# recordingStartDateTime:   the date/time that the meeting recording started
# writeTranscript:          a boolean to indicate whether you want to output a plain text file of the transcript           

# OUTPUT:
# utterance_id:             an incremented numeric identifier for a marked speech utterance
# utterance_start_seconds   the number of seconds from the start of the recording (when it starts)
# utterance_start_time:     the timestamp for the start of the utterance
# utterance_end_seconds     the number of seconds from the start of the recording (when it ends)
# utterance_end_time:       the timestamp for the end of the utterance
# utterance_time_window:    the number of seconds that the utterance took
# user_name:                the name attached to the utterance
# utterance_message:        the text of the utterance
# utterance_language:       the language code for the transcript



processZoomAudio = function(bucketName, jobName, localDir, speakerNames=c(), recordingStartDateTime, writeTranscript) {
    require(paws)
    require(jsonlite)

    transcriptName = paste(jobName, "json", sep=".")
    svc = s3()
    transcript = svc$get_object(Bucket = bucketName, Key = transcriptName)
    # Write the binary component of the downloaded object to the local path
    writeBin(transcript$Body, con = paste(localDir, transcriptName, sep="/"))
    tr.json = fromJSON(paste(localDir, transcriptName, sep="/"))

    if(writeTranscript) {
        outTranscript = paste(localDir, "/", jobName, ".txt", sep="")
        write(tr.json$results$transcripts$transcript, outTranscript)
    }

    # This IDs the words as AWS broke out the different segments of speech
    for(i in 1:length(tr.json$results$speaker$segments$items)){

        res.line = tr.json$results$speaker$segments$items[[i]]
        res.line$segment_id = i
        if(i == 1) {
            res.out = res.line
        } else {
            res.out = rbind(res.out, res.line)
        }

    }

    segments = res.out 
    segment_cuts = tr.json$results$speaker$segments[,c("start_time", "speaker_label", "end_time")] 

    # Pull this apart to just get the word/punctuation with the most confidence
    # Not currently dealing with any of the alternatives that AWS could give
    for(i in 1:length(tr.json$results$items$alternatives)) {

        res.line = tr.json$results$items$alternatives[[i]]

        if(i == 1) {
            res.out = res.line
        } else {
            res.out = rbind(res.out, res.line)
        }

    }

    words = cbind(res.out, tr.json$results$items[,c("start_time", "end_time", "type")])
    words = words[words$type == "pronunciation", ]
    words_segments = merge(words, segments, by=c("start_time", "end_time"), all.x=T)

    words_segments$start_time = as.numeric(words_segments$start_time)
    words_segments$end_time = as.numeric(words_segments$end_time)

    words_segments = words_segments[order(words_segments$start_time), ]
    segment_ids = unique(words_segments$segment_id)
    i = 1


    segment_cuts$utterance_id = NA
    segment_cuts$utterance_message = NA
    for(i in 1:length(segment_ids)) {
        utterance_id = segment_ids[i]
        segment_cuts[i, "utterance_id"] = utterance_id     
        segment_cuts[i, "utterance_message"] = paste0(words_segments[words_segments$segment_id == utterance_id, "content"], collapse=" ")
    }  

    if(length(speakerNames) > 0) {
        user_names = data.frame(0:(length(speakerNames)-1), speakerNames, stringsAsFactors=F)
        names(user_names) = c("speaker_label", "user_name")
        user_names$speaker_label = paste("spk",user_names$speaker_label, sep="_")
        segment_cuts = merge(segment_cuts, user_names, by="speaker_label", all.x=T)
    }

    names(segment_cuts)[2:3] = c("utterance_start_seconds", "utterance_end_seconds")
    segment_cuts[, 2:3] = lapply(segment_cuts[, 2:3], function(x) as.numeric(x))
    segment_cuts = segment_cuts[order(segment_cuts$utterance_start_seconds), ]

    # Now turn these into actual datetime values
    recordingStartDateTime = as.POSIXct(recordingStartDateTime)
    segment_cuts$utterance_start_time = recordingStartDateTime + segment_cuts$utterance_start_seconds
    segment_cuts$utterance_end_time = recordingStartDateTime + segment_cuts$utterance_end_seconds

    # Create a time window (in seconds) for the utterances -- how long is each in seconds
    segment_cuts$utterance_time_window = as.numeric(difftime(segment_cuts$utterance_end_time, segment_cuts$utterance_start_time, units="secs"))

    # Prepare the output file
    res.out = segment_cuts[, c("utterance_id", "utterance_start_seconds", "utterance_start_time", "utterance_end_seconds", "utterance_end_time", "utterance_time_window", "user_name", "utterance_message")]

    # Mark as unidentified any user with a blank username
    res.out$user_name = ifelse(res.out$user_name == "" | is.na(res.out$user_name), "UNIDENTIFIED", res.out$user_name)      

    # Add the language code
    res.out$utterance_language = languageCode

    return(res.out)    

}

Using R to Analyze Zoom Recordings

UPDATE: 2021-05-13: zoomGroupStats is now available as a package on CRAN.


# To use the stable release version of zoomGroupStats, use:
install.packages("zoomGroupStats")

# To use the development version, which will be regularly updated with new functionality, use:
library(devtools)
install_github("https://github.com/andrewpknight/zoomGroupStats")

You can stay up-to-date on the latest functionality on http://zoomgroupstats.org.

UPDATE: 2021-04-30: In response to prodding from several folks who have been using the functions, I have started to build these R functions for analyzing Zoom meetings as a package. I’m grateful for all of the feedback and suggestions for features and modifications to this project. Although things are still in flux, you can now access a package version of zoomGroupStats. The easiest way is to use the dev_tools package and install the current development version from my github repository. To do so, you can run:


library(devtools)
install_github("https://github.com/andrewpknight/zoomGroupStats", force=TRUE)
library(zoomGroupStats)

I have created a multi-part guide for using this package to conduct research using Zoom and analyze data from Zoom, which I will continue to extend and elaborate. To keep up-to-date on this work, please use the dedicated package website http://zoomgroupstats.org.

Please keep the feedback and comments coming!

UPDATE: 2021-02-12: I’ve updated several items in the package. I’m also going to moving the development and updates all over to github to ease version control and documentation. For now, here is a post that is from a recent live tutorial session I facilitated. Stay tuned!

UPDATE: 2020-07-27: The new file contains some alpha-stage functions for doing audio transcription and for conducting a windowed conversation analysis. I haven’t yet tested these functions extensively or commented the windowed conversation analysis. Once COVID teaching planning eases, I’ll get back in an update further.

UPDATE: 2020-04-14: I added a new function (textConversationAnalysis) that gives some very basic and descriptive conversation metrics from either the video transcript or the chat file.

You can always access the most recent version of the functions by including the statement source(“http://apknight.org/zoomGroupStats.R”) at the top of your code.

Alternatively, you could go to http://apknight.org/zoomGroupStats.R and copy/paste the code into your editor.

ORIGINAL POST FOLLOWS:

In response to the shift to so many online meetings, I created a set of R functions to help do research using web-based meetings. In brief, these functions use the output of recorded sessions (e.g., video feed, transcript file, chat file) to do things like sentiment analysis, face analysis, and emotional expression analysis. In the coming week, I will extend these to do basic conversation analysis (e.g., who speaks/chats most, turntaking).

I went overboard in commenting the code so that hopefully others can use them. But, if you’re still having trouble getting them to work, please don’t hesitate to reach out to me.

You can directly access the functions here:
http://apknight.org/zoomGroupStats.R

After reviewing this, you could call these functions using the following statement in R: source(“http://apknight.org/zoomGroupStats.R”)

The effects of group affect on social integration and task performance

Knight, A. P., & Eisenkraft, N. (2015). Positive is usually good, negative is not always bad: The effects of group affect on social integration and task performance. Journal of Applied Psychology, 100, 1214-1227.

Abstract. Grounded in a social functional perspective, this article examines the conditions under which group affect influences group functioning. Using meta-analysis, the authors leverage heterogeneity across 39 independent studies of 2,799 groups to understand how contextual factors—group affect source (exogenous or endogenous to the group) and group life span (one-shot or ongoing)—moderate the influence of shared feelings on social integration and task performance. As predicted, results indicate that group positive affect has consistent positive effects on social integration and task performance regardless of contextual idiosyncrasies. The effects of group negative affect, on the other hand, are context-dependent. Shared negative feelings promote social integration and task performance when stemming from an exogenous source or experienced in a 1-shot group, but undermine social integration and task performance when stemming from an endogenous source or experienced in an ongoing group. The authors discuss implications of their findings and highlight directions for future theory and research on group affect.