August 5, 2021 — Josh
You don't need to install anything on Mac OS to start having fun with sound.
One can play an audio file using afplay
(Audio File Play):
$ afplay example.mp3
▭
Once the song is finished, control will be returned to your shell (or you can press CTRL-C
).
Background Audio:
If you want to continue using the shell without waiting for the process to finish, you can run afplay
in the background:
$ afplay example.mp3 &
[1] 7946
In this example, 7946 is the process ID (PID) of the instance of afplay
that's now running -- if you want to terminate the process before it ends you can use kill
:
$ kill 7946
If you don't remember the PID, you can also do killall afplay
, or:
$ jobs
[1]+ Running afplay example.mp3 &
$ fg 1 # then hit CTRL-C
(These utilities are part of a system known as job control).
Although afplay
doesn't say it directly, I believe it supports the following file formats:
.3gp, .3g2, .aac, .adts .ac3, .aifc, .aiff, .aif, .amr, .m4a, .m4r, .m4b, .caf, .ec3, .flac, .mp1, .mpeg, .mpa, .mp2, .mp3, .mp4, .snd, .au, .sd2., .wav
So basically anything besides OGG files and some Windows stuff? (BTW, I found this information by running afconvert -hf
, so it's possible that the same does not apply to afplay
).
Other than that, the only functionality it really has is playing songs at different rates (i.e. slower or faster).
One can view the metadata of an audio file using afinfo
(Audio File Info):
$ afinfo example.mp3
File: /Users/Josh/example.mp3
File type ID: MPG3
Num Tracks: 1
----
Data format: 2 ch, 44100 Hz, '.mp3' (0x00000000) 0 bits/channel, 0 bytes/packet, 1152 frames/packet, 0 bytes/frame
no channel layout.
estimated duration: 330.893061 sec
audio bytes: 6612816
audio packets: 12667
bit rate: 159878 bits per second
packet size upper bound: 1052
maximum packet size: 835
audio data file offset: 526
optimized
audio 14590464 valid frames + 576 priming + 1344 remainder = 14592384
----
The estimated duration field comes in handy often. Hint: afinfo example.mp3 | grep "duration:" | cut -d' ' -f3
will get you the duration of an audio file in seconds.
Honestly I've never used this one (afconvert
) and it supposedly doesn't work with mp3
files. This StackExchange thread has good directions.
FFMPEG is not deprecated! (meme)
FFMPEG is another command-line utility capable of playing audio (and also video) -- it's huge and much more powerful than the af*
utilities I've shown you, but not installed by default, so it's no fun :P. Regardless, here are some FFMPEG commands that provide similar functionality as seen above:
# -- afplay example.mp3 --
$ ffplay -nodisp -loglevel panic example.mp3
# -- afinfo example.mp3 --
$ ffprobe example.mp3
# -- afinfo example.mp3 | grep "duration:" | cut -d' ' -f3 --
$ ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 example.mp3
# convert mp3 file to a wav file
$ ffmpeg -i example.mp3 example.wav
This one is one of my favorites to play with -- the utility say
converts text to audible speech.
$ say hello world
As if that isn't fun enough, here's some other stuff you can do with say
:
# say 'hello world' very slowly [1]
$ say -r 1 hello world
# save output to a file
$ say hello world -o hello.aif
# list available voices
$ say -v?
Alex en_US # Most people recognize me by my voice.
Alice it_IT # Salve, mi chiamo Alice e sono una voce italiana.
Alva sv_SE # Hej, jag heter Alva. Jag är en svensk röst.
Amelie fr_CA # Bonjour, je m’appelle Amelie. Je suis une voix canadienne.
...
# say something in a russian accent [2]
$ say -v Yuri hello there, eye am yuri
# have a conversation... [3]
$ say -v Yuri milena, what is your favorite color ; say -v Milena it is blue of course
# make a beat real quick [4]
$ while true ; do say -r 200 supW ; done
# interactive! [5]
$ say --interactive=/green spending each day the color of the leaves
[1]:
[2]:
[3]:
[4]:
[5]:
say
in a song that I made (I tried to experiment with autotune to make it sound like singing):
But wait, there's more. Apple's speech synthesis supports the TUNE format, which allows you to "shape the overall melody and timing of an utterance... for example ... to make an utterance sound as if it is spoken with emotion".
To demonstrate this, create a file named apple.txt
(or whatever) with the following contents:
[[inpt TUNE]]
~
AA {D 120; P 176.9:0 171.4:22 161.7:61}
r {D 60; P 166.7:0}
~
y {D 210; P 161.0:0}
UW {D 70; P 178.5:0}
_
S {D 290; P 173.3:0 178.2:8 184.9:19 222.9:81}
1AX {D 280; P 234.5:0 246.1:39}
r {D 170; P 264.2:0}
~
y {D 200; P 276.9:0 274.9:17 271.0:50}
UW {D 40; P 265.0:0 264.3:50}
_
b {D 140; P 263.6:0 263.5:13 263.3:60}
r {D 110; P 263.1:0 260.4:43}
1UX {D 30; P 256.8:0 256.8:6}
S {D 190; P 256.1:0}
t {D 20; P 252.0:0 253.6:47}
~
y {D 30; P 255.5:0 257.8:45}
AO {D 40; P 260.6:0 260.0:56}
r {D 40; P 259.5:0}
_
t {D 190; P 251.3:0 250.0:16 245.9:68}
1IY {D 260; P 243.4:0 248.1:8 286.1:72 288.5:84}
T {D 220; P 291.6:0 262.8:27 220.0:67 184.8:100}
? {D 300}
[[inpt TEXT]]
Then, once you call say -f apple.txt
, you will hear this:
So cool!!! This example was taken from Apple's Speech Synthesis Programming Guide.
The full source code is here, but let's build a skeleton as an example.
#!/bin/bash
stop_song() {
if [ ! -z $PID ]; then
kill $PID
# to avoid stderr output whenever we stop a song
wait $PID 2>/dev/null
fi
}
cleanup() {
stop_song
clear
exit 0
}
# catch ctrl-c press so we can stop the audio before quitting
trap cleanup INT
# <-- START HERE :P
# invoke the player by providing a path to a directory with audio files
if [ $# -ne 1 ]; then
echo "Usage: $0 <dir>"
exit
fi
MUSIC_DIR="$1"
cd "$MUSIC_DIR"
# the 2> /dev/null is just to avoid extraneous output coming up on the screen
NUM_SONGS=$(ls *.mp3 2> /dev/null | wc -l)
if [ "$NUM_SONGS" -eq 0 ]; then
echo "Found no MP3 files in $MUSIC_DIR"
exit
fi
#array of filenames of all mp3 files in directory
MUSIC_LIST=(*.mp3)
# play songs one by one until user quits
INDEX=0
while true; do
SONG_NAME="${MUSIC_LIST[$INDEX]}"
SONG_PATH="$MUSIC_DIR/$SONG_NAME"
DURATION=$(afinfo "$SONG_PATH" | grep "duration:" | cut -d' ' -f3)
#chop off decimal portion (`read -t` only takes integers)
DURATION=${DURATION%.*}
# play the song and capture the process id so we can stop it later
afplay "$SONG_PATH" &
PID=$!
# print pretty stuff (put le ascii art here)
clear
echo "==="
echo "Playing Song #$INDEX:"
echo "$SONG_NAME"
echo "==="
echo "(n)ext, (p)rev, (q)uit"
# show menu until the song finishes or user selects an option
read -n 1 -t "$DURATION" CHOICE
# basically just increment or decrement the index depending on choice
if [ "$CHOICE" = "q" ]; then
cleanup
elif [ "$CHOICE" = "n" ] || [ -z "$CHOICE" ]; then
INDEX=$(($INDEX+1))
if [ $INDEX -eq $NUM_SONGS ]; then INDEX=0; fi
elif [ "$CHOICE" = "p" ]; then
INDEX=$(($INDEX-1))
if [ $INDEX -lt 0 ]; then INDEX=0; fi
fi
stop_song
done
Try this at home!
Reference
$ afplay --help
Audio File Play
Version: 2.0
Copyright 2003-2013, Apple Inc. All Rights Reserved.
Specify -h (-help) for command options
Usage:
afplay [option...] audio_file
Options: (may appear before or after arguments)
{-v | --volume} VOLUME
set the volume for playback of the file
{-h | --help}
print help
{ --leaks}
run leaks analysis
{-t | --time} TIME
play for TIME seconds
{-r | --rate} RATE
play at playback rate
{-q | --rQuality} QUALITY
set the quality used for rate-scaled playback (default is 0 - low quality, 1 - high quality)
{-d | --debug}
debug print output
$ afinfo --help
Audio File Info
Version: 2.0
Copyright 2003-2016, Apple Inc. All Rights Reserved.
Specify -h (-help) for command options
Usage:
afinfo [option...] audio_file(s)
Options: (may appear before or after arguments)
{-h --help}
print help
{-b --brief}
print a brief (one line) description of the audio file
{-r --real}
get the estimated duration after obtaining the real packet count
{ --leaks }
run leaks at the end
{ -i --info }
print contents of the InfoDictionary
{ -u --userprop } 4-cc
find and print a property or user data property (as string or bytes) [does not print to xml]
{ -x --xml }
print output in xml format
{ --warnings }
print warnings if any (by default warnings are not printed in non-xml output mode)
$ afconvert --help
Audio File Convert
Version: 2.0
Copyright 2003-2013, Apple Inc. All Rights Reserved.
Specify -h (-help) for command options
Usage:
afconvert [option...] input_file [output_file]
Options may appear before or after the direct arguments. If output_file
is not specified, a name is generated programmatically and the file
is written into the same directory as input_file.
afconvert input_file [-o output_file [option...]]...
Output file options apply to the previous output_file. Other options
may appear anywhere.
General options:
{ -d | --data } data_format[@sample_rate][/format_flags][#frames_per_packet]
[-][BE|LE]{F|[U]I}{8|16|24|32|64} (PCM)
e.g. BEI16 F32@44100
or a data format appropriate to file format (see -hf)
format_flags: hex digits, e.g. '80'
Frames per packet can be specified for some encoders, e.g.: samr#12
A format of "0" specifies the same format as the source file,
with packets copied exactly.
A format of "N" specifies the destination format should be the
native format of the lossless encoded source file (alac, FLAC only)
{ -c | --channels } number_of_channels
add/remove channels without regard to order
{ -m | --channelmap } list of input channels in output
set a channel map, mapping which input channel goes to each output channel.
channel number starts at zero. -1 makes a silent output channel.
For example, to reverse a stereo stream: -m 1 0
{ -l | --channellayout } layout_tag
layout_tag: name of a constant from CoreAudioTypes.h
(prefix "kAudioChannelLayoutTag_" may be omitted)
if specified once, applies to output file; if twice, the first
applies to the input file, the second to the output file
{ -b | --bitrate } total_bit_rate_bps
e.g. 256000 will give you roughly:
for stereo source: 128000 bits per channel
for 5.1 source: 51000 bits per channel
(the .1 channel consumes few bits and can be discounted in the
total bit rate calculation)
{ -q | --quality } codec_quality
codec_quality: 0-127
{ -r | --src-quality } src_quality
src_quality (sample rate converter quality): 0-127 (default is 127)
{ --src-complexity } src_complexity
src_complexity (sample rate converter complexity): line, norm, bats minp
{ -s | --strategy } strategy
bitrate allocation strategy for encoding an audio track
0 for CBR, 1 for ABR, 2 for VBR_constrained, 3 for VBR
--prime-method method
decode priming method (see AudioConverter.h)
--prime-override samples_prime samples_remain
can be used to override the priming information stored in the source
file to the specified values. If -1 is specified for either, the value
in the file is used.
--no-filler
don't page-align audio data in the output file
--soundcheck-generate
analyze audio, add SoundCheck data to the output file
--media-kind "media kind string"
media kinds are: "Audio Ad", "Video Ad"
--anchor-loudness
set a single precision floating point value to
indicate the anchor loudness of the content in dB
Note that for MP4 and M4* file types, this requires that the
--soundcheck-generate option is also enabled.
--anchor-generate
Analyze audio and add dialogue anchor level data to output file
Note that for MP4 and M4* file types, this requires that the
--soundcheck-generate option is also enabled.
--generate-hash
generate an SHA-1 hash of the input audio data and add it to the output file.
--codec-manuf codec_manuf
specify the codec with the specified 4-character component manufacturer
code
--dither algorithm
algorithm: 1-2
--mix
enable channel downmixing
{ -u | --userproperty } property value
set an arbitrary AudioConverter property to a given value
property is a four-character code; value can be a signed
32-bit integer or a single precision floating point value.
e.g. '-u vbrq <sound_quality>' sets the sound quality level
(<sound_quality>: 0-127)
May not be used in a transcoding situation.
-ud property value
identical to -u except only applies to a decoder. Fails if there is no
decoder.
-ue property value
identical to -u except only applies to an encoder. Fails if there is no
encoder.
Input file options:
--decode-formatid data_format_id
For input audio files with multiple data format layers (e.g. AAC_HE),
specify by format id (e.g. 'aach') which layer of the input file to
decode.
--read-track track_index
For input files containing multiple tracks, the index (0..n-1)
of the track to read and convert.
--offset number_of_frames
the starting offset in the input file in frames. (The first frame is
frame zero.)
--soundcheck-read
read SoundCheck data from source file and set it on any destination
file(s) of appropriate filetype (.m4a, .caf).
--copy-hash
copy an SHA-1 hash chunk, if present, from the source file to the output file.
--gapless-before filename
file coming before the current input file of a gapless album
--gapless-after filename
file coming after the current input file of a gapless album
Output file options:
-o filename
specify an (additional) output file.
{ -f | --file } file_format
use -hf for a complete list of supported file/data formats
--condensed-framing field_size_in_bits
specify storage size in bits for externally framed packet sizes.
Supported value is 16 for aac in m4a and m4b file format.
Other options:
{ -v | --verbose }
print progress verbosely
{ -t | --tag }
If encoding to CAF, store the source file's format and name in a user
chunk. If decoding from CAF, use the destination format and filename
found in a user chunk.
{ --leaks }
run leaks at the end of the conversion
{ --profile }
collect and print performance information
Help options:
{ -hf | --help-formats }
print a list of supported file/data formats
{ -h | --help }
print this help
SAY(1) Speech Synthesis Manager SAY(1)
NAME
say - Convert text to audible speech
SYNOPSIS
say [-v voice] [-r rate] [-o outfile [audio format options] | -n name:port | -a device] [-f file | string ...]
DESCRIPTION
This tool uses the Speech Synthesis manager to convert input text to
audible speech and either play it through the sound output device
chosen in System Preferences or save it to an AIFF file.
OPTIONS
string
Specify the text to speak on the command line. This can consist of
multiple arguments, which are considered to be separated by spaces.
-f file, --input-file=file
Specify a file to be spoken. If file is - or neither this parameter
nor a message is specified, read from standard input.
-v voice, --voice=voice
Specify the voice to be used. Default is the voice selected in
System Preferences. To obtain a list of voices installed in the
system, specify '?' as the voice name.
-r rate, --rate=rate
Speech rate to be used, in words per minute.
-o out.aiff, --output-file=file
Specify the path for an audio file to be written. AIFF is the
default and should be supported for most voices, but some voices
support many more file formats.
-n name, --network-send=name
-n name:port, --network-send=name:port
-n :port, --network-send=:port
-n :, --network-send=:
Specify a service name (default "AUNetSend") and/or IP port to be
used for redirecting the speech output through AUNetSend.
-a ID, --audio-device=ID
-a name, --audio-device=name
Specify, by ID or name prefix, an audio device to be used to play
the audio. To obtain a list of audio output devices, specify '?' as
the device name.
--progress
Display a progress meter during synthesis.
-i, --interactive, --interactive=markup
Print the text line by line during synthesis, highlighting words as
they are spoken. Markup can be one of
o A terminfo capability as described in terminfo(5), e.g. bold,
smul, setaf 1.
o A color name, one of black, red, green, yellow, blue, magenta,
cyan, or white.
o A foreground and background color from the above list,
separated by a slash, e.g. green/black. If the foreground color
is omitted, only the background color is set.
If markup is not specified, it defaults to smso, i.e. reverse
video.
If the input is a TTY, text is spoken line by line, and the output
file, if specified, will only contain audio for the last line of the
input. Otherwise, text is spoken all at once.
AUDIO FORMATS
Starting in MacOS X 10.6, file formats other than AIFF may be
specified, although not all third party synthesizers may initially
support them. In simple cases, the file format can be inferred from the
extension, although generally some of the options below are required
for finer grained control:
--file-format=format
The format of the file to write (AIFF, caff, m4af, WAVE).
Generally, it's easier to specify a suitable file extension for the
output file. To obtain a list of writable file formats, specify '?'
as the format name.
--data-format=format
The format of the audio data to be stored. Formats other than
linear PCM are specified by giving their format identifiers (aac,
alac). Linear PCM formats are specified as a sequence of:
Endianness (optional)
One of BE (big endian) or LE (little endian). Default is native
endianness.
Data type
One of F (float), I (integer), or, rarely, UI (unsigned
integer).
Sample size
One of 8, 16, 24, 32, 64.
Most available file formats only support a subset of these sample
formats.
To obtain a list of audio data formats for a file format specified
explicitly or by file name, specify '?' as the format name.
The format identifier optionally can be followed by @samplerate and
/hexflags for the format.
--channels=channels
The number of channels. This will generally be of limited use, as
most speech synthesizers produce mono audio only.
--bit-rate=rate
The bit rate for formats like AAC. To obtain a list of valid bit
rates, specify '?' as the rate. In practice, not all of these bit
rates will be available for a given format.
--quality=quality
The audio converter quality level between 0 (lowest) and 127
(highest).
ERRORS
say returns 0 if the text was spoken successfully, otherwise non-zero.
Diagnostic messages will be printed to standard error.
EXAMPLES
say Hello, World
say -v Alex -o hi -f hello_world.txt
say --interactive=/green spending each day the color of the leaves
say -o hi.aac 'Hello, [[slnc 200]] World'
say -o hi.m4a --data-format=alac Hello, World.
say -o hi.caf --data-format=LEF32@8000 Hello, World
say -v '?'
say --file-format=?
say --file-format=caff --data-format=?
say -o hi.m4a --bit-rate=?
SEE ALSO
"Speech Synthesis Programming Guide"
1.0 2017-02-16 SAY(1)