Nowadays audio quality is very important. The ears of human beings have great capacity for discernment, but it can also be seen that audibility and clarity of dialogue is a challenge.
PhD Carlos Pantsios Markhauser *
In recent years a new technology in television has entered the world market known as Ultra High Definition Television (UHDTV), which has 4 times more pixels per image on the screen (8.3 Mpix) than the HDTV television, 1080p (2Mpix). The UHDTV television technology also has other outstanding features, such as:
1) a significantly higher dynamic range,
2) a better temporal reproduction of the images (by means of a higher temporal frequency),
3) a substantially larger color reproduction (thanks to an expanded color space) and,
4) more details (resolution) in the reproduced images.
Despite the great advantages mentioned in the video, there is almost no awareness of the fact that there is also a significant change in the sound system that accompanies the UHDTV video.
A new experience in sound is present in the UHDTV
In the first place, it is important to highlight here the difference with which the human being perceives audio and video, that is, the difference between the experience produced by audio and video. For example, it is possible, in practice, to observe perfectly two or more images on the same television screen simultaneously. Television images are limited in nature and usually two-dimensional.
The presence of intervals with loss of information, due to errors of transmission or processing of the video, does not completely complicate the understanding of the distorted images by the user. However these losses are undoubtedly annoying for him viewer. In comparison with the above, it is really complicated to understand several audios that appear simultaneously to the user.
Stereo audio is an unlimited experience (if the user is sitting in the right place) and the presence of intervals with loss of information in the audio quickly reduces the user's ability to understand what happens.
Moreover, if the audio is distorted, it can cause physical pain in the person.
Factors that improve the audio experience
The differences in perception mentioned show that a significant number of factors must be considered to significantly improve the audio experience. Following are three areas that should be considered here:
1 area: It is known that the ability to interact is widely valued positively by the audience, but the audio equivalent to a second screen does not work. Then, how is it done to create more complete interaction, in addition to conventional audio volume control?
2 area: The audio is now of the "immersive" type, but would it be interesting to know if this experience can be improved? Is it possible that a true audio experience in 3D can work satisfactorily even when stereoscopic images in 3D can not do it?
Equally important is to ask yourself, is it possible to deliver this more immersive experience without overloading the production work and the distribution process of the finished programs with a lot of added complexity and more cost? Finally, will it be possible to do the aforementioned in a way that is also accessible to those users who listen to programs in mono, stereo or with headphones?
3 area: Nowadays audio quality is very important. The ears of human beings have great capacity for discernment, but it can also be seen that audibility and clarity of dialogue is a challenge. An important question here is how can the audio experience be adapted and customized to make it work well for different preferences, for a range of technologies and for a variety of listening environments.
Great efforts are currently being made to find different techniques that allow us to satisfactorily fulfill the following three important areas:
2) immersion and
3) adaptation (also known as personalization).
The technology that has shown the best results so far, offering backwards compatibility with current channel-based technologies, is the audio based on objects (audio-objects).
In the conventional world, the audio content of a program is represented using the channel-based format. Here, a number of signals stored in a file are distributed in streams, and each corresponds to a program. The technology known as Broadcasting Wave Format (BWF) does not currently define what each stream represents in the file, nor does Wave Format technology from Microsoft, on which it is based.
The arrangement of speakers is assumed from the number of channels available, and the positions of the speakers is also based on the channel number. For example, a program with two audio channels implies a stereo format; the signals correspond to the left and right speakers, which must be placed at 60 degrees of separation. With this system problems occur quickly when there are more than two channels.
For content in 5.1 format there are different methods that allow ordering channels and there is no reliable way to know, from the file only, which convention has been used. The F64 is a compatible multichannel BWF format, which uses a channel mask to map channels to speaker arrays using a descriptive label, eg. SPEAKER-FRONT-LEFT. This allows the positions of the speakers to be determined, but the order identifiers of the channels and the metadata stored in an XML file are those used to describe the channels. A set of metadata called EBUCore allows greater accuracy in the definition of content within a given file.
For many years, the researchers were working on audio formats independent of speaker configuration. One of them is the Object-Based Format, which describes components in a scene with variable metadata over time, providing maximum flexibility. For the broadcaster this solution is very attractive, since the programs can be produced only once, and distributed in different formats, which are generated automatically. This new BWF allows the representation of the scene and audio-objects, which makes it possible for broadcasters to transport and exchange programs generated in these formats.
This audio technology has been evolving rapidly in recent times, giving rise to new standards. The audio-based audio describes a general presentation of the audio, structured in individual values (or objects), each with its metadata, which describes its relationships, behavior and associations. The metadata tells an "assembler", in the AV system, how to assemble in the best possible way the audio-objects in the desired presentation, with the arrangement of speakers available.
Conceptually, this technological approach is very powerful and flexible, but to achieve a practical implementation it is necessary to know what problems you want to focus on first for your corresponding solution.
Proposing concepts and solutions
One of the most important concepts of audio-object-based technology is "renderer". This is defined in the so-called Forum for Advanced Media in Europe (FAME), an organization that deals with research and development in Ultra High Definition (UHD), Virtual Reality (VR) and other new technologies.
Most likely, in real life it is necessary to transcode between different object-based presentations. The above is due to the fact that high-level dramatic productions will require working with a very large number of objects (possibly hundreds of them or more). Real work flows generally operate with subsets of fewer objects, and bandwidth limitations will force the use of fewer objects for the proper and economical delivery of outputs to households.
Likewise, it is also necessary to be able to evaluate the quality of the different audio renderings corresponding to the different implementations. Until now, there is no technique for evaluating the quality of the different renderings corresponding to the implemented implementations. Techniques already known as the so-called Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) do not work here, since now they are interested in evaluating the "immersiveness" of the production material, rather than in the errors that may appear in it.
The previous definition also makes it clear that in order for the renderer to perform the rendering, both the audio and the metadata are required.
The true nature of such an approximation of flexibility lies in the fact that the renderers can be developed to choose a simple published version, and implement it in the best possible way for a group of platforms, devices and situations. If this is the case, then there is a new challenge since, as a result, the creative working group will have a very remote idea about how the audio program will sound at home.
This raises the question of whether benchmark renderers and monitoring arrangements are required to allow for a representative evaluation, which applies to the production in question. To crown the reproduction of object-based audio in professionally-configured speakers, it has also been added, by the renderer's designer, the even more difficult challenge of how to produce a great sound when presented in the asymmetrical arrangement commonly used in home.
Currently you can see implementations in the consumer market of the new generation of televisions in 4k (UHDTV) that continue to be equipped with conventional audio technology for broadcasting. However, the latest audio solutions are not associated with UHDTV technology and may be applicable to standard TV receivers as well as standard optical discs.
As a result, technologies based on audio-objects are emerging, emerging in many places. For example, Dolby owns objects at the heart of its ATMOS solution for cinema (including home theater) and is introducing its object-based technology as part of the AC4 standard. DTS has launched its Multi-Dimensional Audio format (MDA). Farelight has implemented the ATMOS and MDA tools in its 3DAW audio tools.
The BBC recently demonstrated several examples of immersive developments, personalization and interaction, based on audio-objects in the IBC's 2014 exhibition, and the MPEG-H has been built to be "object ready" for the delivery not only of the audio in 3D for broadcasting, but also for gaming and video conferences.
Great changes await us in the audio part in the near future and, for this, we must prepare adequately.
* Carlos Pantsios Markhauser is a Telecommunications Engineer and Master of Communications from the Simón Bolivar University, with a specialization in Telecommunications in satellite and networks. The George Washington University - School of Engineering & Applied Science, Specialization in Digital Telecommunications University of Colorado Boulder. He works as a postgraduate professor at the telecommunication schools at Simón Bolivar University and Andrés Bello Catholic University. In addition to professional consultant in TV projects based in Argentina.