Build & train an AI model to classify sound and other senses

Introduction

Overview

With the advancements in the field of Artificial Intelligence researchers are able to achieve many breakthroughs in various fields. Although most of the AI methodologies were invented in the previous century, they have never been properly utilised before. Now with the numerous ways to collect data and with the amount of data accessible along with the computing power scientists are able to apply artificial intelligence and achieve remarkable results.

On of the fields where AI is able to make drastic changes is health, with the ability of processing large amount of data and being able to draw conclusions, which humans are unable to. Models are able to predict what diseases a patient is susceptible to based on their genome compared against millions of other patients genomes. This way models are able to detect very early stages of cancer or even diabetes and recommend appropriate measures to prevent such diseases so that they don’t result in serious problems.

For all of such remarkable results to take place one of the most important yet often overlooked aspect of such projects is the data. It is very easy to work with flawed data, which can be a result of human error – which occurs very often. Another issue may be they lack of preprocessing, which results in a lot of noise for the model leading to poor results. Hence this is such an important task, which should not be overlooked.

With all these variables in mind we are to help various patients. For example depression patients. They usually receive a couples hours of therapy per week, where a therapist tries to uncover their source or pain and discomfort, however a couple hours of week is usually not enough for such a severe state to be diagnosed and cured. Ideally, the most effective method would be to monitor the patient 24/7 and analyse their behaviour. This method on the other hand is very costly, hence the best solution to this is artificial intelligence. The goal is to be able to monitor a patient without intruding in their day to day life and later analysing their behaviours to uncover any unusual patters, which a normal therapist wouldn’t be able to.

Aim

The aim of the project is to build and train a model, which will be able to classify sound and other senses. The model is intended to improve activity recognition of a person under psychiatric care at their own homes. People under psychiatric care need a significantly greater amount of attention in order to observe their behaviour and its variations. Hence the most feasible way is to have a multi dimensional monitor that would record a patient and their behaviour. It offers a greater depth of analysis, which is where usually the issue of patients lies.

Hence there are two main objectives – to collect data and properly extract features from it and the second one being training a appropriate model on the collected data.

Structure

This works reports on the current progress of the project. Currently three main methods are being researched, which are described in the feature extraction section. Furthermore the report includes a current status update along with a braid update on the further steps and lastly a risk analysis.

Feature Extraction

Feature extraction is one of the most important aspects of the project. It will enable to properly extract and focus on the important properties of the data and cut out the irrelevant data, which would create a lot of noise. This is the most important part that influences the quality of results that the model outputs.

As sound classification is still not properly explored in the field of AI the resources are very limited. Hence the best way to gain knowledge is the research Speech Recognition and try to apply the techniques to sound classification with some alterations.

Background Literature

Several different resources were consulted to gain more insight on this topic. Such sources included research papers from IEEE, various tutorials and video lectures along with scientific forums.

Extraction Methodologies

Mel Frequency Cepstral Coefficient (MFCC)

Feature extraction involves in identifying the components of the audio signal that are proper for identifying the activity being done and discarding all other parts such as background noise. MFCCs were first by Davis and Mermelstein in the 1980’s, and are still one of the best performing methods till date.

Main steps required to classify MFCCs:

Frame a signal into shorter frames

Calculate the periodogram estimate of the power spectrum for every single frame

Apply the Mel filterbank to the power spectra, sum the energy in each filter

Extract the logarithm of all filterbank energies

Take the DCT of the log filterbank energies

Keep DCT coefficients 2-13, and abandon the rest

An audio signal is not constant – it is changing all the time. Hence it is important to simplify this – it is needed to make an assumption that in smaller intervals the audio signal does not vary a lot, usually such frames last 15ms – 45ms. A frame that is much shorter does not have enough information to draw a proper conclusion from it and a longer frame has too many features, which can be confusing for the model. It results in two different conclusions, and sometimes nether of them is correct. After the separation a power of spectrum is calculated for every single frame. Which in theory simulates the human cochlea – an organ in the ear that vibrates at different spots depending on the frequency of the incoming sounds. Depending on the location in the cochlea that vibrates – there are small hairs that vibrate, different nerves fire, which inform the brain about the spectrum of the certain frequencies are present in the environment.

The periodogram estimate performs a equivalent job by identifying which frequencies are present in the frame.

Yet nonetheless a periodogram spectral estimate still contains a lot of information irrelevant information for sound recognition model. Particularly the cochlea is usually unable to distinguish the difference between two closely spaced frequencies. This effect becomes more effective as the frequencies increase. Which is why it is important to take take sections of periodogram bins and combine them to get an estimate of how much of energy actually exists in various frequency regions. This is performed by the Mel filterbank. The very first filter is immensely narrow and gives an clear indication of how much energy exists near 0 Hertz. As the frequencies get higher the filters get wider due to the fact of being less concerned about variations, they wont infuse the model a lot. The main point of interest is only in an estimate of how much energy occurs at each spot. The Mel scale presents exactly how to space the filterbanks and how wide they should be.

After obtaining all of the filterbank energies, it is advised to take the logarithm of them. This is also motivated by human hearing: humans don’t hear loudness on a linear scale. Roughly to double the perceived volume of a sound a human ear requires more or less nine times as much energy put into it. Which means that large variations in energy may not produce a different sound from the original one. This compression operation makes features match more closely what humans actually hear. Furthermore the logarithm allows one to use cepstral mean subtraction, which is a channel normalisation technique. Hence the reason to discard using a cube root.

Finally, it is required to compute the DCT (Discrete Cosine Transform) of the logarithm filterbank energies. There are two reasons this is performed, mainly the filterbanks usually overlap, and the filterbank energies are very correlated with each other. The DCT de-correlates the energies which yields a diagonal covariance matrices, which can be used to model the features in various classifiers, which is the second part of the project. Nonetheless is it essential to keep in mind that that only twelve of the twenty-six DCT coefficients are kept, the rest are disregarded. This is because the higher range of the DCT coefficients represent fast changes in the filterbank energies, which inevitably makes the fast changes degrade models performance, hence a minor improvement is gained by disregarding them.

Mel Scale

The Mel scale is related to the perceived frequency which is also correlated to the pitch, of a pure tone to its actual measured frequency. Humans are very good at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes the features match more closely what actually humans hear.

There are two formulas required in this process. The first one involves converting from frequency to Mel scale:

Conversion from Mel’s scale to frequency (Inversion of the first formula):

Discrete Fourier Transform

Another useful feature may be to obtain a spectrogram for every sound. A spectrogram shows how the frequency content of a signal changes over time and can be calculated from the time domain signal. To obtain this we need to utilise Discrete Fourier Transform. The DFT is a mathematical methodology that is used to perform Fourier analysis on a discrete signal, which can be a small sample. It converts a finite list of equally spaced samples of a function into the list of coefficients of a finite combination of complex sinusoids, ordered by their frequencies, by considering if those sinusoids had been sampled at the same rate. The main aspect is to properly understand the correlation between a signal and sinusoids of various frequencies.

DFT is an algorithm that inputs a signal and determines the ‘frequency content’ of the signal. A ‘signal’ is an arbitrary sequence of numbers. An audio signal’s fluctuations over time can be depicted as a graph: the x-axis is time, and the y-axis can be represented by the frequency. DFT does mathematically what the human ear does physically: it decomposes a signal into its component frequencies.

The results contains a list of sinusoid functions, identified by frequency, and each sinusoid has an associated amplitude and phase. A phase of a signal is the beginning of a sinusoid relative to specific zero, which may vary depending on a specific case. Phase is measured as an angle, usually in degrees or radians, indicating a part of a complete oscillation. These signals are said to be “in phase”. A sinusoid with a phase of π radians is the numerical opposite of a sinusoid with a phase of 0 radians. These signals are said to be “out of phase” and if combined, would cancel each other out. An ear that is “phase deaf” is one where two sinusoids with different phases will be perceived as the same sound. Two spectrally similar sounds with all frequency components having different phases will sound the same. Hence why the phase component of the Fourier representation is often discarded. In conclusion this method may be reliable to use, however it is vital to research other methods before diving deeper into this one.

Multi-resolution transforms

Fourier transform are not always to best way to proceed. They usually have a drawback, which is that it is a representation that is based completely in the frequency domain. Using the Fourier transform, the result only yields information about only the frequency behaviour of the signal, without knowing when that particular behaviour actually had occurred, unless a methodologies like such as Short-time Fourier transform are used. Fortunately libraries such as Tensorflow, do support STFT, however it is vital to also research other forms, in order to opt for the one that yields the best quality of results. Multi-resolution techniques take into consideration the spectral makeup of the signal at various different time-resolutions, which are able to capture the low frequency information about the signal over a large window and the high frequency information over a smaller window. There is a basis function, called a wavelet, which can be thought of as a windowed sinusoid. They are designed to be orthogonal, in order for a transform using these wavelets would be reversible. In the discrete wavelet transform, the wavelet is stretched to fill the entire time frame of the signal, analysing how much actual low frequency information is present in the frame. The wavelet is then scaled to fit half of the frame, and used twice to analyse the first half and the second half of the frame for slightly higher frequency information, which is localised to each half. By this method the entire frequency spectrum is covered. Hence high frequency information is highly localised in time, and on the other hand low frequency information is localised to a lesser.

Multi resolution transforms, such as the wavelet transform, try to cross the boundary between a single time domain representation and a purely frequency domain representation. They do not correspond to “time” information or “frequency” information, rather the information that they extract from the signal is a kind of time-frequency hybrid. Methods can be employed to extract time or frequency information from a multi-resolution representation such as the wavelet transform.

Chroma Feature Analysis and Synthesis

Chroma features is a powerful representation for music audio in which the entire spectrum is projected onto twelve bins representing twelve distinct semitones (or chroma) of the musical octave. In music and audio, notes that are exactly one octave apart are perceived as very similar, being aware of the the distribution of chroma even without the absolute frequency, which is the original octave is helpful as it can give useful musical information about the audio. Furthermore it may also reveal perceived musical similarity that is not apparent in the original spectra, and would be lost otherwise.

A chord is defined as the simultaneous sounding of two or more different notes. Chord recognition is one of the central tasks in the field of music information retrieval. Many techniques are very similar. The first step, the audio sample is converted into a sequence of chroma-based audio features. These features are often further processed, for example, by applying several smoothing filters to even out temporal outliers or by applying logarithmic compression along in order to enhance small, but very perceptually relevant spectral components. The next step involves using different pattern matching techniques that are applied to map the chroma features to chord labels that correspond to the various musical chords to be considered.

The following were the three main methods for feature extraction for this project, however other are also being research. If there is a case where a better method is found then it will be used.

Current Project Status

At the present moment there have been several techniques which have been already researched to yield the best features from the extraction method. There have been several methods researched, the previous chapter provides a highlight of the most important ones. Hence the next step is to pick the best method, research it further and implement it on a sample dataset. There have been several datasets found, however a further analysis is required to pick the one best suited for this project, the requirements are having household sounds, minimum background noise and clear distinction of the labels.

After a basic implementation has been created the will be a testing process to find out the shortest sample time, which gives the best results. The aim of this is to be as less intrusive to the patients as possible and respect their privacy, which trying to improve their health. These steps are aimed by the beginning of Semester B.

After that the following steps are required:

Researching the best classification techniques

Implementing the best models and picking one

Training the model thoroughly

Testing the model against new data

Improving the model

Further testing of the improved model

Writing the final report

In order to be better prepared for the further steps the book “Deep Learning” by Ian Goodfellow and Yoshua Bengio is being studied to determine if deep learning is actually the best method, which will help to save time in the future when time comes to pick a suitable model. Furthermore there is a very high chance of using Markov models, which is also being currently researched.

Based of the initial project specification it is reasonable to conclude being on track. A minor change has been introduced, where rather than collecting data, it is better to find pre-collected data on the internet. The reason for this is that collecting data is a very time costly process, which may slow the project down. Data collection is not the main gist of the project, but has a potential to take the most amount of time. For example it would be required to collect 1000 sounds of a kettle in various different environments to gain the best results, whereas it is easier to find such an open source dataset and focus on the feature extraction and model classification.

Risk Assessment

With the use of a proper risk assessment a student will be able to predict possible failures and try to prevent them – “Prevention is better than cure”. It is important to analyse when they may occur and how likely it is for them to occur.

Bibliography

Swartz, M. E., & Brown, P. R. (1998, December 7). Use of mathematically enhanced spectral analysis and spectral contrast techniques for the liquid chromatographic and capillary electrophoretic detection and identification of pharmaceutical compounds. Retrieved December 8, 2017, from http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1520-636X(1996)8:1%3C67::AID-CHIR12%3E3.0.CO;2-Q/abstract
Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357-366
X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.
Chang-Hsing Lee, Jau-Ling Shih, Kun-Ming Yu, Hwai-San Lin, “Automatic Music Genre Classification Based on Modulation Spectral Analysis of Spectral and Cepstral Features”, Multimedia IEEE Transactions on, vol. 11, pp. 670-682, 2009, ISSN 1520-9210.
Jiajun Zhu, Lie Lu, “Perceptual Visualization of a Music Collection”, Multimedia and Expo 2005. ICME 2005. IEEE International Conference on, pp. 1058-1061, 2005.
S. Wegener, M. Haller, J.J. Burred, T. Sikora, S. Essid, and G. Richard. On the robustness of audio features for musical instrument classification. In Proceedings of the European Signal Processing Conference, 2008.
D.D. O’Shaughnessy. Automatic speech recognition: History, methods and challenges. Pattern Recognition, 41(10):2965–2979, 2008.
M. Goto. A chorus-section detecting method for musical audio signals. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 437– 440, Hong Kong, China, 2003.
M.A.BartschandG.H.Wakefield.Audiothumbnailingofpop- ular music using chroma-based representations. IEEE Transac- tions on Multimedia, 7(1):96–104, Feb. 2005.
F. Kurth and M. Mu ̈ller. Efficient Index-Based Audio Matching. IEEE Transactions on Audio, Speech, and Language Processing, 16(2):382–395, Feb. 2008.
A. P. Klapuri, A. J. Eronen, and J. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech and Language Processing, 14(1):342–355, 2006.
Google, Tensolflow Documentation, 1 December 2017, https://www.tensorflow.org/api_docs/python/tf/contrib/signal/stft
Bailey, David H.; Swarztrauber, Paul N. (1994), “A fast method for the numerical evaluation of continuous Fourier and Laplace transforms” (PDF), SIAM Journal on Scientific Computing, 15 (5): 1105–1110, doi:10.1137/0915067.
Campbell, George; Foster, Ronald (1948), Fourier Integrals for Practical Applications, New York: D. Van Nostrand Company, Inc.
Boashash, B., ed. (2003), Time-Frequency Signal Analysis and Processing: A Comprehensive Reference, Oxford: Elsevier Science, ISBN 0-08-044335-4.
Min Xu; et al. (2004). “HMM-based audio keyword generation”. In Kiyoharu Aizawa; Yuichi Nakamura; Shin’ichi Satoh. Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia (PDF). Springer. ISBN 3-540-23985-5.
M.P. Wachowiak, G.S. Rash, P.M. Quesada, and A.H. Desoky. 2000. Waveletbased noise removal for biomechanical signals: a comparative study. IEEE Trans. Biomed. Eng. 47(3):360-368, March

Discover more:

Essays on artificial intelligence

Essay: Build & train an AI model to classify sound and other senses

Essay details and download:

Text preview of this essay:

Introduction

Bibliography

Discover more:

Recommended for you

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

Bibliography

Discover more:

Recommended for you

About this essay:

Essay Categories: