Hi All! Following up on this PR: https://github.com/apache/incubator-mxnet/pull/13241 I would need some comments or feedback regarding the API design : https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio
The comments on the PR were mostly around *librosa *and its performance being a blocker if and when the designed API can be tested with bigger ASR models DeepSpeech 2, DeepSpeech 3. I would appreciate if the community provides their expertise/knowledge on loading audio data and feature extraction used currently with bigger ARS models. If there is anything in design which may be changed/improved that will improve the performance, I ll be happy to look into this. Thanks and regards, Gaurav Gireesh On Thu, Nov 15, 2018 at 10:47 AM Gaurav Gireesh <[email protected]> wrote: > Hi Lai! > Thank you for your comments! > Below are the answers to your comments/queries: > 1) That's a good suggestion. However, I have added an example in the Pull > request related to this: > https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f > . > I would be happy to include a dataset similar to MNIST to support that. I > have come across an example dataset used in tensor flow speech > related example here > <https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This > could be included. > > 2) Thank you for the suggestion, I shall look into the FFT operator that > you have pointed out. However, there are other kind of features like, mfcc, > mels and so on which are popular in audio data feature extraction, which > will find utility if implemented. I am not sure if we have operators for > this. > > 3) The references look good too. I shall look into them. Thank you for > bringing them into my notice. > > Regards, > Gaurav > > On Tue, Nov 13, 2018 at 11:22 AM Lai Wei <[email protected]> wrote: > >> Hi Gaurav, >> >> Thanks for starting this. I see the PR is out >> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial >> reviews, good work! >> >> In addition to Sandeep's queries, I have the following: >> 1. Can we include some simple classic audio dataset for users to directly >> import and try out? like MNIST in vision. (e.g.: >> http://pytorch.org/audio/datasets.html#yesno) >> 2. Librosa provides some good audio feature extractions, we can use it for >> now. But it's slow as you have to do conversions between ndarray and >> numpy. >> In the long term, can we make transforms to use mxnet operators and change >> your transforms to hybrid blocks? For example, mxnet FFT >> < >> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft >> > >> operator >> can be used in a hybrid block transformer, which will be a lot faster. >> >> Some additional references on users already using mxnet on audio, we >> should >> aim to make it easier and automate the file load/preprocess/transform >> process. >> 1. https://github.com/chen0040/mxnet-audio >> 2. https://github.com/shuokay/mxnet-wavenet >> >> Looking forward to seeing this feature out. >> Thanks! >> >> Best Regards >> >> Lai >> >> >> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy < >> [email protected]> wrote: >> >> > Thanks, Gaurav for starting this initiative. The design document is >> > detailed and gives all the information. >> > Starting to add this in "Contrib" is a good idea while we expect a few >> > rough edges and cleanups to follow. >> > >> > I had the following queries: >> > 1. Is there any analysis comparing LibROSA with other libraries? w.r.t >> > features, performance, community usage in audio data domain. >> > 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi >> or >> > ask the user to install if required? I prefer the latter, similar to >> > protobuf in ONNX-MXNet. >> > 3. I see LibROSA is a fully Python-based library. Are we getting >> blocked on >> > the dependency for future use cases when we want to make >> transformations as >> > operators and allow for cross-language support? >> > 4. In performance design considerations, with lazy=True / False the >> > performance difference is too scary ( 8 minutes to 4 hours!!) This >> requires >> > some more analysis. If we known turning a flag off/on has 24X >> performance >> > degradation, should we need to provide that control to user? What is the >> > impact of this on Memory usage? >> > 5. I see LibROSA has ISC license ( >> > https://github.com/librosa/librosa/blob/master/LICENSE.md) which says >> free >> > to use with same license notification. I am not sure if this is ok. I >> > request other committers/mentors to suggest. >> > >> > Best, >> > Sandeep >> > >> > On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh <[email protected] >> > >> > wrote: >> > >> > > Dear MXNet Community, >> > > >> > > I recently started looking into performing some simple sound >> multi-class >> > > classification tasks with Audio Data and realized that as a user, I >> would >> > > like MXNet to have an out of the box feature which allows us to load >> > audio >> > > data(at least 1 file format), extract features( or apply some common >> > > transforms/feature extraction) and train a model using the Audio >> Dataset. >> > > This could be a first step towards building and supporting APIs >> similar >> > to >> > > what we have for "vision" related use cases in MXNet. >> > > >> > > Below is the design proposal : >> > > >> > > Gluon - Audio Design Proposal >> > > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio> >> > > >> > > I would highly appreciate your taking time to review and provide >> > feedback, >> > > comments/suggestions on this. >> > > Looking forward to your support. >> > > >> > > >> > > Best Regards, >> > > >> > > Gaurav Gireesh >> > > >> > >> > >> > -- >> > Sandeep Krishnamurthy >> > >> >
