Hi there, I'm Matt Sergei (do call me Matt) and would like to help you at least with an idea of one implementation instead of pretty generic bids here.
Your solution could be implemented using MPEG-7 specification, specifically Audio part (I've graduated with MPEG-7 for describing images yet have learned on my own about using it for audio, too). So following your requirements:
1. A code for converting audio input using MPEG-7 would create an XML file of it. Data includes bytes of sampled data as text values.
2. Matching is done with simple pattern matching. Text values of frequency samples can be converted to numbers first and then matched.
3. Results returned are best matched with ranking by percentage, of course. Now the beauty of having a nice meta description of a particular audio is that it can be saved for example with artist/performer's name, genre etc. So the results can return other songs by the same artist and/or audio with similar sound patterns.
4. I have done some data model training with R in the past, would probably implement it in other programming languages (Python being a logical choice, I'm used to javascript, PHP, Java and C).
Regarding the applications I know the basic app development with Android Studio (not aware of its current status). Kivy (Python based GUI) framework that can export Android apps as well would be my preference.
I'd like to know your thoughts. And would be able to start someday in May.
Best Regards and hope you're safe,
Matt