Natural Language Interface to Songs DB
We have a MySQL database of songs along it meta information grouped if different tables. Our objective is to develop an User Interface, which would allow users to give Natural Language Query for searching for songs and play the selected songs. We are developing this project in C++ using WxWidgets GUI framework. The team is composed of two final year students, namely Sahana and Sarayu of CSE department from SRM University mentored by Sudarsun of ARC.
Beat Count Estimator (Thaalam)
Every Song has a Thaalam (or beat count). The more is the beat count, the fast is the song. Similarly lower beat count means slower song. Beat count could be very interesting and useful parameter we could pre-classify songs. In scenarios of huge song databases, beat-count based pre-classification helps in minimizing the search space and hence speeds up the searching process. Using FFT and other DSP techniques, the beat count could be derived from the song and duly be used as a pricipal feature for classification. The team is formed on 23rd October 2007 with Vikas Bharadwaj of SRM University and Sudarsun of ARC.
Tamil TTS
An open source Text-to-Speech implementation for Tamil language is to be developed which would read out TSCII and UNICODE formatted text documents which contain classical and collocial Tamil text. The basis of TTS is building a library of phonemes and its n-grams with an optimal chaining to give better tamil speech synthesis. Joseph of SRM University and Sudarsun of ARC are currently involved in this project since 27th October 2007.
Audio Language Detection
Given an audio snippet, be it speech, song or whatever, the objective is to detect the spoken language from the audio snippet. We want to explore the properties of Indian languages when they sound and use them to detect the language appropriately. For example, languages like Oriya, Bengali have a lot of "O" sound, languages like Kannada have a lot of "ha" sound and many time they end with "small 'a'" (kuril) sound. Ajay Sundar of SRM University and Sudarsun of ARC are currenly doing this project since 27th October 2007.
Latent Dirichlet Allocation (LDA)
Latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. LDA, a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic robabilities provide an explicit representation of a document. This project is currently hosted at SourceForge.net under YRIF projects.
பல்லாங்குழி (Pallanghuzhi)
பல்லாங்குழி (Pallanghuzhi) is a traditional south indian game played by women and children using pebbles and a wooden board. Pallanghuzhi is a game for two, which can be played between networked computers or over the web or as a standalone game played by one against the computer. The game is to be developed in Linux/C++ as an OSS project which we plan to license under GPL. Students with good background on C++, Linux, Qt/WxWidgets are invited.
Inter-Indian Language Translator
Inter-Indian Language Translator is a TUI/GUI tool that converts text from any Indian Language to any other Indian Language. We want to start with Tamil to Telugu translator as the first stage of development which we expect to expand in terms of more target languages firstly and more source languages following that. Students with command on Linux, Perl/Python/C++, WxWidgets/Qt, Tamil and/or Telugu, are invited.
Indian Language Stemmer
Stemming is a concept of finding the root word of any inflected word forms syntactically. Stemming algorithms are implemented for many languages but not indian languages. We want to start the chain by implementing a stemming algorithm for Tamil language and gradually expand to other languages. Stemming plays a vital role in the development of Information Retrieval systems and Text mining. Porter algorithm is a popular English Stemmer. Students with Linguistic background with Programming experience in C/C++ are invited.
வட்டெழுத்து (Grandham) OCR
South India is flourished with a lot of temples with inscriptions written in an ancient form of Tamil language called Grandham. Inscriptions written on palm-leaves and stones take a round shape without dots for most of the symbols. We intend to develop a tool which converts Grandham (வட்டெழுத்து) to modern tamil and vice versa with an OCR plugin. So photographs of grandham inscriptions could be converted to modern tamil writing using the proposed tool. Students with inclination towards Tamil literature with C++/Linux are invited to participate.
Plagiarism Detector
Plagiarism can be defined as the deliberate use of another person's work in your own work, as if it were your own, without adequate acknowledgement of the original source. If this is done in work that you submit for assessment, then you are attempting to deceive the examiners. In other words, plagiarism is cheating - trying to claim the credit for something that is not your work (As on http://helios.bto.ed.ac.uk). Plagiarism detector compares a test document against the reference document and gives a score of similarity. When the score crosses a preset threshold, the detector triggers the plagiarism alarm. The detector is to be implemented as a standalone GUI tool which should have the facility to learn the similarity-features from training documents. The detector should also implement rewritable rules based similarity measurement. Upon successful standalone implementation, the concept is to be ported as a web-based detector tool.