Hi,
This is an interesting project. I have done several machine learning projects, mainly on python (pandas, sklearn and pylearn2), and I think the difficulty depends almost entirely on the data, so let me ask some questions:
- In the attached doc is said that the best models are based on random forest and neural networks. Which metric are you using to validate the models?
- what are the results you have obtained so far?
- are there different datasets for testing purpose, or the results are obtained by crossvalidation?
- it's hard to estimate the different issues that can arise (covariate shift, outliers, ...) without any dataset. Would you be able to provide some kind of anonymized dataset?
Regarding the web application, I think it depends on the datasets also. For example, if there is covariate shift related to different universities/colleges, maybe different trainings must be done for each university, if there is enough data.
Please, feel free to ask any questions you have.
Best regards
Raul Rios