[OTDev] Validation: Efficiency
Martin Guetlein martin.guetlein at googlemail.comFri Feb 25 14:21:56 CET 2011
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Receiving task
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Feb 25, 2011 at 2:19 PM, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote: > > > On 25 February 2011 15:11, Martin Guetlein <martin.guetlein at googlemail.com> > wrote: >> >> On Fri, Feb 25, 2011 at 1:32 PM, Nina Jeliazkova >> <jeliazkova.nina at gmail.com> wrote: >> > >> > >> > On 25 February 2011 14:26, Martin Guetlein >> > <martin.guetlein at googlemail.com> >> > wrote: >> >> >> >> On Fri, Feb 25, 2011 at 12:53 PM, Nina Jeliazkova >> >> <jeliazkova.nina at gmail.com> wrote: >> >> > Andreas, >> >> > >> >> > On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de> wrote: >> >> > >> >> >> Nina, >> >> >> >> >> >> you are right (I think it still is the case that datasets are >> >> >> redundant). >> >> >> However, with different model parameters, which will probably be >> >> >> used a >> >> >> lot >> >> >> in validation, new datasets will be created. >> >> >> I think it would be definitely necessary to not store data >> >> >> redundantly >> >> >> (as >> >> >> you indicated), but that might be only part of the solution. >> >> >> So it may still be necessary to compress the amount of policies >> >> >> needed. >> >> >> >> >> >> >> >> > Well, thinking further >> >> > >> >> > 1) I would implement validation splits (at least at our services) as >> >> > logical splits of the same dataset , assigning some tags, similar to >> >> > what is >> >> > in the mutagenicity Benchmark dataset (look for column "Set" >> >> > http://apps.ideaconsult.net:8080/ambit2/feature/28956 ) >> >> > >> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100 >> >> > >> >> > and introduce searching similar to the queries below (restricted to >> >> > the >> >> > property in question) >> >> > >> >> > Training set >> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN >> >> > >> >> > Crossvalidation sets >> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1 >> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2 >> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3 >> >> > ... >> >> > >> >> > >> >> > Thus, everything is in the original dataset (or a single copy of it >> >> > on >> >> > another dataset service) and no need of additional policies. >> >> > >> >> > >> >> > Different features , calculated during validation run would be >> >> > specified >> >> > via >> >> > feature_uris[] parameter on the same dataset URI. >> >> > >> >> > >> >> > >> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=.. >> >> >> >> My concerns regarding using one dataset for everything: >> >> >> >> * This would allow the algorithm to remove the search string and >> >> request data that is supposed to be unseen. What do people think about >> >> that? >> >> >> > >> > I am not sure you really prevent this by the current approach - what >> > prevents the algorithm to retrieve any other data by from any dataset >> > service? Removing the search string changes the URI, then the algorithm >> > service could just change the entire URI and receive any other dataset >> > it >> > has access to. >> > >> >> >> >> * Still 10 models would be created (and 10 validations, but I could >> >> try to solve this internally), so we would not end up with 1 policy >> >> for a crossvalidation. >> >> >> > Unless predictions are stored in the same dataset. I pressed the wrong reply button, here is what I send to Nina and her reply: >> I counted the 10 models as resources with policies. If they produce a >> new prediction dataset each (not store it in the same dataset), its >> 20. >> >> One more thought / concern: >> What if features have to be created (and selected) on each training >> fold and stored in the same single dataset? This could lead to >> problems, the super-service / feature creation model has to make sure >> to not mix things up (like reuse features created on a previous fold). > > If each model generates new feature URI (as it should be ) , this is not an > issue. And a new model generates new prediction (new column in a dataset) > and this naturaly means different object and different URI. Otherwise, it is > blurring the meaning of a resource in RDF at least. > > A feature generated from different model should have different ot:hasSource > , pointing to that model. If one feature is shared between models, that > introduces inconsistency , as there will be pointers to two models and the > superservice will not know which one to use for calculation. I hope we > defined ot:hasSource as single valued in the ontology, if not, we should > correct it. > > Nina
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Receiving task
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list