[OTDev] Validation: Efficiency
Martin Guetlein martin.guetlein at googlemail.comFri Feb 25 13:33:05 CET 2011
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Feb 25, 2011 at 1:21 PM, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote: > Martin, > > What do you think about introducing a resource for "groups of datasets" ? It > could be used as a placeholder for URIs of several datasets, and use some > wildcards on the policy server to ensure only one policy for the group of > dataset is needed. > > Regards, > Nina Would this not just shift the A&A work to adding and deleting those wildcards (because we would need one wildcard for every dataset?) Not sure if this would be faster, Andreas? Regards, Martin > > On 25 February 2011 14:00, Martin Guetlein <martin.guetlein at googlemail.com> > wrote: >> >> > On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de> wrote: >> > >> >> Dear all, >> >> >> >> since a single validation of a model on a dataset creates multiple >> >> ressources (currently > 50), and by the fact that everything is >> >> decentralized (i.e. linked via URIs) and referenceable in OpenTox, we >> >> are >> >> facing the problem that currently prohibitively high load is placed on >> >> the >> >> AA services, because a policy must be created and requested multiple >> >> times >> >> (and eventually deleted) for each of the resources. >> >> >> >> For example the spike in http://tinyurl.com/6amuo8x to the very right >> >> is >> >> produced by a single validation. Moreover, the validation service is >> >> very >> >> slow, the AA related part alone takes at least several minutes. All >> >> this is >> >> induced by the amount of single policies that have to be created. >> >> >> >> Martin argues that currently there seems no API compliant way of >> >> improving >> >> performance: One way could be to collect all URIs and create a policy >> >> covering all of them at the end of the validation. However, there is no >> >> way >> >> of notifying validation-involved services to not create policies in the >> >> first place. Also, without policies, there would be no way for >> >> validation to >> >> access the resource, since default (without associated policy) is >> >> "deny". >> >> >> >> We consider this issue high priority, which should be dealt with before >> >> everyone starts using validation in production. Perhaps we would need >> >> an API >> >> extension that allows the collection strategy discussed before, or are >> >> there >> >> other suggestions? >> >> >> >> Best regards >> >> Andreas >> >> On Fri, Feb 25, 2011 at 12:06 PM, Nina Jeliazkova >> <jeliazkova.nina at gmail.com> wrote: >> > Andreas, >> > >> > I have not thought about it in detail, but having in mind differences in >> > dataset implementation at Freiburg and ours, I think part of the problem >> > is >> > (AFAIK) your implementation makes full copy of the dataset on each run, >> > regardless of using same URIs (e.g. as same records in the database) >> > >> > So may be this is just an implementation specific? >> > >> > Nina >> >> Hi Nina, all >> >> I will try to explain my validation point of view on things: >> >> Andreas was talking about a 10-fold crossvalidation when talking about >> a validation. This is were the 50+ resources (and therefore policies) >> come from when making a crossvalidation: >> The dataset is split into 10 training and 10 testdataset. 10 models >> are built. The prediction of the models is stored in 10 prediction >> dataset. The results of each prediction on the folds are stored in 10 >> single validations. The actual crossvalidation is 1 resource. >> (Creating repotrs adds new resources (1 per report)) >> >> I implemented that no redundant training / dataset folds are created >> when a crossvalidation with equal params was performed before >> (dataset, num-of-folds, random-seed, ...). The problem is that >> ToxCreate uploads a new dataset, for each validation, so nothing can >> be reused here. >> >> What Nina is doing with her dataset services is to provide views on >> datasets/subsets off datasets by specifying a feature_uris[] or >> compund_uris[] parameter. I decided to not use this, to prevent >> alogrithms from cheating. Using this would save the 20 training and >> test folds of the about 50 resources. >> >> The problem is that the deletion of policies is very slow. So A&A does >> not slow done (considerably) slow down the actual validation, only >> deleting old validations takes long. This is why we try to think of >> ways to store all resources with one policy. >> >> Martin >> >> >> >> _______________________________________________ >> >> Development mailing list >> >> Development at opentox.org >> >> http://www.opentox.org/mailman/listinfo/development >> >> >> > _______________________________________________ >> > Development mailing list >> > Development at opentox.org >> > http://www.opentox.org/mailman/listinfo/development >> > >> >> >> >> -- >> Dipl-Inf. Martin Gütlein >> Phone: >> +49 (0)761 203 8442 (office) >> +49 (0)177 623 9499 (mobile) >> Email: >> guetlein at informatik.uni-freiburg.de > > -- Dipl-Inf. Martin Gütlein Phone: +49 (0)761 203 8442 (office) +49 (0)177 623 9499 (mobile) Email: guetlein at informatik.uni-freiburg.de
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list