[OTDev] Validation: Efficiency

Fri Feb 25 13:00:39 CET 2011

> On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de> wrote:
>
>> Dear all,
>>
>> since a single validation of a model on a dataset creates multiple
>> ressources (currently > 50), and by the fact that everything is
>> decentralized (i.e. linked via URIs) and referenceable in OpenTox, we are
>> facing the problem that currently prohibitively high load is placed on the
>> AA services, because a policy must be created and requested multiple times
>> (and eventually deleted) for each of the resources.
>>
>> For example the spike in http://tinyurl.com/6amuo8x to the very right is
>> produced by a single validation. Moreover, the validation service is very
>> slow, the AA related part alone takes at least several minutes. All this is
>> induced by the amount of single policies that have to be created.
>>
>> Martin argues that currently there seems no API compliant way of improving
>> performance: One way could be to collect all URIs and create a policy
>> covering all of them at the end of the validation. However, there is no way
>> of notifying validation-involved services to not create policies in the
>> first place. Also, without policies, there would be no way for validation to
>> access the resource, since default (without associated policy) is "deny".
>>
>> We consider this issue high priority, which should be dealt with before
>> everyone starts using validation in production. Perhaps we would need an API
>> extension that allows the collection strategy discussed before, or are there
>> other suggestions?
>>
>> Best regards
>> Andreas

On Fri, Feb 25, 2011 at 12:06 PM, Nina Jeliazkova
<jeliazkova.nina at gmail.com> wrote:
> Andreas,
>
> I have not thought about it in detail, but having in mind differences in
> dataset implementation at Freiburg and ours, I think part of the problem is
> (AFAIK) your implementation makes full copy of the dataset on each run,
> regardless of using same URIs (e.g. as same records in the database)
>
> So may be this is just an implementation specific?
>
> Nina

Hi Nina, all

I will try to explain my validation point of view on things:

Andreas was talking about a 10-fold crossvalidation when talking about
a validation. This is were the 50+ resources (and therefore policies)
come from when making a crossvalidation:
The dataset is split into 10 training and 10 testdataset. 10 models
are built. The prediction of the models is stored in 10 prediction
dataset. The results of each prediction on the folds are stored in 10
single validations. The actual crossvalidation is 1 resource.
(Creating repotrs adds new resources (1 per report))

I implemented that no redundant training / dataset folds are created
when a crossvalidation with equal params was performed before
(dataset, num-of-folds, random-seed, ...). The problem is that
ToxCreate uploads a new dataset, for each validation, so nothing can
be reused here.

What Nina is doing with her dataset services is to provide views on
datasets/subsets off datasets by specifying a feature_uris[] or
compund_uris[] parameter. I decided to not use this, to prevent
alogrithms from cheating. Using this would save the 20 training and
test folds of the about 50 resources.

The problem is that the deletion of policies is very slow. So A&A does
not slow done (considerably) slow down the actual validation, only
deleting old validations takes long. This is why we try to think of
ways to store all resources with one policy.

Martin

>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Dipl-Inf. Martin Gütlein
Phone:
+49 (0)761 203 8442 (office)
+49 (0)177 623 9499 (mobile)
Email:
guetlein at informatik.uni-freiburg.de