[OTDev] Validation: Efficiency
Nina Jeliazkova jeliazkova.nina at gmail.comFri Feb 25 18:59:23 CET 2011
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 25 February 2011 18:59, Christoph Helma <helma at in-silico.ch> wrote: > > > On 25 February 2011 18:16, Christoph Helma <helma at in-silico.ch> wrote: > > > > > > > > > Nina Jeliazkova wrote on 02/25/2011 01:32 PM: > > > > > > > > > > > > > > > On 25 February 2011 14:26, Martin Guetlein > > > > > <martin.guetlein at googlemail.com <mailto: > martin.guetlein at googlemail.com > > > >> > > > > > wrote: > > > > > > > > > > * Still 10 models would be created (and 10 validations, but I > could > > > > > try to solve this internally), so we would not end up with 1 > policy > > > > > for a crossvalidation. > > > > > > > > > > Unless predictions are stored in the same dataset. > > > > > > > > It sounds feasible to me. What do you think, Christoph? > > > > > > For efficiency reasons (and implementation simplicity) I prefer to keep > > > datasets in small and manageable chunks. I am quite convinced that > > > aggregating everything in a single dataset will not scale well. Lets > > > assume a larger dataset with several 1000 compounds and several > > > 1000-10000 class sensitive descriptors. Adding features for each > > > validation fold would increase the dataset 11 times and with such a > size > > > I assume that all search/subset operations will be extremely slow. I > do > > > not even dare thinking about serialising such a monster to rdfxml. > > > > > > > Not impossible, try our monsters :) . Searching on indexed field is > usually > > not a problem. > > I will have to try again, but I remember that in the past downloading e.g. > the > complete CPDB was quite time consuming. If this has improved let me know > what you have done! > Tell me if this is better than before (there are still things left to optimize) . This is run from a remote machine. $ time curl -H "Accept:application/rdf+xml" http://apps.ideaconsult.net:8080/ambit2/dataset/10?max=2000 1> cpdbas.rdf % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12.5M 0 12.5M 0 0 2043k 0 --:--:-- 0:00:06 --:--:-- 2210k real 0m6.295s user 0m0.036s sys 0m0.136s Subset should take less time (as well as different mime type). > > > > Both approaches have their own pros and cons. > > > Yes and both serve different purposes, it would be nice to have a > choice. > > > > > > > @Martin: Would it help with AA to have "sets of datasets" accessible > > > through URIs like /dataset/{set_id}/{dataset_id}. > > > > > > > Yes, let's try a construct like this. I would prefer some other top > level > > term, instead of dataset, as it will be impossible to distinguish set ids > > from dataset ids, e.g. > > > > /set/{id}/dataset/{id} > > > Sorry, /dataset should only indicate that we are talking about the > dataset service, so my proposal was > {dataset_service_uri}/{set_id}/{dataset_id} > > Mapped to our services, there is a need for top level "noun" http://host:port/ambit2/{set_id}/{dataset_id} http://host:port/ambit2/dataset/{set_id}/{dataset_id} http://host:port/ambit2/set/{set_id}/dataset/{dataset_id} Best regards, Nina > Best regards, > Christoph >
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list