[OTDev] ToxCreate integration of Ambit classification datasets
Christoph Helma helma at in-silico.chTue Mar 22 19:43:19 CET 2011
- Previous message: [OTDev] Multiple InChIs in Ambit compound representation
- Next message: [OTDev] ToxCreate integration of Ambit classification datasets
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Nina, Vedrin, All, I had a look at feature http://apps.ideaconsult.net:8080/ambit2/feature/21573 from http://apps.ideaconsult.net:8080/ambit2/dataset/9, which raises some interesting questions: IMHO "Canc" is clearly a nominal feature, but its representation tells me that it is both a nominal and a numeric feature (maybe due to the fact that classes are represented as "1.0", "2.0" and "3.0"). In order to call the correct (classification or regression) algorithms I need however to know unambiguously: 1. the feature type (Numeric or Nominal) 2. "true" and "false" classes for binary classifications I assume that 1. can be easily solved, by making NumericFeature and NominalFeature disjunct. Guessing "true" and "false" classes is harder, because there are many possibilities to indicate them in real world datasets. In our services we are currently checking with regular expressions for common cases (e.g. active/inactive, 1/0, toxic/nontoxic, ...), but this will not work for all possible feature values. I have no definitive solution for problem 2, a few thoughts: a) Present a list of classes and let the user assign true and false classes + can be used for all datasets/features (also for the discretization of NumericFeatures - needs human intervention (not suited for automated model creation) - same step has to be repeated every time a dataset is used - might be error prone, might lead to suboptimal results from inexperienced users b) Standardize allowed values for NominalFeatures + unambiguous, automated processing possible - needs human curation of imported datasets I tend to favor b) as a long term solution, whats your opinion? Another question: If I expand our regexp hack and implement a) as a fallback, I would need to write new feature values into a dataset. Would you prefer to - overwrite the old values in the original dataset (original information is lost) - add a new feature (with modified values) to the original dataset (original information untouched, but might destroy the dataset if handled improperly) - create a new consolidated dataset (IMHO safest) Best regards, Christoph
- Previous message: [OTDev] Multiple InChIs in Ambit compound representation
- Next message: [OTDev] ToxCreate integration of Ambit classification datasets
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list