FMiner
Contact: Andreas Maunz
Categories: Descriptor calculation
Exposed methods:
Description:
News
29 Apr 2009: The Backbone Refinement Class paper (co-authored by Christoph Helma and Stefan Kramer) has been accepted for the KDD 2009 conference on Data Mining and Knowledge Discovery (Jun 28 - Jul 1 2009 in Paris) for a presentation at the conference and inclusion in the conference proceedings.30 Apr 2009: The paper has been selected for oral presentation at MLG 2009.
10 Jul 2009: KDD conference proceedings are online.
Abstract
We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to occurrences, which is characteristic for open or closed fragment mining. In the experiments, the proposed method reduces feature set sizes by >90 % and >30 % compared to complete tree mining and open tree mining, respectively. Evaluation using crossvalidation runs shows that their classification accuracy is similar to the complete set of trees but significantly better than that of open trees. Compared to open or closed fragment mining, a large part of the search space can be pruned due to an improved statistical constraint (dynamic upper bound adjustment), which is also confirmed in the experiments in lower running times compared to ordinary (static) upper bound pruning. Further analysis using large-scale datasets yields insight into important properties of the proposed descriptors, such as the dataset coverage and the class size represented by each descriptor. A final cross-validation run confirms that the novel descriptors render large training sets feasible which previously might have been intractable.
|
|
Co-occurrence-based
2D embedding of molecules and backbone refinement class features
showing close to perfect separation of target classes along top left to
bottom right. (De)activating features are (red) green, (In)active
instances (salmon) blue. Data: CPDB salmonella mutagenicity; Euclidean embedding: Schulz et. al,. Click here for a flash-animated version, indicating occurrences. |
License
LibFminer is licensed under the terms of the GNU General Public License (GPL, see LICENSE). LibFminer is derived from (i.e. includes code from) the following project, licensed under GPL:- Siegfried Nijssen and Joost Kok. A Quickstart in Frequent Structure Mining Can Make a Difference. Proceedings of the SIGKDD, 2004 (http://www.liacs.nl/home/snijssen/gaston/)
These licensing conditions mean essentially that your published program may only use (i.e., link to) and/or derive code from LibFminer under the condition that your source code is also freely available. This is to secure public availability and freedom of use.
Type of Descriptor:
Substructural descriptors, acyclic substructures, currently no wildcards used or other
more advanced features of the SMARTS language, results can be used in all
fingerprint-based similarity and distance measures.
Interfaces: Standalone application
Priority: Medium
Development status: Stable
Homepage: http://www.maunz.de/libfminer-doc/
Dependencies:
External components: OpenBabel, GSL - GNU Scientific Library
Technical details
Data: No
Software: Yes
Programming language(s): C++, bindings can be automatically generated for ruby, Java, Python and many others
Operating system(s): Linux, Windows planned
Input format: SMILES strings, gSpan format
Output format: SMARTS patterns
License: GPL

