ml4chem.data package

Submodules

ml4chem.data.handler module

class ml4chem.data.handler.Data(images, purpose=None)[source]

Bases: object

A Data class

An adequate data structure is very important to develop machine-learning models. In general a model receives a data set (X) and a target vector (y). This class should in principle arrange this in a format that can be vectorized and operate not only with neural networks but also with support vector machines.

The central object here is the data set.

Parameters
  • images (list or object) – List of images. Supported format is from ASE.

  • purpose (str) – Is this data for training or inference purpose?. Supported strings are: “training”, and “inference”.

get_data(purpose=None)[source]

A method to get data

Parameters

purpose (str) – The purpose of the data so that structure is prepared accordingly. Supported are: ‘training’, ‘inference’

Returns

  • self.images (dict) – Ordered dictionary of images corresponding to order of self.targets list.

  • self.targets (list) – Targets used for training the model.

get_total_number_atoms()[source]

Get the total number of atoms

get_unique_element_symbols(images=None, purpose=None)[source]

Unique element symbol in data set

Parameters
  • images (list of images.) – ASE object.

  • purpose (str) – The supported categories are: ‘training’, ‘inference’.

is_valid_structure(images)[source]

Check if the data has a valid structure

Parameters

images (list of atoms) – List of images.

Returns

valid – Whether or not the structure is valid.

Return type

bool

prepare_images(images, purpose=None)[source]

Function to prepare images to operate with ML4Chem

Parameters
  • images (list or object) – List of images.

  • purpose (str) – The purpose of the data so that structure is prepared accordingly. Supported are: ‘training’, ‘inference’

to_pandas()[source]

Convert data to pandas DataFrame

ml4chem.data.parser module

class ml4chem.data.parser.SinglePointCalculator(implemented_properties=None)[source]

Bases: ase.calculators.calculator.Calculator

A SinglePointCalculator class

This class creates a fake calculator that is used to populate calc.results dictionaries in ASE objects.

Parameters

implemented_properties (list) – List with supported properties.

static get_forces(atoms)[source]

Get atomic forces

Parameters

atoms (obj) – Atoms objects

Returns

The atomic force of the molecule.

Return type

forces

static get_potential_energy(atoms)[source]

Get the potential energy

Parameters

atoms (obj) – Atoms objects

Returns

The energy of the molecule.

Return type

energy

ml4chem.data.parser.ani_to_ase(hdf5file, data_keys, trajfile=None)[source]

ANI to ASE

Parameters
  • hdf5file (hdf5, list) – hdf5 file loaded using pyanitools (or list of them).

  • data_keys (list) – List of keys to extract data.

  • trajfile (str, optional) – Name of trajectory file to be saved, by default None.

Returns

A list of Atoms objects.

Return type

atoms

ml4chem.data.parser.cjson_parser(cjsonfile, trajfile=None)[source]

Parse CJSON files

Parameters
  • cjsonfile (str) – Path to the CJSON file.

  • trajfile (str, optional) – Name of trajectory file to be saved, by default None.

Returns

A list of Atoms objects.

Return type

atoms

ml4chem.data.parser.cjson_to_ase(cjson)[source]
ml4chem.data.parser.get_total_energy(cjson)[source]

ml4chem.data.preprocessing module

ml4chem.data.serialization module

ml4chem.data.utils module

ml4chem.data.utils.ase_to_xyz(atoms, comment='', file=True)[source]

Convert ASE to xyz

This function is useful to save xyz to DataFrame.

ml4chem.data.utils.split_data(images, training_name='training_images.traj', test_name='test_images.traj', randomize=True, test_set=20, logfile='data_split.log')[source]

Split Data

Parameters
  • images (str or object) – A path to an ASE trajectory file or a list of Atoms objects.

  • training_name (str, optional) – Name of the training set trajectory file, by default ‘training_images.traj’

  • test_name (str, optional) – Name of the test set file, by default ‘test_images.traj’

  • randomize (bool, optional) – Randomize indices of images, by default True

  • test_set (int, optional) – Percentage of the Data to be used as test set, by default 20

  • logfile (str, optional) – Log file name, by default ‘data_split.log’

Module contents