twaml.data¶

This module contains a class (and functions to load it) which abstracts datasets using pandas.DataFrames as the payload for feeding to machine learning frameworks and other general data investigating.

Classes¶

dataset A class to define a dataset with a pandas.DataFrame as the payload of the class.

Creation Functions¶

`from_root`(input_files[, name, tree_name, …])	Initialize a dataset from ROOT files
`from_pytables`(file_name[, name, tree_name, …])	Initialize a dataset from pytables output generated from `dataset.to_pytables()`
`from_h5`(file_name, name, columns[, …])	Initialize a dataset from generic h5 input (loosely expected to be from the ATLAS Analysis Release utility `ttree2hdf5`

Utility Free Functions¶

scale_weight_sum(to_update, reference) Scale the weights of the to_update dataset such that the sum of weights are equal to the sum of weights of the reference dataset.

Details¶

class twaml.data.dataset[source]¶

Bases: object

A class to define a dataset with a pandas.DataFrame as the payload of the class. The twaml.data module provides a set of static functions to construct a dataset. The class constructor should be used only in very special cases.

datasets should always be constructed using one of three functions:

from_root()

from_pytables()

from_h5()

Methods

`aggressively_strip`()	Drop all columns that should never be used in a classifier.
`append`(other)	Append a dataset to an exiting one
`auxlabel_asarray`()	retrieve a homogenous array of auxiliary labels (or `None`) if no auxlabel
`change_weights`(wname)	Change the main weight of the dataset
`has_payload`()	check if dataframe and weights are non empty
`keep_columns`(cols)	Drop all columns not included in `cols`
`keep_weights`(weights)	Drop all columns from the aux weights frame that are not in `weights`
`label_asarray`()	retrieve a homogenuous array of labels (or `None`) if no label
`rm_chargeflavor_columns`()	Drop all columns that are related to charge and flavor
`rm_columns`(cols)	Remove columns from the dataset
`rm_meta_columns`()	Drop all columns are are considered meta data from the payload
`rm_region_columns`()	Drop all columns that are prefixed with `reg`, like `reg2j2b`
`rm_weight_columns`()	Remove all payload df columns which begin with `weight_`
`selected_datasets`(selections)	Based on a dictionary of selections, break the dataset into a set of multiple (finer grained) datasets.
`selection_masks`(selections)	Based on a dictionary of selections, calculate masks (boolean
`to_pytables`(file_name[, to_hdf_kw])	Write dataset to disk as a pytables h5 file

Attributes

`auxweights`	dataframe of auxiliary event weights
`df`	the payload of the dataset class
`dsid`	retrieve the DSID from the wtloop_metas information (if available)
`files`	list of files which make up the dataset
`initial_state`	retrieve initial state from the wtloop_metas information (if available)
`shape`	shape of dataset (n events, n features)
`weights`	array of event weights
`wtloop_metas`	dictionary of metadata information (one for each file making up the dataset)

name¶

Name for the dataset

Type:	str

weight_name¶

Name of the branch which the weight array originates from

Type:	str

tree_name¶

All of our datasets had to come from a ROOT tree at some point

Type:	str

selection_formula¶

A string (in pandas.DataFrame.eval() form) that all of data in the dataset had to satisfy

Type:	Optional[str]

label¶

Optional dataset label (as an int)

Type:	Optional[int]

auxlabel¶

Optional auxiliary label (as an int) - sometimes we need two labels

Type:	Optional[int]

TeXlabel¶

LaTeX formatted name (for plot labels)

Type:	Optional[str]

__add__(other)[source]¶

Add two datasets together

We perform concatenations of the dataframes and weights to generate a new dataset with the combined a new payload.

if one dataset has aux weights and the other doesn’t, the aux weights are dropped.

Return type:	`dataset`

aggressively_strip()[source]¶

Drop all columns that should never be used in a classifier.

This calls the following functions:

rm_meta_columns()
rm_region_columns()
rm_chargeflavor_columns()
rm_weight_columns()

Return type:	`None`

append(other)[source]¶

Append a dataset to an exiting one

We perform concatenations of the dataframes and weights to update the existing dataset’s payload.

if one dataset has aux weights and the other doesn’t, the aux weights are dropped.

Parameters:	other (twaml.data.dataset) – The dataset to append
Return type:	`None`

auxlabel_asarray()[source]¶

retrieve a homogenous array of auxiliary labels (or None) if no auxlabel

Return type:	`Optional`[<sphinx.ext.autodoc.importer._MockObject object at 0x7f8eed0f3e10>]

auxweights¶

dataframe of auxiliary event weights

Return type:	<sphinx.ext.autodoc.importer._MockObject object at 0x7f8eed0f3d30>

change_weights(wname)[source]¶

Change the main weight of the dataset

this function will swap the current main weight array of the dataset with one in the auxweights frame (based on its name in the auxweights frame).

Parameters:	wname (`str`) – name of weight in `auxweight` DataFrame to turn into the main weight.
Return type:	`None`

df¶

the payload of the dataset class

Return type:	<sphinx.ext.autodoc.importer._MockObject object at 0x7f8eed088c18>

dsid¶

retrieve the DSID from the wtloop_metas information (if available)

This will return 999999 if wtloop_metas information is unavailable, or a set if multiple different DSIDs are found.

Return type:	`int`

files¶

list of files which make up the dataset

Return type:	`List`[`PosixPath`]

has_payload()[source]¶

check if dataframe and weights are non empty

Return type:	`bool`

initial_state¶

retrieve initial state from the wtloop_metas information (if available)

This will return ‘unknown’ if wtloop_metas information is unavailable, or a set if multiple different initial states are found.

Return type:	`str`

keep_columns(cols)[source]¶

Drop all columns not included in cols

Parameters:	cols (List[str]) – Columns to keep
Return type:	`None`

keep_weights(weights)[source]¶

Drop all columns from the aux weights frame that are not in weights

Parameters:	weights (List[str]) – Weights to keep in the aux weights frame
Return type:	`None`

label_asarray()[source]¶

retrieve a homogenuous array of labels (or None) if no label

Return type:	`Optional`[<sphinx.ext.autodoc.importer._MockObject object at 0x7f8eee3d2cc0>]

rm_chargeflavor_columns()[source]¶

Drop all columns that are related to charge and flavor

This would be [elmu, elel, mumu, OS, SS] Internally this is done by calling pandas.DataFrame.drop() with inplace on the payload.

Return type:	`None`

rm_columns(cols)[source]¶

Remove columns from the dataset

Internally this is done by calling pandas.DataFrame.drop() with inplace on the payload

Parameters:	cols (List[str]) – List of column names to remove
Return type:	`None`

rm_meta_columns()[source]¶

Drop all columns are are considered meta data from the payload

This includes runNumber, eventNumber, randomRunNumber

Internally this is done by calling pandas.DataFrame.drop() with inplace on the payload.

Return type:	`None`

rm_region_columns()[source]¶

Drop all columns that are prefixed with reg, like reg2j2b

Internally this is done by calling pandas.DataFrame.drop() with inplace on the payload.

Return type:	`None`

rm_weight_columns()[source]¶

Remove all payload df columns which begin with weight_

If you are reading a dataset that was created retaining weights in the main payload, this is a useful function to remove them. The design of twaml.data.dataset expects weights to be separated from the payload’s main dataframe.

Internally this is done by calling pandas.DataFrame.drop() with inplace on the payload

Return type:	`None`

selected_datasets(selections)[source]¶

Based on a dictionary of selections, break the dataset into a set of multiple (finer grained) datasets.

Warning

For large datasets this can get memory intensive quickly. A good alternative is selection_masks() combined with the __getitem__ implementation.

Parameters:	selections (`Dict`[`str`, `str`]) – Dictionary of selections in the form `{ name : selection }`.

Examples

A handful of selections with all requiring OS and elmu to be true, while changing the reg{...} requirement.

>>> selections = { '1j1b' : '(reg1j1b == True) & (OS == True) & (elmu == True)',
...                '2j1b' : '(reg2j1b == True) & (OS == True) & (elmu == True)',
...                '2j2b' : '(reg2j2b == True) & (OS == True) & (elmu == True)',
...                '3j1b' : '(reg3j1b == True) & (OS == True) & (elmu == True)'}
>>> selected_datasets = ds.selected_datasets(selections)

Return type:	`Dict`[`str`, `dataset`]

selection_masks(selections)[source]¶

Based on a dictionary of selections, calculate masks (boolean ararys) for each selection

Parameters:	selections (`Dict`[`str`, `str`]) – Dictionary of selections in the form `{ name : selection }`.
Return type:	`Dict`[`str`, <sphinx.ext.autodoc.importer._MockObject object at 0x7f8eeceb4f98>]

shape¶

shape of dataset (n events, n features)

Return type:	`Tuple`

to_pytables(file_name, to_hdf_kw=None)[source]¶

Write dataset to disk as a pytables h5 file

This method saves a file using a strict twaml-compatible naming scheme. An existing dataset label is not stored. The properties of the class that are serialized to disk (and the associated key for each item):

df as {name}_payload
weights as {name}_{weight_name}
auxweights as {name}_auxweights
wtloop_metas as {name}_wtloop_metas

These properties are wrapped in a pandas DataFrame (if they are not already) to be stored in a .h5 file. The from_pytables() is designed to read in this output; so the standard use case is to call this function to store a dataset that was intialized via from_root().

Internally this function uses pandas.DataFrame.to_hdf() on a number of structures.

Parameters:	file_name (`str`) – output file name, format – dict of keyword arguments fed to `pd.DataFrame.to_hdf()`

Examples

>>> ds = twaml.dataset.from_root("file.root", name="myds",
...                              detect_weights=True, wtloop_metas=True)
>>> ds.to_pytables("output.h5")
>>> ds_again = twaml.dataset.from_pytables("output.h5")
>>> ds_again.name
'myds'

Return type:	`None`

weights¶

array of event weights

Return type:	<sphinx.ext.autodoc.importer._MockObject object at 0x7f8eed0f3cc0>

wtloop_metas¶

dictionary of metadata information (one for each file making up the dataset)

Return type:	`Optional`[`Dict`[`str`, `Dict`[`str`, `Any`]]]

twaml.data.from_root(input_files, name=None, tree_name='WtLoop_nominal', weight_name='weight_nominal', branches=None, selection=None, label=None, auxlabel=None, allow_weights_in_df=False, aggressively_strip=False, auxweights=None, detect_weights=False, nthreads=None, wtloop_meta=False, TeXlabel=None)[source]¶

Initialize a dataset from ROOT files

Parameters:

input_files (Union[str, List[str]]) – Single or list of ROOT input file(s) to use
name (Optional[str]) – Name of the dataset (if none use first file name)
tree_name (str) – Name of the tree in the file to use
weight_name (str) – Name of the weight branch
branches (Optional[List[str]]) – List of branches to store in the dataset, if None use all
selection (Optional[str]) – A string passed to pandas.DataFrame.eval to apply a selection based on branch/column values. e.g. (reg1j1b == True) & (OS == True) requires the reg1j1b and OS branches to be True.
label (Optional[int]) – Give the dataset an integer label
auxlabel (Optional[int]) – Give the dataset an integer auxiliary label
allow_weights_in_df (bool) – Allow “^weight_w+” branches in the payload dataframe
aggressively_strip (bool) – Call twaml.data.dataset.aggressively_strip() during construction
auxweights (Optional[List[str]]) – Auxiliary weights to store in a second dataframe.
detect_weights (bool) – If True, fill the auxweights df with all “^weight_” branches If auxweights is not None, this option is ignored.
nthreads (Optional[int]) – Number of threads to use reading the ROOT tree (see uproot.TTreeMethods_pandas.df)
wtloop_meta (bool) – grab and store the WtLoop_meta YAML entries. stored as a dictionary of the form { str(filename) : dict(yaml) } in the class variable wtloop_metas.
TeXlabel (Optional[str]) – A LaTeX format label for the dataset

Examples

Example with a single file and two branches:

>>> ds1 = dataset.from_root(["file.root"], name="myds",
...                         branches=["pT_lep1", "pT_lep2"], label=1)

Example with multiple input_files and a selection (uses all branches). The selection requires the branch nbjets == 1 and njets >= 1, then label it 5.

>>> flist = ["file1.root", "file2.root", "file3.root"]
>>> ds = dataset.from_root(flist, selection='(nbjets == 1) & (njets >= 1)')
>>> ds.label = 5

Example using aux weights

>>> ds = dataset.from_root(flist, name="myds", weight_name="weight_nominal",
...                        auxweights=["weight_sys_radLo", " weight_sys_radHi"])

Example where we detect aux weights automatically

>>> ds = dataset.from_root(flist, name="myds", weight_name="weight_nominal",
...                        detect_weights=True)

Example using a ThreadPoolExecutor (16 threads):

>>> ds = dataset.from_root(flist, name="myds", nthreads=16)

Return type:	`dataset`

twaml.data.from_pytables(file_name, name='auto', tree_name='none', weight_name='auto', label=None, auxlabel=None, TeXlabel=None)[source]¶

Initialize a dataset from pytables output generated from dataset.to_pytables()

The payload is extracted from the .h5 pytables files using the name of the dataset and the weight name. If the name of the dataset doesn’t exist in the file you’ll crash. Aux weights are retrieved if available.

Parameters:

file_name (str) – Name of h5 file containing the payload
name (str) – Name of the dataset inside the h5 file. If "auto" (default), we attempt to determine the name automatically from the h5 file.
tree_name (str) – Name of tree where dataset originated (only for reference)
weight_name (str) – Name of the weight array inside the h5 file. If "auto" (default), we attempt to determine the name automatically from the h5 file.
label (Optional[int]) – Give the dataset an integer label
auxlabel (Optional[int]) – Give the dataset an integer auxiliary label
TeXlabel (Optional[str]) – LaTeX formatted label

Examples

Creating a dataset from pytables where everything is auto detected:

>>> ds1 = dataset.from_pytables("ttbar.h5")
>>> ds1.label = 1 ## add label dataset after the fact

Return type:	`dataset`

twaml.data.from_h5(file_name, name, columns, tree_name='WtLoop_nominal', weight_name='weight_nominal', label=None, auxlabel=None, TeXlabel=None)[source]¶

Initialize a dataset from generic h5 input (loosely expected to be from the ATLAS Analysis Release utility ttree2hdf5

The name of the HDF5 dataset inside the file is assumed to be tree_name. The name argument is something you choose.

Parameters:

file_name (str) – Name of h5 file containing the payload
name (str) – Name of the dataset you would like to define
columns (List[str]) – Names of columns (branches) to include in payload
tree_name (str) – Name of tree dataset originates from (HDF5 dataset name)
weight_name (str) – Name of the weight array inside the h5 file
label (Optional[int]) – Give the dataset an integer label
auxlabel (Optional[int]) – Give the dataset an integer auxiliary label
TeXlabel (Optional[str]) – LaTeX form label

Examples

>>> ds = dataset.from_h5("file.h5", "dsname", TeXlabel=r"$tW$",
...                      tree_name="WtLoop_EG_RESOLUTION_ALL__1up")

Return type:	`dataset`

twaml.data.scale_weight_sum(to_update, reference)[source]¶

Scale the weights of the to_update dataset such that the sum of weights are equal to the sum of weights of the reference dataset.

Parameters:	to_update (`dataset`) – dataset with weights to be scaled reference (`dataset`) – dataset to scale to
Return type:	`None`