twaml.data¶
This module contains a class (and functions to load it) which
abstracts datasets using pandas.DataFrames
as the payload for
feeding to machine learning frameworks and other general data
investigating.
Classes¶
dataset |
A class to define a dataset with a pandas.DataFrame as the payload of the class. |
Creation Functions¶
from_root (input_files[, name, tree_name, …]) |
Initialize a dataset from ROOT files |
from_pytables (file_name[, name, tree_name, …]) |
Initialize a dataset from pytables output generated from dataset.to_pytables() |
from_h5 (file_name, name, columns[, …]) |
Initialize a dataset from generic h5 input (loosely expected to be from the ATLAS Analysis Release utility ttree2hdf5 |
Utility Free Functions¶
scale_weight_sum (to_update, reference) |
Scale the weights of the to_update dataset such that the sum of weights are equal to the sum of weights of the reference dataset. |
Details¶
-
class
twaml.data.
dataset
[source]¶ Bases:
object
A class to define a dataset with a
pandas.DataFrame
as the payload of the class. Thetwaml.data
module provides a set of static functions to construct a dataset. The class constructor should be used only in very special cases.datasets
should always be constructed using one of three functions:Methods
aggressively_strip
()Drop all columns that should never be used in a classifier. append
(other)Append a dataset to an exiting one auxlabel_asarray
()retrieve a homogenous array of auxiliary labels (or None
) if no auxlabelchange_weights
(wname)Change the main weight of the dataset has_payload
()check if dataframe and weights are non empty keep_columns
(cols)Drop all columns not included in cols
keep_weights
(weights)Drop all columns from the aux weights frame that are not in weights
label_asarray
()retrieve a homogenuous array of labels (or None
) if no labelrm_chargeflavor_columns
()Drop all columns that are related to charge and flavor rm_columns
(cols)Remove columns from the dataset rm_meta_columns
()Drop all columns are are considered meta data from the payload rm_region_columns
()Drop all columns that are prefixed with reg
, likereg2j2b
rm_weight_columns
()Remove all payload df columns which begin with weight_
selected_datasets
(selections)Based on a dictionary of selections, break the dataset into a set of multiple (finer grained) datasets. selection_masks
(selections)Based on a dictionary of selections, calculate masks (boolean to_pytables
(file_name[, to_hdf_kw])Write dataset to disk as a pytables h5 file Attributes
auxweights
dataframe of auxiliary event weights df
the payload of the dataset class dsid
retrieve the DSID from the wtloop_metas information (if available) files
list of files which make up the dataset initial_state
retrieve initial state from the wtloop_metas information (if available) shape
shape of dataset (n events, n features) weights
array of event weights wtloop_metas
dictionary of metadata information (one for each file making up the dataset) -
selection_formula
¶ A string (in
pandas.DataFrame.eval()
form) that all of data in the dataset had to satisfyType: Optional[str]
-
__add__
(other)[source]¶ Add two datasets together
We perform concatenations of the dataframes and weights to generate a new dataset with the combined a new payload.
if one dataset has aux weights and the other doesn’t, the aux weights are dropped.
Return type: dataset
-
aggressively_strip
()[source]¶ Drop all columns that should never be used in a classifier.
This calls the following functions:
Return type: None
-
append
(other)[source]¶ Append a dataset to an exiting one
We perform concatenations of the dataframes and weights to update the existing dataset’s payload.
if one dataset has aux weights and the other doesn’t, the aux weights are dropped.
Parameters: other (twaml.data.dataset) – The dataset to append Return type: None
-
auxlabel_asarray
()[source]¶ retrieve a homogenous array of auxiliary labels (or
None
) if no auxlabelReturn type: Optional
[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd1b0e2bcf8>]
-
auxweights
¶ dataframe of auxiliary event weights
Return type: <sphinx.ext.autodoc.importer._MockObject object at 0x7fd1b0e2bc50>
-
change_weights
(wname)[source]¶ Change the main weight of the dataset
this function will swap the current main weight array of the dataset with one in the
auxweights
frame (based on its name in theauxweights
frame).Parameters: wname ( str
) – name of weight inauxweight
DataFrame to turn into the main weight.Return type: None
-
df
¶ the payload of the dataset class
Return type: <sphinx.ext.autodoc.importer._MockObject object at 0x7fd1b0dc1ba8>
-
dsid
¶ retrieve the DSID from the wtloop_metas information (if available)
This will return 999999 if wtloop_metas information is unavailable, or a set if multiple different DSIDs are found.
Return type: int
-
initial_state
¶ retrieve initial state from the wtloop_metas information (if available)
This will return ‘unknown’ if wtloop_metas information is unavailable, or a set if multiple different initial states are found.
Return type: str
-
keep_columns
(cols)[source]¶ Drop all columns not included in
cols
Parameters: cols (List[str]) – Columns to keep Return type: None
-
keep_weights
(weights)[source]¶ Drop all columns from the aux weights frame that are not in
weights
Parameters: weights (List[str]) – Weights to keep in the aux weights frame Return type: None
-
label_asarray
()[source]¶ retrieve a homogenuous array of labels (or
None
) if no labelReturn type: Optional
[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd1b0e2bcc0>]
-
rm_chargeflavor_columns
()[source]¶ Drop all columns that are related to charge and flavor
This would be [elmu, elel, mumu, OS, SS] Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payload.Return type: None
-
rm_columns
(cols)[source]¶ Remove columns from the dataset
Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payloadParameters: cols (List[str]) – List of column names to remove Return type: None
-
rm_meta_columns
()[source]¶ Drop all columns are are considered meta data from the payload
This includes runNumber, eventNumber, randomRunNumber
Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payload.Return type: None
-
rm_region_columns
()[source]¶ Drop all columns that are prefixed with
reg
, likereg2j2b
Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payload.Return type: None
-
rm_weight_columns
()[source]¶ Remove all payload df columns which begin with
weight_
If you are reading a dataset that was created retaining weights in the main payload, this is a useful function to remove them. The design of
twaml.data.dataset
expects weights to be separated from the payload’s main dataframe.Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payloadReturn type: None
-
selected_datasets
(selections)[source]¶ Based on a dictionary of selections, break the dataset into a set of multiple (finer grained) datasets.
Warning
For large datasets this can get memory intensive quickly. A good alternative is
selection_masks()
combined with the__getitem__
implementation.Parameters: selections ( Dict
[str
,str
]) – Dictionary of selections in the form{ name : selection }
.Examples
A handful of selections with all requiring
OS
andelmu
to be true, while changing thereg{...}
requirement.>>> selections = { '1j1b' : '(reg1j1b == True) & (OS == True) & (elmu == True)', ... '2j1b' : '(reg2j1b == True) & (OS == True) & (elmu == True)', ... '2j2b' : '(reg2j2b == True) & (OS == True) & (elmu == True)', ... '3j1b' : '(reg3j1b == True) & (OS == True) & (elmu == True)'} >>> selected_datasets = ds.selected_datasets(selections)
Return type: Dict
[str
,dataset
]
-
selection_masks
(selections)[source]¶ Based on a dictionary of selections, calculate masks (boolean ararys) for each selection
Parameters: selections ( Dict
[str
,str
]) – Dictionary of selections in the form{ name : selection }
.Return type: Dict
[str
, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd1b0d304e0>]
-
shape
¶ shape of dataset (n events, n features)
Return type: Tuple
-
to_pytables
(file_name, to_hdf_kw=None)[source]¶ Write dataset to disk as a pytables h5 file
This method saves a file using a strict twaml-compatible naming scheme. An existing dataset label is not stored. The properties of the class that are serialized to disk (and the associated key for each item):
df
as{name}_payload
weights
as{name}_{weight_name}
auxweights
as{name}_auxweights
wtloop_metas
as{name}_wtloop_metas
These properties are wrapped in a pandas DataFrame (if they are not already) to be stored in a .h5 file. The
from_pytables()
is designed to read in this output; so the standard use case is to call this function to store a dataset that was intialized viafrom_root()
.Internally this function uses
pandas.DataFrame.to_hdf()
on a number of structures.Parameters: - file_name (
str
) – output file name, - format – dict of keyword arguments fed to
pd.DataFrame.to_hdf()
Examples
>>> ds = twaml.dataset.from_root("file.root", name="myds", ... detect_weights=True, wtloop_metas=True) >>> ds.to_pytables("output.h5") >>> ds_again = twaml.dataset.from_pytables("output.h5") >>> ds_again.name 'myds'
Return type: None
-
weights
¶ array of event weights
Return type: <sphinx.ext.autodoc.importer._MockObject object at 0x7fd1b0e2b9b0>
-
-
twaml.data.
from_root
(input_files, name=None, tree_name='WtLoop_nominal', weight_name='weight_nominal', branches=None, selection=None, label=None, auxlabel=None, allow_weights_in_df=False, aggressively_strip=False, auxweights=None, detect_weights=False, nthreads=None, wtloop_meta=False, TeXlabel=None)[source]¶ Initialize a dataset from ROOT files
Parameters: - input_files (
Union
[str
,List
[str
]]) – Single or list of ROOT input file(s) to use - name (
Optional
[str
]) – Name of the dataset (if none use first file name) - tree_name (
str
) – Name of the tree in the file to use - weight_name (
str
) – Name of the weight branch - branches (
Optional
[List
[str
]]) – List of branches to store in the dataset, ifNone
use all - selection (
Optional
[str
]) – A string passed to pandas.DataFrame.eval to apply a selection based on branch/column values. e.g.(reg1j1b == True) & (OS == True)
requires thereg1j1b
andOS
branches to beTrue
. - label (
Optional
[int
]) – Give the dataset an integer label - auxlabel (
Optional
[int
]) – Give the dataset an integer auxiliary label - allow_weights_in_df (
bool
) – Allow “^weight_w+” branches in the payload dataframe - aggressively_strip (
bool
) – Calltwaml.data.dataset.aggressively_strip()
during construction - auxweights (
Optional
[List
[str
]]) – Auxiliary weights to store in a second dataframe. - detect_weights (
bool
) – If True, fill the auxweights df with all “^weight_” branches Ifauxweights
is notNone
, this option is ignored. - nthreads (
Optional
[int
]) – Number of threads to use reading the ROOT tree (see uproot.TTreeMethods_pandas.df) - wtloop_meta (
bool
) – grab and store the WtLoop_meta YAML entries. stored as a dictionary of the form{ str(filename) : dict(yaml) }
in the class variablewtloop_metas
. - TeXlabel (
Optional
[str
]) – A LaTeX format label for the dataset
Examples
Example with a single file and two branches:
>>> ds1 = dataset.from_root(["file.root"], name="myds", ... branches=["pT_lep1", "pT_lep2"], label=1)
Example with multiple input_files and a selection (uses all branches). The selection requires the branch
nbjets == 1
andnjets >= 1
, then label it 5.>>> flist = ["file1.root", "file2.root", "file3.root"] >>> ds = dataset.from_root(flist, selection='(nbjets == 1) & (njets >= 1)') >>> ds.label = 5
Example using aux weights
>>> ds = dataset.from_root(flist, name="myds", weight_name="weight_nominal", ... auxweights=["weight_sys_radLo", " weight_sys_radHi"])
Example where we detect aux weights automatically
>>> ds = dataset.from_root(flist, name="myds", weight_name="weight_nominal", ... detect_weights=True)
Example using a ThreadPoolExecutor (16 threads):
>>> ds = dataset.from_root(flist, name="myds", nthreads=16)
Return type: dataset
- input_files (
-
twaml.data.
from_pytables
(file_name, name='auto', tree_name='none', weight_name='auto', label=None, auxlabel=None, TeXlabel=None)[source]¶ Initialize a dataset from pytables output generated from
dataset.to_pytables()
The payload is extracted from the .h5 pytables files using the name of the dataset and the weight name. If the name of the dataset doesn’t exist in the file you’ll crash. Aux weights are retrieved if available.
Parameters: - file_name (
str
) – Name of h5 file containing the payload - name (
str
) – Name of the dataset inside the h5 file. If"auto"
(default), we attempt to determine the name automatically from the h5 file. - tree_name (
str
) – Name of tree where dataset originated (only for reference) - weight_name (
str
) – Name of the weight array inside the h5 file. If"auto"
(default), we attempt to determine the name automatically from the h5 file. - label (
Optional
[int
]) – Give the dataset an integer label - auxlabel (
Optional
[int
]) – Give the dataset an integer auxiliary label - TeXlabel (
Optional
[str
]) – LaTeX formatted label
Examples
Creating a dataset from pytables where everything is auto detected:
>>> ds1 = dataset.from_pytables("ttbar.h5") >>> ds1.label = 1 ## add label dataset after the fact
Return type: dataset
- file_name (
-
twaml.data.
from_h5
(file_name, name, columns, tree_name='WtLoop_nominal', weight_name='weight_nominal', label=None, auxlabel=None, TeXlabel=None)[source]¶ Initialize a dataset from generic h5 input (loosely expected to be from the ATLAS Analysis Release utility
ttree2hdf5
The name of the HDF5 dataset inside the file is assumed to be
tree_name
. Thename
argument is something you choose.Parameters: - file_name (
str
) – Name of h5 file containing the payload - name (
str
) – Name of the dataset you would like to define - columns (
List
[str
]) – Names of columns (branches) to include in payload - tree_name (
str
) – Name of tree dataset originates from (HDF5 dataset name) - weight_name (
str
) – Name of the weight array inside the h5 file - label (
Optional
[int
]) – Give the dataset an integer label - auxlabel (
Optional
[int
]) – Give the dataset an integer auxiliary label - TeXlabel (
Optional
[str
]) – LaTeX form label
Examples
>>> ds = dataset.from_h5("file.h5", "dsname", TeXlabel=r"$tW$", ... tree_name="WtLoop_EG_RESOLUTION_ALL__1up")
Return type: dataset
- file_name (