twaml¶
Python tooling for \(tW\).
Getting Started¶
Requirements¶
It is highy recommended to use the Anaconda python distribution. Most of the required libraries (outside of the bleeding edge machine learning packages) will be included with Anaconda environments.
The simplest way to get up and running with twaml
is to use the environment.yml
file.
$ cd /path/to/twaml
$ conda env create -f environment.yml
$ conda activate twaml
The bare requirements for data handling, plotting, and testing include
(enforced by requirements.txt
, see file for verions):
- uproot
- pandas
- scikit-learn
- matplotlib
- h5py
- pytables
- numexpr (to ensure pandas.eval acceleration)
Since twaml adopts “live at the head”, requirement versions may be fluid.
For training and testing models (not enforced by requirements.txt
)
you’ll probably want:
- tensorflow
- pytorch
- xgboost
For building documentation
- sphinx
- sphinx_rtd_theme
- sphinx-autodoc-typehints
- sphinxcontrib-programoutput
Base Setup in a venv¶
A base setup without the machine learning libraries just requires a
pip installation of the twaml
.
$ python3 -m venv ~/.venvs/twaml-venv
$ source ~/.venvs/twaml-venv/bin/activate
$ cd /path/to/twaml
$ pip install .
This will make the twaml.data
and twaml.viz
APIs accessible.
Example GPU Anaconda Setup¶
Start with a fresh Anaconda virtual environment:
$ cd /path/to/twaml
$ conda env create -f environment.yml
$ conda activate twaml
$ conda install pytorch torchvision cuda100 -c pytorch ## requires recent nvidia linux drivers
$ conda install tensorflow-gpu ## or just tensorflow
$ pip install .
Command Line Applications¶
twaml-root2pytables
¶
Convert a set of ROOT files into a single pytables HDF5 file.
$ twaml-root2pytables --help
usage: twaml-root2pytables [-h] -i INPUT_FILES [INPUT_FILES ...] -n NAME -o
OUT_FILE [-b BRANCHES [BRANCHES ...]]
[--tree-name TREE_NAME] [--weight-name WEIGHT_NAME]
[--auxweights AUXWEIGHTS [AUXWEIGHTS ...]]
[--selection SELECTION] [--detect-weights]
[--nthreads NTHREADS]
Convert ROOT files to a pytables hdf5 dataset via twaml.data.root_dataset and
twaml.data.dataset.to_pytables
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILES [INPUT_FILES ...], --input-files INPUT_FILES [INPUT_FILES ...]
input ROOT files
-n NAME, --name NAME dataset name (required when reading back into
twaml.data.dataset)
-o OUT_FILE, --out-file OUT_FILE
Output h5 file (existing file will be overwritten)
-b BRANCHES [BRANCHES ...], --branches BRANCHES [BRANCHES ...]
branches to save (defaults to all)
--tree-name TREE_NAME
tree name
--weight-name WEIGHT_NAME
weight branch name
--auxweights AUXWEIGHTS [AUXWEIGHTS ...]
extra auxiliary weights to save
--selection SELECTION
A selection string or YAML file containing a map of
selections (see `selection` argument docs in
`twaml.data.from_root`)
--detect-weights detect weights in the dataset, --auxweights overrides
this
--nthreads NTHREADS number of threads to use via ThreadPoolExecutor
An example that uses the default --tree-name
and
--weight-name
, while only saving the branches b1
and b2
and requiring the branch elmu
to be true and pT_lep1
to be
greater than 50 to save the event.
$ twaml-root2pytables -i file.root -o file.h5 --branches b1 b2 \
--selection "(elmu == True) & (pT_lep1 > 50)"
The docs for the function that is being called:
-
twaml._apps.
root2pytables
()[source]¶ command line application which converts a set of ROOT files into a pytables HDF5 file via the
twaml.data.from_root()
function and thetwaml.data.dataset.to_pytables()
member function of thetwaml.data.dataset
class.
API Reference¶
tW analysis machine learning: twaml
This python package contains modules for handling different machine learning requirements for the ATLAS Full Run II tW analysis.
twaml.data¶
The data
module provides a thin wrapper around working with
persistent ROOT files, persistent h5 files, and the use of transient
pandas DataFrames.
-
class
twaml.data.
dataset
[source]¶ Bases:
object
A class to define a dataset with a
pandas.DataFrame
as the payload of the class. Thetwaml.data
module provides a set of static functions to construct a dataset. The class constructor should be used only in very special cases.datasets
should always be constructed using one of three functions:-
files
¶ List of files delivering the dataset
Type: List[PosixPath]
-
weights
¶ The array of event weights
Type: numpy.ndarray
-
df
¶ The payload of the class, a dataframe
Type: pandas.DataFrame
-
auxweights
¶ Auxiliary weights to have access too
Type: Optional[pandas.DataFrame]
-
shape
¶ Shape of the main payload dataframe
Type: Tuple
-
selection_formula
¶ A string (in
pandas.DataFrame.eval()
form) that all of data in the dataset had to satisfyType: Optional[str]
-
__add__
(other)[source]¶ Add two datasets together
We perform concatenations of the dataframes and weights to generate a new dataset with the combined a new payload.
if one dataset has aux weights and the other doesn’t, the aux weights are dropped.
Return type: dataset
-
aggressively_strip
()[source]¶ Drop all columns that should never be used in a classifier.
This calls the following functions:
Return type: None
-
append
(other)[source]¶ Append a dataset to an exiting one
We perform concatenations of the dataframes and weights to update the existing dataset’s payload.
if one dataset has aux weights and the other doesn’t, the aux weights are dropped.
Parameters: other (twaml.data.dataset) – The dataset to append Return type: None
-
apply_selections
(selections)[source]¶ Based on a dictionary of selections, break the dataset into a set of multiple (finer grained) datasets.
Parameters: selections ( Dict
[str
,str
]) – Dictionary of selections in the form{ name : selection }
.Returns: A dictionary of datasets satisfying the selections Return type: Dict[str, dataset] Examples
A handful of selections with all requiring
OS
andelmu
to be true, while changing thereg{...}
requirement.>>> selections = { '1j1b' : '(reg1j1b == True) & (OS == True) & (elmu == True)', ... '2j1b' : '(reg2j1b == True) & (OS == True) & (elmu == True)', ... '2j2b' : '(reg2j2b == True) & (OS == True) & (elmu == True)', ... '3j1b' : '(reg3j1b == True) & (OS == True) & (elmu == True)'} >>> selected_datasets = ds.apply_selections(selections)
-
auxlabel_asarray
()[source]¶ retrieve a homogenous array of auxiliary labels (or
None
) if no auxlabelReturn type: Optional
[<sphinx.ext.autodoc.importer._MockObject object at 0x7f9d5c57c240>]
-
change_weights
(wname)[source]¶ Change the main weight of the dataset
this function will swap the current main weight array of the dataset with one in the
auxweights
frame (based on its name in theauxweights
frame).Parameters: wname ( str
) – name of weight inauxweight
DataFrame to turn into the main weight.Return type: None
-
keep_columns
(cols)[source]¶ Drop all columns not included in
cols
Parameters: cols (List[str]) – Columns to keep Return type: None
-
keep_weights
(weights)[source]¶ Drop all columns from the aux weights frame that are not in
weights
Parameters: weights (List[str]) – Weights to keep in the aux weights frame Return type: None
-
label_asarray
()[source]¶ retrieve a homogenuous array of labels (or
None
) if no labelReturn type: Optional
[<sphinx.ext.autodoc.importer._MockObject object at 0x7f9d5c53c198>]
-
rm_chargeflavor_columns
()[source]¶ Drop all columns that are related to charge and flavor
This would be [elmu, elel, mumu, OS, SS] Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payload.Return type: None
-
rm_columns
(cols)[source]¶ Remove columns from the dataset
Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payloadParameters: cols (List[str]) – List of column names to remove Return type: None
-
rm_meta_columns
()[source]¶ Drop all columns are are considered meta data from the payload
This includes runNumber, eventNumber, randomRunNumber
Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payload.Return type: None
-
rm_region_columns
()[source]¶ Drop all columns that are prefixed with “reg”, e.g. “reg2j2b”
Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payload.Return type: None
-
rm_weight_columns
()[source]¶ Remove all payload df columns which begin with
weight_
If you are reading a dataset that was created retaining weights in the main payload, this is a useful function to remove them. The design of
twaml.data.dataset
expects weights to be separated from the payload’s main dataframe.Internally this is done by calling
pandas.DataFrame.drop()
withinplace
on the payloadReturn type: None
-
to_pytables
(file_name)[source]¶ Write dataset to disk as a pytables h5 file (with a strict twaml-compatible naming scheme)
An existing dataset label is not stored. The properties of the class that are serialized to disk:
df
as{name}_payload
weights
as{name}_{weight_name}
auxweights
as{name}_auxweights
wtloop_metas
as{name}_wtloop_metas
These properties are wrapped in a pandas DataFrame (if they are not already) to be stored in a .h5 file. The
from_pytables()
is designed to read in this output; so the standard use case is to call this function to store a dataset that was intialized viafrom_root()
.Internally this function uses
pandas.DataFrame.to_hdf()
on a number of structures.Parameters: file_name ( str
) – output file name,Examples
>>> ds = twaml.dataset.from_root("file.root", name="myds", ... detect_weights=True, wtloop_metas=True) >>> ds.to_pytables("output.h5") >>> ds_again = twaml.dataset.from_pytables("output.h5") >>> ds_again.name 'myds'
Return type: None
-
-
twaml.data.
from_root
(input_files, name=None, tree_name='WtLoop_nominal', weight_name='weight_nominal', branches=None, selection=None, label=None, auxlabel=None, allow_weights_in_df=False, aggressively_strip=False, auxweights=None, detect_weights=False, nthreads=None, wtloop_meta=False, TeXlabel=None)[source]¶ Initialize a dataset from ROOT files
Parameters: - input_files (
Union
[str
,List
[str
]]) – Single or list of ROOT input file(s) to use - name (
Optional
[str
]) – Name of the dataset (if none use first file name) - tree_name (
str
) – Name of the tree in the file to use - weight_name (
str
) – Name of the weight branch - branches (
Optional
[List
[str
]]) – List of branches to store in the dataset, ifNone
use all - selection (
Optional
[str
]) – A string passed to pandas.DataFrame.eval to apply a selection based on branch/column values. e.g.(reg1j1b == True) & (OS == True)
requires thereg1j1b
andOS
branches to beTrue
. - label (
Optional
[int
]) – Give the dataset an integer label - auxlabel (
Optional
[int
]) – Give the dataset an integer auxiliary label - allow_weights_in_df (
bool
) – Allow “^weight_w+” branches in the payload dataframe - aggressively_strip (
bool
) – Calltwaml.data.dataset.aggressively_strip()
during construction - auxweights (
Optional
[List
[str
]]) – Auxiliary weights to store in a second dataframe. - detect_weights (
bool
) – If True, fill the auxweights df with all “^weight_” branches Ifauxweights
is notNone
, this option is ignored. - nthreads (
Optional
[int
]) – Number of threads to use reading the ROOT tree (see uproot.TTreeMethods_pandas.df) - wtloop_meta (
bool
) – grab and store the WtLoop_meta YAML entries. stored as a dictionary of the form{ str(filename) : dict(yaml) }
in the class variablewtloop_metas
. - TeXlabel (
Optional
[str
]) – A LaTeX format label for the dataset
Examples
Example with a single file and two branches:
>>> ds1 = dataset.from_root(["file.root"], name="myds", ... branches=["pT_lep1", "pT_lep2"], label=1)
Example with multiple input_files and a selection (uses all branches). The selection requires the branch
nbjets == 1
andnjets >= 1
, then label it 5.>>> flist = ["file1.root", "file2.root", "file3.root"] >>> ds = dataset.from_root(flist, selection='(nbjets == 1) & (njets >= 1)') >>> ds.label = 5
Example using aux weights
>>> ds = dataset.from_root(flist, name="myds", weight_name="weight_nominal", ... auxweights=["weight_sys_radLo", " weight_sys_radHi"])
Example where we detect aux weights automatically
>>> ds = dataset.from_root(flist, name="myds", weight_name="weight_nominal", ... detect_weights=True)
Example using a ThreadPoolExecutor (16 threads):
>>> ds = dataset.from_root(flist, name="myds", nthreads=16)
Return type: dataset
- input_files (
-
twaml.data.
from_pytables
(file_name, name='auto', tree_name='none', weight_name='auto', label=None, auxlabel=None, TeXlabel=None)[source]¶ Initialize a dataset from pytables output generated from dataset.to_pytables
The payload is extracted from the .h5 pytables files using the name of the dataset and the weight name. If the name of the dataset doesn’t exist in the file you’ll crash. Aux weights are retrieved if available.
Parameters: - file_name (
str
) – Name of h5 file containing the payload - name (
str
) – Name of the dataset inside the h5 file. If"auto"
(default), we attempt to determine the name automatically from the h5 file. - tree_name (
str
) – Name of tree where dataset originated (only for reference) - weight_name (
str
) – Name of the weight array inside the h5 file. If"auto"
(default), we attempt to determine the name automatically from the h5 file. - label (
Optional
[int
]) – Give the dataset an integer label - auxlabel (
Optional
[int
]) – Give the dataset an integer auxiliary label - TeXlabel (
Optional
[str
]) – LaTeX formatted label
Examples
Creating a dataset from pytables where everything is auto detected:
>>> ds1 = dataset.from_pytables("ttbar.h5") >>> ds1.label = 1 ## add label dataset after the fact
Return type: dataset
- file_name (
-
twaml.data.
from_h5
(file_name, name, columns, tree_name='WtLoop_nominal', weight_name='weight_nominal', label=None, auxlabel=None, TeXlabel=None)[source]¶ Initialize a dataset from generic h5 input (loosely expected to be from the ATLAS Analysis Release utility
ttree2hdf5
The name of the HDF5 dataset inside the file is assumed to be
tree_name
. Thename
argument is something you choose.Parameters: - file_name (
str
) – Name of h5 file containing the payload - name (
str
) – Name of the dataset you would like to define - columns (
List
[str
]) – Names of columns (branches) to include in payload - tree_name (
str
) – Name of tree dataset originates from (HDF5 dataset name) - weight_name (
str
) – Name of the weight array inside the h5 file - label (
Optional
[int
]) – Give the dataset an integer label - auxlabel (
Optional
[int
]) – Give the dataset an integer auxiliary label - TeXlabel (
Optional
[str
]) – LaTeX form label
Examples
>>> ds = dataset.from_h5("file.h5", "dsname", TeXlabel=r"$tW$", ... tree_name="WtLoop_EG_RESOLUTION_ALL__1up")
Return type: dataset
- file_name (
twaml.viz¶
The viz
module provides visualization tools.
twaml.viz
A module to aid visualizing our datasets
-
twaml.viz.
compare_columns
(ds1, ds2, columns=None, names=None, colors=None, density=True, **subplots_kw)[source]¶ generate a set of histograms comparing the distributions of a set of columns in two different datasets.
Parameters: - ds1 (twaml.data.dataset) – The first dataset
- ds2 (twaml.data.dataset) – The second dataset
- columns (Optional[List[str]]) – Columns to plot; if None, plot all
- names (Optional[Tuple[str,str]]) – Names for the legend, if None use the dataset
name
attributes - colors (Optional[Tuple[str,str]]) – Colors for the histograms
- density (bool) – Feed to
density
parameter inmatplotlib.pyplot.hist
- subplots_kw (Dict) – all additional keywords to send to
matplotlib.pyplot.subplots
-
twaml.viz.
compare_distributions
(dist1, dist2, bins=None, titles=['dist1', 'dist2'], colors=['C0', 'C1'], ratio=True, weight1=None, weight2=None, **subplots_kw)[source]¶ Compare two histogrammed distributons with matplotlib
Parameters: - dist1 – any mpl-histogrammable object (
np.ndarray
,pd.Series
, etc.) - dist2 – any mpl-histogrammable object (
np.ndarray
,pd.Series
, etc.) - bins (np.ndarray) – define the bin edges
- titles (List[str]) – labels for the distributions
- ratio (bool) – add a ratio plot
- weight1 (Optional[np.ndarray]) – weights associated with dist1
- weight2 (Optional[np.ndarray]) – weights associated with dist2
- subplots_kw (Dict) – all additional keywords to send to
matplotlib.pyplot.subplots
Returns: - fig (matpotlib.figure.Figure)
- ax (matplotlib.axes.Axes or array of them) – ax can be either a single matplotlib.axes.Axes object or an array of Axes objects if more than one subplot was created. The dimensions of the resulting array can be controlled with the squeeze keyword, see above.
- h1 – the return of
matplotlib.axes.Axes.hist
for dist1 - h2 – the return of
matplotlib.axes.Axes.hist
for dist2
- dist1 – any mpl-histogrammable object (
twaml.utils¶
The utils
module provides simple utilities and constants
twaml.utils
A module providing some utlities
Tests¶
Running the testes requires pytest
. Tests are implemented for the
non-modeling parts of twaml. That is, data handling and plotting
To run tests simply run pytest
and you’ll see something of this
form:
$ cd /path/to/twaml
$ pytest
========================== test session starts =======================
platform darwin -- Python 3.6.8, pytest-4.0.2, py-1.7.0, pluggy-0.8.0
rootdir: /Users/ddavis/ATLAS/analysis/twaml, inifile:
collected 14 items
tests/test_data.py ............ [ 85%]
tests/test_viz.py .. [100%]
========================= 14 passed in 1.24 seconds ==================