Skip to content

Latest commit

 

History

History
40 lines (25 loc) · 10.7 KB

usage.md

File metadata and controls

40 lines (25 loc) · 10.7 KB

Usage

The system consists of 9 files. To operate the system as a whole, only pipeline.py needs to be run. If you wish to use the other files independently of one another this is also possible; please read the individual sections for each component if this is the case.

The Pipeline

If you have not trained any models, you can set up the pipeline by calling Pipeline.setup which will, by default, generate reflectivity data, convert to images, train a classifier and train regressors using the data. You will need to provide the layers for which you wish to set the pipeline up for, e.g. [1,2,3] for up to three-layer structures. You can also modify the number of curves generated (which will significantly impact runtime), chunk size for h5 storage, and the number of epochs to train the classifier and regressors for. The show_plots parameter controls whether the confusion matrix and regression plots are shown after training. You can also specify whether you want to use a neutron or x-ray probe and whether you want to apply noise to the data with the xray and noisy parameters respectively. If you have already generated data and converted it to images, you can set the generate_data parameter to false to train on your own data. Likewise, if you already have trained a classifier, this can be loaded by setting the train_classifier parameter to false. Finally, if you have already trained the regressors for each layer, you can load these instead of training from scratch by setting the train_regressor parameter to false.

To run the pipeline, you can call the Pipeline.run method. You must provide a path to a directory containing .dat files to predict on (these should be CSV files of the form X, Y, Error). You will also need to provide the file paths of a trained classifier as well as a dictionary of file paths for trained regressors for each layer you wish to classify. By running the pipeline, the provided classifier will be used for a layer prediction. This will then be used to determine the regressor to use when predicting each layer's SLD and thickness using dropout prediction. These predictions are then fed into refnx models and plotted against the given data. Fitting can optionally be applied by setting the fit parameter to True; the predictions are used as initial estimates for the fitting algorithm for each of the models. The n_iter parameter controls the number of times predictions are performed for each data point when using the dropout predictor for Bayesian-like predictions. To run the pipeline with x-ray data and xray-trained models, set the xray parameter to True.

Synthetic Data Generation

Data can be generated with refnx using the generate_refnx.py file. Specifically, NeutronGenerator.generate allows for generation of a specified number of refnx structure objects with a given number of layers using a neutron probe. The code is currently setup to use a silicon substrate with 2.047x10^-6 Å^-2 SLD. A random roughness in the range [2,8]Å is applied between each layer in each sample to simulate real data. Thickness choices are biased towards thinner layers. The number of Q values is set to 500, scale is 1 and resolution is 2%. SLDs and thicknesses are generated within [-1,10]x10^-6 Å^-2 and [20,1000]Å bounds respectively.

To generate with an x-ray probe, you can use XRayGenerator.generate. Again, you need only provide the number of curves to generate and the number of layers for each curve. refnx uses material densities when defining layers for x-ray probes and so H2O has been used as the material for each layer and then the density is varied between [0.5,16]g cm^-3 to produce a range of SLDs comparable to the neutron generation code. For example, the substrate density is set to 2.1g cm^-3 to produce the SLD of silicon. The radiation wavelength is set to 1.54Å.

The NeutronGenerator.save or XRayGenerator.save method can then be used to store these structures as reflectivity curves in h5 format. You will need to provide a file path to a directory to save the data, the file name and the list of structure objects. The option to add sample and background noise is also available with the noisy parameter. This requires the included directbeam_noise.dat sample. The max Q value is 0.3 and background is 0 for neutron data. For x-ray data, the background is 1e-9.

Creating Images

Images are required as input to the classifier and regressors and generate_data.py facilitates creation of these images. Data is split into training, validation and test splits for use in these models.

The generate_images function can be called with the file path to a directory containing h5 files (in the correct format) to generate images for. Also provided to this function is a list of the layers for which images are to be generated for the corresponding file. For example, passing [1,2,3] will create images for any files containing 'one', 'two' or 'three' in their file name in the given data path. These will then be saved together in the given save path directory. This allows for files with curves of a specific layer to be created, as is required for regression, as well as for files containing curves of multiple different layers, as is required for classification. If your data uses an x-ray probe, set the xray parameter to True.

Depending on the number of curves being generated, the chunk size for the h5 files can be modified for potential speedup. Please note that this process can take some time with large numbers of curves and will potentially generate large files (~1.5GB for 10,000 curves). During creation of these images of the input reflectivity curves, targets are scaled to be between 0 and 1 for speeding up training and the data is shuffled. Images are greyscale arrays consisting of values between 0 and 1. To optimise storage, these values are multiplied up and stored as 16-bit unsigned integers. This avoids unnecessary 64-bit floating point storage. When these images are later loaded, they are divided by the same constant to obtain a very close approximation of the original image. This rounding has negligible impact on training performance but massively reduces file sizes.

Merging Files

To merge train.h5, validate.h5 and test.h5 files, you can call the merge function in merge_data.py. After creating separate h5 files for each layer for each of the regressors, these files can be merged together to get a combined file for use in training the classifier. This allows for reuse of data in the classifier as well as the regressors and also allows for images to be generated on separate machines before being combined. Please note, the number of curves in each of the files must be the same to merge them. This is to avoid reading and writing chunks of differing sizes for each file.

Classification

To perform classification, call the classify function in classification.py. You will need to provide a file path to the data (train.h5, validate.h5 and test.h5) to test and/or train on. Optionally, a save path can be provided to save a newly trained model to. A load path could also be provided to load an existing classifier for further training and evaluation; the train parameter indicates whether to perform training or not. Hyperparameters that can be adjusted include the number of epochs to train for, the learning rate, batch size and dropout rate. The classifier will work with up to three-layer classification.

The show_plots option for the classify function can be set to True to generate a confusion matrix when evaluating against a test set. This calls the ConfusionMatrixPrinter.pretty_plot method in confusion_matrix_pretty_print.py. The confusion matrix is the result of predicting on the given test set and then displaying how each individual example was classified, against the ground truth label.

Regression

To perform regression, call the regress function in regression.py. You will need to provide a file path to the data containing data for a specific layer (train.h5, validate.h5 and test.h5) to test and/or train on. You must also provide the layer for which the regressor is being used for. Optionally, a save path can be provided to save a newly trained model to. A load path could also be provided to load an existing regressor for further training and evaluation; the train parameter indicates whether to perform training or not. Hyper parameters that can be adjusted include the number of epochs to train for, the learning rate, batch size and dropout rate.

The show_plots option for the regress function can be set to True to generate a prediction plot when evaluating against a test set. The plot is of the predicted depths and SLDs against ground truth values for each layer. If you are performing regression on x-ray data, set the xray parameter to True.

Plotting with Error Bars

To generate the plots used in the paper of predictions against ground truths with error bars, you will need to run the code in plotting.py. The Plotter.kpd_plot method creates the plot using the KerasDropoutPredicter (KDP) class. The KDP takes trained models and uses dropout at test time to make Bayesian-like predictions. The n_iter parameter controls the number of predictions that are made for each example. The paper uses 100 predictions per point with 160 points (defined by batch_size of 20 and steps of 8). The results are "unscaled", and then mean and standard deviation are taken; these form the predictions and associated errors repetitively. If you are running on x-ray data, set the xray parameter to True.

Generating Time-Varying Data

To create the time-varying dataset or to generate similar data, the generate_time_varying.py file can be used. By calling the TimeVarying.generate method, an experiment with two layers on a silicon substrate is simulated with a top-layer thickness that changes over time. Specifically, the first layer's thickness varies in between [100,850]Å in 50Å increments. The second layer has a fixed 100Å thickness. The SLDs of the first and second layer are 2.5x10^-6 Å^-2 and 5.0x10^-6 Å^-2 respectively. A separate .dat file is created for each thickness value consisting of (Q,R) values. Realistic noise is also added to this data. TimeVarying.predictcan then be used with a given file path of a two-layer regressor to perform predictions on the time-varying dataset. This method will also produce plots comparing the ground truth SLDs and thicknesses for the two layers for each file in the dataset against the predicted values.