Skip to content

Latest commit

 

History

History
276 lines (213 loc) · 21.5 KB

ai4se.md

File metadata and controls

276 lines (213 loc) · 21.5 KB

 


 home   |   syllabus   |   groups   |   moodle   |   video   |   review   |   © 2022


AI ❤️ SE

What is AI?

  • Machine intelligence
  • Our focus: machine learning, and specifically, data-driven algorithms.
  • In particular, we will focus on applied ML.
  • We will use the term "AI" to refer to machine learning algorithms.

AI4SE: The different perspectives

The AI perspective

  • Most SE tasks can be formulated as an optimization problem.
  • We can then use optimization theory in AI to solve the problem.
  • There are several papers that do this:

Example: for automated microservice partitioning [9]

  • Optimization can be black-box (e.g., using a neural network) or white-box (e.g., using a linear programming solver); for state-of-the-art results, the former is more popular.

The SE perspective

  • SE problems are first-class citizens.
  • We can use AI to solve them, but we're more interested in the underlying problem and pragmatics.
  • Closer to the Working Backwards approach.

In either case, the knowledge assumptions are different. When looking through the AI perspective, optimization, designing loss functions, and the theoretical results are "obvious". When looking through the SE perspective, the underlying problem is "obvious", and the theoretical results are less important.

What can AI do for SE?

Once framed as an optimization problem, nearly anything. See:

  • Recognizing actionable static code warnings [10]
  • Self-admitted technical debt detection [11]
  • Human-in-the-loop optimization [12]
  • Automated microservice partitioning [9, 13]
  • Assisting developers in writing code [14]
  • Defect prediction [15]

Case Study 1: GHOST

  • Our case study: defect prediction
  • Our bias: we will use the AI lens to look at the problem.

Step 1: Obtain data

  • Use the PROMISE repository (from 2005!) [16]
  • This is tabular data, containing 21 attributes--lines of code, number of public methods, depth of inheritance tree, etc.

Step 2: Try a classical model

  • Classical models run significantly faster than deep learners, and can often perform as well as deep learners.
  • You can use algorithms like DODGE [17] to tune multiple learners with multiple configs each in minutes.
  • Moreover, classical models are more interpretable, which can be preferable [18].
  • That said: classical learners, like any learner in a non-convex landscape, can fall into local optima--moreover, neural networks distilled into soft decision trees can perform better than decision trees trained directly on the data, due to dark knowledge [19].
    • Brownie points: I've noticed this is true, at least for defect prediction, even when you distill into a hard decision tree--can you code this up?

Step 3: Use a deep learner

  • Why? Some preliminary results with classical learners did poorly.
  • Obviously, we need to use hyper-parameter optimization (why? because hyper-parameters affect the number of piecewise linear regions of the decision boundary; see [20]).
    • Brownie points: derive a greedy approach to find the optimal theoretical architecture based on Theorem 5 of this paper, given the number of linear regions required to separate the data. (And if you can find a way to compute the number of linear regions needed, you're strongly encouraged to join the lab!)
  • And by the way, we need to oversample, since we have imbalanced data; the trivial way to do this is to use a weighted loss function.

$$ \mathcal{L}^1(y_i, \hat{y_i}) = \frac{w_i}{n}\sum_{\substack{i=1 \ y_i = c_0}}^m \mathcal{L}(y_i, \hat{y_i}) + \sum_{\substack{i=1 \ y_i \neq c_0}}^m \mathcal{L}(y_i, \hat{y_i}) $$

(Credits: I am the author of the paper this figure is from, and I permit myself to use it.)

  • In fact, oversampling is as good as any other method to deal with imbalanced data, at least for convolutional networks [21]--but there's no reason it shouldn't work for feedforward networks, right?.
  • Weighted losses are also computationally efficient (why?).

Step 4: Define a preprocessing operator

  • The above didn't work well at all, so we need a new hypothesis.
  • This hypothesis is that the decision boundary gets too close to data samples, and therefore we may have samples that are accidentally on the other side (motivation: SVMs)

(Credits: [22])

  • Note here the difference between this method of pushing data samples away vs. the SVM way: the SVM uses a convex optimization problem to find the optimal decision boundary, while this method uses a computational method of doing so. This makes the SVM more efficient, but you may need a softer margin to get the same results.

Step 5: Iterate on the solution

  • We can apply fuzzy sampling to the majority class as well.
  • Fuzzy sampling creates the opposite class imbalance problem (ironically, the same problem we had before), but we can use SMOTE to solve it.
    • Brownie points: How many times can you apply fuzzy sampling (and then end it with SMOTE) before you get worse results? Can you explain why too many rounds of fuzzy sampling is bad, aside from the computational cost?
    • Brownie points: Can you find an issue with this approach? Hint: something here is unnecessary.

Congrats--you are now the SOTA for defect prediction! Until the next guy comes along, that is.

Lessons learned

  • We were very AI-centric about our approach--the actual data itself, or its source never impacted our decisions.
  • We did need to come up with a novel preprocessing operator, perhaps because we failed to include domain knowledge.
  • Although we used a feature set from 2005, we beat state-of-the-art approaches that used language models to mine features.

Case Study 2: SNEAK

Options, options, options

  • The number of options to tune can be daunting.
  • In software configuration, for example, MySQL has 461 options, which is $10^{138}$ possible configurations [23]. Most of these have little to no effect.
  • This is true for machine learning, but especially so for deep learning:
    • Learning rate [24]
    • Activation function [25-26] (the latter activation function has its own Twitter account--no, really)
    • Batch normalization [27] or dropout [28], but not both [29].
    • Initialization [30, 41]
    • Loss functions
  • How do we tackle this?

But why is AI4SE not more widespread?

  • The AI community is not very interested in SE problems.
    • This point was generated by Codex, but I ask you: how wrong is it?
  • The SE community is not very interested in AI.
    • This point was also generated by Codex, but at ICSE 2022, all our papers were rejected, partly for this reason. I'd rewrite this point as: there is some resistance within the SE community to AI--and the politics of academia are out of scope of this class.
    • Note: we do not talk about ICSE '22.
  • Researchers are notoriously bad at writing production-grade code, or documentation. Moreover, companies also expect additional things such as community support and a license permitting commercial use without needing to open-source the product itself.

But what about SE4AI?

  • It exists, and it's called MLOps.
  • We cannot possibly cover everything in this class, but we refer you to Full-Stack Deep Learning.

Why?

  • Experimenting with models on datasets creates many artifacts.
  • You may also want to use an older idea from your tinkering to try new things.
  • This creates a need for a way to track and manage these artifacts.
  • Most MLOps frameworks today will offer the following features:
    • Experiment tracking
    • Model versioning
    • Model deployment
    • Visualization
    • Reproducing experiments

How?

There are many contenders in this field, but the most popular ones are:

  • MLFlow
  • Weights & Biases
  • DVC
  • Comet
  • ClearML

X vs Y

  • Decision trees vs. SVMs
Decision tree SVM
What is the same Classification algorithms
What is different Based on maximizing information gain/split information/etc. Can use kernel trick
Convex optimization based
When to use Need interpretability Need power of kernel trick
When decision is wrong Need more complex boundaries Need interpretability
  • Deep learning vs. classical machine learning
Deep learning Classical ML
What is the same Both learning algorithms
What is different Layered structure Hierarchical structure
Minimize a loss function May not be loss-based
Can automatically extract hierarchical features Need manually provided features
When to use Larger dataset Smaller datasets
Need interpretability
When decision is wrong Cannot figure out why it doesn't work Need more complex boundary
Takes too long Dataset features are complex
  • AI4SE vs SE4AI
AI4SE SE4AI
What is the same Both involve AI and SE
What is different AI is a supporting tool SE is a supporting tool
When to use Smaller scale AI Larger scale AI
When decision is wrong No or poor ROI Wasting time on the engineering
  • Manual tracking vs MLOps
Manual tracking MLOps
What is the same Both are used for the SE4AI problem
What is different Manual organization of artifacts Uses SE concepts like version control, automatic logging, etc.
When to use Smaller scale AI Larger scale AI
When decision is wrong Artifacts are difficult to manage Too much overhead

References

[1] Kumar, K. Vinay, et al. "Software development cost estimation using wavelet neural networks." Journal of Systems and Software 81.11 (2008): 1853-1867.
[2] Minku, Leandro L., and Xin Yao. "Software effort estimation as a multiobjective learning problem." ACM Transactions on Software Engineering and Methodology (TOSEM) 22.4 (2013): 1-32.
[3] Sarro, Federica, Alessio Petrozziello, and Mark Harman. "Multi-objective software effort estimation." 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 2016.
[4] Du, Xin, et al. "An evolutionary algorithm for performance optimization at software architecture level." 2015 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2015.
[5] Sadiq, Ali Safaa, et al. "An efficient ids using hybrid magnetic swarm optimization in wanets." IEEE Access 6 (2018): 29041-29053.
[6] De Carvalho, Andre B., Aurora Pozo, and Silvia Regina Vergilio. "A symbolic fault-prediction model based on multiobjective particle swarm optimization." Journal of Systems and Software 83.5 (2010): 868-882.
[7] Liu, Yi, Taghi M. Khoshgoftaar, and Naeem Seliya. "Evolutionary optimization of software quality modeling with multiple repositories." IEEE Transactions on Software Engineering 36.6 (2010): 852-864.
[8] Abdessalem, Raja Ben, et al. "Testing vision-based control systems using learnable evolutionary algorithms." 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 2018.
[9] Desai, Utkarsh, Sambaran Bandyopadhyay, and Srikanth Tamilselvam. "Graph neural network to dilute outliers for refactoring monolith application." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 1. 2021.
[10] Yedida, Rahul, et al. "How to Find Actionable Static Analysis Warnings." arXiv preprint arXiv:2205.10504 (2022).
[11] Tu, Huy, and Tim Menzies. "DebtFree: minimizing labeling cost in self-admitted technical debt identification using semi-supervised learning." Empirical Software Engineering 27.4 (2022): 1-37.
[12] Lustosa, Andre, et al. "SNEAK: Faster Interactive Search-based SE." arXiv preprint arXiv:2110.02922 (2021).
[13] Yedida, Rahul, et al. "An Expert System for Redesigning Software for Cloud Applications." arXiv preprint arXiv:2109.14569 (2021).
[14] Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021).
[15] Yedida, Rahul, and Tim Menzies. "On the value of oversampling for deep learning in software defect prediction." IEEE Transactions on Software Engineering (2021).
[16] Shirabad, J. Sayyad, and Tim J. Menzies. "The PROMISE repository of software engineering databases." School of Information Technology and Engineering, University of Ottawa, Canada 24 (2005): 3.
[17] Agrawal, Amritanshu, et al. "How to “dodge” complex software analytics." IEEE Transactions on Software Engineering 47.10 (2019): 2182-2194. [18] Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1.5 (2019): 206-215.
[19] Frosst, Nicholas, and Geoffrey Hinton. "Distilling a neural network into a soft decision tree." arXiv preprint arXiv:1711.09784 (2017).
[20] Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." Advances in neural information processing systems 27 (2014).
[21] Buda, Mateusz, Atsuto Maki, and Maciej A. Mazurowski. "A systematic study of the class imbalance problem in convolutional neural networks." Neural networks 106 (2018): 249-259.
[22] Larhmam, CC BY-SA 4.0, via Wikimedia Commons. [23] Xu, Tianyin, et al. "Hey, you have given me too many knobs!: Understanding and dealing with over-designed configuration in system software." Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 2015.
[24] Yedida, Rahul, Snehanshu Saha, and Tejas Prashanth. "Lipschitzlr: Using theoretically computed adaptive learning rates for fast convergence." Applied Intelligence 51.3 (2021): 1460-1478.
[25] Saha, Snehanshu, et al. "Evolution of novel activation functions in neural network training for astronomy data: habitability classification of exoplanets." The European Physical Journal Special Topics 229.16 (2020): 2629-2738.
[26] Klambauer, Günter, et al. "Self-normalizing neural networks." Advances in neural information processing systems 30 (2017).
[27] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International conference on machine learning. PMLR, 2015.
[28] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.
[29] Li, Xiang, et al. "Understanding the disharmony between dropout and batch normalization by variance shift." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[30] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.
[31] Bergstra, James, et al. "Algorithms for hyper-parameter optimization." Advances in neural information processing systems 24 (2011).
[32] Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of machine learning research 13.2 (2012).
[33] Bergstra, James, Daniel Yamins, and David Cox. "Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures." International conference on machine learning. PMLR, 2013.
[34] Bergstra, James, et al. "Hyperopt: a python library for model selection and hyperparameter optimization." Computational Science & Discovery 8.1 (2015): 014008.
[35] Loshchilov, Ilya, and Frank Hutter. "CMA-ES for hyperparameter optimization of deep neural networks." arXiv preprint arXiv:1604.07269 (2016).
[36] Falkner, Stefan, Aaron Klein, and Frank Hutter. "BOHB: Robust and efficient hyperparameter optimization at scale." International Conference on Machine Learning. PMLR, 2018.
[37] Pedregosa, Fabian. "Hyperparameter optimization with approximate gradient." International conference on machine learning. PMLR, 2016.
[38] Bardenet, Rémi, et al. "Collaborative hyperparameter tuning." International conference on machine learning. PMLR, 2013.
[39] Lindauer, Marius, et al. "SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization." J. Mach. Learn. Res. 23 (2022): 54-1. [40] Liu, Chenxi, et al. "Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[41] Zhang, Hongyi, Yann N. Dauphin, and Tengyu Ma. "Fixup initialization: Residual learning without normalization." arXiv preprint arXiv:1901.09321 (2019).
[42] Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. "Neural architecture search: A survey." The Journal of Machine Learning Research 20.1 (2019): 1997-2017.
[43] Liu, Chenxi, et al. "Progressive neural architecture search." Proceedings of the European conference on computer vision (ECCV). 2018.
[44] Zoph, Barret, and Quoc V. Le. "Neural architecture search with reinforcement learning." arXiv preprint arXiv:1611.01578 (2016).
[45] Tan, Mingxing, et al. "Mnasnet: Platfor m-aware neural architecture search for mobile." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[46] Pham, Hieu, et al. "Efficient neural architecture search via parameters sharing." International conference on machine learning. PMLR, 2018.
[47] Jin, Haifeng, Qingquan Song, and Xia Hu. "Auto-keras: An efficient neural architecture search system." Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019.
[48] Mellor, Joe, et al. "Neural architecture search without training." International Conference on Machine Learning. PMLR, 2021.
[49] Li, Liam, and Ameet Talwalkar. "Random search and reproducibility for neural architecture search." Uncertainty in artificial intelligence. PMLR, 2020.
[50] Zhou, Hongpeng, et al. "Bayesnas: A bayesian approach for neural architecture search." International conference on machine learning. PMLR, 2019.
[51] Wen, Wei, et al. "Neural predictor for neural architecture search." European Conference on Computer Vision. Springer, Cham, 2020.
[52] Gong, Xinyu, et al. "Autogan: Neural architecture search for generative adversarial networks." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
[53] Xie, Sirui, et al. "SNAS: stochastic neural architecture search." arXiv preprint arXiv:1812.09926 (2018).
[54] Nayman, Niv, et al. "Xnas: Neural architecture search with expert advice." Advances in neural information processing systems 32 (2019).
[55] Yang, Zhaohui, et al. "Cars: Continuous evolution for efficient neural architecture search." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.