Language models are few-shot learners T Brown, B Mann, N Ryder, M Subbiah, JD Kaplan, P Dhariwal, ... Advances in neural information processing systems 33, 1877-1901, 2020 | 30392* | 2020 |
Deep reinforcement learning from human preferences PF Christiano, J Leike, T Brown, M Martic, S Legg, D Amodei Advances in neural information processing systems 30, 2017 | 2102 | 2017 |
Extracting training data from large language models N Carlini, F Tramer, E Wallace, M Jagielski, A Herbert-Voss, K Lee, ... 30th USENIX Security Symposium (USENIX Security 21), 2633-2650, 2021 | 1209 | 2021 |
Scaling laws for neural language models J Kaplan, S McCandlish, T Henighan, TB Brown, B Chess, R Child, ... arXiv preprint arXiv:2001.08361, 2020 | 1105 | 2020 |
Adversarial patch TB Brown, D Mané, A Roy, M Abadi, J Gilmer arXiv preprint arXiv:1712.09665, 2017 | 926 | 2017 |
Fine-tuning language models from human preferences DM Ziegler, N Stiennon, J Wu, TB Brown, A Radford, D Amodei, ... arXiv preprint arXiv:1909.08593, 2019 | 859 | 2019 |
Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models J Kaplan, S McCandlish, T Henighan, TB Brown, B Chess arXiv preprint arXiv:2001.08361 2, 1557-1566, 2020 | 749 | 2020 |
Training a helpful and harmless assistant with reinforcement learning from human feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022 | 681 | 2022 |
Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models J Kaplan, S McCandlish, T Henighan, TB Brown, B Chess arXiv preprint arXiv:2001.08361 1 (2), 4, 2020 | 628 | 2020 |
Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 583 | 2022 |
Technical report on the cleverhans v2. 1.0 adversarial examples library N Papernot, F Faghri, N Carlini, I Goodfellow, R Feinman, A Kurakin, ... arXiv preprint arXiv:1610.00768, 2016 | 531* | 2016 |
Aurko Roy, Martın Abadi, and Justin Gilmer. Adversarial patch TB Brown, D Mané arXiv preprint arXiv:1712.09665 2 (3), 4, 2017 | 486 | 2017 |
cleverhans v2. 0.0: an adversarial machine learning library N Papernot, I Goodfellow, R Sheatsley, R Feinman, P McDaniel arXiv preprint arXiv:1610.00768 10, 2016 | 319 | 2016 |
Scaling laws for autoregressive generative modeling T Henighan, J Kaplan, M Katz, M Chen, C Hesse, J Jackson, H Jun, ... arXiv preprint arXiv:2010.14701, 2020 | 235 | 2020 |
Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ... arXiv preprint arXiv:2207.05221, 2022 | 223 | 2022 |
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022 | 214 | 2022 |
A general language assistant as a laboratory for alignment A Askell, Y Bai, A Chen, D Drain, D Ganguli, T Henighan, A Jones, ... arXiv preprint arXiv:2112.00861, 2021 | 214 | 2021 |
In-context learning and induction heads C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ... arXiv preprint arXiv:2209.11895, 2022 | 189 | 2022 |
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022 | 172 | 2022 |
A mathematical framework for transformer circuits N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread 1, 1, 2021 | 149 | 2021 |