Christian Engelmann
Christian Engelmann
Senior Scientist and Group Leader, Intelligent Systems and Facilities, Oak Ridge National Laboratory
Verified email at ornl.gov - Homepage
Title
Cited by
Cited by
Year
Proactive fault tolerance for HPC with Xen virtualization
AB Nagarajan, F Mueller, C Engelmann, SL Scott
Proceedings of the 21st annual international conference on Supercomputing, 23-32, 2007
4982007
Addressing failures in exascale computing
M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ...
The International Journal of High Performance Computing Applications 28 (2á…, 2014
4292014
Detection and correction of silent data corruption for large-scale high-performance computing
D Fiala, F Mueller, C Engelmann, R Riesen, K Ferreira, R Brightwell
SC'12: Proceedings of the International Conference on High Performanceá…, 2012
3432012
Proactive process-level live migration in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 1-12, 2008
2292008
Combining partial redundancy and checkpointing for HPC
J Elliott, K Kharbas, D Fiala, F Mueller, K Ferreira, C Engelmann
2012 IEEE 32nd International Conference on Distributed Computing Systemsá…, 2012
1872012
Proactive fault tolerance using preemptive migration
C Engelmann, GR Vallee, T Naughton, SL Scott
2009 17th Euromicro International Conference on Parallel, Distributed andá…, 2009
1282009
A job pause service under LAM/MPI+ BLCR for transparent fault tolerance
C Wang, F Mueller, C Engelmann, SL Scott
2007 IEEE International Parallel and Distributed Processing Symposium, 1-10, 2007
1152007
Functional partitioning to optimize end-to-end performance on many-core architectures
M Li, SS Vazhkudai, AR Butt, F Meng, X Ma, Y Kim, C Engelmann, ...
SC'10: Proceedings of the 2010 ACM/IEEE International Conference for Highá…, 2010
1142010
The case for modular redundancy in large-scale high performance computing systems
C Engelmann, HH Ong, SL Scott
Proceedings of the 8th IASTED international conference on parallel andá…, 2009
1022009
Failures in large scale systems: Long-term measurement, analysis, and implications
S Gupta, T Patel, C Engelmann, D Tiwari
Proceedings of the International Conference for High Performance Computingá…, 2017
1002017
NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines
C Wang, SS Vazhkudai, X Ma, F Meng, Y Kim, C Engelmann
2012 IEEE 26th International Parallel and Distributed Processing Symposiumá…, 2012
932012
System-level virtualization for high performance computing
G Vallee, T Naughton, C Engelmann, H Ong, SL Scott
16th Euromicro Conference on Parallel, Distributed and Network-Basedá…, 2008
832008
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod
Whitepaper, Dec, 2009
802009
Super-scalable algorithms for computing on 100,000 processors
C Engelmann, A Geist
International Conference on Computational Science, 313-321, 2005
782005
A framework for proactive fault tolerance
G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, ...
2008 Third International Conference on Availability, Reliability andá…, 2008
752008
Redundant execution of HPC applications with MR-MPI
C Engelmann, S B÷hm
Proceedings of the 10th IASTED International Conference on Parallel andá…, 2011
712011
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
C Engelmann
Future Generation Computer Systems 30, 59-65, 2014
692014
Hybrid checkpointing for MPI jobs in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
2010 IEEE 16th International Conference on Parallel and Distributed Systemsá…, 2010
662010
xSim: The extreme-scale simulator
S B÷hm, C Engelmann
2011 International Conference on High Performance Computing & Simulationá…, 2011
632011
Proactive process-level live migration and back migration in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
Journal of Parallel and Distributed Computing 72 (2), 254-267, 2012
582012
The system can't perform the operation now. Try again later.
Articles 1–20