Читать книгу Multi-Processor System-on-Chip 1 - Liliana Andrade - Страница 41
2.6. References
ОглавлениеBodin, B., Munier-Kordon, A., and Dupont de Dinechin, B. (2013). Periodic schedules for cyclo-static dataflow. The 11th IEEE Symposium on Embedded Systems for Real-time Multimedia, Montreal, QC, Canada, 105–114.
Bodin, B., Munier-Kordon, A., and Dupont de Dinechin, B. (2016). Optimal and fast throughput evaluation of CSDF. Proceedings of the 53rd Annual Design Automation Conference. Austin, USA, 160:1–160:6.
Brunie, N. (2017). Modified fused multiply and add for exact low precision product accumulation. 24th IEEE Symposium on Computer Arithmetic. London, United Kingdom, 106–113.
Carmichael, Z., Langroudi, H.F., Khazanov, C., Lillie, J., Gustafson, J.L., and Kudithipudi, D. (2019). Performance-efficiency trade-off of low-precision numerical formats in deep neural networks. Proceedings of the Conference for Next Generation Arithmetic. New York, USA, 3:1–3:9.
CAST (2016). Multi-core Processors, Technical Report CAST-32A, FAA [Online]. Available: https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/cast/cast_papers/.
Cavicchioli, R., Capodieci, N., Solieri, M., and Bertogna, M. (2019). Novel methodologies for predictable CPU-To-GPU command offloading. Proceedings of the 31st Euromicro Conference on Real-Time Systems. Stuttgart, Germany, vol. 133 of LIPIcs, 22:1–22:22.
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Ghandi, M., Lo, D., Reinhardt, S., Alkalay, S., Angepat, H., Chiou, D., Forin, A., Burger, D., Woods, L., Weisz, G., Haselman, M., and Zhang, D. (2018). Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro, 38, 8–20.
CNX (2019). Autoware.AI-Software-Architecture [Online]. Available: https://www.cnx-software.com/wp-content/uploads/2019/02/Autoware.AI-Software-Architecture.png.
Davis, R.I., Altmeyer, S., Indrusiak, L.S., Maiza, C., Nélis, V., and Reineke, J. (2018). An extensible framework for multicore response time analysis. Real-Time Systems, 54(3), 607–661.
de Dinechin, F., Forget, L., Muller, J.-M., and Uguen, Y. (2019). Posits: The good, the bad and the ugly. Proceedings of the Conference for Next Generation Arithmetic. Association for Computing Machinery, New York, USA.
Dupont de Dinechin, B. (2004). From machine scheduling to VLIW instruction scheduling. ST Journal of Research, 1(2).
Dupont de Dinechin, B. (2014). Using the SSA-Form in a code generator. 23rd International Conference on Compiler Construction, vol. 8409 of Lecture Notes in Computer Science, Springer, 1–17.
Dupont de Dinechin, B., and Graillat, A. (2017). Feed-forward routing for the wormhole switching network-on-chip of the kalray MPPA2 processor. Proceedings of the 10th International Workshop on Network on Chip Architectures. Cambridge, USA, 10:1–10:6.
Dupont de Dinechin, B., de Ferrière, F., Guillon, C., and Stoutchinin, A. (2000). Code generator optimizations for the ST120 DSP-MCU core. Proceedings of the 2000 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES, San Jose, USA, 93–102.
Dupont de Dinechin, B., Ayrignac, R., Beaucamps, P., Couvert, P., Ganne, B., de Massas, P. G., Jacquet, F., Jones, S., Chaisemartin, N. M., Riss, F., and Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. IEEE High Performance Extreme Computing Conference, Waltham, USA, 1–6.
Dupont de Dinechin, B., van Amstel, D., Poulhiès, M., and Lager, G. (2014). Time-critical computing on a single-chip massively parallel processor. Design, Automation and Test in Europe Conference and Exhibition, Dresden, Germany, 1–6.
Dupont de Dinechin, M., Schuh, M., Moy, M., and Maïza, C. (2020). Scaling up the memory interference analysis for hard real-time many-core systems. Design, Automation and Test in Europe Conference and Exhibition, Grenoble, France, 1–4.
Firesmith, D. (2017). Multicore Processing [Online]. Available: https://insights.sei.cmu.edu/ sei_blog/2017/08/multicore-processing.html.
Fisher, J. A., Faraboschi, P., and Young, C. (2005). Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers Inc., San Francisco, USA.
Forsberg, B., Palossi, D., Marongiu, A., and Benini, L. (2017). GPU-accelerated real-time path planning and the predictable execution model. Procedia Computer Science – International Conference on Computational Science, Zurich, Switzerland, 108, 2428–2432.
Graillat, A., Moy, M., Raymond, P., and Dupont de Dinechin, B. (2018). Parallel code generation of synchronous programs for a many-core architecture. Design, Automation and Test in Europe Conference and Exhibition, Dresden, Germany, 1139–1142.
Graillat, A., Maiza, C., Moy, M., Raymond, P., and Dupont de Dinechin, B. (2019). Response time analysis of dataflow applications on a many-core processor with shared-memory and network-on-chip. Proceedings of the 27th International Conference on Real-Time Networks and Systems. Toulouse, France, 61–69.
Gschwind, M. (2016). Workload acceleration with the IBM POWER vector–scalar architecture. IBM Journal of Research and Development, 60(2–3).
Gustafson, J.L. (2017). Beyond floating point: Next-generation computer arithmetic [Online]. Available: https://web.stanford.edu/class/ee380/Abstracts/170201-slides.pdf.
Gustafson, J.L. and Yonemoto, I.T. (2017). Beating floating point at its own game: Posit arithmetic. Supercomputing Frontiers and Innovations, 4(2), 71–86.
Halbwachs, N., Caspi, P., Raymond, P., and Pilaud, D. (1991). The synchronous data flow programming language LUSTRE. Proceedings of the IEEE, 79(9), 1305–1320.
Hascoët, J., Dupont de Dinechin, B., de Massas, P.G., and Ho, M.Q. (2017). Asynchronous one-sided communications and synchronizations for a clustered manycore processor. Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia, Seoul, Republic of Korea, 51–60.
Hascoët, J., Dupont de Dinechin, B., Desnos, K., and Nezan, J. (2018). A distributed framework for low-latency openVX over the RDMA NoC of a clustered manycore. 2018 IEEE High Performance Extreme Computing Conference HPEC, Waltham, USA, 1–7.
Huang, M., Men, L., and Lai, C. (2013). Accelerating mean shift segmentation algorithm on hybrid CPU/GPU platforms. In Modern Accelerator Technologies for Geographic Information Science, Shi, X., Kindratenko, V. and Yang, C. (eds). Springer, New York.
Intel (2018). BFLOAT16 – Hardware Numerics Definition Revision 1.0. November 2018.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H., and Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA, 2704–2713.
Jia, Z., Maggioni, M., Staiger, B., and Scarpazza, D.P. (2018). Dissecting the NVIDIA volta GPU architecture via microbenchmarking. ArXiv, abs/1804.06826.
Johnson, J. (2018). Rethinking floating point for deep learning. ArXiv, abs/1811. 01721.
Kanduri, A., Rahmani, A.M., Liljeberg, P., Hemani, A., Jantsch, A., and Tenhunen, H. (2017). A Perspective on Dark Silicon. Springer International Publishing.
Kästner, D., Pister, M., Gebhard, G., Schlickling, M., and Ferdinand, C. (2013). Confidence in timing. SAFECOMP 2013 - Workshop SASSUR (Next Generation of System Assurance Approaches for Safety-Critical Systems) of the 32nd International Conference on Computer Safety, Reliability and Security, Toulouse, France.
Krishnamoorthi, R. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper. ArXiv abs/1806.08342.
Lee, E.A., Reineke, J., and Zimmer, M. (2017). Abstract PRET Machines. IEEE Real-Time Systems Symposium, RTSS, Paris, France, December 5–8, 1–11.
NVIDIA (2020). Programming Tensor Cores in CUDA 9 [Online]. Available: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/.
Pagetti, C., Saussié, D., Gratia, R., Noulard, E., and Siron, P. (2014). The ROSACE case study: From simulink specification to multi/many-core execution. 20th IEEE Real-Time and Embedded Technology and Applications Symposium. Berlin, Germany, 309–318.
Perret, Q., Maurère, P., Noulard, E., Pagetti, C., Sainrat, P., and Triquet, B. (2016). Temporal isolation of hard real-time applications on many-core processors. IEEE Real-Time and Embedded Technology and Applications Symposium. Vienna, Austria, April 11-14, 37–47.
Resmerita, D., Farias, R.C., Dupont de Dinechin, B., and Fillatre, L. (2020). Benchmarking alternative floating-point formats for deep learning inference. Conférence francophone d’informatique en Parallélisme, Architecture et Système.
Rihani, H., Moy, M., Maiza, C., Davis, R.I., and Altmeyer, S. (2016). Response time analysis of synchronous data flow programs on a many-core processor. Proceedings of the 24th International Conference on Real-Time Networks and Systems. Brest, France, 67–76.
Rodriguez, A., Ziv, B., Fomenko, E., Meiri, E., and Shen, H. (2018). Lower numerical precision deep learning inference and training. Intel AI Developer Program, 1–19 [Online]. Available: https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html.
Rovder, S., Cano, J., and O’Boyle, M. (2019). Optimising convolutional neural networks inference on low-powered GPUs. 12th International Workshop on Programmability and Architectures for Heterogeneous Multicores. Valencia, Spain.
Saidi, S., Ernst, R., Uhrig, S., Theiling, H., and Dupont de Dinechin, B. (2015). The shift to multicores in real-time and safety-critical systems. International Conference on Hardware/Software Codesign and System Synthesis. Amsterdam, The Netherlands, October 4–9, 220–229.
Wilhelm, R. and Reineke, J. (2012). Embedded systems: Many cores - Many problems. 7th IEEE International Symposium on Industrial Embedded Systems. Karlsruhe, Germany, June 20–22, 176–180.
For a color version of all figures in this book, see www.iste.co.uk/andrade/multi1.zip.
1 1. Numbers in each pair denote, respectively, the bit-width of the multiplicands and the accumulator.
2 2. Motivated by saving the silicon area and not constrained by the architecture.
4 4. Passing the OpenCL 1.2 conformance with PoCL is work in progress.
5 5. https://www.ansys.com/products/embedded-software/ansys-scade-suite.