Who am I?

Personal Details

Utkarsh Mathur
  • Name: Utkarsh Mathur
  • Location: San Jose, CA
  • Email:
    u7karsh [AT] yahoo [DOT] co [DOT] in
    umathur [AT] alumni [DOT] ncsu [DOT] edu
  • LinkedIn   
  • GitHub   

About Me

Currently working at Nvidia as a Sr. Architect, CPU. An avid skydiving enthusiast!

Past experience includes Marvell Semiconductors (ThunderX4 architecture team), Cadence Design Systems (Verification IP @ HDMI, I2C, MHL, USB type-C)

M.S. thesis in Computer Engineering from NCSU under the guidance of Prof. Eric Rotenberg

GPU security under Prof. Huiyang Zhou

Research interests: Computer Architecture, General Purpose Computation on Graphics Processors (GPGPU), and architectural support for security.

My Professional Background

Work Experience

2021January - Present


Sr. Architect, CPU

  • Working on some core microarch stuff!
  • More info on this work coming soon ;)

2019August - 2021January

Marvell Semiconductors

Sr. Architecture Engineer

  • Part of the ThunderX4 server processor architecture team working on on the performance model and RTL
  • Led the L1D prefetcher project. Was responsible for both performance studies and RTL implementation
  • Worked on different structures in the load store unit to improve overall performance
  • Worked on different branch prediction schemes and improving fetch bandwidth
  • Owned the tracing and simpoints infrastructure for our in-house trace based simulator

2018May - 2019May

NC State University

Research Assistant

Teaching Assistant

2015October - 2017July

Cadence Design Systems

R&D Engineer

  • Development in Verification IP for the protocols HDMI, MHL & I2c
  • Co-created a component-based methodology with the aim to have more scalable and flexible architecture of Verification IPs and reduced time to market
  • Created features like Consumer Electronics Control (CEC) Physical Layer, adDDC from scratch
Click to browse through more Work Experience

2015July - 2015October

Cadence Design Systems

R&D Intern

  • Learnt various methodologies and trained in tools like ClearCase for Code Version Control
  • Tested HDMI verification IP and fixed major performance-related bugs

2014May - 2014August


(Defence R&D Organization)


  • Developed Data Logger using BL2120 SBC to replace their old firmware and increased recording time by a factor of 10
  • Designed schematic layout for data logging systems using dsPIC33F microcontroller for analyzing data recorded by various sensors during free fall to improve parachute designs
  • Invented a model for non-contact distance measurement of objects using Image Processing to help them determine terminal velocity of freely falling payloads

2014March - 2014April



  • Taught undergraduate level image processing for Robovito. Managed responsibility for lectures, workshops, and project

2013June - 2013July

NaMPET Laboratory
IIT Kanpur

Summer Intern

  • Developed a software suite in assembly language for the DSP processor TMS320F2812 & incorporated a package of routines to implement any filter of Order 2
  • Enabled production of 50Hz three phase sinusoids with various control mechanisms for consistent voltage

My research work?

Patents / Thesis / Publications

U Mathur. HW Cain. Effective prefetch throttling based on stream length. US Patent #63/045,681. Filed June 29, 2020. Pending

U. Mathur. Post-Silicon Microarchitecture (PSM) Implementation of Checkpointed Early Load Retirement (CLEAR). M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, March 2019. [NCSU library: on-line thesis]

C. Kumar, A. Chaudhary, S. Bhawalkar, U. Mathur, S. Jain, A. Vastrad, and E. Rotenberg. Post-Silicon Microarchitecture. IEEE Computer Architecture Letters (CAL), 19(1):26-29, Jan.-June 1, 2020. (Date of Publication: 09 March 2020.) [pdf]

Z. Lin, U. Mathur, and H. Zhou, "Scatter-and-Gather Revisited: High-Performance Side-Channel-Resistant AES on GPUs", The 12th workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-2019), 2019 [pdf][source code]

Mathur, U., Sharma, R., Srivastava, N., "Script independent angular skew detection and correction algorithms", 2013 International Conference on Signal Processing and Communication (ICSC-2013), IEEE. pp. 466-469, 12-14 Dec. 2013. DOI: 10.1109/ICSPCom.2013.6719835. [IEEE Xplore, SCOPUS]

Sharma, R., Mathur, U., Srivastava, N., "Angular Skew Correction Algorithm for Handwritten Hindi Text", International Journal of Advance Computer Research; Jun2013, Vol. 3 Issue 2, p43.[Impact factor: 1.863]

My latest work


2018October - 2018Dec

Side Channel Resistant Implementation of AES on GPUs

  • Proposed and implemented a high throughput masking based design of AES to mitigate timing based side channel attacks
  • Studied performance of variants like S-Box, T-Table, Bitsliced, Rivain-Prouff, H-Table, and the proposed for different RNG configurations

2018September - 2018Dec

OS Design - XINU OS

  • Implemented virtual memory abstraction (including virtual stack space) on x86 using 4KB pages
  • Implemented (1) multi-level feedback queue for process scheduling; (2) spin-lock, guard-lock, try-lock, and priority inheritence
GitHub: https://github.com/u7karsh/os_virtual_mem

2018October - 2018October

Quantum Security

  • Designed and implemented an FPGA synthesizable module for Binary Learning with Errors (LWE), a Quantum Secure Module
  • Performed power based side channel analysis on the designed module to extract secret information

2018March - 2018May

Load Latency Hiding using LSB (slice) and LVP

  • Proposed and implemented a micro-arch for continual flow of instructions during retirement stall on L2 miss at head of ROB
  • Implemented LSB in 721sim (cycle-accurate superscalar simulator) to maintain fake-retired load dependent instructions
  • Implemented re-insertion of the instructions in slice to issue queue on load value misprediction
  • Implemented hierarchical store queue with membership test buffer (MTB) to prevent it from becoming a cycle time bottleneck

2018February - 2018May

Multipath Execution for Divergent Control Flow in GPUs

  • The current SIMT stack approach serialize the execution of divergent control flow. We implemented a technique proposed by ElTantawy et. al. (A scalable multi-path microarchitecture for efficient GPU control flow), which allows interleaved execution of divergent paths.
  • Implemented split table and reconvergence table in GPGPU-sim to allow interleaved execution of divergent paths
  • Modified scoreboard logic to handle dependencies from diverged paths correctly

2018February - 2018April

Pipelined LC3 Microcontroller Functional Verification

  • Implemented a layered verification model for LC3 microcontroller in SystemVerilog, including sequencer, driver, monitor, and environment
  • The package included a coverage plan and several test cases for functional completeness and correctness
GitHub: https://github.com/u7karsh/745_lc3_verif

2017November - 2017December

Dynamic Instruction Scheduling

  • Developed a simulator for an out-of-order superscalar processor based on Tomasulo‚Äôs algorithm that fetches, dispatches, and issues N instructions per cycle with integrated two level caches. Perfect branch prediction was assumed
GitHub: https://github.com/u7karsh/dynamic_scheduler_ece563

2017October - 2017October

Synthesizable Convolutional Neural Network

  • Developed a synthesizable Verilog design for two staged convolutional neural network arithmetic. Design generates 8-bit output vectors for object classification
  • Two parametrized architectures were proposed and developed with one being throughput and other being area oriented.
GitHub: https://github.com/u7karsh/cnn_ece564

2017September - 2017October

Branch Predictor and Cache Simulator

  • Developed a generic cache simulator for WTWNA, WTWA and WBWA policies which could be used to instantiate any level of memory hierarchy with the option to augment victim cache. Replacement policies like LRU, LFU and LRFU were also incorporated
  • Worked on a cache simulator for MESI, MOESI and MSI cache coherence protocols
  • Developed a simulator for branch predictor with different configurations like GShare, BiModal, Hybrid with an option to add BTB
GitHub: https://github.com/u7karsh/cache_simulator_ece563

2015January - 2015May

Reconfigurable computing using Field Programmable Gate Array

  • Built a computer System on Chip (SoC) on Papilio Pro (Spartan 6) based on Zilog 80 core with 4KB paged Memory Management Unit (64KB virtual, 64MB physical address space) along with 8MB SDRAM with 16KB 4-way associative cache and communication protocol Universal Asynchronous Receiver Transmitter (UART)
  • Created several programs in assembly to demonstrate features like user input using UART module, arithmetic operations, etc.
Click to browse through more Projects

2015March - 2015May


  • Created a hardware cryptographic tool using C, Java, Python to securely upload and download files on a cloud
  • To ensure data security & integrity, system used sessions keys and authentication tokens generated from the hardware using Milenage algorithm that is used in GSM

2015January - 2015March

FPGA Place and Route

  • Developed open source tool chain using C++ for implementing a place & route mechanism for iCE40 FPGA
  • Tool chain used already existing FPGA synthesizer (YOSYS) to develop a fully open source FPGA compilation flow

2014August - 2014December

Hardware Signal Processing Toolbox

  • Designed a low cost (~30$) tool using dsPIC33EP microcontroller for facilitating 3-channel hardware signal processing (up to 1.0MHz bandwidth) that interfaced with MATLAB & JAVA

2014October - 2014December

Pose Invariant Face Recognition

  • Dataset of 35 males was created manually by taking 5 photographs of each male in different poses
  • Features were selected using the embedded method for deep learning
  • Images were cropped & pre-processed using Gabor filter & Histogram Equalization after being converted to gray scale
  • Keeping all the frontal face images in test set, linear Support Vector Machine achieved an accuracy of 74%

2014May - 2014July


  • Centralized & digitized the fee collection data of Jaypee Youth Club through deployment of a central CentOS-based TCP server, that hosted a JAVA-based GUI application for fee collection, receipt printing & uploading data to the server on a channel securely using Python
  • An automated email to every fee depositor as a receipt confirmation was an added feature

2013July - 2014May

Micro Electro Mechanical Systems (MEMS)

  • Modelled a Cantilever-based clamped free resonators with Magnetostatic actuation, Piezoresistive detection & Electrostatic actuation, electrostatic detection using Verilog A at the Centre for Microelectromechanical Systems (MEMS) design funded by the National Program on Micro & Smart Systems (NPMASS) initiated by Aeronautical Development Agency (ADA), Govt. of India.

2013August - 2013September

Grid Solving Robot

  • Robot capable of solving a grid by finding & traversing the shortest path using Dijkstra's algorithm from one node to another without crossing the blocked/restricted nodes
  • Robot had the capability to pick objects from target node autonomously

2013May - 2014September


  1. Impressions 2013 and 2014: Designed official website of techno-cultural festival at JIIT, Noida
  2. Jaypee Model United Nations 2014: Designed official website for JMUN 2014
  3. ICSC 2013 and 2015: Designed payment gateway for International Conference on Signal Processing and Communication using JavaServer Pages.

Website link: http://u7karsh.com/projects

2013April - 2013May

Foot Mouse

  • Developed a system using an accelerometer interfaced with a microcontroller that could easily be attached to a person's shoe
  • Used this module & C++ interface to control the location of mouse pointer on the Windows platform & performed a left/right click operation
  • Versatile for games like FIFA & Pro Evolution Soccer

2012June - 2012August

OpenCV-Based Marker Detection Library

  • A robust intensity invariant marker detection framework was developed using OpenCV in C++ to train multiple markers & use it to detect & obtain the Pose Matrices of all markers in a video stream
  • The Pose Matrix could then be used in a variety of augmented reality application

What I'm best at

Skills & Knowledge



C/C++, CUDA, Verilog, SystemVerilog, Python, Assembly for RISC-V, Java, PHP, HTML, SQL, CQL


GPGPU-sim, 721sim (cycle-accurate RISC-V superscalar simulator), MATLAB, ModelSim, Synopsys Design Vision, LaTeX


UVM ( Universal Verification Methodology )

Version Control

ClearCase, Perforce, GIT

Operating Systems

Unix/Linux ❤, Windows



Papilio Pro (Spartan 6)


TMS320F2812, 8051


dsPIC30 and dsPIC33 family, and ATmega328

Development Boards

Arduino, Raspberry Pi and Galileo board

What have I acheived?

Honours and Awards


Merit Certificate holder from CBSE (Science) for being in the top 0.1% of students who appeared in All India Senior School Certificate Examination (AISSCE 2009)


January 2014
Secured 1st prize in HoverOn, a line following hovercraft event at Asia's largest Science and Technical Festival, IIT Bombay, 2014

Click here to browse through more Awards


January 2014
Finalists in Magneto, a gesture controlled robotic event at Asia's largest Science and Technical Festival, IIT Bombay, 2014


March 2013
Awarded 2nd prize in Trailblazer, a manual + autonomous line following bot event, Impressions JIIT, 2013


September 2012
Received 1st prize in RoboWars, Neuron Malaviya National Institute Technology(MNIT), 2012

Line Follower

September 2012
Recipient of 2nd prize in Line Follower, Neuron MNIT, 2012


September 2012
Awarded 3rd prize in Rescuer, a manual + autonomous grid solving bot event, Neuron MNIT, 2012


March 2012
Secured 2nd prize in Connexions, a manual + autonomous line following bot event, Impressions JIIT, 2012