DecentraJobs - Go to homepage
4 hours ago

HPC Specialist

Drw

MontrealRemote

📍 On-site

Category: EngineeringSubcategory: DevOps / SREType: Full-time


DRW is a diversified trading firm with over 3 decades of experience bringing sophisticated technology and exceptional people together to operate in markets around the world. We value autonomy and the ability to quickly pivot to capture opportunities, so we operate using our own capital and trading at our own risk.

Headquartered in Chicago with offices throughout the U.S., Canada, Europe, and Asia, we trade a variety of asset classes including Fixed Income, ETFs, Equities, FX, Commodities and Energy across all major global markets. We have also leveraged our expertise and technology to expand into three non-traditional strategies: real estate, venture capital and cryptoassets.

We operate with respect, curiosity and open minds. The people who thrive here share our belief that it’s not just what we do that matters–it's how we do it. DRW is a place of high expectations, integrity, innovation and a willingness to challenge consensus.

We are looking for an HPC Specialist to join our AI and Multi Asset Systematic Strategies team. This team builds and operates GPU infrastructure that powers AI and ML workloads. You'll work on the infrastructure stack from bare metal to model serving, combining systems engineering, performance optimization, and infrastructure automation to solve complex problems at the intersection of hardware, networking, and distributed systems.

Responsibilities:

  • Deploy, maintain, and optimize GPU infrastructure for large-scale LLM inference workloads, including provisioning, configuration, and deployment of GPU server fleets.
  • Architect and implement distributed serving solutions for multi-node, multi-GPU model deployments.
  • Manage GPU-enabled Kubernetes clusters for LLM and ML workloads.
  • Configure network infrastructure including load balancers, firewalls, and inter-node communication for GPU clusters.
  • Implement and optimize storage solutions for model weights and inference caches.
  • Troubleshoot performance bottlenecks across the stack: hardware, drivers, networking, and application layer.
  • Research and evaluate emerging GPU technologies, model serving frameworks, and infrastructure optimizations.
  • Collaborate with ML engineers to profile model performance and implement inference acceleration techniques.
  • Drive reliability improvements through monitoring, alerting, capacity planning, and incident response.

Requirements:

  • Bachelor's or Master's degree in Computer Science, Systems Engineering, or related field.
  • 5+ years in DevOps, SRE, or infrastructure engineering roles.
  • Strong experience with GPU infrastructure, model serving frameworks (vLLM, SGLang), and GPU driver management.
  • Hands-on experience optimizing deep learning workloads (inference or training) on GPU clusters.
  • Deep Linux systems knowledge including network configuration, storage optimization, and Kubernetes orchestration.
  • Experience with infrastructure as code tools (Ansible, Terraform, or similar).
  • Strong understanding of distributed systems, networking protocols (TCP/IP, HTTP/2), and load balancing.
  • Proficiency in Python and Bash scripting for automation.
  • Experience with monitoring and observability tools (Prometheus, Grafana, or similar).

For more information about DRW's processing activities and our use of job applicants' data, please view our Privacy Notice at https://drw.com/privacy-notice.

California residents, please review the California Privacy Notice for information about certain legal rights at https://drw.com/california-privacy-notice.

[#LI-KS1]

Tags
Full-time

Share This Job

Apply for this position

Interested? Click below to submit your application.

Apply to this job
Drw logo

Drw

Website

DRW identifies and capitalizes on global trading and investment opportunities through a diversified strategy spanning multiple asset classes and markets, with timeframes from seconds to years. The firm combines the dynamism of a startup with the stability of an established company, emphasizing technology, research, and risk management. Its culture promotes continuous learning, high standards, curiosity, and collaboration among talented and dedicated professionals.

1,001 - 5,000 employees
Chicago, IL, Illinois, US
Privately Held
trading
research
technology
négociation
recherche
technologie