Contract
Posted on 28 June 25 by Reginald Dykes
Powered by Tracker
Job Title: Senior Cloud AI Engineer
Location: Addison, TX
Job Type: Full-time
We are seeking a highly skilled Senior Cloud & AI Infrastructure Engineer to join our advanced AI systems team. This role focuses on architecting, deploying, and optimizing hybrid cloud solutions and machine learning pipelines to power large language model (LLM) inference and generative AI applications. The ideal candidate will have deep experience with multi-cloud infrastructure, scalable API development, GPU provisioning, and distributed data workflows.
Architect and implement hybrid cloud solutions across AWS, Azure, and GCP, prioritizing scalability, availability, and cost efficiency.
Develop and maintain RESTful APIs using FastAPI and Swagger to support real-time LLM inference and serve scalable model pipelines.
Optimize and deploy generative AI models (e.g., LLaMA, Mistral, OpenAI GPT), including retrieval-augmented generation (RAG) pipelines using Ray and VectorAI.
Automate infrastructure provisioning using Terraform, Ansible, and Crossplane to support multi-cloud deployments.
Enable reproducible machine learning workflows using MLflow, DVC, and VectorAI, supporting experiment tracking and model versioning.
Provision and manage GPU-accelerated infrastructure to enhance LLM training throughput by up to 50%.
Utilize vector databases such as Milvus and Pinecone with Apache Iceberg to enable efficient semantic search and maintain dataset lineage.
Design and orchestrate real-time and batch data workflows using Apache Airflow, Spark, and Flink for scalable data processing.
Implement observability and monitoring solutions using Prometheus, Datadog, and Splunk to ensure system reliability and performance.
Build insightful dashboards and metrics pipelines to drive operational visibility, performance tuning, and rapid debugging.
5+ years of experience in cloud infrastructure engineering, DevOps, or MLOps roles.
Demonstrated expertise with AWS, GCP, and Azure in production-grade environments.
Proven experience with FastAPI, Swagger, and scalable API design.
Solid background in LLM optimization, model serving, and RAG implementations.
Hands-on experience with Terraform, Ansible, Crossplane, and GPU provisioning strategies.
Familiarity with ML lifecycle tools (MLflow, DVC), and vector stores (Milvus, Pinecone).
Strong understanding of data orchestration frameworks (Airflow, Spark, Flink).
Deep knowledge of observability tools and building telemetry pipelines.
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience with container orchestration (Kubernetes, Ray Serve).
Contributions to open-source AI or ML infrastructure projects.
Strong communication skills and the ability to lead cross-functional initiatives.