工作內容:<About the Job>
We are looking for a highly motivated and skilled AI Infrastructure Engineer with strong hands-on experience in Kubernetes (K8s), particularly in supporting AI/ML workflows. In this role, you will be instrumental in designing, implementing, and maintaining robust, scalable, and high-performance Kubernetes-based infrastructure that supports the entire lifecycle of our AI applications—from data processing and model training to deployment and monitoring. You will work closely with data scientists, AI/ML engineers, and DevOps teams to ensure seamless integration of AI/ML workloads within cloud-native environments. The ideal candidate has a deep understanding of container orchestration, distributed systems, and MLOps practices, and is passionate about building efficient, reliable platforms that enable rapid AI innovation. This is a unique opportunity to work at the intersection of AI and cloud infrastructure, contributing to next-generation systems that power intelligent applications at scale.
<Job Responsibilities>
.Design & Architecture: Design, build, and scale a reliable and efficient Kubernetes platform optimized for AI/ML workloads. This includes provisioning GPUs, managing resources, and ensuring optimal performance for computationally intensive tasks.
.Infrastructure Management: Manage the entire Kubernetes cluster lifecycle—from provisioning and configuration to ongoing maintenance, monitoring, and troubleshooting, ensuring high availability and scalability.
.Deployment & Automation: Develop and implement CI/CD pipelines to automate the deployment, scaling, and updating of machine learning models and AI services. Ensure seamless integration with AI tools like Kubeflow, MLflow, and Argo Workflows.
.Performance Optimization: Continuously monitor and optimize system performance, focusing on resource utilization, latency reduction, and improving the overall efficiency of AI workloads. Ensure high availability and minimal downtime for AI services.
.Collaboration & Guidance: Work closely with data scientists, ML engineers, and cross-functional teams to understand their infrastructure requirements and provide technical solutions to meet workload demands effectively.
.Security & Compliance: Implement best practices for cluster security, including network policies, access controls, and vulnerability management to safeguard sensitive data and maintain compliance.
.Cost & Resource Efficiency: Manage resources effectively to optimize cost while maintaining high-performance infrastructure for AI model training, inference, and data processing.
<Skills & Qualifications>
.Kubernetes Expertise: You should have hands-on experience with Kubernetes (K8s) architecture, including deploying applications, managing resources, and troubleshooting complex cluster issues in a production environment.
.Containerization & Linux Environment: Strong knowledge of container technologies such as Docker, along with hands-on experience in Linux environments. Expertise in container orchestration and deployment practices is highly valued.
.AI Workloads: Deep understanding of GPU scheduling and performance optimization, including strategies for resource allocation, workload balancing, and maximizing throughput for AI/ML tasks.
.Automation & CI/CD: You need practical experience with building and managing CI/CD pipelines using tools like GitLab CI, Jenkins, GitHub Actions, or ArgoCD to automate deployments.
.Programming & Scripting: Proficiency in at least one scripting language (e.g., Python, Bash) is a must.
.Networking: Knowledge of container networking and service mesh technologies (e.g., Istio, Linkerd) is highly desirable and a great advantage.