รายละเอียดงานAs a Site Reliability Engineer / System Administrator at THCloud.AI, you will be responsible for maintaining the reliability, scalability, and efficiency of our AI and blockchain infrastructure across on-premise and multi-cloud environments. You will drive automation and operational excellence by designing, implementing, and managing CI/CD pipelines, monitoring system health, and proactively addressing potential issues before they impact performance.
Key Responsibilities:
1. Maintain, monitor, and troubleshoot the company's cloud, blockchain, AI and associated business systems across on-premise and multi-cloud environments.
2. Deploy and manage applications on Linux platforms and virtualized infrastructure (Proxmox, VMware, OpenShift), handling system installations, configurations, and ongoing maintenance tasks.
3. Develop, implement, and manage CI/CD pipelines using tools such as GitHub Actions, Ansible, and Kubernetes to ensure seamless and efficient deployment workflows.
4. Design high-availability systems with load balancing (HAProxy, Nginx), caching (Redis), and failover configurations.
5. Conduct daily monitoring, data backup, and recovery using open-source monitoring tools (Prometheus, Grafana, Loki) for performance reporting, issue tracking, and proactive health checks.
6. Perform anomaly detection, root cause analysis, and automated alerting to address and prevent system failures and performance bottlenecks.
7. Automate operational tasks and improve system resilience through scripting (Bash, Python, or Golang) and configuration management tools.
8. * Maintain and optimize infrastructure components such as Docker, Kubernetes, databases (PostgreSQL, MySQL), and distributed storage (Ceph, MinIO).
9. Setup VPN, VPC, and secure networking for client environments with proper isolation and security.
10. Collaborate with cross-functional teams to support infrastructure improvements, incident response, and operational resilience.