15 Site Reliability Engineer jobs in Vietnam
Senior Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
- Design, implement, and maintain highly available and scalable production systems.
- Develop and manage infrastructure automation using tools like Terraform, Ansible, or Chef.
- Implement and manage container orchestration platforms (e.g., Kubernetes, Docker Swarm).
- Set up and maintain robust monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK stack).
- Lead incident response efforts, troubleshoot complex issues, and conduct thorough post-mortems.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Automate operational tasks and reduce manual intervention (toil reduction).
- Collaborate with development teams to ensure the reliability and performance of new features and services.
- Participate in on-call rotation to provide 24/7 support for critical systems.
- Contribute to capacity planning and performance tuning.
- Ensure security best practices are implemented across the infrastructure.
- Document system architecture, operational procedures, and incident reports.
- Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree is a plus.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
- Proven experience with cloud platforms such as AWS, Azure, or GCP.
- Expertise in scripting languages (e.g., Python, Go, Bash).
- Strong understanding of networking concepts (TCP/IP, HTTP, DNS, load balancing).
- Experience with CI/CD tools and practices (e.g., Jenkins, GitLab CI).
- Familiarity with containerization technologies (Docker, Kubernetes).
- Excellent troubleshooting, problem-solving, and analytical skills.
- Strong communication and collaboration skills, with the ability to explain technical concepts clearly.
- Experience with databases (SQL and NoSQL) and their administration.
- On-call experience and ability to work under pressure.
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
Optimizely fosters an inclusive and diverse culture with a global team of 1500+ people spread
across the US, Europe, Dubai, Australia, Singapore, Bangladesh, and Vietnam. Our unique work environment focuses on flexibility, trust, teamwork, diversity, and moving fast.
We genuinely believe that our people make all the difference, and once we find the best talent, we go out of our way to nurture them. If you are looking to work on the next generation of digital technologies in a fast-paced and growing environment with industry leaders, Optimizely is the place for you!
**Introduction**:
**Responsibilities**:
- Define a roadmap for all engineering teams to utilize fully automated, self-service, highly scalable, cost-efficient, observable, auditable and reliable infrastructure services as standard practice.
- Drive the execution of this roadmap across the engineering organization, collaborating with SREs and senior engineers across engineering while also performing hands-on work on the most critical challenges.
- Provide expert technical guidance and ongoing engineering design review to teams planning and implementing large migrations, service-oriented architecture, broad architectural shifts, and capacity growth.
- Build a metrics-driven operational culture standardizing our practices for SLO definition and review as well as for logging, monitoring, alerting, and on-call practices.
- Make iterative improvements to blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization.
- Partner closely with Security, Quality, and Product teams to achieve high priority security, privacy, compliance, reliability, and business-continuity objectives on our overall roadmap.
- Propose and drive large improvements to production systems to achieve a significant impact to our business and engineering teams.
- Mentor and coach engineers to be curious and effective at discovering and solving technical challenges.
**Knowledge and Experience**:
- You have proven experience (6+ years) demonstrating hands-on technical leadership and business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges.
- You have deep technical experience with various cloud providers, containerization technologies, automated deployment frameworks, orchestration frameworks, monitoring, logging, alerting, system internals, networking, databases, distributed systems, and service-oriented architecture.
- You have the skills to implement load, stress, performance, and reliability testing standards at scale to improve service, platform, and infrastructure resiliency.
- You promote openness, diversity of opinions, and inclusive discussions at all times to evaluate a wide variety of ideas and perspectives in solving challenging problems.
- You demonstrate clear decision making and good trade-offs in complex situations comprising multiple opinions, needs, teams, technologies, cloud providers, and architectural settings.
- You communicate effectively with stakeholders ranging from executives to junior engineers across the breadth and depth of the engineering organization.
- You exemplify high accountability, integrity, and resilience to maintain focus on both big-picture goals and milestones to get there.
- You enable the engineering organization to innovate and deliver with greater speed and safety.
**Education**:
BS CS or equivalent industry experience
**Competencies**:
- Displaying Technical Expertise- Critical Thinking- Testing and Troubleshooting- Demonstrating Initiative- Utilizing Feedback**About Us**:
- 5 working days /week with flexible working time and no overtime.
- Annual luxury Kick-off vacation.
- International, professional, creative working environment and talented teams
- Onsite opportunities in Europe and US.
- Common cultural-sportive
- art Clubs and activities, sponsored and/or supported by the Company (Ex: Football, GYM, Swimming, Guitar, English ).
- Powerful workstation: Core i7-9700, 16-32 GB RAM, 02 x QHD 2560x1440 monitors (2K resolution).
- 100% official salary during the probation period, 13th month salary, annual salary raises.
- 12 days of annual leave and 3 days of company holidays (New Year eve 31/12, Juneteenth day 18/6, Work Anniversary)
- Up to 03 extra paid-leave days per year.
- Social, Health and Unemployed Insurance are based on 100% Gross salary and fully paid by Company.
- Extra bonus at $ 60 per special occasions (Birthday, Labor Day, National Day, Solar New year, Lunar New Year).
- Lunch allowance at $30 per month.
Remote Lead Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities:
- Design, build, and maintain scalable and reliable production systems.
- Develop and implement automation strategies for deployment, monitoring, and incident response.
- Identify and address performance bottlenecks and proactively mitigate risks.
- Lead troubleshooting efforts and conduct post-mortems for incidents.
- Collaborate with software engineers to ensure reliability is designed into new features.
- Develop and maintain system monitoring, alerting, and logging infrastructure.
- Manage CI/CD pipelines and optimize deployment processes.
- Mentor and guide junior SRE team members.
- Contribute to architectural discussions and technology selection.
- Ensure system security and compliance with industry standards.
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 5+ years of experience in Site Reliability Engineering, DevOps, or System Administration.
- Expertise in cloud platforms such as AWS, Azure, or GCP.
- Proficiency in at least one scripting language (e.g., Python, Go, Bash).
- Experience with containerization technologies like Docker and Kubernetes.
- Strong understanding of networking concepts (TCP/IP, DNS, HTTP).
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Proven ability to diagnose and resolve complex system issues.
- Excellent communication and collaboration skills for remote teamwork.
- Experience with monitoring tools (e.g., Prometheus, Grafana, ELK stack).
Senior Site Reliability Engineer (Remote)
Posted today
Job Viewed
Job Description
Responsibilities:
- Design, implement, and manage highly available and scalable systems.
- Develop and maintain infrastructure automation tools and scripts.
- Build and manage CI/CD pipelines for efficient software deployment.
- Implement and optimize monitoring, alerting, and logging systems.
- Lead incident response and conduct post-mortems to prevent future issues.
- Collaborate with development teams to ensure system reliability and performance.
- Conduct capacity planning and performance tuning.
- Automate operational tasks and reduce manual toil.
- Contribute to the design and architecture of new systems and features.
- Mentor junior SREs and share best practices.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
- Strong experience with cloud platforms such as AWS, Azure, or GCP.
- Proficiency in scripting and programming languages like Python, Go, or Java.
- Experience with containerization technologies (Docker, Kubernetes).
- Expertise in infrastructure as code (IaC) tools (Terraform, Ansible).
- Knowledge of monitoring tools (Prometheus, Grafana, Datadog).
- Strong understanding of networking, operating systems, and distributed systems.
- Excellent problem-solving, analytical, and debugging skills.
- Ability to work effectively in a remote team and manage complex projects.
Senior Site Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Key Responsibilities:
- Design, implement, and manage scalable and reliable cloud-based infrastructure (e.g., AWS, Azure, GCP).
- Develop and maintain automation tools and scripts for deployment, monitoring, and incident management.
- Implement and enforce best practices for system monitoring, alerting, and logging.
- Participate in on-call rotation to respond to and resolve production incidents.
- Conduct root cause analysis for production issues and implement preventative measures.
- Collaborate with development teams to improve application reliability and performance throughout the software development lifecycle.
- Manage and optimize CI/CD pipelines for efficient and safe software deployments.
- Develop and maintain infrastructure as code (IaC) using tools like Terraform or Ansible.
- Contribute to capacity planning and performance tuning of systems.
- Document system architecture, operational procedures, and incident post-mortems.
- Stay current with emerging technologies and industry best practices in SRE and cloud computing.
- Mentor junior engineers and promote a culture of reliability and operational excellence.
Qualifications:
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Minimum of 6 years of experience in system administration, DevOps, or Site Reliability Engineering.
- Proficiency with cloud platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
- Strong scripting skills (e.g., Python, Bash, Go).
- Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging systems (e.g., ELK stack).
- Familiarity with CI/CD tools and practices (e.g., Jenkins, GitLab CI).
- Solid understanding of networking concepts (TCP/IP, DNS, HTTP).
- Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
- Ability to work independently and manage priorities in a remote, fast-paced environment.
Remote Senior Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
Key Responsibilities:
- Design, build, and maintain reliable, scalable, and high-performance infrastructure.
- Develop and implement automation for operational tasks, deployments, and incident response.
- Monitor system health, performance, and availability, and establish effective alerting mechanisms.
- Participate in on-call rotations and manage production incidents.
- Conduct root cause analysis for production issues and implement preventative measures.
- Manage cloud infrastructure resources and optimize for cost and performance.
- Collaborate with software engineering teams to improve the reliability and deployability of applications.
- Develop and maintain infrastructure-as-code using tools like Terraform or Ansible.
- Perform capacity planning and performance tuning.
- Contribute to disaster recovery planning and testing.
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
- Proven experience with cloud platforms such as AWS, GCP, or Azure.
- Strong proficiency in at least one scripting language (e.g., Python, Go, Bash).
- Hands-on experience with containerization technologies like Docker and orchestration tools like Kubernetes.
- Solid understanding of networking concepts (TCP/IP, DNS, HTTP, load balancing).
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible, Chef, Puppet).
- Experience in building and managing CI/CD pipelines.
- Excellent problem-solving skills and the ability to work under pressure.
- Strong communication and collaboration skills, especially in a remote environment.
Principal Site Reliability Engineer (Zalopay
Posted today
Job Viewed
Job Description
- Eliminating toil by automation across all the layers - infrastructure provisioning, configuration management, deployment, testing, and operation on premise and public clouds (Google Cloud and AWS)
- Working on retooling our infrastructure to provide an agile, cloud based foundation that provides common infrastructure management and automation framework.
- Interfacing directly with senior staff members within the organization to discuss and assess compliance with IT policies, standards and procedures, suggest opportunities for improvement, and report on the status of specific. Work with development teams throughout the software life cycle ensuring sustainable software releases.
- Practicing sustainable incident response and blameless postmortems
- Help to build methodology to manage infrastructure and platform cost
- Train SRE junior members
- Manage small SRE team (4-6 members) to drive automation, scalability, high availability and performance of ZaloPay
**Yêu cầu**:
- Bachelor’s degree with five or more years of work experience.
- Six or more years of SRE relevant work experience.
- Experience in Systems Architecture, in-depth knowledge on SRE, IT Operations, Cloud, Coding and Scripting experience with Golang, Java, Python and automation tool: Terraform, Ansible,
- Strong experience with Google, AWS cloud environments, with working knowledge in standard cloud services, features and tool, with Certification in appropriate areas.
- Strong experience with automation provisioning dependency software on premises.
- Have experience building Disaster recovery solution is preferred
**Preferred**
- Five or more years of experience working on middle technologies like Kafka/ RabbitMQ, Springboot, REDIS, Elasticsearch MySQL, ETCD.
- Automation experience and ability to code or script at an advance level.
- Experience in Cloud & Container platform Strategies, Design, Architecture and Migration.
- Experience with designing and implementing CI/CD DevOps solutions using Jenkins pipelines using Python, Git, Shell, YAML, Kubernetes and Docker.
- Configuration Management experience with Chef, Puppet, Ansible or Python.
- Experience serving as both a mentor and advocate for your team.
- Experience performing analytics on previous incidents and usage patterns to better predict issues and take proactive actions.
Be The First To Know
About the latest Site reliability engineer Jobs in Vietnam !
DevOps/site Reliability Engineer (Remote
Posted today
Job Viewed
Job Description
- Act as a cloud system admin (AWS primarily and knowledge of multi-cloud infrastructure).
- Monitoring and maintaining networks and servers.
- Creating and automating alerting and monitoring system logs.
- Building tools to mitigate weaknesses in incident management or software delivery.
- Troubleshooting Support Escalation requests.
- Upgrading, installing and configuring new hardware and software to meet company objectives.
- Implementing security protocols and procedures to prevent potential threats.
- Creating user accounts and performing access control.
- Performing diagnostic tests and debugging procedures to optimize computer systems.
- Documenting processes, as well as backing up and archiving data.
- Developing data retrieval and recovery procedures.
- Designing and implementing efficient end-user feedback and error reporting systems.
- Supervising and mentoring IT department employees, as well as providing IT support.
- Keeping up to date with advancements and best practices in IT administration.
**Requirements**:
- Bachelor's degree in Computer Science, Information Technology, Information Systems, or similar.
- Applicable professional qualification, such as Microsoft, Oracle, or Cisco certification.
- At least two years' experience in a similar role.
- Extensive experience with IT systems, networks, and related technologies.
- Solid knowledge of best practices in IT administration and system security.
- Exceptional leadership, organizational, and time management skills.
- Strong analytical and problem-solving skills.
- Excellent interpersonal and communication skills.
Token Metrics helps crypto investors build profitable portfolios using artificial intelligence based crypto indices, rankings, and price predictions.
Token Metrics has a diverse set of customers, from retail investors and traders to crypto fund managers, in more than 50 countries.
Site Reliability Engineer Lead (Linux)
Posted 7 days ago
Job Viewed
Job Description
Our Site Reliability Engineer Lead (Linux) responsibilities will include but are not limited to.
- Lead team of experienced Site Reliability Engineers (Linux), in both Vietnam and US li>Design, deploy/install, configure, automate, and maintain systems infrastructure and applications.
- Investigate and troubleshoot system/application behaving and take the needed action to fix it
- Being the reference for UNIX-like OS, Docker, Kubernetes & handling escalation internal/external and On-call
- Being a technical expert for the team leader regarding activities and projects
- Understand customer demands and propose the best solutions.
- Automated provisioning, configuration management in cloud environments or on-premises
- CI/CD of applications
- Automate repetitive tasks and maintain scripts for the same.
- Work closely with Project Manager to collect information and understand customers’ needs. Then help them to adopt the right solution. < i>Drive root cause analysis and implement permanent fixes across the landscape.
Site Reliability Engineer Lead (Linux)
Posted 15 days ago
Job Viewed
Job Description
Our Site Reliability Engineer Lead (Linux) responsibilities will include but are not limited to.
- Lead team of experienced Site Reliability Engineers (Linux), in both Vietnam and US li>Design, deploy/install, configure, automate, and maintain systems infrastructure and applications.
- Investigate and troubleshoot system/application behaving and take the needed action to fix it
- Being the reference for UNIX-like OS, Docker, Kubernetes & handling escalation internal/external and On-call
- Being a technical expert for the team leader regarding activities and projects
- Understand customer demands and propose the best solutions.
- Automated provisioning, configuration management in cloud environments or on-premises
- CI/CD of applications
- Automate repetitive tasks and maintain scripts for the same.
- Work closely with Project Manager to collect information and understand customers’ needs. Then help them to adopt the right solution. < i>Drive root cause analysis and implement permanent fixes across the landscape.