Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

15 Site Reliability Engineer jobs in Vietnam

Senior Site Reliability Engineer

100000 An Cu, An Giang WhatJobs

Posted 2 days ago

Tap Again To Close

Job Description

full-time

Our client is looking for a highly experienced Senior Site Reliability Engineer (SRE) to join their growing team based in **Hanoi, Hanoi, VN**. This critical role focuses on ensuring the availability, performance, scalability, and security of our client's production systems and services. You will be responsible for designing, building, and operating large-scale, distributed systems, automating infrastructure management, and implementing robust monitoring and alerting solutions. The ideal candidate will have a strong background in systems engineering, software development, and a deep understanding of cloud computing platforms and DevOps practices. You will work closely with development teams to foster a culture of reliability and ownership throughout the software lifecycle. Responsibilities include defining SLOs/SLIs, managing incident response, conducting post-mortems, and driving initiatives to reduce toil and improve system resilience. This position requires hands-on expertise with infrastructure as code (IaC) tools, containerization technologies, and CI/CD pipelines. Collaboration, communication, and a proactive approach to problem-solving are essential. You will be instrumental in maintaining the high availability and performance standards that our users expect. This role offers a significant opportunity to impact the core infrastructure of a dynamic technology company. Responsibilities:

Design, implement, and maintain highly available and scalable production systems.
Develop and manage infrastructure automation using tools like Terraform, Ansible, or Chef.
Implement and manage container orchestration platforms (e.g., Kubernetes, Docker Swarm).
Set up and maintain robust monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK stack).
Lead incident response efforts, troubleshoot complex issues, and conduct thorough post-mortems.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Automate operational tasks and reduce manual intervention (toil reduction).
Collaborate with development teams to ensure the reliability and performance of new features and services.
Participate in on-call rotation to provide 24/7 support for critical systems.
Contribute to capacity planning and performance tuning.
Ensure security best practices are implemented across the infrastructure.
Document system architecture, operational procedures, and incident reports.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree is a plus.
5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
Proven experience with cloud platforms such as AWS, Azure, or GCP.
Expertise in scripting languages (e.g., Python, Go, Bash).
Strong understanding of networking concepts (TCP/IP, HTTP, DNS, load balancing).
Experience with CI/CD tools and practices (e.g., Jenkins, GitLab CI).
Familiarity with containerization technologies (Docker, Kubernetes).
Excellent troubleshooting, problem-solving, and analytical skills.
Strong communication and collaboration skills, with the ability to explain technical concepts clearly.
Experience with databases (SQL and NoSQL) and their administration.
On-call experience and ability to work under pressure.

This on-site role in Hanoi is crucial for maintaining our robust technological infrastructure.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

Hanoi, Hanoi Optimizely

Posted today

Tap Again To Close

Job Description

Optimizely is focused on unlocking the boundless potential of our clients and employees. We are a category leader in Digital Experience Platform (DXP) and have the pleasure of serving over 9,000 brands, from global organizations such as Visa, Sky, Yamaha, and Wall Street Journal to tech innovators like Atlassian DocuSign, FitBit, and Zillow.

Optimizely fosters an inclusive and diverse culture with a global team of 1500+ people spread
across the US, Europe, Dubai, Australia, Singapore, Bangladesh, and Vietnam. Our unique work environment focuses on flexibility, trust, teamwork, diversity, and moving fast.

We genuinely believe that our people make all the difference, and once we find the best talent, we go out of our way to nurture them. If you are looking to work on the next generation of digital technologies in a fast-paced and growing environment with industry leaders, Optimizely is the place for you!

**Introduction**:
**Responsibilities**:

- Define a roadmap for all engineering teams to utilize fully automated, self-service, highly scalable, cost-efficient, observable, auditable and reliable infrastructure services as standard practice.
- Drive the execution of this roadmap across the engineering organization, collaborating with SREs and senior engineers across engineering while also performing hands-on work on the most critical challenges.
- Provide expert technical guidance and ongoing engineering design review to teams planning and implementing large migrations, service-oriented architecture, broad architectural shifts, and capacity growth.
- Build a metrics-driven operational culture standardizing our practices for SLO definition and review as well as for logging, monitoring, alerting, and on-call practices.
- Make iterative improvements to blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization.
- Partner closely with Security, Quality, and Product teams to achieve high priority security, privacy, compliance, reliability, and business-continuity objectives on our overall roadmap.
- Propose and drive large improvements to production systems to achieve a significant impact to our business and engineering teams.
- Mentor and coach engineers to be curious and effective at discovering and solving technical challenges.

**Knowledge and Experience**:

- You have proven experience (6+ years) demonstrating hands-on technical leadership and business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges.
- You have deep technical experience with various cloud providers, containerization technologies, automated deployment frameworks, orchestration frameworks, monitoring, logging, alerting, system internals, networking, databases, distributed systems, and service-oriented architecture.
- You have the skills to implement load, stress, performance, and reliability testing standards at scale to improve service, platform, and infrastructure resiliency.
- You promote openness, diversity of opinions, and inclusive discussions at all times to evaluate a wide variety of ideas and perspectives in solving challenging problems.
- You demonstrate clear decision making and good trade-offs in complex situations comprising multiple opinions, needs, teams, technologies, cloud providers, and architectural settings.
- You communicate effectively with stakeholders ranging from executives to junior engineers across the breadth and depth of the engineering organization.
- You exemplify high accountability, integrity, and resilience to maintain focus on both big-picture goals and milestones to get there.
- You enable the engineering organization to innovate and deliver with greater speed and safety.

**Education**:
BS CS or equivalent industry experience

**Competencies**:

- Displaying Technical Expertise- Critical Thinking- Testing and Troubleshooting- Demonstrating Initiative- Utilizing Feedback**About Us**:

- 5 working days /week with flexible working time and no overtime.
- Annual luxury Kick-off vacation.
- International, professional, creative working environment and talented teams
- Onsite opportunities in Europe and US.
- Common cultural-sportive
- art Clubs and activities, sponsored and/or supported by the Company (Ex: Football, GYM, Swimming, Guitar, English ).
- Powerful workstation: Core i7-9700, 16-32 GB RAM, 02 x QHD 2560x1440 monitors (2K resolution).
- 100% official salary during the probation period, 13th month salary, annual salary raises.
- 12 days of annual leave and 3 days of company holidays (New Year eve 31/12, Juneteenth day 18/6, Work Anniversary)
- Up to 03 extra paid-leave days per year.
- Social, Health and Unemployed Insurance are based on 100% Gross salary and fully paid by Company.
- Extra bonus at $ 60 per special occasions (Birthday, Labor Day, National Day, Solar New year, Lunar New Year).
- Lunch allowance at $30 per month.

This advertiser has chosen not to accept applicants from your region.

Remote Lead Site Reliability Engineer

30000 Haiphong , Haiphong WhatJobs

Posted today

Tap Again To Close

Job Description

full-time

Our client is seeking an experienced and highly skilled Lead Site Reliability Engineer to join their dynamic and innovative team. This is a fully remote position, allowing you to work from anywhere within our operational framework. You will play a critical role in ensuring the availability, performance, scalability, and security of our client's cutting-edge digital platforms and infrastructure. As a Lead SRE, you will be responsible for designing, implementing, and maintaining robust systems, automating operational tasks, and developing strategies to prevent downtime and resolve complex technical issues. This role involves deep collaboration with development, QA, and operations teams to foster a culture of shared responsibility for reliability. You will mentor junior engineers, contribute to architectural decisions, and champion SRE best practices. Your expertise in cloud technologies, containerization, and infrastructure-as-code will be essential. We are looking for a proactive problem-solver who thrives in a challenging, fast-paced environment and is passionate about building resilient systems. This remote-first role emphasizes asynchronous communication and effective collaboration across distributed teams.
Responsibilities:

Design, build, and maintain scalable and reliable production systems.
Develop and implement automation strategies for deployment, monitoring, and incident response.
Identify and address performance bottlenecks and proactively mitigate risks.
Lead troubleshooting efforts and conduct post-mortems for incidents.
Collaborate with software engineers to ensure reliability is designed into new features.
Develop and maintain system monitoring, alerting, and logging infrastructure.
Manage CI/CD pipelines and optimize deployment processes.
Mentor and guide junior SRE team members.
Contribute to architectural discussions and technology selection.
Ensure system security and compliance with industry standards.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5+ years of experience in Site Reliability Engineering, DevOps, or System Administration.
Expertise in cloud platforms such as AWS, Azure, or GCP.
Proficiency in at least one scripting language (e.g., Python, Go, Bash).
Experience with containerization technologies like Docker and Kubernetes.
Strong understanding of networking concepts (TCP/IP, DNS, HTTP).
Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
Proven ability to diagnose and resolve complex system issues.
Excellent communication and collaboration skills for remote teamwork.
Experience with monitoring tools (e.g., Prometheus, Grafana, ELK stack).

This is an exceptional opportunity to shape the future of our client's infrastructure.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer (Remote)

500000 Hoa Sơn WhatJobs

Posted today

Tap Again To Close

Job Description

full-time

Our client is seeking a highly experienced Senior Site Reliability Engineer (SRE) to join their innovative technology team on a fully remote basis. In this critical role, you will be instrumental in ensuring the reliability, scalability, and performance of our production systems and infrastructure. You will design, build, and maintain robust systems, automate operational tasks, and implement best practices in site reliability engineering. The ideal candidate will have a deep understanding of distributed systems, cloud computing platforms (e.g., AWS, GCP, Azure), and extensive experience with infrastructure as code (IaC) tools, CI/CD pipelines, and monitoring solutions. Your responsibilities will include developing and implementing strategies to improve system availability, latency, and efficiency; proactively identifying and resolving performance bottlenecks; and leading incident response efforts to minimize downtime. This remote-first position requires a proactive, analytical, and collaborative mindset. You will be expected to work autonomously, manage complex technical challenges, and mentor junior engineers. We are looking for an individual with a strong coding background (e.g., Python, Go, Java) and a passion for automation and operational excellence. You will contribute to the design of resilient architectures, participate in capacity planning, and drive improvements in our observability stack. Your expertise will be crucial in maintaining the stability and performance of our critical services. The ability to effectively communicate technical concepts and solutions to diverse audiences is essential. This is an exceptional opportunity to work with cutting-edge technologies, solve challenging problems, and shape the future of our platform in a flexible, remote work environment.

Responsibilities:

Design, implement, and manage highly available and scalable systems.
Develop and maintain infrastructure automation tools and scripts.
Build and manage CI/CD pipelines for efficient software deployment.
Implement and optimize monitoring, alerting, and logging systems.
Lead incident response and conduct post-mortems to prevent future issues.
Collaborate with development teams to ensure system reliability and performance.
Conduct capacity planning and performance tuning.
Automate operational tasks and reduce manual toil.
Contribute to the design and architecture of new systems and features.
Mentor junior SREs and share best practices.

Qualifications:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
5+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
Strong experience with cloud platforms such as AWS, Azure, or GCP.
Proficiency in scripting and programming languages like Python, Go, or Java.
Experience with containerization technologies (Docker, Kubernetes).
Expertise in infrastructure as code (IaC) tools (Terraform, Ansible).
Knowledge of monitoring tools (Prometheus, Grafana, Datadog).
Strong understanding of networking, operating systems, and distributed systems.
Excellent problem-solving, analytical, and debugging skills.
Ability to work effectively in a remote team and manage complex projects.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer (SRE)

25000 Thai Binh , Thai Binh WhatJobs

Posted today

Tap Again To Close

Job Description

full-time

Our client is seeking a highly experienced Senior Site Reliability Engineer (SRE) to ensure the performance, scalability, and reliability of their critical infrastructure and applications. This is a fully remote position, enabling you to contribute to our robust systems from any location. The ideal candidate will have a deep understanding of distributed systems, cloud computing, automation, and operational excellence. You will be responsible for designing, building, and maintaining highly available and fault-tolerant systems, as well as proactively identifying and resolving potential issues.

Key Responsibilities:

Design, implement, and manage scalable and reliable cloud-based infrastructure (e.g., AWS, Azure, GCP).
Develop and maintain automation tools and scripts for deployment, monitoring, and incident management.
Implement and enforce best practices for system monitoring, alerting, and logging.
Participate in on-call rotation to respond to and resolve production incidents.
Conduct root cause analysis for production issues and implement preventative measures.
Collaborate with development teams to improve application reliability and performance throughout the software development lifecycle.
Manage and optimize CI/CD pipelines for efficient and safe software deployments.
Develop and maintain infrastructure as code (IaC) using tools like Terraform or Ansible.
Contribute to capacity planning and performance tuning of systems.
Document system architecture, operational procedures, and incident post-mortems.
Stay current with emerging technologies and industry best practices in SRE and cloud computing.
Mentor junior engineers and promote a culture of reliability and operational excellence.

The successful candidate will possess strong troubleshooting and problem-solving skills, with a proactive approach to anticipating and preventing system failures. Excellent communication and collaboration abilities are essential for working effectively with distributed teams. A deep understanding of system architecture, networking, and security principles is required.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Minimum of 6 years of experience in system administration, DevOps, or Site Reliability Engineering.
Proficiency with cloud platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
Strong scripting skills (e.g., Python, Bash, Go).
Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging systems (e.g., ELK stack).
Familiarity with CI/CD tools and practices (e.g., Jenkins, GitLab CI).
Solid understanding of networking concepts (TCP/IP, DNS, HTTP).
Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
Ability to work independently and manage priorities in a remote, fast-paced environment.

Join our client's cutting-edge engineering team and contribute to building and maintaining world-class, reliable systems. This remote role offers a challenging and rewarding opportunity for passionate SRE professionals.

This advertiser has chosen not to accept applicants from your region.

Remote Senior Site Reliability Engineer

200000 Phuong Son WhatJobs

Posted 2 days ago

Tap Again To Close

Job Description

full-time

Our client is seeking an experienced and highly motivated Senior Site Reliability Engineer to join their distributed, fully remote team. This role is critical for ensuring the availability, performance, scalability, and security of our client's production systems and infrastructure. You will be responsible for designing, implementing, and automating solutions that enhance system reliability, operational efficiency, and disaster recovery capabilities. Working in a remote-first environment, you'll collaborate with development and operations teams to proactively identify and address potential issues before they impact users. The ideal candidate will have a deep understanding of system administration, networking, cloud computing (preferably AWS or GCP), and infrastructure-as-code principles. You will play a key role in defining and implementing SRE best practices, including monitoring, alerting, capacity planning, and incident response. This is a challenging opportunity to contribute to a high-growth technology company, work with modern tools and technologies, and make a significant impact on the stability and performance of our services. We encourage candidates who are passionate about automation, system resilience, and continuous improvement to apply. Your expertise in scripting languages (Python, Bash), containerization (Docker, Kubernetes), and CI/CD pipelines will be essential for success.

Key Responsibilities:

Design, build, and maintain reliable, scalable, and high-performance infrastructure.
Develop and implement automation for operational tasks, deployments, and incident response.
Monitor system health, performance, and availability, and establish effective alerting mechanisms.
Participate in on-call rotations and manage production incidents.
Conduct root cause analysis for production issues and implement preventative measures.
Manage cloud infrastructure resources and optimize for cost and performance.
Collaborate with software engineering teams to improve the reliability and deployability of applications.
Develop and maintain infrastructure-as-code using tools like Terraform or Ansible.
Perform capacity planning and performance tuning.
Contribute to disaster recovery planning and testing.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
Proven experience with cloud platforms such as AWS, GCP, or Azure.
Strong proficiency in at least one scripting language (e.g., Python, Go, Bash).
Hands-on experience with containerization technologies like Docker and orchestration tools like Kubernetes.
Solid understanding of networking concepts (TCP/IP, DNS, HTTP, load balancing).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible, Chef, Puppet).
Experience in building and managing CI/CD pipelines.
Excellent problem-solving skills and the ability to work under pressure.
Strong communication and collaboration skills, especially in a remote environment.

This position is primarily associated with **Thai Nguyen, Thai Nguyen, VN**, but is a fully remote role, allowing you to work from anywhere.

This advertiser has chosen not to accept applicants from your region.

Principal Site Reliability Engineer (Zalopay

Ho Chi Minh City VNG

Posted today

Tap Again To Close

Job Description

Implementing SRE automation, developing automation across the stack, and optimizing operations hours by reducing manual operations.
- Eliminating toil by automation across all the layers - infrastructure provisioning, configuration management, deployment, testing, and operation on premise and public clouds (Google Cloud and AWS)
- Working on retooling our infrastructure to provide an agile, cloud based foundation that provides common infrastructure management and automation framework.
- Interfacing directly with senior staff members within the organization to discuss and assess compliance with IT policies, standards and procedures, suggest opportunities for improvement, and report on the status of specific. Work with development teams throughout the software life cycle ensuring sustainable software releases.
- Practicing sustainable incident response and blameless postmortems
- Help to build methodology to manage infrastructure and platform cost
- Train SRE junior members
- Manage small SRE team (4-6 members) to drive automation, scalability, high availability and performance of ZaloPay

**Yêu cầu**:

- Bachelor’s degree with five or more years of work experience.
- Six or more years of SRE relevant work experience.
- Experience in Systems Architecture, in-depth knowledge on SRE, IT Operations, Cloud, Coding and Scripting experience with Golang, Java, Python and automation tool: Terraform, Ansible,
- Strong experience with Google, AWS cloud environments, with working knowledge in standard cloud services, features and tool, with Certification in appropriate areas.
- Strong experience with automation provisioning dependency software on premises.
- Have experience building Disaster recovery solution is preferred

**Preferred**
- Five or more years of experience working on middle technologies like Kafka/ RabbitMQ, Springboot, REDIS, Elasticsearch MySQL, ETCD.
- Automation experience and ability to code or script at an advance level.
- Experience in Cloud & Container platform Strategies, Design, Architecture and Migration.
- Experience with designing and implementing CI/CD DevOps solutions using Jenkins pipelines using Python, Git, Shell, YAML, Kubernetes and Docker.
- Configuration Management experience with Chef, Puppet, Ansible or Python.
- Experience serving as both a mentor and advocate for your team.
- Experience performing analytics on previous incidents and usage patterns to better predict issues and take proactive actions.

This advertiser has chosen not to accept applicants from your region.

Be The First To Know

About the latest Site reliability engineer Jobs in Vietnam !

Set Email Alert:

Enter your email

Job title

Location

DevOps/site Reliability Engineer (Remote

Hanoi, Hanoi Token Metrics

Posted today

Tap Again To Close

Job Description

**Responsibilities**:

- Act as a cloud system admin (AWS primarily and knowledge of multi-cloud infrastructure).
- Monitoring and maintaining networks and servers.
- Creating and automating alerting and monitoring system logs.
- Building tools to mitigate weaknesses in incident management or software delivery.
- Troubleshooting Support Escalation requests.
- Upgrading, installing and configuring new hardware and software to meet company objectives.
- Implementing security protocols and procedures to prevent potential threats.
- Creating user accounts and performing access control.
- Performing diagnostic tests and debugging procedures to optimize computer systems.
- Documenting processes, as well as backing up and archiving data.
- Developing data retrieval and recovery procedures.
- Designing and implementing efficient end-user feedback and error reporting systems.
- Supervising and mentoring IT department employees, as well as providing IT support.
- Keeping up to date with advancements and best practices in IT administration.

**Requirements**:

- Bachelor's degree in Computer Science, Information Technology, Information Systems, or similar.
- Applicable professional qualification, such as Microsoft, Oracle, or Cisco certification.
- At least two years' experience in a similar role.
- Extensive experience with IT systems, networks, and related technologies.
- Solid knowledge of best practices in IT administration and system security.
- Exceptional leadership, organizational, and time management skills.
- Strong analytical and problem-solving skills.
- Excellent interpersonal and communication skills.

Token Metrics helps crypto investors build profitable portfolios using artificial intelligence based crypto indices, rankings, and price predictions.

Token Metrics has a diverse set of customers, from retail investors and traders to crypto fund managers, in more than 50 countries.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer Lead (Linux)

700000 Ho Chi Minh, Ho Chi Minh Aperia Solutions Vietnam Co Ltd

Posted 7 days ago

Tap Again To Close

Job Description

full-time

Our Site Reliability Engineer Lead (Linux) responsibilities will include but are not limited to.

Lead team of experienced Site Reliability Engineers (Linux), in both Vietnam and US
Investigate and troubleshoot system/application behaving and take the needed action to fix it
Being the reference for UNIX-like OS, Docker, Kubernetes & handling escalation internal/external and On-call
Being a technical expert for the team leader regarding activities and projects
Understand customer demands and propose the best solutions.
Automated provisioning, configuration management in cloud environments or on-premises
CI/CD of applications
Automate repetitive tasks and maintain scripts for the same.
Work closely with Project Manager to collect information and understand customers’ needs. Then help them to adopt the right solution.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer Lead (Linux)

700000 Ho Chi Minh, Ho Chi Minh Aperia Solutions Vietnam Co Ltd

Posted 15 days ago

Tap Again To Close

Job Description

full-time

Our Site Reliability Engineer Lead (Linux) responsibilities will include but are not limited to.

Lead team of experienced Site Reliability Engineers (Linux), in both Vietnam and US
Investigate and troubleshoot system/application behaving and take the needed action to fix it
Being the reference for UNIX-like OS, Docker, Kubernetes & handling escalation internal/external and On-call
Being a technical expert for the team leader regarding activities and projects
Understand customer demands and propose the best solutions.
Automated provisioning, configuration management in cloud environments or on-premises
CI/CD of applications
Automate repetitive tasks and maintain scripts for the same.
Work closely with Project Manager to collect information and understand customers’ needs. Then help them to adopt the right solution.

This advertiser has chosen not to accept applicants from your region.

Industry

View All Site Reliability Engineer Jobs

Menu

Search Suggestions

Recent Searches

Popular Searches

Location Suggestions

Popular Locations

Nearby Locations

Other Jobs Near Me

Industry

15 Site Reliability Engineer jobs in Vietnam

Senior Site Reliability Engineer

Job Description

Senior Site Reliability Engineer

Job Description

Remote Lead Site Reliability Engineer

Job Description

Senior Site Reliability Engineer (Remote)

Job Description

Senior Site Reliability Engineer (SRE)

Job Description

Remote Senior Site Reliability Engineer

Job Description

Principal Site Reliability Engineer (Zalopay

Job Description

Be The First To Know

DevOps/site Reliability Engineer (Remote

Job Description

Site Reliability Engineer Lead (Linux)

Job Description

Site Reliability Engineer Lead (Linux)

Job Description

Nearby Locations

Other Jobs Near Me

Industry