Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

243 Condition Monitoring jobs in Vietnam

Reliability Engineer

Hanoi, Hanoi Microsoft Corporation

Posted 21 days ago

Tap Again To Close

Job Description

Microsoft is a world leader in the design of hardware devices and entertainment devices. We are currently looking for a creative and talented individual with a passion for technology to drive reliability and qualification of hardware products to advance Hardware's leadership position in exceeding our consumers' durability expectations.
This key position in our Quality **Reliability Engineering** organization, based in Vietnam.
The ideal candidate will have a solid reliability and simulations background with process/manufacturing background in consumer electronics industry (electromechanical) and effective in supplier quality management with in-depth knowledge on reliability testing methodology and reliability analysis.
To qualify for this exciting opportunity, this candidate must possess effective communication, organizational, technical and documentation skills. You must function well in a fast-paced collaborative environment and be able to apply critical thinking and strong problem solving skills to complex production environment scenarios to ensure high availability.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
**Responsibilities**
+ Candidate will be responsible for monitoring product performance in the field and will work closely with manufacturing partners and component vendors to perform failure analysis and drive corrective actions.
+ Provide reliability guidance to Contract manufacturers and suppliers for release to manufacture phase and lab qualifications.
+ Develop Suppliers to setup On-Going-Reliability test to monitoring mass production.
+ Work with China and Redmond Reliability teams to develop and to document reliability qualification plans for new products.
+ Managing multiple design qualification activities and development schedule to improve the quality of products.
+ Evaluate and Drive effectiveness of the reliability stresses or resolve reliability issues related to products.
+ Proactively drive root cause investigation of reliability failures and work with cross-functional teams for issues closures.
+ Participate in component vendor selection activity and drive component qualification activity for components that are critical and strategic to Microsoft product requirements.
+ Understanding of the technology, materials and failure mechanisms associated with major electronic and electro-mechanical components/materials.
+ Use knowledge of process capability for electronic component production as well as system-level performance requirements to establish Critical-to-Quality performance metrics.
+ 0-25% overseas travelling opportunity as needed.
**Qualifications**
**Required Qualifications:**
+ Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 2+ years technical engineering experience OR Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 3+ years technical engineering experience OR 7+ years technical engineering experience.
+ Solid Experience in working with suppliers in setting up Reliability labs and run qualification plans during development and sustaining phase.
+ Familiar with all the various Environmental, Mechanical Reliability test methodologies in ASTM /IEEE Industry Standards and understand basics of.
+ Solid experience in hardware verification, PCBA and Box Build Assemblies process controls and quality controls.
+ Effective English communication skills, verbal and writing.
**Preferred Qualifications:**
+ Statistical analysis skill, familiar with tools as Minitab or Weibull.
+ Understand the PoF with good basic failure analysis knowledge.
+ DFMEA experience.
+ Effective communication and collaboration skills to work with people from a variety of technical backgrounds.
#W+DJOBS
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations ( .

This advertiser has chosen not to accept applicants from your region.

Reliability Engineer

Hanoi, Hanoi Công Ty TNHH AMREP Việt Nam

Posted today

Tap Again To Close

Job Description

**Mô tả công việc**:
(Mức lương: Thỏa thuận)
Work with cross-functional team to overall manage suppliers' quality and project development and management.
Take lead to improve suppliers‘ performance and develop key suppliers
New supplier audit, evaluation and training & qualification.
Conducting, Identifying and reporting on potential failures within a process
Designing new systems and performing predictive analysis
Planning performance evaluation assessments
Conduct Failure mode and effects analysis (FMEA), Reliability hazard analysis, Dynamic reliability block-diagram analysis, Fault tree analysis, Accelerated testing, Avoidance of single point of failure (SPOF)
Perform Root cause analysis and create action plan on corrective actions
Do the Functional analysis and functional failure analysis (e.g., function FMEA, FHA or FFA)
Conduct Operational hazard analysis
Creating and monitoring life cycle asset management plans
Develop the manufacturing & Test process / equipment / materials for FOL (Front Of Line) and EOL (End Of Line).
Provide technical leadership in resolving any process or test related issue;
Responsible for developing DOEs, Statistical Process Analysis, process specification, D/PFMEA and process control plan.
Process capability analysis, validation, improvement and qualification.
Inspection, measurement, tester validation and qualification through MSA.

**Chức vụ**: Nhân Viên/Chuyên Viên

**Hình thức làm việc**: Toàn thời gian

**Quyền lợi được hưởng**:
Interesting bonus scheme
Employee Benefit insurance
A great working environment where you are working with the best people in the industry

**Yêu cầu bằng cấp (tối thiểu)**: Đại Học

**Yêu cầu công việc**:
Bachelor degree in Electrical/Electronics/Electro-Mechanical or equivalent practical experience.
At least 6 years work experience in Quality/Reliability Engineering and manufacturing engineering
Experience in Electronic modules like smartphones, tablet, PCs, smart wearable device and its manufacturing process.
Good logic thinking, quality knowledge and process methodology.
Excellent communication and coordination skill.
Good English written and oral skill.

**Yêu cầu giới tính**: Nam/Nữ

**Ngành nghề**: Cơ Điện,Điện Tử

Đại Học
Không yêu cầu

This advertiser has chosen not to accept applicants from your region.

Reliability Engineer , eero

Hanoi, Hanoi Amazon

Posted 7 days ago

Tap Again To Close

Job Description

Description
The Role:
A Reliability Engineer who's passionate and takes great pride in launching high quality and reliable products into the consumer market. The position will collaborate with cross-functional team members to establish product design and performance validation test methodologies and performance specifications to ensure that product is ready for production. The ideal candidate will be responsible for system reliability testing, packaging reliability testing, accessories reliability testing, reliability calculations, statistical analysis, performance tests and field analysis of eero products from prototypes to mass production. You will partner with the Packaging Engineering, Accessories team, Product Management, Development Engineering, Material Sourcing, Manufacturing Engineering, Strategic Product Development, Manufacturing Partners and Component Suppliers to achieve key product quality, cost, and reliability goals. Specifically, this person will work with eero cross-functional engineers on new and sustaining product reliability tests creations, assessments and acceptance criteria, identify critical field issues and actively implement corrective and preventative actions with partnered CMs, JDMs and/or ODMs based in Asia.
What you'll do:
● Perform system reliability testing, packaging reliability testing, accessory's reliability testing, review testing reports and highlight the reliability results to the cross-functional team (Product Design, Hardware team, Packaging team, Design & Development, Product, Operations).
● Develop system, packaging and accessories reliability plans with goals and quantifiable results: ISTA, MTBF level, ALT, 85C/85RH etc. tests.
● Perform DFR (Design for Reliability Reviews) and DFMEA (Design Failure Mode Effects Analysis) reviews by partnering with Engineering and Manufacturers to achieve key reliability goals (i.e. design margin analysis, preferred parts, suppliers, component/system, alternative components or technologies).
● Execute reliability qualification plans by driving external labs and leveraging internal resources.
● Support system level products (routers) reliability testing, DOEs, and studying/developing new test cases.
● Verify suppliers' reliability calculations and tests at component level in partnership with Component, Supplier Quality and Supply Chain engineers.
● Write Engineering Verification Test plans, execute plans, and create test reports.
● Define a set of production reliability tests and methodologies (packaging, accessories and products), such as ORT, ESS, FMEA, DFX, etc. in order to ensure field reliable parts and products.
● Apply metrics for monitoring the field reliability performance and dynamically act on the findings with corrective actions, using best-in-class methodologies such as 8D, fish bone, 6 Sigma, DMAIC, FMEA, and SPC.
● Identify field trends and set up alerts using applications such as Weibull+ or JMP to perform sound analysis and predictions based on field data.
● Report on findings at core team meetings to reach consensus on the actions to be applied.
● Report critical issues and findings to executive leadership for directions and/or escalations.
● Analyze failures from field, production and qualification tests providing improvement suggestions, based on the failure mechanisms and root causes, to Develop Engineering, Manufacturers and Customer Support.
● Create a culture of continuous improvement at eero and inspire best practices by writing guidelines, providing feedback, solutions, applying innovative metrics and measurements, planning DOEs, and benchmarking the state of the art in comparable industries, technologies and companies.
Basic Qualifications
● Technical Degree (BSEE, BSME, BSCS, Physics, Industrial Engineering, other)
● 8+ years of combined experience in Packaging, Accessories and Product Reliability Engineering and Testing for New Product Introductions and Sustaining.
● 5~10 years of combined experience in consumer electronics manufacturing; experience with Sensors, RF and/or Wi-Fi based products will be a plus
● Experience with industry standards (ISTA, IEC, UL, ASTM, ANSI, TUV, ISO, IPC, MIL, etc.)
● Demonstrated excellent leadership, communication, interpersonal skills.
● Results driven, team player, proven ability to influence design teams and cross-company teams.
● Must have the ability to thrive in a fast-paced, team-oriented environment.
● Familiarity with documentation required for manufacturing assembly & test of RF systems, particularly BOMs, Schematics, Block Diagrams, Release Notes, System Requirement Documents, Assembly Instructions, MFG Test Instructions.
● Ability to work independently on testing & diagnosis of system failures down to the board level and component level product, including test, debug and repair. Work with Design Engineers to resolve issues via reliability test results.
● Strong analytical, technical, problem-solving skills.
● Strong verbal & written communication skills, excellent interpersonal skills, ability to work in a variety of locations (office, external labs, customer sites, contract manufacturers)
Preferred Qualifications
● Familiarity with various operating systems (Mac, Windows, Linux) both GUI and Command Line
● Familiarity with various programming languages and tools (LabVIEW, C/C++/C#, Excel VBA, Python, Scripting, HTML/XML, scripts, batch files, Weibull+, JMP) and test equipment such as stain gauge, environmental chambers, impact and vibration testers, power supplies, salt spray and UV testers, etc.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit for more information. If the country/region you're applying in isn't listed, please contact your Recruiting Partner.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

100000 An Cu, An Giang WhatJobs

Posted 2 days ago

Tap Again To Close

Job Description

full-time

Our client is looking for a highly experienced Senior Site Reliability Engineer (SRE) to join their growing team based in **Hanoi, Hanoi, VN**. This critical role focuses on ensuring the availability, performance, scalability, and security of our client's production systems and services. You will be responsible for designing, building, and operating large-scale, distributed systems, automating infrastructure management, and implementing robust monitoring and alerting solutions. The ideal candidate will have a strong background in systems engineering, software development, and a deep understanding of cloud computing platforms and DevOps practices. You will work closely with development teams to foster a culture of reliability and ownership throughout the software lifecycle. Responsibilities include defining SLOs/SLIs, managing incident response, conducting post-mortems, and driving initiatives to reduce toil and improve system resilience. This position requires hands-on expertise with infrastructure as code (IaC) tools, containerization technologies, and CI/CD pipelines. Collaboration, communication, and a proactive approach to problem-solving are essential. You will be instrumental in maintaining the high availability and performance standards that our users expect. This role offers a significant opportunity to impact the core infrastructure of a dynamic technology company. Responsibilities:

Design, implement, and maintain highly available and scalable production systems.
Develop and manage infrastructure automation using tools like Terraform, Ansible, or Chef.
Implement and manage container orchestration platforms (e.g., Kubernetes, Docker Swarm).
Set up and maintain robust monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK stack).
Lead incident response efforts, troubleshoot complex issues, and conduct thorough post-mortems.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Automate operational tasks and reduce manual intervention (toil reduction).
Collaborate with development teams to ensure the reliability and performance of new features and services.
Participate in on-call rotation to provide 24/7 support for critical systems.
Contribute to capacity planning and performance tuning.
Ensure security best practices are implemented across the infrastructure.
Document system architecture, operational procedures, and incident reports.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree is a plus.
5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
Proven experience with cloud platforms such as AWS, Azure, or GCP.
Expertise in scripting languages (e.g., Python, Go, Bash).
Strong understanding of networking concepts (TCP/IP, HTTP, DNS, load balancing).
Experience with CI/CD tools and practices (e.g., Jenkins, GitLab CI).
Familiarity with containerization technologies (Docker, Kubernetes).
Excellent troubleshooting, problem-solving, and analytical skills.
Strong communication and collaboration skills, with the ability to explain technical concepts clearly.
Experience with databases (SQL and NoSQL) and their administration.
On-call experience and ability to work under pressure.

This on-site role in Hanoi is crucial for maintaining our robust technological infrastructure.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

Hanoi, Hanoi Optimizely

Posted today

Tap Again To Close

Job Description

Optimizely is focused on unlocking the boundless potential of our clients and employees. We are a category leader in Digital Experience Platform (DXP) and have the pleasure of serving over 9,000 brands, from global organizations such as Visa, Sky, Yamaha, and Wall Street Journal to tech innovators like Atlassian DocuSign, FitBit, and Zillow.

Optimizely fosters an inclusive and diverse culture with a global team of 1500+ people spread
across the US, Europe, Dubai, Australia, Singapore, Bangladesh, and Vietnam. Our unique work environment focuses on flexibility, trust, teamwork, diversity, and moving fast.

We genuinely believe that our people make all the difference, and once we find the best talent, we go out of our way to nurture them. If you are looking to work on the next generation of digital technologies in a fast-paced and growing environment with industry leaders, Optimizely is the place for you!

**Introduction**:
**Responsibilities**:

- Define a roadmap for all engineering teams to utilize fully automated, self-service, highly scalable, cost-efficient, observable, auditable and reliable infrastructure services as standard practice.
- Drive the execution of this roadmap across the engineering organization, collaborating with SREs and senior engineers across engineering while also performing hands-on work on the most critical challenges.
- Provide expert technical guidance and ongoing engineering design review to teams planning and implementing large migrations, service-oriented architecture, broad architectural shifts, and capacity growth.
- Build a metrics-driven operational culture standardizing our practices for SLO definition and review as well as for logging, monitoring, alerting, and on-call practices.
- Make iterative improvements to blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization.
- Partner closely with Security, Quality, and Product teams to achieve high priority security, privacy, compliance, reliability, and business-continuity objectives on our overall roadmap.
- Propose and drive large improvements to production systems to achieve a significant impact to our business and engineering teams.
- Mentor and coach engineers to be curious and effective at discovering and solving technical challenges.

**Knowledge and Experience**:

- You have proven experience (6+ years) demonstrating hands-on technical leadership and business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges.
- You have deep technical experience with various cloud providers, containerization technologies, automated deployment frameworks, orchestration frameworks, monitoring, logging, alerting, system internals, networking, databases, distributed systems, and service-oriented architecture.
- You have the skills to implement load, stress, performance, and reliability testing standards at scale to improve service, platform, and infrastructure resiliency.
- You promote openness, diversity of opinions, and inclusive discussions at all times to evaluate a wide variety of ideas and perspectives in solving challenging problems.
- You demonstrate clear decision making and good trade-offs in complex situations comprising multiple opinions, needs, teams, technologies, cloud providers, and architectural settings.
- You communicate effectively with stakeholders ranging from executives to junior engineers across the breadth and depth of the engineering organization.
- You exemplify high accountability, integrity, and resilience to maintain focus on both big-picture goals and milestones to get there.
- You enable the engineering organization to innovate and deliver with greater speed and safety.

**Education**:
BS CS or equivalent industry experience

**Competencies**:

- Displaying Technical Expertise- Critical Thinking- Testing and Troubleshooting- Demonstrating Initiative- Utilizing Feedback**About Us**:

- 5 working days /week with flexible working time and no overtime.
- Annual luxury Kick-off vacation.
- International, professional, creative working environment and talented teams
- Onsite opportunities in Europe and US.
- Common cultural-sportive
- art Clubs and activities, sponsored and/or supported by the Company (Ex: Football, GYM, Swimming, Guitar, English ).
- Powerful workstation: Core i7-9700, 16-32 GB RAM, 02 x QHD 2560x1440 monitors (2K resolution).
- 100% official salary during the probation period, 13th month salary, annual salary raises.
- 12 days of annual leave and 3 days of company holidays (New Year eve 31/12, Juneteenth day 18/6, Work Anniversary)
- Up to 03 extra paid-leave days per year.
- Social, Health and Unemployed Insurance are based on 100% Gross salary and fully paid by Company.
- Extra bonus at $ 60 per special occasions (Birthday, Labor Day, National Day, Solar New year, Lunar New Year).
- Lunch allowance at $30 per month.

This advertiser has chosen not to accept applicants from your region.

Remote Lead Site Reliability Engineer

30000 Haiphong , Haiphong WhatJobs

Posted today

Tap Again To Close

Job Description

full-time

Our client is seeking an experienced and highly skilled Lead Site Reliability Engineer to join their dynamic and innovative team. This is a fully remote position, allowing you to work from anywhere within our operational framework. You will play a critical role in ensuring the availability, performance, scalability, and security of our client's cutting-edge digital platforms and infrastructure. As a Lead SRE, you will be responsible for designing, implementing, and maintaining robust systems, automating operational tasks, and developing strategies to prevent downtime and resolve complex technical issues. This role involves deep collaboration with development, QA, and operations teams to foster a culture of shared responsibility for reliability. You will mentor junior engineers, contribute to architectural decisions, and champion SRE best practices. Your expertise in cloud technologies, containerization, and infrastructure-as-code will be essential. We are looking for a proactive problem-solver who thrives in a challenging, fast-paced environment and is passionate about building resilient systems. This remote-first role emphasizes asynchronous communication and effective collaboration across distributed teams.
Responsibilities:

Design, build, and maintain scalable and reliable production systems.
Develop and implement automation strategies for deployment, monitoring, and incident response.
Identify and address performance bottlenecks and proactively mitigate risks.
Lead troubleshooting efforts and conduct post-mortems for incidents.
Collaborate with software engineers to ensure reliability is designed into new features.
Develop and maintain system monitoring, alerting, and logging infrastructure.
Manage CI/CD pipelines and optimize deployment processes.
Mentor and guide junior SRE team members.
Contribute to architectural discussions and technology selection.
Ensure system security and compliance with industry standards.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5+ years of experience in Site Reliability Engineering, DevOps, or System Administration.
Expertise in cloud platforms such as AWS, Azure, or GCP.
Proficiency in at least one scripting language (e.g., Python, Go, Bash).
Experience with containerization technologies like Docker and Kubernetes.
Strong understanding of networking concepts (TCP/IP, DNS, HTTP).
Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
Proven ability to diagnose and resolve complex system issues.
Excellent communication and collaboration skills for remote teamwork.
Experience with monitoring tools (e.g., Prometheus, Grafana, ELK stack).

This is an exceptional opportunity to shape the future of our client's infrastructure.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer (Remote)

500000 Hoa Sơn WhatJobs

Posted today

Tap Again To Close

Job Description

full-time

Our client is seeking a highly experienced Senior Site Reliability Engineer (SRE) to join their innovative technology team on a fully remote basis. In this critical role, you will be instrumental in ensuring the reliability, scalability, and performance of our production systems and infrastructure. You will design, build, and maintain robust systems, automate operational tasks, and implement best practices in site reliability engineering. The ideal candidate will have a deep understanding of distributed systems, cloud computing platforms (e.g., AWS, GCP, Azure), and extensive experience with infrastructure as code (IaC) tools, CI/CD pipelines, and monitoring solutions. Your responsibilities will include developing and implementing strategies to improve system availability, latency, and efficiency; proactively identifying and resolving performance bottlenecks; and leading incident response efforts to minimize downtime. This remote-first position requires a proactive, analytical, and collaborative mindset. You will be expected to work autonomously, manage complex technical challenges, and mentor junior engineers. We are looking for an individual with a strong coding background (e.g., Python, Go, Java) and a passion for automation and operational excellence. You will contribute to the design of resilient architectures, participate in capacity planning, and drive improvements in our observability stack. Your expertise will be crucial in maintaining the stability and performance of our critical services. The ability to effectively communicate technical concepts and solutions to diverse audiences is essential. This is an exceptional opportunity to work with cutting-edge technologies, solve challenging problems, and shape the future of our platform in a flexible, remote work environment.

Responsibilities:

Design, implement, and manage highly available and scalable systems.
Develop and maintain infrastructure automation tools and scripts.
Build and manage CI/CD pipelines for efficient software deployment.
Implement and optimize monitoring, alerting, and logging systems.
Lead incident response and conduct post-mortems to prevent future issues.
Collaborate with development teams to ensure system reliability and performance.
Conduct capacity planning and performance tuning.
Automate operational tasks and reduce manual toil.
Contribute to the design and architecture of new systems and features.
Mentor junior SREs and share best practices.

Qualifications:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
5+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
Strong experience with cloud platforms such as AWS, Azure, or GCP.
Proficiency in scripting and programming languages like Python, Go, or Java.
Experience with containerization technologies (Docker, Kubernetes).
Expertise in infrastructure as code (IaC) tools (Terraform, Ansible).
Knowledge of monitoring tools (Prometheus, Grafana, Datadog).
Strong understanding of networking, operating systems, and distributed systems.
Excellent problem-solving, analytical, and debugging skills.
Ability to work effectively in a remote team and manage complex projects.

This advertiser has chosen not to accept applicants from your region.

Be The First To Know

About the latest Condition monitoring Jobs in Vietnam !

Set Email Alert:

Enter your email

Job title

Location

Senior Site Reliability Engineer (SRE)

25000 Thai Binh , Thai Binh WhatJobs

Posted today

Tap Again To Close

Job Description

full-time

Our client is seeking a highly experienced Senior Site Reliability Engineer (SRE) to ensure the performance, scalability, and reliability of their critical infrastructure and applications. This is a fully remote position, enabling you to contribute to our robust systems from any location. The ideal candidate will have a deep understanding of distributed systems, cloud computing, automation, and operational excellence. You will be responsible for designing, building, and maintaining highly available and fault-tolerant systems, as well as proactively identifying and resolving potential issues.

Key Responsibilities:

Design, implement, and manage scalable and reliable cloud-based infrastructure (e.g., AWS, Azure, GCP).
Develop and maintain automation tools and scripts for deployment, monitoring, and incident management.
Implement and enforce best practices for system monitoring, alerting, and logging.
Participate in on-call rotation to respond to and resolve production incidents.
Conduct root cause analysis for production issues and implement preventative measures.
Collaborate with development teams to improve application reliability and performance throughout the software development lifecycle.
Manage and optimize CI/CD pipelines for efficient and safe software deployments.
Develop and maintain infrastructure as code (IaC) using tools like Terraform or Ansible.
Contribute to capacity planning and performance tuning of systems.
Document system architecture, operational procedures, and incident post-mortems.
Stay current with emerging technologies and industry best practices in SRE and cloud computing.
Mentor junior engineers and promote a culture of reliability and operational excellence.

The successful candidate will possess strong troubleshooting and problem-solving skills, with a proactive approach to anticipating and preventing system failures. Excellent communication and collaboration abilities are essential for working effectively with distributed teams. A deep understanding of system architecture, networking, and security principles is required.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Minimum of 6 years of experience in system administration, DevOps, or Site Reliability Engineering.
Proficiency with cloud platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
Strong scripting skills (e.g., Python, Bash, Go).
Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging systems (e.g., ELK stack).
Familiarity with CI/CD tools and practices (e.g., Jenkins, GitLab CI).
Solid understanding of networking concepts (TCP/IP, DNS, HTTP).
Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
Ability to work independently and manage priorities in a remote, fast-paced environment.

Join our client's cutting-edge engineering team and contribute to building and maintaining world-class, reliable systems. This remote role offers a challenging and rewarding opportunity for passionate SRE professionals.

This advertiser has chosen not to accept applicants from your region.

Remote Senior Site Reliability Engineer

200000 Phuong Son WhatJobs

Posted 2 days ago

Tap Again To Close

Job Description

full-time

Our client is seeking an experienced and highly motivated Senior Site Reliability Engineer to join their distributed, fully remote team. This role is critical for ensuring the availability, performance, scalability, and security of our client's production systems and infrastructure. You will be responsible for designing, implementing, and automating solutions that enhance system reliability, operational efficiency, and disaster recovery capabilities. Working in a remote-first environment, you'll collaborate with development and operations teams to proactively identify and address potential issues before they impact users. The ideal candidate will have a deep understanding of system administration, networking, cloud computing (preferably AWS or GCP), and infrastructure-as-code principles. You will play a key role in defining and implementing SRE best practices, including monitoring, alerting, capacity planning, and incident response. This is a challenging opportunity to contribute to a high-growth technology company, work with modern tools and technologies, and make a significant impact on the stability and performance of our services. We encourage candidates who are passionate about automation, system resilience, and continuous improvement to apply. Your expertise in scripting languages (Python, Bash), containerization (Docker, Kubernetes), and CI/CD pipelines will be essential for success.

Key Responsibilities:

Design, build, and maintain reliable, scalable, and high-performance infrastructure.
Develop and implement automation for operational tasks, deployments, and incident response.
Monitor system health, performance, and availability, and establish effective alerting mechanisms.
Participate in on-call rotations and manage production incidents.
Conduct root cause analysis for production issues and implement preventative measures.
Manage cloud infrastructure resources and optimize for cost and performance.
Collaborate with software engineering teams to improve the reliability and deployability of applications.
Develop and maintain infrastructure-as-code using tools like Terraform or Ansible.
Perform capacity planning and performance tuning.
Contribute to disaster recovery planning and testing.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
Proven experience with cloud platforms such as AWS, GCP, or Azure.
Strong proficiency in at least one scripting language (e.g., Python, Go, Bash).
Hands-on experience with containerization technologies like Docker and orchestration tools like Kubernetes.
Solid understanding of networking concepts (TCP/IP, DNS, HTTP, load balancing).
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible, Chef, Puppet).
Experience in building and managing CI/CD pipelines.
Excellent problem-solving skills and the ability to work under pressure.
Strong communication and collaboration skills, especially in a remote environment.

This position is primarily associated with **Thai Nguyen, Thai Nguyen, VN**, but is a fully remote role, allowing you to work from anywhere.

This advertiser has chosen not to accept applicants from your region.

Principal Site Reliability Engineer (Zalopay

Ho Chi Minh City VNG

Posted today

Tap Again To Close

Job Description

Implementing SRE automation, developing automation across the stack, and optimizing operations hours by reducing manual operations.
- Eliminating toil by automation across all the layers - infrastructure provisioning, configuration management, deployment, testing, and operation on premise and public clouds (Google Cloud and AWS)
- Working on retooling our infrastructure to provide an agile, cloud based foundation that provides common infrastructure management and automation framework.
- Interfacing directly with senior staff members within the organization to discuss and assess compliance with IT policies, standards and procedures, suggest opportunities for improvement, and report on the status of specific. Work with development teams throughout the software life cycle ensuring sustainable software releases.
- Practicing sustainable incident response and blameless postmortems
- Help to build methodology to manage infrastructure and platform cost
- Train SRE junior members
- Manage small SRE team (4-6 members) to drive automation, scalability, high availability and performance of ZaloPay

**Yêu cầu**:

- Bachelor’s degree with five or more years of work experience.
- Six or more years of SRE relevant work experience.
- Experience in Systems Architecture, in-depth knowledge on SRE, IT Operations, Cloud, Coding and Scripting experience with Golang, Java, Python and automation tool: Terraform, Ansible,
- Strong experience with Google, AWS cloud environments, with working knowledge in standard cloud services, features and tool, with Certification in appropriate areas.
- Strong experience with automation provisioning dependency software on premises.
- Have experience building Disaster recovery solution is preferred

**Preferred**
- Five or more years of experience working on middle technologies like Kafka/ RabbitMQ, Springboot, REDIS, Elasticsearch MySQL, ETCD.
- Automation experience and ability to code or script at an advance level.
- Experience in Cloud & Container platform Strategies, Design, Architecture and Migration.
- Experience with designing and implementing CI/CD DevOps solutions using Jenkins pipelines using Python, Git, Shell, YAML, Kubernetes and Docker.
- Configuration Management experience with Chef, Puppet, Ansible or Python.
- Experience serving as both a mentor and advocate for your team.
- Experience performing analytics on previous incidents and usage patterns to better predict issues and take proactive actions.

This advertiser has chosen not to accept applicants from your region.

Industry

View All Condition Monitoring Jobs

Menu

Search Suggestions

Recent Searches

Popular Searches

Location Suggestions

Popular Locations

Nearby Locations

Other Jobs Near Me

Industry

243 Condition Monitoring jobs in Vietnam

Reliability Engineer

Job Description

Reliability Engineer

Job Description

Reliability Engineer , eero

Job Description

Senior Site Reliability Engineer

Job Description

Senior Site Reliability Engineer

Job Description

Remote Lead Site Reliability Engineer

Job Description

Senior Site Reliability Engineer (Remote)

Job Description

Be The First To Know

Senior Site Reliability Engineer (SRE)

Job Description

Remote Senior Site Reliability Engineer

Job Description

Principal Site Reliability Engineer (Zalopay

Job Description

Nearby Locations

Other Jobs Near Me

Industry