Site Reliability Engineer
Alianza
Software Engineering
Posted on Jan 23, 2026
You must currently have the right to work in Australia without requiring sponsorship, either now or in the future.
A Site Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and scalability for Alianza’s Cloud Platform systems and infrastructure.
Key Objectives include:
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Improve reliability, quality, and time-to-market of software solutions.
- Balance feature development speed and reliability with well-defined service-level objectives.
Key Responsibilities:
- Monitoring and Maintenance:
- Continuously monitor system health and performance, ensuring high availability and reliability of applications.
- Detect and automatically handle failures, preparing disaster recovery plans.
- Automation and Improvement:
- Build and maintain software and systems to manage platform infrastructure and applications.
- Implement automation to reduce manual intervention and improve system efficiency.
- Performance Optimization:
- Measure and optimize system performance, pushing capabilities forward and innovating for continual improvement.
- Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding.
- Collaboration and Consulting:
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Incident Management:
- Provide primary operational support and engineering for multiple large-scale distributed software applications.
- Participate in on-call rotations to respond to incidents and ensure system reliability.
Competencies & Attributes
Competencies:
- Attention to Detail:
The ability to perform tasks with thoroughness and accuracy, ensuring all aspects of the system are meticulously managed.
Problem-Solving Skills:
The capability to analyze complex issues, identify root causes, and develop effective solutions to ensure system reliability and performance. - Technical Expertise:
Proficiency in understanding and applying technical knowledge related to infrastructure, code, and tools, which can be enhanced through continuous learning and experience. - Automation Skills:
The ability to design and implement automation processes to reduce manual intervention and improve system efficiency. - Communication Skills:
The ability to clearly convey ideas, strategies, and updates to various stakeholders, ensuring alignment and transparency across the organization.
Attributes:
- Meticulousness:
An inherent tendency to be precise and conscientious, ensuring high standards are maintained in all aspects of work.
- Resilience:
The innate ability to remain calm and composed under pressure, effectively managing stressful situations and leading the team through challenges. - Curiosity:
A natural inclination to explore and learn new technologies and methodologies, driving innovation and continuous improvement. - Empathy:
An inherent quality of understanding and valuing the perspectives and needs of team members and stakeholders, fostering a supportive and inclusive environment.
- Adaptability:
The ability to naturally adjust to changing circumstances and environments, ensuring effective responses to new challenges and opportunities.
Desired Skills/Qualifications
- Technical Proficiency:
- Understanding of high-level languages such as Python, Java, C/C++, Ruby, and JavaScript.
- Experience with distributed storage technologies and dynamic resource management frameworks.
- Experience of Telco technology and Metaswitch software as a bonus.
- Problem-Solving Skills:
- Strong analytical skills to diagnose and resolve complex technical issues.
- Communication Skills:
- Excellent communication skills to collaborate effectively with cross-functional teams and convey technical concepts.
- Experience with Cloud Platforms:
- Hands-on experience with cloud platforms like AWS, GCP, or Azure. Understanding cloud-native applications and services is vital for modern SRE roles.
- Knowledge of Networking and Distributed Systems:
- Strong understanding of networking fundamentals and experience with distributed systems such as Kafka, Kubernetes, and other stream-processing technologies. This helps in managing large-scale, complex systems.