February 4, 2024

Operation

Half-Year Anniversary of the Launch of KINTO FACTORY: A Path of Challenge and Learning

Introduction

Hello, I am Nishida, a backend engineer at KINTO FACTORY.
It has been six months since the launch of KINTO FACTORY, and I would like to talk about the problems we encountered since we started operating the service, mainly involving its release and system monitoring, and what we have learned.

About KINTO FACTORY

I will give a brief overview of KINTO FACTORY's service. The service allows you to update functions and items such as hardware and software that are appropriate for your vehicle. With KINTO FACTORY, you add items that you could only pick before when ordering a new vehicle, such as manufacturer options, or vehicle updates that normally require you to go to a dealer. This update service extends to, not only the KINTO lineup but to Toyota, Lexus, and GR vehicles if they are compatible. Almost half a year has passed since it launched this summer.

Operations

There is a release every two weeks, and we build CI/CD and deploy mainly with GitHub Actions.
deploy

The service is monitored mainly by the development team members, who take turns to do so, as there is no established operation team or organization. We use PagerDuty's scheduling function for the rotation.

Incident detection is set up as shown in the illustration.

Organization

Application logs and service monitoring information are linked to OpenSearch to determine user impact and send notifications to PagerDuty. (The image shows the result of the transmission)

After that, a notification is sent to Slack, and the person on duty to monitor responds. The person in charge responds. We try to get an expert involved when necessary. In addition, the details of the response are shared at the daily scrum the next day so that the team knows.

Challenges and Responses

I will talk about the problems we encountered since we started operations and how we dealt with them.

Preparing for the release had a large burden

It took us time as we had no templates, so we had to start preparing for the release from scratch. Additionally, the task granularity was uneven and the release procedure was managed based on Confluence, so checking and modifying said procedure was complicated.

⇒ Transferring the release procedure manual to code management let us do version control, create templates, confirm differences, and reduce the burden of preparing for the release.
move_docs

Incidents were detected frequently

The number of incidents within a few days after launch was as follows:

Because the error handling design on the application side was lenient, errors were detected and treated as critical even if they did not affect the user. After we started operations, there were incidents every day, and monitoring was a heavy burden.

⇒ Because there were too many incidents to handle all at once, we optimized by using exclusion settings so that each output was reviewed if it was not urgent, and if it could not be corrected immediately, it would not be notified.

Slow initial response time when responding to incidents

Because the response procedure and other procedures were not organized, members interpreted things differently, and there were times when they took time. There were also members who were not used to PagerDuty and forgot to confirm status updates. ..

⇒ We improved the environment so that members could define workflows in Slack, start responding to incidents by entering a command, and handle incidents according to procedures.

We also made an incident VS team structure in which every team member gathered and handled incidents as a mob while comparing each other's interpretations. By doing so, we reduced the time from when an alert is made to when someone responds down to less than one tenth of the time before.
MTTA

Summary

We were busy dealing with a variety of problems right after the launch, but it has improved continuously little by little. Looking back, members responded differently based on their experience and knowledge, and I feel that we had a smooth start because we planned in advance what kind of operations to do. We still have room for improvement, but we will continue operations so that we can provide better services for our users!

Conclusion

In this article, I talked about the problems we encountered while operating the service and how we dealt with them. We hope that our experience could be of some help to you. Also, KINTO FACTORY is looking for new members, so if you are interested, please check out the following job openings!

We are hiring!

【PdM（KINTO FACTORY）】プロジェクト推進G／東京

KINTO FACTORYについて自動車のソフトウェア、ハードウェア両面でのアップグレードを行う新サービスです。トヨタ・レクサスの車をお持ちのお客様にOTAやハードウェアアップデートを通してリフォーム、アップグレード、パーソナライズなどを提供し購入後にも進化続ける自動車を提供するモビリティ業界における先端のサービスの開発となります。