Detailed solution introduction
During training AI models, the powerful computing power of the GPU shortens the computing time. The current highest-end GPU cards such as NVDIA's A100, AMD's MI200 series, whether it is SXM4 supporting NVLink or OAM of infinity fabric, MyelinTek MLSteam is fully supported.
The hardware is of course important, but if it’s integrated with software, it will make hardware resources more effectively and shorten the time of software development environment.
In the AI missions, MyelinTek has three tasks and four aspects for IT engineers, AI developers, and inferencing operators:
The 1st task - model training
Establish goals and use deep learning to let the computer learn the inferencing mission. For example, real-time translation, object detection, etc...
The 2nd task - model inferencing
The model generated after learning is deployed to the front-end devices for inferencing. For example, defect detection, car plate recognition, work safety detection, etc...
The 3rd task – model retraining
As data changes over time, it’s too late to define new concepts, more and more incoming new data, and the meaning of old concepts changes, which will detract from the accuracy of model predictions, resulting in data drift. Transfer the newly collected data into the data center, adjust the definition and retrain to optimize the model and make it more accurate.
1. Hardware aspect –
Hardware resources integration, cluster the dataset storage systems, GPU computing systems, and management servers to perform a data center-level cluster. Let IT or MIS monitor, allocate, and manage all hardware resources to maximize their utilization. Briefly speaking, it makes the company perform the highest efficiency for every penny spent on hardware.
2. Software aspect –
Integrate the tools required for model development, such as data cleaning templates, labeling tools, development environments (such as JupyterLab for hyperparameter adjustment), model templates (such as YOLO, Cifar), so that model developers can immediately work on AI model training. On the GPU, we support NVIDIA CUDA and MIG functions, as well as AMD ROCM. Let developers not be GPU-bound.
3. Time aspect –
Combining the above two points, the administrators can customize the upper limit of the user's hardware resources without causing the situation of resource interrupting, and the developers can also fine-tune the hyperparameters to train the models at the same time and select the best experimental results. In short, the resources are individual from each developer and do not cause queuing problems, and users themselves can also run experiments with different parameters at the same time to save time and cost. In addition, the pipeline feature allows users to duplicate the development process of a successful model, simplifying the tedious process of setting up the development environment. In other words, users can use successful cases as SOPs and then fine-tune them according to the needs of their projects to quickly develop.
4. Deployment aspect –
When the trained model is deployed to the front-end devices, the operator can make adjustments in the form of no-code or low-code, to be closer to the inferencing site. In addition, for deployment to third-party operators, the model protection mechanism can also be used to avoid full exposure of the know-how of the model. The ability to automatically retrain is also currently being developed to lower the threshold for MLOps.