Only watched demo, but judging from the fact there are several agent-decided steps in the whole model generation process, I think it'd be useful for Plexe to ask the user in-between if they're happy with the plan for the next steps, so it's more interactive and not just a single, large one-shot.
E.g. telling the user what features the model plans to use, and the user being able to request any changes before that step is executed.
Also wanted to ask how you plan to scale to more advanced (case-specific) models? I see this as a quick and easy way to get the more trivial models working especially for less ML-experienced people, but am curious what would change for more complicated models or demanding users?
Agree. We've designed a mechanism to enable any of the agents to ask for input from the user, but we haven't implemented it yet. Especially for more complex use cases, or use cases where the datasets are large and training runs are long, being able to interrupt (or guide) the agents' work would really help avoid "wasted" one-shot runs.
Regarding more complicated models and demanding users, I think we'd need:
1. More visibility into the training runs; log more metrics to MLFlow, visualise the state of the multi-agent system so the user knows "who is doing what", etc.
2. Give the user more control over the process, both before the building starts and during. Let the user override decisions made by the agents. This will require the mechanism I mentioned for letting both the user and the agents send each other messages during the build process.
3. Run model experiments in parallel. Currently the whole thing is "single thread", but with better parallelism (and potentially launching the training jobs on a separate Ray cluster, which we've started working on) you could throw more compute at the problem.
I'm sure there are many more things that would help here, but these are the first that come to mind off the top of my head.
What are your thoughts? Anything in particular that you think a demanding user would want/need?
This is a really interesting idea! I'll be honest, it took me a minute to really get what it was doing. The GitHub page video doesn't play with any audio, so it's not clear what's happening.
Once I watched the video, I think I have a better understanding. One thing I would like to see is more of a breakdown of how this solves a problem that just a big model itself wouldn't.
Yeah we rushed to create a "Plexe in action" video for our Readme. We'll put a link to the YouTube video on the Readme so it's easier.
Using large generative models enables fast prototyping, but runs into several issues: generic LLMs have high latency and cost, and fine-tuning/distilling doesn’t address the fundamental size issue. Given these pain points, we realized the solution isn’t bigger generic models (fine-tuned or not), but rather automating the creation, deployment, and management of lightweight models built on domain-specific data. An LLM can detect if an email is malicious, but a classifier built specifically for detecting malicious emails is orders of magnitude smaller and more efficient. Plus, it's easier to retrain with more data.
Does it decide based on data if it should make its own ML model or fine-tune a relevant one?
Also, does it detect issues with the training data? When I was doing NLP ML models before LLMs, the tasks that took all my time were related to data cleaning, not the training or choosing the right approach.
Currently it decides whether to make its own model or fine-tune a relevant one based primarily on the problem description. The agent's ability to analyse the data when making decisions is pretty limited right now, and something we're currently working on (i.e. let the agent look at the data whenever relevant, etc).
I guess that kind of answers your second question, too: it does not currently detect issues with the training data. But it will after the next few pull requests we have lined up!
And yes, completely agree about data cleaning vs. model building. We started from model building as that's the "easier" problem, but our aim is to add more agents to the system to also handle reviewing the data, reasoning about it, creating feature engineering jobs, etc.
I don't want to hate, what you built is really cool and should save time in a data scientist's workflow, but... we did this. It won't "automate most of the ML lifecycle." Back in ~2018 "autoML" was all the rage. It failed because creating boilerplate and training models are not the hard parts of ML. The hard parts are evaluating data quality, seeking out new data, designing features, making appropriate choices to prevent leakage, designing evaluation appropriate to the business problem, and knowing how this will all interact with the model design choices.
Yes, this is the issue. In any reasonably-sized enterprise you’re not going to have a clean CSV to plug in to a model generator. You’re either going to have 1) 50 different excel spreadsheets to wrangle and combine somehow or 2) 50+ terabytes of messy logs to process.
Creating something that can grok MNIST is certainly cool, but it’s kind of the equivalent of saying leetcode is equivalent to software engineering.
Second, and more practically speaking, you are automating (what I think of as) the most fun part of ML: the creativity of framing a problem and designing a model to solve that problem.
Agree completely. We built Plexe with that first scenario in mind - the messy spreadsheet problem that's so common in enterprise. You can connect multiple data sources, and Plexe will identify what it needs based on the problem description. We're also gradually developing support for handling terabyte-scale data, though we're not there yet. We started by validating our approach on well-defined problems with clean datasets, but we've been systematically adding capabilities to handle increasingly complex scenarios.
On your second point about automating the "fun part", we see Plexe as amplifying that creativity rather than automating it. We're trying to make it easier to design the experiments and evaluating results. But would love to hear your feedback on this!
Hey, one of the authors here! I completely agree with your comment. Training ML models on a clean dataset is the "easy" and fun part of an ML engineer's job.
While we do think our approach might have some advantages compared to "2018-style" AutoML (more flexibility, easier to use, potentially more intelligence solution space exploration), we know it suffers from the issue you highlighted. For the time being, this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.
Our next focus area is trying to apply the same agentic approach to the "data exploration" and "feature ETL engineering" part of the ML project lifecycle. Think a "data analyst agent" or "data engineering agent", with the ability to run and deploy feature processing jobs. I know it's a grand vision, and it won't happen overnight, but it's what we'd like to accomplish!
> this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.
I respect software engineers a lot, however ANYONE who "doesn't know how to build models" also doesn't know what data leakage is, how to evaluate a model more deeply than simple metrics/loss, and can easily trick themselves into building a "great" model that ends up falling on its face in prod. So apologies if I'm highly skeptical of the admittedly very very cool thing you have built. I'd love to hear your thoughts.
I think you're probably right. As an example of this challenge, I've noticed that engineers who don't have a background in ML often lack the "mental models" to understand how to think about testing ML models (i.e. statistical testing as opposed to the kind of pass/fail test cases that are used to test code).
The way I look at this is that plexe can be useful even if it doesn't solve this fundamental problem. When a team doesn't have ML expertise, their choices are A) don't use ML B) acquire ML expertise C) use ChatGPT as your predictor. Option C suffers of the same problem you mentioned, in addition to latency/scalability/cost and the model not being trained on your data etc. So something like Plexe could be an improvement on option C by at least addressing the latter pain points.
Plus: we can keep throwing more compute at the agentic model building process, doing more analysis, more planning, more evaluation, more testing, etc. It still won't solve the problem you bring up, but hopefully it gets us closer to the point of "good enough to not matter" :)
Just a thought, but maybe a good angle would be to interview data analysts and ask them what the most annoying parts of their jobs are, to figure out how to automate the drudge work. If you can make their lives easier, they’ll sell the product for you.
Absolutely! When we started building this out, we knew that we had to build an agent to perform data cleaning and feature transformations. After speaking to data analysts, PMs and engineers over the last few weeks, we've received strong feedback about adding this capability to Plexe and we're actively working on it. We've already added some features related to this and hopefully will roll out the whole agent very soon!
Is there a benchmark or eval for why this might be a better approach than actually modeling the problem? If you're selling this a non-ML person, I get the draw. But you'd still have to show why using these LLMs would be better than training it with something simpler / more lightweight.
That said, it's likely that you'll get good zero-shot performance, so the model building phase could benefit from fine-tuning the prompt given the dataset - instead of training the underlying model itself.
Just to clarify, we're not directly using the LLMs as the "predictor" models for the task. We're making the LLMs do the modeling work for you.
For example, take the classic "house price prediction" problem. We don't use an LLM to make the predictions, we use LLMs to model the problem and write code that trains an ML models to predict house prices. This would most likely end up being an xgboost regressor or something like that.
As to your point about evals, great question! We've done some testing but haven't yet carried out a systematic eval. We intend to run this on OpenAI's MLE-Bench to quantify how well it actually does as creating models.
hey this is very cool I work at a bank and we are starting to look at something like this mainly to automate boilerplate code for experimentation and model training however we are a GCP shop, I might play with this over the weekend to see if i can add support for vertex.ai experiments.
Have you thought about extending this to cover the model development lifecycle and perhaps having agents to help with EDA, model selection, explanation and feature engineering? this is where we are seeing a lot of demand from users as well but we are starting out with experiment/ pipeline / serving boilerplate.
Hey! That sounds great. Happy to help in case you face any issues while adding support for vertex.ai.
We’ve added a tool which does EDA and when the model package is created, it contains a file called metadata.json which has detailed explanations for why a model was chosen, preprocessing steps and technical strengths & limitations. We’re working on adding an agent for performing feature engineering and should be out soon!
Any review of smolagent? This combination of agents approach seems likely to be really useful in a lot of places, and I’m wondering if you liked it, loved it, hated it, …
Hey, I'm one of the authors of Plexe. Overall, I'd say we like smolagents: it's simple, easy to understand, and you can get a project set up very quickly. It also has some neat features, such as the "step callbacks" (functions that are executed after every step the agent takes).
However, the library does feel somewhat immature, and has some drawbacks that hinder building a production application. Some of the issues we've ran into include:
1. It's not easy to customise the agents' system prompts. You have to "patch" the smolagents library's YAML templates in a hacky way.
2. There is no "shared memory" abstraction out of the box to help you manage communication between agents. We had to implement an "ObjectRegistry" class into which the agents can register objects, so that another agent can retrieve the object just by knowing the object's key string. As we scale, we will need to build more complex communication abstractions (tasks queues etc). Given that communication is a key element of multi-agent systems, I would have expected a popular library like smolagents to have some kind of built-in support for it.
3. No "structured response" where you can pass a Pydantic BaseModel (or similar) to specify what structure the agent response should have.
4. "Managed agents" are always executed synchronously. If you have a hierarchy of managed agents, only one agent will ever be working at any given time. So we'll have to build an async execution mechanism ourselves.
I think we've run into some other limitations as well, but these are the first that come to my mind :) hope this helps!
Thanks - super helpful. Passing state around to agents feels like a big pain point right now. That said just getting simple state transition libraries working with agents is a bit of a pain point as well.
Feels like there might be a good infra company in there for someone to build.
Smolagents works great for us but we did run into some limitations. For example, it lacks structured output enforcement, parallel execution, and in-built shared memory, which are crucial features for orchestrating a multi-layer agent hierarchy beyond simple chatbots. We've also been playing around with Pydantic AI due to its benefits with validation and type enforcement but haven't shifted yet.
Do you mean being able to wrap the created model in a scikit-learn Pipeline? This isn't something we've thought about and we haven't explicitly built support for it, though we could.
As of now, I think you could relatively easily wrap the plexe model, which has a `predict()` method, in a scikit-learn Estimator. You could then plug it into a Pipeline.
What do you have in mind? How would you want to use this with scikit-learn pipelines?
I think what I'm after is being able to put these in pipeline.
I.e. if I already have some data cleaning/normalisation, some dimensional reduction and then some fitting, being able to drop the Agent in place with an appropriate description and task.
Cleaning: Feed it a data frame and have it figure out what needs imputing etc.
The rest: Could either be separate tasks or one big task for the Agent..
Interesting! We don't currently support this explicitly.
You could wrap the Plexe-built model in a scikit-learn Estimator like I mentioned, and you can specify the desired input/output schema of the model when you start building it, so it will fit into your Pipeline.
This is an interesting requirement for us to think about though. Maybe we'll build proper support for the "I want to use this in a Pipeline" use case :)
Plexe analyzes your data and task description, then builds custom ML models using standard Python libraries (like scikit-learn, XGBoost, etc.). If your problem is best solved by a regression model, it will build that. If classification is more appropriate, it will implement that instead.
Fine-tuning existing language models is also an option in Plexe's toolkit. For example, when we needed to classify prompt injections for LLMs, Plexe determined fine-tuning RoBERTa was the best approach. But for most structured data problems (like forecasting or recommendations), Plexe typically builds lightweight models from scratch that are trained directly on your dataset.
Sorry I think I explained poorly. Plexe does build deep learning models automatically. When it gets a dataset and a problem description, it automatically evaluates various model architectures (NNs being one of them).
Plexe experiments with multiple approaches - from traditional algorithms like gradient boosting to deep neural networks. It runs the training jobs and compares performance metrics across different architectures to identify which solution best fits your specific data and problem constraints.
No, not by default. In fact, the default installation of plexe doesn't include deep learning libraries.
Plexe _can_ build deep learning models using `torch` and `transformers`, and often the experimentation process will include some NN-based solutions as well, but that's just one of the ML frameworks available to the agent. It can also build models using xgboost, scikit-learn, and several others.
You can also explicitly tell Plexe not to use neural nets, if that's a requirement.
Nice execution! I built a simpler version of it a year ago https://github.com/jmaczan/csv-to-ml I hope you succeed with the product and push the automl forward
Hey, this is super cool! We found a few projects working on similar things to Plexe, but were not aware of yours. Thanks for sharing, will check it out!
You're right. We've seen the "garbage in, garbage out" problem firsthand.
We've seen the models hit typical statistical pitfalls like overfitting and data leakage during testing. We've improved by implementing strict validation protocols and guardrails around data handling. While we've fixed the agents getting stuck in recursive debugging loops, statistical validity remains an ongoing challenge. We're actively working on better detection of these issues, but ultimately, we rely on domain expertise from users for evaluating model performance.
Only watched demo, but judging from the fact there are several agent-decided steps in the whole model generation process, I think it'd be useful for Plexe to ask the user in-between if they're happy with the plan for the next steps, so it's more interactive and not just a single, large one-shot.
E.g. telling the user what features the model plans to use, and the user being able to request any changes before that step is executed.
Also wanted to ask how you plan to scale to more advanced (case-specific) models? I see this as a quick and easy way to get the more trivial models working especially for less ML-experienced people, but am curious what would change for more complicated models or demanding users?
Regarding more complicated models and demanding users, I think we'd need:
1. More visibility into the training runs; log more metrics to MLFlow, visualise the state of the multi-agent system so the user knows "who is doing what", etc. 2. Give the user more control over the process, both before the building starts and during. Let the user override decisions made by the agents. This will require the mechanism I mentioned for letting both the user and the agents send each other messages during the build process. 3. Run model experiments in parallel. Currently the whole thing is "single thread", but with better parallelism (and potentially launching the training jobs on a separate Ray cluster, which we've started working on) you could throw more compute at the problem.
I'm sure there are many more things that would help here, but these are the first that come to mind off the top of my head.
What are your thoughts? Anything in particular that you think a demanding user would want/need?
Once I watched the video, I think I have a better understanding. One thing I would like to see is more of a breakdown of how this solves a problem that just a big model itself wouldn't.
Yeah we rushed to create a "Plexe in action" video for our Readme. We'll put a link to the YouTube video on the Readme so it's easier.
Using large generative models enables fast prototyping, but runs into several issues: generic LLMs have high latency and cost, and fine-tuning/distilling doesn’t address the fundamental size issue. Given these pain points, we realized the solution isn’t bigger generic models (fine-tuned or not), but rather automating the creation, deployment, and management of lightweight models built on domain-specific data. An LLM can detect if an email is malicious, but a classifier built specifically for detecting malicious emails is orders of magnitude smaller and more efficient. Plus, it's easier to retrain with more data.
Does it decide based on data if it should make its own ML model or fine-tune a relevant one?
Also, does it detect issues with the training data? When I was doing NLP ML models before LLMs, the tasks that took all my time were related to data cleaning, not the training or choosing the right approach.
I guess that kind of answers your second question, too: it does not currently detect issues with the training data. But it will after the next few pull requests we have lined up!
And yes, completely agree about data cleaning vs. model building. We started from model building as that's the "easier" problem, but our aim is to add more agents to the system to also handle reviewing the data, reasoning about it, creating feature engineering jobs, etc.
Creating something that can grok MNIST is certainly cool, but it’s kind of the equivalent of saying leetcode is equivalent to software engineering.
Second, and more practically speaking, you are automating (what I think of as) the most fun part of ML: the creativity of framing a problem and designing a model to solve that problem.
On your second point about automating the "fun part", we see Plexe as amplifying that creativity rather than automating it. We're trying to make it easier to design the experiments and evaluating results. But would love to hear your feedback on this!
While we do think our approach might have some advantages compared to "2018-style" AutoML (more flexibility, easier to use, potentially more intelligence solution space exploration), we know it suffers from the issue you highlighted. For the time being, this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.
Our next focus area is trying to apply the same agentic approach to the "data exploration" and "feature ETL engineering" part of the ML project lifecycle. Think a "data analyst agent" or "data engineering agent", with the ability to run and deploy feature processing jobs. I know it's a grand vision, and it won't happen overnight, but it's what we'd like to accomplish!
Would love to hear your thoughts :)
I respect software engineers a lot, however ANYONE who "doesn't know how to build models" also doesn't know what data leakage is, how to evaluate a model more deeply than simple metrics/loss, and can easily trick themselves into building a "great" model that ends up falling on its face in prod. So apologies if I'm highly skeptical of the admittedly very very cool thing you have built. I'd love to hear your thoughts.
The way I look at this is that plexe can be useful even if it doesn't solve this fundamental problem. When a team doesn't have ML expertise, their choices are A) don't use ML B) acquire ML expertise C) use ChatGPT as your predictor. Option C suffers of the same problem you mentioned, in addition to latency/scalability/cost and the model not being trained on your data etc. So something like Plexe could be an improvement on option C by at least addressing the latter pain points.
Plus: we can keep throwing more compute at the agentic model building process, doing more analysis, more planning, more evaluation, more testing, etc. It still won't solve the problem you bring up, but hopefully it gets us closer to the point of "good enough to not matter" :)
Would love to hear your thoughts on this.
That said, it's likely that you'll get good zero-shot performance, so the model building phase could benefit from fine-tuning the prompt given the dataset - instead of training the underlying model itself.
For example, take the classic "house price prediction" problem. We don't use an LLM to make the predictions, we use LLMs to model the problem and write code that trains an ML models to predict house prices. This would most likely end up being an xgboost regressor or something like that.
As to your point about evals, great question! We've done some testing but haven't yet carried out a systematic eval. We intend to run this on OpenAI's MLE-Bench to quantify how well it actually does as creating models.
Hope I didn't misunderstand your comment!
Have you thought about extending this to cover the model development lifecycle and perhaps having agents to help with EDA, model selection, explanation and feature engineering? this is where we are seeing a lot of demand from users as well but we are starting out with experiment/ pipeline / serving boilerplate.
We’ve added a tool which does EDA and when the model package is created, it contains a file called metadata.json which has detailed explanations for why a model was chosen, preprocessing steps and technical strengths & limitations. We’re working on adding an agent for performing feature engineering and should be out soon!
Any review of smolagent? This combination of agents approach seems likely to be really useful in a lot of places, and I’m wondering if you liked it, loved it, hated it, …
However, the library does feel somewhat immature, and has some drawbacks that hinder building a production application. Some of the issues we've ran into include:
1. It's not easy to customise the agents' system prompts. You have to "patch" the smolagents library's YAML templates in a hacky way. 2. There is no "shared memory" abstraction out of the box to help you manage communication between agents. We had to implement an "ObjectRegistry" class into which the agents can register objects, so that another agent can retrieve the object just by knowing the object's key string. As we scale, we will need to build more complex communication abstractions (tasks queues etc). Given that communication is a key element of multi-agent systems, I would have expected a popular library like smolagents to have some kind of built-in support for it. 3. No "structured response" where you can pass a Pydantic BaseModel (or similar) to specify what structure the agent response should have. 4. "Managed agents" are always executed synchronously. If you have a hierarchy of managed agents, only one agent will ever be working at any given time. So we'll have to build an async execution mechanism ourselves.
I think we've run into some other limitations as well, but these are the first that come to my mind :) hope this helps!
Feels like there might be a good infra company in there for someone to build.
Smolagents works great for us but we did run into some limitations. For example, it lacks structured output enforcement, parallel execution, and in-built shared memory, which are crucial features for orchestrating a multi-layer agent hierarchy beyond simple chatbots. We've also been playing around with Pydantic AI due to its benefits with validation and type enforcement but haven't shifted yet.
As of now, I think you could relatively easily wrap the plexe model, which has a `predict()` method, in a scikit-learn Estimator. You could then plug it into a Pipeline.
What do you have in mind? How would you want to use this with scikit-learn pipelines?
I.e. if I already have some data cleaning/normalisation, some dimensional reduction and then some fitting, being able to drop the Agent in place with an appropriate description and task.
Cleaning: Feed it a data frame and have it figure out what needs imputing etc.
The rest: Could either be separate tasks or one big task for the Agent..
You could wrap the Plexe-built model in a scikit-learn Estimator like I mentioned, and you can specify the desired input/output schema of the model when you start building it, so it will fit into your Pipeline.
This is an interesting requirement for us to think about though. Maybe we'll build proper support for the "I want to use this in a Pipeline" use case :)
Fine-tuning existing language models is also an option in Plexe's toolkit. For example, when we needed to classify prompt injections for LLMs, Plexe determined fine-tuning RoBERTa was the best approach. But for most structured data problems (like forecasting or recommendations), Plexe typically builds lightweight models from scratch that are trained directly on your dataset.
Plexe experiments with multiple approaches - from traditional algorithms like gradient boosting to deep neural networks. It runs the training jobs and compares performance metrics across different architectures to identify which solution best fits your specific data and problem constraints.
Plexe _can_ build deep learning models using `torch` and `transformers`, and often the experimentation process will include some NN-based solutions as well, but that's just one of the ML frameworks available to the agent. It can also build models using xgboost, scikit-learn, and several others.
You can also explicitly tell Plexe not to use neural nets, if that's a requirement.
We've seen the models hit typical statistical pitfalls like overfitting and data leakage during testing. We've improved by implementing strict validation protocols and guardrails around data handling. While we've fixed the agents getting stuck in recursive debugging loops, statistical validity remains an ongoing challenge. We're actively working on better detection of these issues, but ultimately, we rely on domain expertise from users for evaluating model performance.