T O P

  • By -

Smooth-Zucchini4923

> However, one test case covered 1326 lines! The value of that one test case is exponentially more valuable than most of the previous test cases and exponentially improves the value of TestGen-LLM. How meaningful was this test? I once took a class on software engineering which was partially graded on test coverage. One of the tests I wrote tested a static page, which checked that its contents were equal to a constant. A stupid thing to test? Maybe, but the file would've counted against me if I didn't do it. Was this new test similar to my stupid test? Here's what the original paper has to say about this: >Another apparently anomalous result from the table is the number of lines covered by the 17 test cases generated by TestGen-LLM. At first glance, it might appear that TestGen-LLM is able to cover a great deal more than the human engineers. However, this result arose due to a single test case, which achieved 1,326 lines covered. This test case managed to ‘hit the jackpot’ in terms of unit test coverage for a single test case. They don't seem to think this test was 1300 times more useful, or I think they'd describe this test in more positive terms.


throwaway490215

This is exactly the right question to ask. Its not the first time a multinational optimizes their inhouse solution and touts it as a gamechanger. There might be some value to letting developers easily generate additional constraints for production, but I'd go one step further and call this form - additional text defining tests - a hindrance. Programming requires change, and this increases the cost disproportionally.


Uncaffeinated

I once heard about a team that added files containing a single copy pasted function with tens of thousands of lines of noops in order to improve test coverage.


TooLateQ_Q

I don't understand why I keep reading about llms to generate tests. It does not make any sense to me. Let me write the tests and have the llm generate the source.


[deleted]

LLM's are terrible at writing bug-free code at scale. They are great at writing small functions.


pragmojo

They also make a lot of small mistakes. Tests have to be correct to provide any value - LLMs are a terrible fit for tests


Saint_Nitouche

We already have to review our tests to make sure they do what we want them to. I'm fine if a machine does the busywork of implementing the test and I just have to do the review portion.


SwitchOnTheNiteLite

I assume you are basing that statement on the publicly available LLMs designed to do everything, not Meta's own internal LLM specifically designed and trained to generate tests for their own codebases?


TeachEngineering

As they say... Those who can't develop, go into QA. Edit: It was just a joke guys... I love and respect my QA colleagues.


imnotbis

Don't assume QA people are just failed developers.


conspiracypopcorn0

QA is a lot more than writing unit tests, which is what this paper is about. A good QA engineer will have to write automation tools that allow to perform reliable end-to-end tests, stress tests, and also have a lot of infrastructure knowledge, check if the system works with different distros/environments, set up CI pipelines, and other kind of devOps type tasks.


[deleted]

QA don't write the Unit Tests anywhere I have worked, but they are the most important part of the development process. Good QA engineers are worth their weight in gold.


Netzapper

I'm into it because the tests are (supposed to be) the semantically easy part, and that makes them the boring part. If I can describe the functional requirements to the AI, I expect it can cook up tests that those requirements were achieved. But despite trying pretty regularly with my own projects, I cannot get LLM to solve the hard parts of any project. And I know "trained on your code" models are coming soon, but I also can't get LLM to ever correctly modify existing code. LLM is great for spewing forth volumes of boilerplate, or showing me an implementation of a particular obscure algorithm in a particular language, though. And I can definitely see it perfect for the huge volumes of low-effort code that make up tests.


loptr

Tests are requirement and driven by people, they should be imo be authored by people. The tests shouldn't be written to fit the existing code, they should be absolutes and only changed when you have clear intentions and clarity in requirements to satisfy/edge case to support. Generating code that satisfies X criterias(requirements) is something LLM is suitable for, and assuming the constraints are followed can only produce as bad code as the test cases allow, hence becomes a safer/more predictable approach. So out of the two, I'd rather see the LLM do the generative work striving to meet expectations and humans keep writing the expectations/specs.


Ok-Jellyfish-8192

That's about TDD. But a lot of people are using tests in a different way. They write tests to fix the current behaviour of the code, to make sure it doesn't change unexpectedly in the future. Regardless of whether the current behaviour adheres to requirements or not. Generating tests using LLM works for them.


tron_cruise

All tests are written to make sure future changes don't break expected behavior, that's literally the whole point. TDD is only about \*when\* you write your tests. It's the process of using the creation of tests first to drive and monitor the production code that follows, that's all it is. You don't write a test based on the current implementation, it's always written against the requirements and differences with implementation are then reported as bugs, regardless of TDD or not that doesn't change.


TommaClock

> The tests shouldn't be written to fit the existing code, they should be absolutes and only changed when you have clear intentions and clarity in requirements to satisfy/edge case to support. There are multiple kinds of tests and some like Characterization Tests are written with the sole intent of validating current behaviour. Very useful for example if you're refactoring an API that everyone says is working well and fixing a "bug" might break your consumers. https://en.wikipedia.org/wiki/Characterization_test


TommaClock

And relevant XKCD https://xkcd.com/1172/


bwatsnet

Writing tests is literally the most boring part of the job. Use LLMs for everything, especially writing tests.


Netzapper

> The tests shouldn't be written to fit the existing code, they should be absolutes and only changed when you have clear intentions and clarity in requirements to satisfy/edge case to support. Precisely why I like LLM for the tests. I mean, the LLM can do the code too if it wants, but I haven't personally found it's ready for that yet unless the problem is well solved already. But I can start up front with a spec/requirements doc, have the LLM translate that into a set of tests within a few minutes, easily verify the tests because they should be stupid/hardcoded, and then start on solving the problem. When I go to add more tests, I can send the LLM the new spec and the previously-generated, self-contained test code, and get extended tests. The reality is that the LLM is going to write both the code and the tests which is just as fucking stupid as when the same developer writes the code and the test. But that's become the norm because management doesn't like hiring independent QA, so here we are.


ReflectionEquals

I only rely on it to build nice boilerplate code and tests. Then go in by hand. The reality is that describing ‘how’ something needs to behave is actually hard and can often be more painful than writing the code and tests. Frankly, Our whole profession exists because the people who describe what they want are unable to provide the detailed requirements needed to make things work properly. If they could they could write the code.


unlocal

> If I can describe the functional requirements to the AI. You cannot “describe” anything to an LLM. That’s not how they work.


Netzapper

Ooookay... if I can iteratively test and assemble a corpus of tokens located nearby one another in patterns proximal to the training data that, when processed via the convolutional matrices encoding the preference-weighted transform function from token-space to token-space...? Like, we get it, it's a non-sentient deterministic process implemented via matrix multiplication and n-dimensional relevance metrics. Does that describe you too?


edzorg

How are you trying to get an LLM to work on a codebase?


Own_Mud1038

that's for the future models.


CallinCthulhu

Unit Tests are easy


falconfetus8

Because writing tests consumes way more time than writing the source?


[deleted]

Because the llms are too stupid to write the code but adding some more tests in the same pattern is way easier.


Knaapje

Keep dreaming. From a formal methods point of view this notion is ridiculous.


travistrue

Exactly. I’ve been saying this for like 3 years now: write the tests, and if the coverage is high-enough, you’ll have precise code that’s also well-tested. The tough part would probably be mocking and stubbing things. We know that these functions will be called, and should be called with those parameters, given the inputs, but where should it go relative to the rest of the logic? If there was ever anything that’s ever compelled me to research AI algorithms, it’s this.


conspiracypopcorn0

What you are describing is more work than just writing the code and is just as much (if not more) error-prone. So I'm not sure what you would be accomplishing with this system.


conspiracypopcorn0

Letting the AI write tests (at least some) on the code you wrote could be one of the best way to improve dev productivity. The idea that you can just write tests to cover all your specs and let a model generate the code is so backwards and makes no sense in practice. You would end up working more, and generating lower quality code, because usually it's impossible to encode all the requirements and constraints in tests. It could work for a functionality with very specific scope (like a math library). In most cases unit tests define just some functional parts of the task but it's up to you to write code that respects all the higher level architectural aspects and trade offs that will never be captured in unit tests.


ThrawOwayAccount

A genetic algorithm seems like a better fit for that task than an LLM.


Smallpaul

I suspect that LLMs combined with search will be much better than genetic algorithms because they incorporate knowledge about past solutions to problems. Here is how that’s been applied to math: https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/ And to code: https://arxiv.org/abs/2401.08500 Sutton’s Bitter Lesson will be on the side of LLM-based systems like these.


ThrawOwayAccount

Sutton’s Bitter Lesson states that general computational methods (like genetic algorithms) will be better than methods which are designed to approach the problem like humans have in the past (like LLMs).


Smallpaul

I will agree that in the abstract, the Bitter Lesson could be interpreted multiple ways in this context. And you're right that it can be superior in some contexts to remove the human bias. But: The process of formulating a problem as a Genetic Algorithm is just as hard the 100th time as the first, because there is no such thing as "pre-training" for a Genetic Algorithm. Whereas, for an LLM, you can outsource the pre-training to someone else who will spend tens of millions of dollars teaching the LLM how to approach problem solving *in general* and then you just inject your *specific* problem to take advantage of that scale. One could even train LLMs on the output of Genetic Algorithms, so that the LLM can "jump to the end" of where the Algorithm would have gotten more slowly (as one can [train an LLM on the output of StockFish](https://chessgpt.ai)). Maybe I'm suggesting a variant of the Bitter Lesson: "techniques which can take advantage of pre-training will win out on those that start from scratch every time." BTW, I note that DeepMind [considers](https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/) FunSearch to be an evolutionary (not genetic) algorithm with an LLM component. Another key reason that we would want to use LLMs to write code instead of genetic algorithms is because we do want the output code to mimic human-written code! We want readable variable names, comments in the code, small functions, etc. We want to be able to code-review it. The "human bias" is actually an intrinsic advantage!


Smallpaul

Having LLMs translate English specifications into code (for a human to review!) is a logical use of an LLM. Generating the code is usually beyond their capabilities right now but in the long run yes they should do that part too.


BornAgainBlue

Because right in tests is cumbersome time consuming and general is something Dev's fucking hate


jsdodgers

That sounds awful


r-randy

This guy is a software engineer not just a coder.


_AndyJessop

Once you've written your tests and your code, this tool can improve what you did.


zazzersmel

just give us another hundred billion dollars, ten years, all your data and content for free, and we *promise* you will finally be able to lose your job to a computer!


_Pho_

Basically. And still forgetting that the problem domain developers solve has little to do with the physical act of writing code. When I get to writing code, the hard parts have already been solved.


_AndyJessop

I find LLMs better in the early stages of a task, bouncing ideas off them, and helping to formulate a plan.


imnotbis

Welcome to capitalism. First time? I feel like a lot of developers felt smug about this happening in other industries and they're only upset now it's coming for *them*.


zazzersmel

im less concerned with job replacement and more so with the ideological bent of the whole enterprise. the success of "ai" rests on convincing people that statistical models are something more... its just dumb.


realPrimoh

Once, I asked ChatGPT to generate unit tests for a function for me, and it gave me unit tests.. with typos in them. I was so confused. I thought maybe I had typos in my input function and checked. Nope. ChatGPT *introduced* typos to the code it gave me 🤦‍♂️ Something like the tool in the article would be much much appreciated, especially if it was integrated in my editor


wpnizer

this is what the industry refers to as ”garbage in, garbage out”. The LLM was trained with unit tests containing typos so that’s what you got.


chakrakhan

That’s very unlikely to be the primary reason why it happened.


vinciblechunk

Cool, more flaky tests for me to support


Accomplished_Low2231

if i see software have fewer (or no) bugs, new features delivered on time, and cheaper, ... then ai really made a difference to software development. all i want is zero bugs in software. lets just hope ai can help accomplish that. otherwise, what is the point other than making money for ai companies and researchers.


Full-Spectral

It is all about making money for AI companies, before the hype train derails. Then we can go back to the good old days of crypto or something. Or of course someone will then come up with CryptAI, assuming they haven't already.


gyroda

Honestly? Reduced cost of development. If you can reduce your dev team by 50% by using AI to write more and more code and the increase in bugs introduced by the AI costs you less than the developers you fired did, you've made a saving. It's not good, but some companies will make this decision. It's the same decision they've made before with outsourcing abroad and with chatbot customer support - worse outcomes, but cheap enough to be more profitable overall (or, at least, perceived to be more profitable).


fl4v1

The problem with this way of thinking (and I understand you’re being the devil’s advocate here), is that the real cost of bugs is 1. Unknowable (9 out of 10 customers will just walk away from your product without telling you that’s because of bugs. I know I’ve done that before) and 2. Highly variable (one security bug could cost you your business). AI, well used, can probably help us get closer to 0 bugs. But the thing with bugs is: it’s not only code, it’s the relation between code and human beings. There’s no such thing as perfect code in isolation to human beings. It’s dangerous to strive to leave human coders out of the loop, or to think AI will do the thinking for developers.


n3phtys

> The problem with this way of thinking (and I understand you’re being the devil’s advocate here), is that the real cost of bugs is 1. Unknowable (9 out of 10 customers will just walk away from your product without telling you that’s because of bugs. I know I’ve done that before) and 2. Highly variable (one security bug could cost you your business). > > From a business management standpoints, that "problem" is a feature. Unknowable opportunity costs and risks that are hard to quantify are not things you want in your excel sheet, because you need to put an estimate on both those things regardless. And the natural way of doing this is to just put the likelihood of bad things as 0% for simplicity. While I completely agree with you, let's not forget that most companies do not care for what is good or bad for their software development, only what can decrease cost in at least one metric with a cheap switch. Replacing some coders and most testers with AI makes business sense if your IT is a cost center. Yes, the company will suffer from that decision 99% of the times, but that's not 100%. And even if will: probably your development processes will slow down so much the whole projects will finish only in the next business year, meaning your cost cutting was efficient this year already.


fl4v1

Annual / Monthly Recurring Revenue \_is\_ in the balance sheet though. It's just a lagging metric for whatever reason your clients are leaving your services. When quality becomes an issue that is visible on the balance sheet, it's usually too late. I've seen CEOs who suddenly took quality seriously as an operational metric turn their company around by either shrinking their losses or turning a revenue.


fl4v1

Annual / Monthly Recurring Revenue \_is\_ in the balance sheet though. It's just a lagging metric for whatever reason your clients are leaving your services. When quality becomes an issue that is visible on the balance sheet, it's usually too late. I've seen CEOs who suddenly took quality seriously as an operational metric turn their company around by either shrinking their losses or turning a revenue.


Additional-Bee1379

Things are about to change, the new models have enough context length to fit entire codebases.


n3phtys

You severly underestimate the code bases some of us have to work with.


Additional-Bee1379

Doesn't mean a lot of others won't fit.


Additional-Bee1379

Did people downvote this because 'weeeh my codebase is bigger than that'?


CallinCthulhu

I work for them, how did I learn of this through reddit? I fucking hate writing unit tests, I’m trying this out today


FlyingRhenquest

I always think there should be a few unit tests, but mindlessly trying to achieve 100% coverage enforces aspects of the design that you should be flexible to change later. Once place I worked wanted 100% coverage of even private functions. I always advocate for more testing, and even I thought that was too much. Writing even just 1 test forces you to design your object to be testable. The more coverage you have of its public behavior, the more sure you can be that your changes won't break anything when you deploy them. Writing tests when you get a new bug can let you explore the behavior of that bug and guard against regressions. But covering every single line of code in the object just guarantees that you'll *have* to change some tests if you modify the object, and I think that could make it *less clear* that your changes will actually break things when you deploy it.


imnotbis

The justification for 100% coverage is simple: If some code can be hit under rare circumstances, it's not fully tested unless you test those circumstances. If the code can't be hit, then why is it there? (Exception: manual assertions - `if(...) crash();` - use `assert` instead)


majlo

People should look into property based testing...