Help: get a full picture of input values of a java method call systematically #14660

oriana19993926782 · 2023-11-01T17:02:52Z

oriana19993926782
Nov 1, 2023

Hello CodeQL Community,

I'm now doing a research on assessing Large language models'(LLaMAs) capabilities in code analysis. More specifically, I give a LLaMA a method and an input value, then ask them what this method returns, and compare its answer to the correct expected output.

The methods I intend for LLaMA to analyze are those that are called by assertions in unit tests. And the inputs/expected outputs of such methods can be in different forms :

another method call(whose inputs can also be another method call or variables)
local variables
global variables
class, like JobInstance.class
hard-coded, like "Hello world" or 8(no need to get full picture for this kind)

@Test
void assertMarshal() {
    FooYamlConfiguration actual = new FooYamlConfiguration();
    actual.setFoo("foo");
    actual.setBar("bar");
    FooYamlConfiguration nest = new FooYamlConfiguration();
    nest.setFoo("nest_foo");
    nest.setBar("nest_bar");
    actual.setNest(nest);
    assertThat(YamlEngine.marshal(actual), is(YAML));
}

In the above unit test example, the method in question is marshal, and input is actual(a local variable) and the correct expected output is YAML(a global variable). However, I cannot simply provide LLaMA with actual as the input value for this method, nor can I directly compare its answer to YAML. My final objective is to extract a full picture of both input values and expected output here.

I have made some trials for extracting the full picture of input variables. Here's the query I've written so far:

import java
import semmle.code.java.dataflow.DataFlow

class AssertMethodCall extends MethodAccess {
  AssertMethodCall() {
    this.getMethod().getDeclaringType().getQualifiedName().matches("%Assert%") and
    this.getEnclosingCallable() instanceof TestMethod
  }
}

from 
AssertMethodCall assertMethodCall, Callable focalM, MethodAccess methodCall, VarAccess input
, Variable var, Expr varInit, DataFlow::Node declaredNode
, DataFlow::Node endNode
// ,  DataFlow::Node predNode

where
  methodCall = assertMethodCall.getAnArgument().(MethodAccess) and
  input = methodCall.getArgument(_) and
  focalM = methodCall.getCallee() and
  focalM.fromSource() and
  var = input.(VarAccess).getVariable() and
  varInit = var.getAnAssignedValue() and
  declaredNode = DataFlow::exprNode(varInit) 

  // and DataFlow::localFlow(predNode, declaredNode) 
  // and predNode.getLocation().getStartLine() < assertMethodCall.getLocation().getStartLine()
  // and predNode.getEnclosingCallable()  = assertMethodCall.getEnclosingCallable()

  and DataFlow::localFlow(declaredNode, endNode) 
  and endNode.getLocation().getStartLine() < assertMethodCall.getLocation().getStartLine()
  and endNode.getEnclosingCallable().getFile() = assertMethodCall.getEnclosingCallable().getFile()

select
  assertMethodCall, var, varInit
  // , predNode.(DataFlow::ExprNode).getExpr().getEnclosingStmt()
  , endNode.(DataFlow::ExprNode).asExpr().getEnclosingStmt()

Take the previous unit test for example, this query successfully identifies "marshal" as the method being tested, its input variable actual, the declaration of actual, and only part of the subsequent changes to actual after its declaration(the changes that directly affect actual ), marked by "✅", but left out some changes indirectly affecting actual, marked by "❌":

   FooYamlConfiguration actual = new FooYamlConfiguration(); ✅
   actual.setFoo("foo"); ✅
   actual.setBar("bar"); ✅
   FooYamlConfiguration nest = new FooYamlConfiguration(); ❌
   nest.setFoo("nest_foo"); ❌
   nest.setBar("nest_bar"); ❌
   actual.setNest(nest); ✅

So, the 1st problem with the above query is that it cannot capture the full subsequent changes(only direct changes) to the variable(actual) after its declaration.

And the 2rd problem with the above query is that even when you uncomment all the content related to preNode, this query cannot capture any potential statement that could influence the state of the variable to be declared. For a start, in the following unit test, the preNode in my query could not capture the statements, marked by "❌", before the declaration of proxy , which clearly determine what proxy is.

    @Test
    void assertJdkDynamicProxyForGetTarget() {
        ElasticJob target = new TargetJob(); ❌
        ProxyFactory pf = new ProxyFactory(target); ❌
        pf.addInterface(ElasticJob.class);  ❌
        ElasticJob proxy = (ElasticJob) pf.getProxy(); ✅
        assertTrue(AopUtils.isJdkDynamicProxy(proxy));
        assertThat(AopTargetUtils.getTarget(proxy), is(target));
    }

Unfortunately, things can get more complex :

@Test
void assertUpdateJobExecutionEventWhenSuccessAndConflict() {
    JobExecutionEvent startEvent = new JobExecutionEvent("localhost", "127.0.0.1", "fake_task_id", "test_job", JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER, 0); ❌
    JobExecutionEvent successEvent = startEvent.executionSuccess();  ✅
    assertTrue(storage.addJobExecutionEvent(successEvent));

In this complex unit test above, successEvent is the input variable here. To extract the full picture, we need to know what startEvent and executionSuccess() is, and also what JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER (input of startEventinstantiation) is, bringing about the 3rd problem with the above query, which is it cannot get the further dependency of the statements captured by the query if any, either a method call like executionSuccess()or another variable like JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER here. I am quite new to codeQL, I don't even have an idea of how to get the full picture of startEvent in this complex situation.

Below is another complex unit test, the method tested here has 2 layers of inputs and it used a global variable in the 2nd layer(this is just to show how the inputs under study can have various forms, and thus different kinds of complexity):

1st layer: new JobExecutionEvent(...)
2nd layer: "localhost", "127.0.0.1", "fake_task_id", "test_job", JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER, 0

@Test
    void assertAddJobExecutionEvent() {
        assertTrue(storage.addJobExecutionEvent(new JobExecutionEvent("localhost", "127.0.0.1", "fake_task_id", "test_job", JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER, 0)));
    }

The java project under study is https://github.com/apache/shardingsphere-elasticjob.git

I am struggling and completely overwhelmed by the complexity here! I would greatly appreciate any guidance on how to modify my query to get a full picture at least for input values. And thank you for taking the time to read this far. 💗💗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help: get a full picture of input values of a java method call systematically #14660

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Help: get a full picture of input values of a java method call systematically #14660

oriana19993926782 Nov 1, 2023

Replies: 0 comments

oriana19993926782
Nov 1, 2023