Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Why is this symlink() call returning successfully while apparently failing to create the sym-link?

+6
−0

Summary

I'm building an internal system (hardware simulator, using Rust) to help test some Python-based services that talk to hardware. (The services talk to hardware via TTYs.) To trick the Python services into "believing" they're talking to the hardware they expect, I create some PTYs, where the master side is used by my simulator and the slave side of the PTY is given over to the Python-based service.

Since the Python service looks for its PTY by using a specific name, I create a symbolic link that points back to the slave PTY mentioned above, but using the name the Python service expects --or so I thought.

The problem is that the symlink function reports it's being successful, but no link actually exists in the file system. I'm not yet sure what I may be missing.

My question is: Does anyone know why this is happening and how to fix it?

Details

I'm using the std::os::unix::fs::symlink function to create the sym-link. The function's documentation is clear about its usage, e.g. symlink("a.txt", "b.txt") would create a sym-link called b.txt that points back to a.txt. (Assuming a.txt already exists.)

When I use the function in the simulator, I observe the following:

  1. The function call returns successfully, and
  2. No actual sym-link can be found in the file system.

I wrote a simple test program (below) to verify my usage, and it works fine:

use std::os::unix::fs::symlink;

fn main() {
    symlink("/tmp/src.txt", "/tmp/dst.txt").expect("symlink failed");
}

The above shows that I'm using it correctly, with dst.txt being the link that points back to src.txt. Also, the above would've failed if I had the arguments inverted or didn't have the correct permissions, so it's not that, either.

I also ran the simulator with strace to check what the lower-level syscall was actually doing (e.g. maybe that was failing but the Rust std library was not handling the error?), but it shows a successful return code:

symlink("/dev/pts/5", "/home/<user>/Projects/vpanel/ttymxc4") = 0

This is despite the fact that the symlink /home/<user>/Projects/vpanel/ttymxc4 does not really exist. A simple visual inspection with the ls command in the above directory does not show the file, the Python service complains that it cannot find the symlink to its PTY, and the simulator itself reports a panic! when it tries to clean up after itself but fails to find the symlink to remove it:

Drop for ServiceProxy { pty: OpenptyResult { master: 9, slave: 10 }, fspath: "/home/<user>/Projects/vpanel/ttymxc4", timeout: 8s }
thread 'tokio-runtime-worker' panicked at 'remove_file failed: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/proxy.rs:262:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
task join failed: JoinError::Panic(...)

The backtrace is not very helpful.

Reference Code

This is what the relevant code looks like in the actual simulator:

// main.rs
#[tokio::main]
async fn main() -> Result<(), Error> {
    let m = App::from(load_yaml!("cli.yaml")).get_matches();
    let ptys: Vec<&str> = m
        .values_of("ptys")
        .expect("Missing PTY paths")
        .collect();

    let proxies = ptys
        .iter()
        .map(|pty| ServiceProxy::new(*pty))
        .collect();

    // ...
}
// proxy.rs
impl ServiceProxy {
    pub fn new(symlink_path: &str) -> Self {
        let pty = openpty(None, None).expect("openpty failed");
        let cstr = unsafe { CStr::from_ptr(ttyname(pty.slave)) };
        let slave_path = String::from(cstr.to_str().expect("CStr::to_str failed"));
        
        // This call claims success, but no link ever shows up
        symlink(&slave_path, &symlink_path).expect("PTY symlink failed");

        // ...
    }
    // ...
}

impl ops::Drop for ServiceProxy {
    fn drop(&mut self) {
        eprintln!("Drop for {:?}", self);
        close(self.pty.master).expect("close master failed");
        close(self.pty.slave).expect("close slave failed");
        
        // The panic shown earlier comes from here
        remove_file(&self.fspath).expect("remove_file failed");
    }
}

Remarks

I don't think any of these should make a difference, but just in case

  • the simulator is using tokio (async/await futures, tasks, etc);
  • the simulator is working with PTYs instead of "regular" files like the short Rust test/example;
  • a simple Python test script using os.symlink(...) works fine.

Update

I added the following code to the simulator, as a test, right after the symlink call:

if Path::new(&symlink_path).exists() {
    eprintln!("What?!: {}", symlink_path);
}
for p in std::fs::read_dir("/home/<user>/Projects/vpanel").unwrap() {
    eprintln!("{:?}", p.unwrap().path().display());
}

Interestingly, it lists the symlink as being present (irrelevant stuff omitted):

What?!: /home/<user>/Projects/vpanel/ttymxc4
...
"/home/<user>/Projects/vpanel/ttymxc4"
...

However, it's never listed by commands such as ls -la or anything. To make sure that there weren't any unexpected remove_file calls, I checked as follows:

$ find src -name '*.rs' | xargs grep remove_file
src/proxy.rs:    fs::remove_file,
src/proxy.rs:        remove_file(&self.fspath).expect("remove_file failed");

The only hit for an actual call in the code base is from the std::ops::Drop implementation. (The top hit is from the use std::{..., fs::remove_file, ...}; block.)

In short, there're no hidden/unexpected/accidental calls to remove_file after the symlink call. There's only the one we already knew about.

History
Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (7 comments)

1 answer

+3
−0

Summary

I've fixed the issue and what follows is the best explanation I have so far. What I had described in the OP were some of the observations, but I'll be including more detail so that it's (hopefully) easy to follow. This includes forks, COWs, and dead children.

Additional Observations

I got Rust to list the directory contents immediately after the symlink sys-call shown in the OP, with this simple loop:

for p in std::fs::read_dir("/home/<user>/Projects/vpanel").unwrap() {
    println!("{}", p.unwrap().path().display());
}

It actually showed that the sym-link was present in the file system. As expected, the symlink behavior described in the OP was an effect --not the cause.

I discovered that the drop implementation was (unexpectedly) being called twice, which, in short, did not really make sense due to how Rust handles object lifetimes. (I don't call drop on anything manually; the proxy gets destroyed automatically based on Rust's scoping rules.) This unexpected behavior was confirmed in the debugger (hitting the breakpoint twice) and also with a simple line of code that left 2 files behind instead of 1:

// ...
std::process::Command::new("mktemp")
    .arg("drop.XXXXXXX")
    .output()
    .expect("mktemp::drop failed");

However, I never saw more than a single eprintln! message...

Getting to the Point

Originally, the simulator took some CLI options including the full file system path for the PTY symlink and also a command to execute as a child process, e.g. vpanel -p $(pwd)/ttymxc4 -c 'python3 /tmp/script.py'. The main program would fork itself and the child process would use execvp to replace itself with the -command given on the CLI. Here's a code excerpt:

#[tokio::main]
async fn main() {
    // ...
    let pty = forkpty(None, None)?;
    match pty.fork_result.is_parent() {
        true => run_simulator(panel, proxies, sigint).await?,
        false => exec_child(&m),
    }
    close(pty.master)?;
    // ...
}

// ...

fn exec_child(m: &ArgMatches) {
    let args: Vec<CString> = m.value_of("child_cmd")
        .expect("Missing child process command")
        .split(' ')
        .map(|s| CString::new(s).expect("CString from String failed"))
        .collect();
    let argv: Vec<&CStr> = args
        .iter()
        .map(|s| s.as_c_str())
        .collect();
    execvp(argv[0], &argv)
        .expect(format!("execvp failed: {:?}", argv).as_str());
}

Due to some changes on how I was testing things, I stopped using the -c ... option, but the code was still forking and trying to execvp the child, and this has some implications. When a GNU+Linux process forks, the child process is a copy of the parent and, due to COW, that includes the parent's own memory pages.

Actual Cause

The best explanation I currently have is that when I stopped using the -c option, the exec_child function would panic!s instead of replacing itself with the actual command's code on execvp. (This much is probably obvious, since there's no CLI command for it to replace itself with.) We never saw the child's I/O in our shell because we were never directly connected to its I/O streams. However, the child would still "see" the pre-existing proxy object instance created by the parent b/c of the shared memory pages.

When the child copy panics and exits, the obj instance it sees also goes out of scope, and the Rust borrow-checker decides its time to drop it. This drop is the first (and unexpected) breakpoint hit, but it is successful in removing the link b/c it had just been created and it's actually there. This happens almost immediately after launching the main program (likely microseconds), explaining why I could never see the sym-link in the file system when checking, but would see it listed from Rust as mentioned previously.

The parent continues to run while this is happening, but when it wants to terminate, it hits the breakpoint a(n expected) second time b/c it's its own turn to drop the (original) proxy obj instance (for going out of scope). However, the drop implementation can no longer find the sym-link at this point, b/c that had been successfully, though unexpectedly, removed by the now dead child on the first drop. Since my shell is connected to the parent's I/O streams, I always got to see that panic! message.

In short: The parent forked a copy of itself as a child. Thanks to COW, the child shared the parent's memory space and could see the proxy/object instance responsible for managing the sym-link's life cycle. The child copy failed and exited, causing Rust to drop the child's own copy of the instance. This unexpected drop would then cause the object to "clean its own mess", taking the sym-link with it. Some time later, the parent would terminate and its own original proxy instance would get droped, causing it to also "clean up" a "mess" that had already been cleaned up and no longer existed. This would produce the documented panics.

This is why removing the left-over forkpty call following the removal of the -c option actually fixed the issue. It's not an immediately obvious or intuitive error to figure out and required some hindsight knowledge to understand why this even happened in the first place.

The code without the left-over fork looks as follows:

#[tokio::main]
async fn main() {
    // ...
    run_simulator(panel, proxies, sigint).await?;
    // ...
}
History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »