Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Why is this symlink() call returning successfully while apparently failing to create the sym-link?
Summary
I'm building an internal system (hardware simulator, using Rust) to help test some Python-based services that talk to hardware. (The services talk to hardware via TTYs.) To trick the Python services into "believing" they're talking to the hardware they expect, I create some PTYs, where the master side is used by my simulator and the slave side of the PTY is given over to the Python-based service.
Since the Python service looks for its PTY by using a specific name, I create a symbolic link that points back to the slave PTY mentioned above, but using the name the Python service expects --or so I thought.
The problem is that the symlink
function reports it's being successful, but no link actually exists in the file system. I'm not yet sure what I may be missing.
My question is: Does anyone know why this is happening and how to fix it?
Details
I'm using the std::os::unix::fs::symlink
function to create the sym-link. The function's documentation is clear about its usage, e.g. symlink("a.txt", "b.txt")
would create a sym-link called b.txt
that points back to a.txt
. (Assuming a.txt
already exists.)
When I use the function in the simulator, I observe the following:
- The function call returns successfully, and
- No actual sym-link can be found in the file system.
I wrote a simple test program (below) to verify my usage, and it works fine:
use std::os::unix::fs::symlink;
fn main() {
symlink("/tmp/src.txt", "/tmp/dst.txt").expect("symlink failed");
}
The above shows that I'm using it correctly, with dst.txt
being the link that points back to src.txt
. Also, the above would've failed if I had the arguments inverted or didn't have the correct permissions, so it's not that, either.
I also ran the simulator with strace
to check what the lower-level syscall was actually doing (e.g. maybe that was failing but the Rust std
library was not handling the error?), but it shows a successful return code:
symlink("/dev/pts/5", "/home/<user>/Projects/vpanel/ttymxc4") = 0
This is despite the fact that the symlink /home/<user>/Projects/vpanel/ttymxc4
does not really exist. A simple visual inspection with the ls
command in the above directory does not show the file, the Python service complains that it cannot find the symlink to its PTY, and the simulator itself reports a panic!
when it tries to clean up after itself but fails to find the symlink to remove it:
Drop for ServiceProxy { pty: OpenptyResult { master: 9, slave: 10 }, fspath: "/home/<user>/Projects/vpanel/ttymxc4", timeout: 8s }
thread 'tokio-runtime-worker' panicked at 'remove_file failed: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/proxy.rs:262:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
task join failed: JoinError::Panic(...)
The backtrace is not very helpful.
Reference Code
This is what the relevant code looks like in the actual simulator:
// main.rs
#[tokio::main]
async fn main() -> Result<(), Error> {
let m = App::from(load_yaml!("cli.yaml")).get_matches();
let ptys: Vec<&str> = m
.values_of("ptys")
.expect("Missing PTY paths")
.collect();
let proxies = ptys
.iter()
.map(|pty| ServiceProxy::new(*pty))
.collect();
// ...
}
// proxy.rs
impl ServiceProxy {
pub fn new(symlink_path: &str) -> Self {
let pty = openpty(None, None).expect("openpty failed");
let cstr = unsafe { CStr::from_ptr(ttyname(pty.slave)) };
let slave_path = String::from(cstr.to_str().expect("CStr::to_str failed"));
// This call claims success, but no link ever shows up
symlink(&slave_path, &symlink_path).expect("PTY symlink failed");
// ...
}
// ...
}
impl ops::Drop for ServiceProxy {
fn drop(&mut self) {
eprintln!("Drop for {:?}", self);
close(self.pty.master).expect("close master failed");
close(self.pty.slave).expect("close slave failed");
// The panic shown earlier comes from here
remove_file(&self.fspath).expect("remove_file failed");
}
}
Remarks
I don't think any of these should make a difference, but just in case
- the simulator is using
tokio
(async/await futures, tasks, etc); - the simulator is working with PTYs instead of "regular" files like the short Rust test/example;
- a simple Python test script using
os.symlink(...)
works fine.
Update
I added the following code to the simulator, as a test, right after the symlink
call:
if Path::new(&symlink_path).exists() {
eprintln!("What?!: {}", symlink_path);
}
for p in std::fs::read_dir("/home/<user>/Projects/vpanel").unwrap() {
eprintln!("{:?}", p.unwrap().path().display());
}
Interestingly, it lists the symlink as being present (irrelevant stuff omitted):
What?!: /home/<user>/Projects/vpanel/ttymxc4
...
"/home/<user>/Projects/vpanel/ttymxc4"
...
However, it's never listed by commands such as ls -la
or anything. To make sure that there weren't any unexpected remove_file
calls, I checked as follows:
$ find src -name '*.rs' | xargs grep remove_file
src/proxy.rs: fs::remove_file,
src/proxy.rs: remove_file(&self.fspath).expect("remove_file failed");
The only hit for an actual call in the code base is from the std::ops::Drop
implementation. (The top hit is from the use std::{..., fs::remove_file, ...};
block.)
In short, there're no hidden/unexpected/accidental calls to remove_file
after the symlink
call. There's only the one we already knew about.
1 answer
Summary
I've fixed the issue and what follows is the best explanation I have so far. What I had described in the OP were some of the observations, but I'll be including more detail so that it's (hopefully) easy to follow. This includes fork
s, COWs, and dead children.
Additional Observations
I got Rust to list the directory contents immediately after the symlink
sys-call shown in the OP, with this simple loop:
for p in std::fs::read_dir("/home/<user>/Projects/vpanel").unwrap() {
println!("{}", p.unwrap().path().display());
}
It actually showed that the sym-link was present in the file system. As expected, the symlink
behavior described in the OP was an effect --not the cause.
I discovered that the drop
implementation was (unexpectedly) being called twice, which, in short, did not really make sense due to how Rust handles object lifetimes. (I don't call drop
on anything manually; the proxy gets destroyed automatically based on Rust's scoping rules.) This unexpected behavior was confirmed in the debugger (hitting the breakpoint twice) and also with a simple line of code that left 2 files behind instead of 1:
// ...
std::process::Command::new("mktemp")
.arg("drop.XXXXXXX")
.output()
.expect("mktemp::drop failed");
However, I never saw more than a single eprintln!
message...
Getting to the Point
Originally, the simulator took some CLI options including the full file system path for the PTY symlink and also a command to execute as a child process, e.g. vpanel -p $(pwd)/ttymxc4 -c 'python3 /tmp/script.py'
. The main program would fork
itself and the child process would use execvp
to replace itself with the -c
ommand given on the CLI. Here's a code excerpt:
#[tokio::main]
async fn main() {
// ...
let pty = forkpty(None, None)?;
match pty.fork_result.is_parent() {
true => run_simulator(panel, proxies, sigint).await?,
false => exec_child(&m),
}
close(pty.master)?;
// ...
}
// ...
fn exec_child(m: &ArgMatches) {
let args: Vec<CString> = m.value_of("child_cmd")
.expect("Missing child process command")
.split(' ')
.map(|s| CString::new(s).expect("CString from String failed"))
.collect();
let argv: Vec<&CStr> = args
.iter()
.map(|s| s.as_c_str())
.collect();
execvp(argv[0], &argv)
.expect(format!("execvp failed: {:?}", argv).as_str());
}
Due to some changes on how I was testing things, I stopped using the -c ...
option, but the code was still fork
ing and trying to execvp
the child, and this has some implications. When a GNU+Linux process fork
s, the child process is a copy of the parent and, due to COW, that includes the parent's own memory pages.
Actual Cause
The best explanation I currently have is that when I stopped using the -c
option, the exec_child
function would panic!
s instead of replacing itself with the actual command's code on execvp
. (This much is probably obvious, since there's no CLI command for it to replace itself with.) We never saw the child's I/O in our shell because we were never directly connected to its I/O streams. However, the child would still "see" the pre-existing proxy object instance created by the parent b/c of the shared memory pages.
When the child copy panics and exits, the obj instance it sees also goes out of scope, and the Rust borrow-checker decides its time to drop
it. This drop
is the first (and unexpected) breakpoint hit, but it is successful in removing the link b/c it had just been created and it's actually there. This happens almost immediately after launching the main program (likely microseconds), explaining why I could never see the sym-link in the file system when checking, but would see it listed from Rust as mentioned previously.
The parent continues to run while this is happening, but when it wants to terminate, it hits the breakpoint a(n expected) second time b/c it's its own turn to drop
the (original) proxy obj instance (for going out of scope). However, the drop
implementation can no longer find the sym-link at this point, b/c that had been successfully, though unexpectedly, removed by the now dead child on the first drop
. Since my shell is connected to the parent's I/O streams, I always got to see that panic!
message.
In short: The parent fork
ed a copy of itself as a child. Thanks to COW, the child shared the parent's memory space and could see the proxy/object instance responsible for managing the sym-link's life cycle. The child copy failed and exited, causing Rust to drop
the child's own copy of the instance. This unexpected drop
would then cause the object to "clean its own mess", taking the sym-link with it. Some time later, the parent would terminate and its own original proxy instance would get drop
ed, causing it to also "clean up" a "mess" that had already been cleaned up and no longer existed. This would produce the documented panics.
This is why removing the left-over forkpty
call following the removal of the -c
option actually fixed the issue. It's not an immediately obvious or intuitive error to figure out and required some hindsight knowledge to understand why this even happened in the first place.
The code without the left-over fork
looks as follows:
#[tokio::main]
async fn main() {
// ...
run_simulator(panel, proxies, sigint).await?;
// ...
}
1 comment thread