Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Transferring files from a legacy project to an existing one as varbinary
Our team is currently transferring all functionality (+ some changes) from small and very old project A (almost code freeze) to project B (actively developed). As part of the data migration, there are about 10GB of files to be transferred.
Currently, project B stores files as varbinary in the database (not OK, will be migrated in a couple of months to an internal storage that provides an API). The files migration boils down to migrating physical files of project A to varbinary in project B.
The projects reside on different web servers which access different SQL Server instances. What I know for sure now is that a linked server is configured between the two SQL Server instances.
Project A and B will run side by side for a little while, so I need to support differential migration of data (real-time not required though).
Project A's files reside on the webserver and developers are not allowed to access any production server, except through the application's APIs.
Installing a new application on the production requires a lot of headaches (all apps are installed through an old-school pipeline), so the solution should involve using the existing applications.
What I have in mind for the migration:
- add an endpoint in application A that syncs the files in project A's database
- add an endpoint in application B that sync the files in its database using the existing linked server
An alternative that should be easier to maintain would be for application B to be able to read files directly from webserver A, but I am not sure that the security rules allow it.
Any ideas about how to improve this process? Or possible pitfalls? The only thing that concerns me is the transfer of large data over the linked server.
1 answer
The approach I would take given the constraints you've stated is to make much simpler and safer changes to Project A. Namely, 1) provide an API endpoint for fetching a file, and 2) provide an API endpoint, if needed, that takes some kind of "timestamp" (real or logical) and returns a list of files (optionally with some kind of checksum e.g. a MD5 hash) that have been added since that "timestamp" including any other metadata that's required.
As you've likely already gathered, Project B can then just fetch the list and then fetch files on its own schedule. It can refetch the list whenever it reaches the end of the current list or at any other time, e.g. if the Project B application is restarted.
The benefits of this approach is that it doesn't require any schema changes to Project A (assuming there's something already that can serve as a "timestamp"), and it's purely read-only from Project A. This avoids bloating Project A's database with file data that is not used by Project A. It also lessens risk of corrupting or breaking Project A, or even of needing downtime for it. It may also simplify throttling to avoid the "syncing" from being disruptive, if necessary.
On Project B's side, it's arguably less coupling to Project A. While accessing the file list endpoint is specialized to Project A, the file data endpoint can be generic and useful for other purposes. Even for syncing with Project A, it may be handy to be able to pull individual files on-demand outside of some background "syncing" process. Indeed, it's possible you could drop the file list endpoint entirely and do the file copying entirely on-demand.
This approach is basically the "read files directly from Project A's server" just mediated by Project A.
There may be different security requirements needed for these endpoints. You might be able to do something like IP address filtering since only Project B's web server needs access to these endpoints, though you can certainly use the same authorization mechanism you use for the other endpoints with just a "Project B" user or whatever.
This approach relies on the fact that the files are immutable. More coordination and complexity would be needed otherwise.
1 comment thread