Preprints are a cornerstone of timely and open research dissemination. arXiv, the most widely used preprint service, takes this idea one step further: alongside the PDF, it publishes the LaTeX sources and other files used to create the paper, for approximately 93% of all submissions.
As is known from software repositories like GitHub, making source code publicly available carries the risk of exposing information that was never meant to be public. In this work, we studied how much sensitive information is unintentionally disclosed through arXiv source files.
We systematically answer this question for all 2.7 million arXiv submissions with source files (covering all submissions with sources until the end of 2025), across three dimensions: (1) unnecessary files included in the source files, (2) metadata embedded in images and other files, and (3) irrelevant content such as LaTeX comments. Notable findings range from links to editable internal documents and exposed API keys to complete Git histories. While several sanitization tools promise to clean source files before submission, we show they fail to reliably do so. As mitigation, we introduce ALC-NG, a new open-source tool that comprehensively addresses all three dimensions.
arXiv publishes the LaTeX source files of submitted papers alongside the compiled PDF. This strategy is supposed to enable notation reuse and provide learning opportunities for the community. However, making sources publicly available simultaneously introduces three dimensions of unintentional information disclosure. To make matters worse, all submitted versions of an arXiv paper remain permanently accessible, along with their sources. Accordingly, uploading a cleaned replacement does not remove the original, and existing source mirrors (e.g., archive.org) make even author-requested takedowns largely ineffective.
The uploaded sources often include files that are never referenced during compilation: Python scripts, shell scripts, unused bibliographies, configuration files, and even entire Git repositories. They are irrelevant and do not affect the compiled PDF at all, but remain publicly accessible (downloadable).
The fact that images and PDFs carry metadata is well know and also applies to all source files. Metadata frequently reports author usernames, email addresses, software used, GPS coordinates from mobile photos, and file modification timestamps. These details can reveal personal authoring details that authors have no reason to suspect are present.
LaTeX sources frequently contain text that never appears in the compiled PDF: standard comments (lines starting with %), content after \end{document}, non-taken \if branches, and ignored arguments of custom commands. Authors routinely leave behind commented-out text, discussions among co-authors, TODO items, and in serious cases, credentials or publicly-accessible links to private documents.
We analyzed all arXiv submissions with available LaTeX source files from 1991 through December 2025: 2.7 million papers in total, covering 93% of all arXiv submissions. For each dimension of “hidden” information, we apply a dedicated detection method chosen for reliability over heuristics.
We attempt to compile each paper with pdflatex and use its built-in recorder option to log which files are accessed. Any file not accessed during compilation is classified as dangling. Results are verified against file system access timestamps. If the compilation is not successful, we rely on a regex-based approach to identify required files.
We apply exiftool to all non-TeX files to extract embedded metadata, such as author names, GPS coordinates, software versions, timestamps, and more.
Instead of fragile regex heuristics, we parse LaTeX using an abstract syntax tree grammar. This approach reliably identifies all four mechanisms by which content can be hidden: (A) standard line comments, (B) content outside \end{document}, (C) non-taken \if branches, and (D) ignored arguments of custom commands.
We apply a locally-hosted Qwen2.5-72B model to classify 113 million unique comments by content type and flags potentially sensitive material. This way, we ensure that no sensitive content is transmitted to third parties.
Targeted regular expressions search all source files for concrete sensitive content: API keys, passwords, private keys, URLs to internal documents, and other credentials. Subsequently, matches undergo manual validation to confirm true positives.
We evaluate existing sanitization tools alongside our own ALC-NG against nine test cases, covering the most prevalent methods to insert comments in LaTeX source files. This assessment highlights that most tools have shortcomings of some kind, disregarding certain methods or even breaking source compilation after attempting sanitization.
To focus on content that originates from a paper’s own authors rather than shared templates, we define “interesting” content as any item appearing in at most two submissions. We were able to successfully compile 85% of papers (2.32 million) in our TeX Live 2023 evaluation environment.
Across all three dimensions, hidden information is widespread. Focusing on unique content likely originating from each paper’s own authors: 75% of submissions contain unique interesting comments, 56% contain unique metadata, and 25% have unique dangling files. Combined, 88% are affected in at least one dimension. The most common combination is comments and metadata (30%); 18% of submissions are affected in all three dimensions simultaneously.
We identify 12 million dangling files across the dataset. Most are benign leftovers (template style files, unused images), but a long tail of rare file types is predominantly dangling and carries genuine risk:
| File Type | Share of Submissions | Risk |
|---|---|---|
| README files | 3.4% | Project structure, collaborator info |
| Python scripts | 0.21% | Analysis code, possibly with hardcoded secrets |
| Shell scripts | 0.21% | Build workflows, server names |
| CSV data files | 0.20% | Unpublished research data |
| Config files | 2,774 submissions | API keys, database connection strings |
| Git repositories | 74 submissions | Full editing history, potentially with credentials or removed content |
| NFS metadata files | 40 submissions | Filesystem artifacts confirming authors tried to delete these files |
Bibliography files are a special case. They are redundant once the compiled .bbl file exists, yet 93% of submitted .bib files contain entries not referenced in the compiled document. Of these: 8.4% of submissions reuse a bibliography from a prior paper verbatim; 5.5% include large community-wide bibliography databases with more than 500 entries. Beyond unused entries, .bib files also carry 42 million non-standard extra fields, of which 29 million appear unique to individual submissions, often containing author ratings, editorial notes, or local file paths.
Template files can reveal submission histories. Conference-specific style files (e.g., usenix.sty) in dangling positions might hint at where a paper was submitted before. In a sample of 115 submissions with dangling USENIX templates matched via dblp, 77% were ultimately published elsewhere. We identified 47 venues and 561 submissions potentially traceable this way.
90% of submissions include files with modification timestamps. Of these, 72% show their most recent file modification within one hour of the arXiv-recorded submission time (35% within five minutes), revealing fine-grained working patterns. Additionally, 93% of all images and 93% of all included PDFs carry exiftool-extractable metadata. Key disclosures beyond what is already visible in the compiled PDF:
11% of submissions expose system usernames via image or file metadata. In double-blind reviewing, this information could unblind authors.
7,326 submissions carry GPS metadata in images. 2,238 contain multiple distinct locations. 235 of those span distances up to 50 km, a range consistent with commuting between workplace and home. In a random sample of ten such submissions, nine included coordinates for both research institutions and residential areas.
60% disclose the software used to create files (e.g., application name and version). The most frequent metadata category, though usually with comparably low sensitivity.
Using our AST-based parser, we extract 644 million pieces of irrelevant content (comments, unused branches, post-document text) across 95% of all submissions. After filtering for uniqueness, 114 million “interesting” items remain, present in 75% of submissions. A locally-hosted LLM classifies 113 million of these (macro F1: 0.83) to reveal what authors actually leave behind. Our structured parsing approach constitutes a reliable lower bound: it minimizes false positives while covering all commonly used comment mechanisms.
| Comment Category | Share of Comments | Submissions Affected | Example Content |
|---|---|---|---|
| Academic text | 44% | 76% | Commented-out paragraphs, alternative phrasings, removed sections |
| LaTeX markup | 30% | 92% | Disabled table code, commented math, inert formatting blocks |
| Bibliography notes | 9.1% | 62% | Notes on references, annotation fields |
| Conversational | 7.9% | 72% | Informal co-author exchanges, critiques of text quality, strategic framing decisions |
| Formatting aids | 1.7% | 64% | Section separators, visual markers |
| TODO items | 0.98% | 44% | Unfinished tasks, sometimes acknowledging methodological weaknesses not in the published text |
| Potentially sensitive (LLM-flagged) | 7.0% | — | PII, opinions about reviewers, credentials, personal discussions |
Based on bootstrapping (95% CI), the LLM estimates that at least 3.6% of comments contain sensitive content. This statement should be treated carefully, though: sensitivity perception varies by individual, and the model is deliberately conservative to minimize false positives.
Beyond semantic classification, targeted pattern matching (with manual validation) identifies concrete sensitive content:
265 API tokens in 128 submissions (26% in dangling files, the rest in comments); 4 private keys in 4 submissions; 171 passwords in 82 submissions (25% in dangling files). Their placement in commented parts or dangling files confirms that the authors most likely did not intend to share them publicly.
3,948 submissions link to Google Docs. Of these, at least 1,119 grant viewing access and 699 grant editing access. Through manual analysis, we identify at least 200 cases exposing sensitive content: peer-review materials, cover letters, rebuttals, meeting minutes with Zoom links, student assignments, and shift schedules. Additionally, 4,272 submissions link to Google Drive; 26,000 to Overleaf projects.
127 million unique URLs across 905,000 submissions. 29,000 submissions reference FTP URLs; 349 reference SSH connection strings. Publicly-accessible FTP servers still online may serve sensitive content or expose internal infrastructure.
18 cases where discovered Google Docs links led to live survey data of study participants, a potential ethics and data protection violation for the researchers involved.
162 submissions contain prompt-injection-style patterns in commented sections, likely intended to influence AI-assisted peer review. 537 submissions embed common AI-generated text disclaimers (e.g., “As an AI language model”) in comments, indicating generated content.
3,200 submissions contain review-related keywords (“reviewer,” “rebuttal,” “not correct”) in commented sections, likely responses to peer feedback that authors forgot to remove. 269 submissions use LaTeX censoring packages (e.g., censor, pdfprivacy), suggesting attempted but ineffective sanitization.
Computer science (CS) papers contain significantly more “hidden” information than papers from other disciplines, even after normalizing for source file size (Mann-Whitney U, p < 0.0001). Within CS, security papers exceed the CS average, and A*/A-ranked security venue papers exceed lower-ranked ones. This finding likely reflects more elaborate source archives (more files, larger bibliographies, more co-authors) rather than greater carelessness.
arXiv allows submitters to upload new versions of a paper over time. 1.1 million submissions in our dataset have multiple versions (4.3 million versions in total). One might hope that authors use version updates to remove accidentally disclosed content. At scale, however, we find no consistent indication that subsequent versions reduce the amount of hidden information. If anything, later versions tend to add new unique content first, and only remove it in the long tail.
Individual examples confirm this trend: one submission initially containing many valid OpenID tokens removed them in v2, yet the original v1 remains permanently accessible. Similarly, seven respondents to our survey reported uploading cleaned replacement versions without realizing that old versions and their sources remain downloadable.
Several sanitization tools promise to “clean” LaTeX sources before arXiv submission. We test all of them against nine test cases covering relevant methods for embedding comments in LaTeX. None of the evaluated tools performs satisfactorily, and critically, all tools ignore the metadata dimension entirely.
| Tool | Dangling Files | Metadata | Comment Coverage | Papers Benefited | Sources Broken |
|---|---|---|---|---|---|
| Perl one-liner (arXiv, 2005) | None | None | Partial | 95% | 4.2% |
| arxiv_latex_cleaner (ALC, Google) | Partial | None | Partial; mishandles special environments | 91% | 19% |
| latexindent.pl | None | None | Partial | 3.7% | 9.2% |
| arXiv Cleaner (Elsa-Lab) | Partial | None | Partial; crashes on comment environments | 80% | 80% |
| ALC-NG (this work) | Complete | Complete | Exhaustive | 98% | 15% |
The most widely used tool, arxiv_latex_cleaner (ALC), produces pixel-perfect PDFs in 88% of compilable cases; ALC-NG in 87%. Despite arXiv endorsing ALC since its release in 2018, we estimate only 538 submissions (0.10%) show evidence of having been run through it. ALC also leaves residual content: after cleaning, we still find 362 sensitive regex matches in underremoved comments, including links to Overleaf, illustrating that partial cleaning can create a false sense of security.
Beyond privacy, unnecessarily submitted content costs arXiv permanent storage. Accurate sanitization of the latest versions alone could free 658 GB (22.9%) of their source storage. Considering all versions, the savings amount to 2,720 GB (23.3%), of which 97% stems from dangling files.
Given the systematic shortcomings of existing tools, we develop and open-source ALC-NG. To the best of our knowledge, it is the first tool to address all three dimensions of hidden information, combining established best practices for each.
Uses pdflatex’s built-in file access log to identify exactly which files are needed for compilation. No heuristics. Validated against file system access timestamps; we do not observe any discrepancies. It further respects arXiv-specific conventions: 00README files and /anc/ directories for intentional ancillary files.
Strips embedded metadata from images and PDFs across all supported file types. Optionally suppresses file modification timestamps. The first sanitization tool that also accounts for the metadata dimension.
Re-purposes the same AST-based detection used in our analysis to reliably identify and remove all considered types of irrelevant LaTeX content. It passes all nine synthetic comment cleanup tests; no existing tool achieves this coverage.
ALC-NG can validate its own output: optionally, it compares the original and sanitized PDFs pixel-by-pixel using pdfium to confirm that no visible content was altered. It produces pixel-perfect output in 87% of compilable cases, benefits 98% of submissions, and breaks only 15% of sources, comparable to or better than all existing tools on each metric.
We disclosed validated findings (exposed credentials, accessible documents) to 2,660 authors of affected papers and invited them to a survey on awareness and practices around arXiv source files. We received 112 complete responses (5.3% response rate). 29% identified as assistant professors, 16% full professors, 28% permanent research staff, and 18% postdocs and PhD students. 49% reported more than ten arXiv submissions.
Only 41% were already aware that arXiv publishes source files and of the associated risks before receiving our disclosure notice.
43% agreed that the disclosed content is sensitive. Conversational comments were seen as sensitive by 39%; TODO items by 38%. Respondents who reported no sensitive content had significantly fewer actual issues (Mann-Whitney U, p ≤ 0.03).
Only 28% had heard of cleaning tools; 11% used one regularly; 8.9% applied manual cleaning before submission. Seven authors uploaded cleaned versions without realizing that old versions remain accessible. 51 authors replied positively via email; several restricted access to exposed online documents.
We acknowledge potential response bias: participants were primed by our disclosure notice immediately before the survey, and authors who did not consider their findings sensitive were less likely to respond. All research was conducted under institutional ethics review. The LLM analysis used a locally-hosted model. Web resources were accessed read-only with strict rate limiting; no credentials were used or tested. We publish our findings without paper identifiers to protect affected individuals.
Remove all comments and dangling files from source files prior to distributing them, e.g., before submission to arXiv, preferably using a sanitization tool with good coverage such as ALC-NG. This step should include removing sensitive metadata from included PDF files and images.
Affected authors are likely unable to fully remove unintentionally-disclosed information due to existing source file mirrors. Consequently, corrective actions should concentrate on revoking permissions for exposed API keys and enforcing access control for affected URLs.
To protect affiliated authors and intellectual properties, institutions should offer guidelines outlining recommended practice and establish comprehensive policies on the release of LaTeX sources to preprint repositories such as arXiv.
Complementing efforts by authors and institutions, preprint repositories should explicitly raise awareness of the risks resulting from “hidden” information and enforce sensible behavior.
By incorporating checks on irrelevant content that sanitization tools would remove, repositories could issue warnings for detected potentially sensitive “hidden” information in uploaded source files and automatically discard dangling files.
For already-published source files, preprint repositories could proactively check for indicators of sensitive “hidden” information and work with authors to expurgate affected submissions.