Hidden Secrets in the arXiv

Abstract

arXiv Source Files (Unintentionally) Distribute "Hidden" Information

Preprints are a cornerstone of timely and open research dissemination. arXiv, the most widely used preprint service, takes this idea one step further: alongside the PDF, it publishes the LaTeX sources and other files used to create the paper, for approximately 93% of all submissions.

As is known from software repositories like GitHub, making source code publicly available carries the risk of exposing information that was never meant to be public. In this work, we studied how much sensitive information is unintentionally disclosed through arXiv source files.

We systematically answer this question for all 2.7 million arXiv submissions with source files (covering all submissions with sources until the end of 2025), across three dimensions: (1) unnecessary files included in the source files, (2) metadata embedded in images and other files, and (3) irrelevant content such as LaTeX comments. Notable findings range from links to editable internal documents and exposed API keys to complete Git histories. While several sanitization tools promise to clean source files before submission, we show they fail to reliably do so. As mitigation, we introduce ALC-NG, a new open-source tool that comprehensively addresses all three dimensions.

The Problem

Three Ways Source Files Expose Hidden Content

arXiv publishes the LaTeX source files of submitted papers alongside the compiled PDF. This strategy is supposed to enable notation reuse and provide learning opportunities for the community. However, making sources publicly available simultaneously introduces three dimensions of unintentional information disclosure. To make matters worse, all submitted versions of an arXiv paper remain permanently accessible, along with their sources. Accordingly, uploading a cleaned replacement does not remove the original, and existing source mirrors (e.g., archive.org) make even author-requested takedowns largely ineffective.

Dimension 1

Dangling Files

The uploaded sources often include files that are never referenced during compilation: Python scripts, shell scripts, unused bibliographies, configuration files, and even entire Git repositories. They are irrelevant and do not affect the compiled PDF at all, but remain publicly accessible (downloadable).

Dimension 2

Embedded Metadata

The fact that images and PDFs carry metadata is well know and also applies to all source files. Metadata frequently reports author usernames, email addresses, software used, GPS coordinates from mobile photos, and file modification timestamps. These details can reveal personal authoring details that authors have no reason to suspect are present.

Dimension 3

Irrelevant Content

LaTeX sources frequently contain text that never appears in the compiled PDF: standard comments (lines starting with %), content after \end{document}, non-taken \if branches, and ignored arguments of custom commands. Authors routinely leave behind commented-out text, discussions among co-authors, TODO items, and in serious cases, credentials or publicly-accessible links to private documents.

Methodology

All 2.7 Million arXiv Papers with Sources, Analyzed

We analyzed all arXiv submissions with available LaTeX source files from 1991 through December 2025: 2.7 million papers in total, covering 93% of all arXiv submissions. For each dimension of “hidden” information, we apply a dedicated detection method chosen for reliability over heuristics.

Dangling Files

pdflatex Recorder

We attempt to compile each paper with pdflatex and use its built-in recorder option to log which files are accessed. Any file not accessed during compilation is classified as dangling. Results are verified against file system access timestamps. If the compilation is not successful, we rely on a regex-based approach to identify required files.

Metadata

Exiftool

We apply exiftool to all non-TeX files to extract embedded metadata, such as author names, GPS coordinates, software versions, timestamps, and more.

Contents

Tree-sitter AST Grammar

Instead of fragile regex heuristics, we parse LaTeX using an abstract syntax tree grammar. This approach reliably identifies all four mechanisms by which content can be hidden: (A) standard line comments, (B) content outside \end{document}, (C) non-taken \if branches, and (D) ignored arguments of custom commands.

Content Semantics

Local LLM Classification

We apply a locally-hosted Qwen2.5-72B model to classify 113 million unique comments by content type and flags potentially sensitive material. This way, we ensure that no sensitive content is transmitted to third parties.

Secret Discovery

Regex Pattern Matching

Targeted regular expressions search all source files for concrete sensitive content: API keys, passwords, private keys, URLs to internal documents, and other credentials. Subsequently, matches undergo manual validation to confirm true positives.

Sanitization

Tool Evaluation

We evaluate existing sanitization tools alongside our own ALC-NG against nine test cases, covering the most prevalent methods to insert comments in LaTeX source files. This assessment highlights that most tools have shortcomings of some kind, disregarding certain methods or even breaking source compilation after attempting sanitization.

To focus on content that originates from a paper’s own authors rather than shared templates, we define “interesting” content as any item appearing in at most two submissions. We were able to successfully compile 85% of papers (2.32 million) in our TeX Live 2023 evaluation environment.

Figure 2. Annual arXiv submissions with and without LaTeX source files. Source availability has remained above 91% since 2013, reaching essentially all submissions in recent years. Click legend items to toggle series.

Key Findings

What We Found Across 2.7 Million Papers

Across all three dimensions, hidden information is widespread. Focusing on unique content likely originating from each paper’s own authors: 75% of submissions contain unique interesting comments, 56% contain unique metadata, and 25% have unique dangling files. Combined, 88% are affected in at least one dimension. The most common combination is comments and metadata (30%); 18% of submissions are affected in all three dimensions simultaneously.

Figure 3. Share of arXiv submissions (with source files) affected by each dimension of hidden information, per year from 1990 to 2025. Toggle between “unique” content (likely author-specific) and all detected content. Click legend items to filter.

Figure 4. Complementary CDFs (1-CDF): fraction of submissions with at least the given count of hidden items. Solid lines show all detected items; dashed lines show unique (author-specific) items only. Hover for values; click legend to toggle series.

Dimension 1

Dangling Files: 12 Million Unnecessary Files

We identify 12 million dangling files across the dataset. Most are benign leftovers (template style files, unused images), but a long tail of rare file types is predominantly dangling and carries genuine risk:

File Type	Share of Submissions	Risk
README files	3.4%	Project structure, collaborator info
Python scripts	0.21%	Analysis code, possibly with hardcoded secrets
Shell scripts	0.21%	Build workflows, server names
CSV data files	0.20%	Unpublished research data
Config files	2,774 submissions	API keys, database connection strings
Git repositories	74 submissions	Full editing history, potentially with credentials or removed content
NFS metadata files	40 submissions	Filesystem artifacts confirming authors tried to delete these files

Bibliography files are a special case. They are redundant once the compiled .bbl file exists, yet 93% of submitted .bib files contain entries not referenced in the compiled document. Of these: 8.4% of submissions reuse a bibliography from a prior paper verbatim; 5.5% include large community-wide bibliography databases with more than 500 entries. Beyond unused entries, .bib files also carry 42 million non-standard extra fields, of which 29 million appear unique to individual submissions, often containing author ratings, editorial notes, or local file paths.

Template files can reveal submission histories. Conference-specific style files (e.g., usenix.sty) in dangling positions might hint at where a paper was submitted before. In a sample of 115 submissions with dangling USENIX templates matched via dblp, 77% were ultimately published elsewhere. We identified 47 venues and 561 submissions potentially traceable this way.

Figure 5. Distribution of file types in arXiv sources (MIME type → subtype → extension → dangling/required). Some file types (e.g., images) are often dangling; others (e.g., TeX files) rarely are. Hover nodes and links for counts.

Dimension 2

Embedded Metadata: Timestamps, GPS, and Usernames

90% of submissions include files with modification timestamps. Of these, 72% show their most recent file modification within one hour of the arXiv-recorded submission time (35% within five minutes), revealing fine-grained working patterns. Additionally, 93% of all images and 93% of all included PDFs carry exiftool-extractable metadata. Key disclosures beyond what is already visible in the compiled PDF:

Usernames: 11%

11% of submissions expose system usernames via image or file metadata. In double-blind reviewing, this information could unblind authors.

GPS coordinates: 7,326 submissions

7,326 submissions carry GPS metadata in images. 2,238 contain multiple distinct locations. 235 of those span distances up to 50 km, a range consistent with commuting between workplace and home. In a random sample of ten such submissions, nine included coordinates for both research institutions and residential areas.

Software: 60%

60% disclose the software used to create files (e.g., application name and version). The most frequent metadata category, though usually with comparably low sensitivity.

Figure 6. Left: share of file modifications by hour of day, separately for weekdays and weekends (local time). Right: count of unique files by how long before arXiv submission they were last modified. Most were touched minutes before upload.

Dimension 3

Irrelevant Content: 644 Million Comments Analyzed

Using our AST-based parser, we extract 644 million pieces of irrelevant content (comments, unused branches, post-document text) across 95% of all submissions. After filtering for uniqueness, 114 million “interesting” items remain, present in 75% of submissions. A locally-hosted LLM classifies 113 million of these (macro F1: 0.83) to reveal what authors actually leave behind. Our structured parsing approach constitutes a reliable lower bound: it minimizes false positives while covering all commonly used comment mechanisms.

Comment Category	Share of Comments	Submissions Affected	Example Content
Academic text	44%	76%	Commented-out paragraphs, alternative phrasings, removed sections
LaTeX markup	30%	92%	Disabled table code, commented math, inert formatting blocks
Bibliography notes	9.1%	62%	Notes on references, annotation fields
Conversational	7.9%	72%	Informal co-author exchanges, critiques of text quality, strategic framing decisions
Formatting aids	1.7%	64%	Section separators, visual markers
TODO items	0.98%	44%	Unfinished tasks, sometimes acknowledging methodological weaknesses not in the published text
Potentially sensitive (LLM-flagged)	7.0%	—	PII, opinions about reviewers, credentials, personal discussions

Based on bootstrapping (95% CI), the LLM estimates that at least 3.6% of comments contain sensitive content. This statement should be treated carefully, though: sensitivity perception varies by individual, and the model is deliberately conservative to minimize false positives.

Pattern-Based Detection: Specific Sensitive Findings

Beyond semantic classification, targeted pattern matching (with manual validation) identifies concrete sensitive content:

Credentials and Keys

265 API tokens in 128 submissions (26% in dangling files, the rest in comments); 4 private keys in 4 submissions; 171 passwords in 82 submissions (25% in dangling files). Their placement in commented parts or dangling files confirms that the authors most likely did not intend to share them publicly.

Editable Online Documents

3,948 submissions link to Google Docs. Of these, at least 1,119 grant viewing access and 699 grant editing access. Through manual analysis, we identify at least 200 cases exposing sensitive content: peer-review materials, cover letters, rebuttals, meeting minutes with Zoom links, student assignments, and shift schedules. Additionally, 4,272 submissions link to Google Drive; 26,000 to Overleaf projects.

Protocols and URLs

127 million unique URLs across 905,000 submissions. 29,000 submissions reference FTP URLs; 349 reference SSH connection strings. Publicly-accessible FTP servers still online may serve sensitive content or expose internal infrastructure.

Participant Survey Data

18 cases where discovered Google Docs links led to live survey data of study participants, a potential ethics and data protection violation for the researchers involved.

Hidden LLM Instructions

162 submissions contain prompt-injection-style patterns in commented sections, likely intended to influence AI-assisted peer review. 537 submissions embed common AI-generated text disclaimers (e.g., “As an AI language model”) in comments, indicating generated content.

Review-Related Content

3,200 submissions contain review-related keywords (“reviewer,” “rebuttal,” “not correct”) in commented sections, likely responses to peer feedback that authors forgot to remove. 269 submissions use LaTeX censoring packages (e.g., censor, pdfprivacy), suggesting attempted but ineffective sanitization.

Figure 7. Top-10 URL protocols (left) and domain categories (right, classified via Cloudflare) found in arXiv LaTeX comments. Toggle between total occurrences and unique URLs. Beyond prevalent HTTPS, sources reference FTP servers, SSH strings, and file-sharing services.

Research Fields

Computer Science Papers Are Disproportionately Affected

Computer science (CS) papers contain significantly more “hidden” information than papers from other disciplines, even after normalizing for source file size (Mann-Whitney U, p < 0.0001). Within CS, security papers exceed the CS average, and A*/A-ranked security venue papers exceed lower-ranked ones. This finding likely reflects more elaborate source archives (more files, larger bibliographies, more co-authors) rather than greater carelessness.

Figures 8 & 9. Prevalence of hidden information by arXiv research field (default) and by CORE ranking among matched security papers. Switch views and toggle between all vs. unique content using the controls above. Math/Physics papers tend to have fewer dangling files, likely due to simpler source archives; A*/A-ranked security papers have disproportionately more.

Security researchers are presumably more aware of information disclosure risks than average researchers, yet their papers are disproportionately affected. This observation reinforces that the issue rather stems from the structure of the tooling and workflow, not from individual carelessness.

Paper Versions

Updating a Paper Does Not Fix the Problem

arXiv allows submitters to upload new versions of a paper over time. 1.1 million submissions in our dataset have multiple versions (4.3 million versions in total). One might hope that authors use version updates to remove accidentally disclosed content. At scale, however, we find no consistent indication that subsequent versions reduce the amount of hidden information. If anything, later versions tend to add new unique content first, and only remove it in the long tail.

Individual examples confirm this trend: one submission initially containing many valid OpenID tokens removed them in v2, yet the original v1 remains permanently accessible. Similarly, seven respondents to our survey reported uploading cleaned replacement versions without realizing that old versions and their sources remain downloadable.

Figure 10. Average hidden-item counts by paper version, normalized to v1 = 100 (left axis), with the fraction of papers reaching at least that version as 1−CDF (right axis, log scale). Later versions consistently accumulate more unique content rather than removing it. Click legend to toggle series.

Sanitization Tools

Why Existing Cleaning Tools Fall Short

Several sanitization tools promise to “clean” LaTeX sources before arXiv submission. We test all of them against nine test cases covering relevant methods for embedding comments in LaTeX. None of the evaluated tools performs satisfactorily, and critically, all tools ignore the metadata dimension entirely.

Tool	Dangling Files	Metadata	Comment Coverage	Papers Benefited	Sources Broken
Perl one-liner (arXiv, 2005)	None	None	Partial	95%	4.2%
arxiv_latex_cleaner (ALC, Google)	Partial	None	Partial; mishandles special environments	91%	19%
latexindent.pl	None	None	Partial	3.7%	9.2%
arXiv Cleaner (Elsa-Lab)	Partial	None	Partial; crashes on comment environments	80%	80%
ALC-NG (this work)	Complete	Complete	Exhaustive	98%	15%

The most widely used tool, arxiv_latex_cleaner (ALC), produces pixel-perfect PDFs in 88% of compilable cases; ALC-NG in 87%. Despite arXiv endorsing ALC since its release in 2018, we estimate only 538 submissions (0.10%) show evidence of having been run through it. ALC also leaves residual content: after cleaning, we still find 362 sensitive regex matches in underremoved comments, including links to Overleaf, illustrating that partial cleaning can create a false sense of security.

Storage Implications

Beyond privacy, unnecessarily submitted content costs arXiv permanent storage. Accurate sanitization of the latest versions alone could free 658 GB (22.9%) of their source storage. Considering all versions, the savings amount to 2,720 GB (23.3%), of which 97% stems from dangling files.

Our Solution

ALC-NG: Next-Generation LaTeX Sanitization

Given the systematic shortcomings of existing tools, we develop and open-source ALC-NG. To the best of our knowledge, it is the first tool to address all three dimensions of hidden information, combining established best practices for each.

Dimension 1 · Files

pdflatex Recorder

Uses pdflatex’s built-in file access log to identify exactly which files are needed for compilation. No heuristics. Validated against file system access timestamps; we do not observe any discrepancies. It further respects arXiv-specific conventions: 00README files and /anc/ directories for intentional ancillary files.

Dimension 2 · Metadata

Exiftool Integration

Strips embedded metadata from images and PDFs across all supported file types. Optionally suppresses file modification timestamps. The first sanitization tool that also accounts for the metadata dimension.

Dimension 3 · Content

Tree-sitter AST Grammar

Re-purposes the same AST-based detection used in our analysis to reliably identify and remove all considered types of irrelevant LaTeX content. It passes all nine synthetic comment cleanup tests; no existing tool achieves this coverage.

ALC-NG can validate its own output: optionally, it compares the original and sanitized PDFs pixel-by-pixel using pdfium to confirm that no visible content was altered. It produces pixel-perfect output in 87% of compilable cases, benefits 98% of submissions, and breaks only 15% of sources, comparable to or better than all existing tools on each metric.

Get ALC-NG at alc-ng.deopen source, designed to be usable by inexperienced LaTeX users, and ready to run before your next arXiv submission.

Responsible Disclosure

Notifying 2,660 Authors: What We Learned

We disclosed validated findings (exposed credentials, accessible documents) to 2,660 authors of affected papers and invited them to a survey on awareness and practices around arXiv source files. We received 112 complete responses (5.3% response rate). 29% identified as assistant professors, 16% full professors, 28% permanent research staff, and 18% postdocs and PhD students. 49% reported more than ten arXiv submissions.

Prior awareness

Only 41% were already aware that arXiv publishes source files and of the associated risks before receiving our disclosure notice.

Sensitivity perceptions

43% agreed that the disclosed content is sensitive. Conversational comments were seen as sensitive by 39%; TODO items by 38%. Respondents who reported no sensitive content had significantly fewer actual issues (Mann-Whitney U, p ≤ 0.03).

Cleaning practices

Only 28% had heard of cleaning tools; 11% used one regularly; 8.9% applied manual cleaning before submission. Seven authors uploaded cleaned versions without realizing that old versions remain accessible. 51 authors replied positively via email; several restricted access to exposed online documents.

We acknowledge potential response bias: participants were primed by our disclosure notice immediately before the survey, and authors who did not consider their findings sensitive were less likely to respond. All research was conducted under institutional ethics review. The LLM analysis used a locally-hosted model. Web resources were accessed read-only with strict rate limiting; no credentials were used or tested. We publish our findings without paper identifiers to protect affected individuals.

Recommendations

What Should Change

For Authors

Sanitize sources before public distribution

Remove all comments and dangling files from source files prior to distributing them, e.g., before submission to arXiv, preferably using a sanitization tool with good coverage such as ALC-NG. This step should include removing sensitive metadata from included PDF files and images.

Act on already-exposed information

Affected authors are likely unable to fully remove unintentionally-disclosed information due to existing source file mirrors. Consequently, corrective actions should concentrate on revoking permissions for exposed API keys and enforcing access control for affected URLs.

For Institutions

Offer guidelines and establish comprehensible policies

To protect affiliated authors and intellectual properties, institutions should offer guidelines outlining recommended practice and establish comprehensive policies on the release of LaTeX sources to preprint repositories such as arXiv.

For Preprint Repositories

Raise awareness and enforce sensible behavior

Complementing efforts by authors and institutions, preprint repositories should explicitly raise awareness of the risks resulting from “hidden” information and enforce sensible behavior.

Warn authors at submission time

By incorporating checks on irrelevant content that sanitization tools would remove, repositories could issue warnings for detected potentially sensitive “hidden” information in uploaded source files and automatically discard dangling files.

Proactively scan already-published sources

For already-published source files, preprint repositories could proactively check for indicators of sensitive “hidden” information and work with authors to expurgate affected submissions.

arXiv Source Files (Unintentionally) Distribute "Hidden" Information

Three Ways Source Files Expose Hidden Content

Dangling Files

Embedded Metadata

Irrelevant Content

All 2.7 Million arXiv Papers with Sources, Analyzed

pdflatex Recorder

Exiftool

Tree-sitter AST Grammar

Local LLM Classification

Regex Pattern Matching

Tool Evaluation

What We Found Across 2.7 Million Papers

Dangling Files: 12 Million Unnecessary Files

Embedded Metadata: Timestamps, GPS, and Usernames

Usernames: 11%

GPS coordinates: 7,326 submissions

Software: 60%

Irrelevant Content: 644 Million Comments Analyzed

Pattern-Based Detection: Specific Sensitive Findings

Credentials and Keys

Editable Online Documents

Protocols and URLs

Participant Survey Data

Hidden LLM Instructions

Review-Related Content

Computer Science Papers Are Disproportionately Affected

Updating a Paper Does Not Fix the Problem

Why Existing Cleaning Tools Fall Short

Storage Implications

ALC-NG: Next-Generation LaTeX Sanitization

pdflatex Recorder

Exiftool Integration

Tree-sitter AST Grammar

Notifying 2,660 Authors: What We Learned

Prior awareness

Sensitivity perceptions

Cleaning practices

What Should Change

For Authors

Sanitize sources before public distribution

Act on already-exposed information

For Institutions

Offer guidelines and establish comprehensible policies

For Preprint Repositories

Raise awareness and enforce sensible behavior

Warn authors at submission time

Proactively scan already-published sources

BibTeX