diff --git a/README.md b/README.md index 862f88c..59200d3 100644 --- a/README.md +++ b/README.md @@ -1,37 +1,104 @@ -RSS to Bluesky - in Python --------------------------- +# post2bsky -This is a proof-of-concept implementation for posting RSS/Atom content to Bluesky. Some hacking may be required. Issues and pull requests welcome to improve the system. +post2bsky is a Python-based tool for automatically posting content from RSS feeds and Twitter accounts to Bluesky (AT Protocol). It supports both RSS-to-Bluesky and Twitter-to-Bluesky synchronization, with configurable workflows for various sources. +## Features -## Built with: +- **RSS to Bluesky**: Parse RSS feeds and post new entries to Bluesky with proper formatting and media handling. +- **Twitter to Bluesky**: Scrape tweets from specified Twitter accounts and repost them to Bluesky, including media attachments. +- **Daemon Mode**: Run as a background service for continuous monitoring and posting. +- **Configurable Workflows**: Use YAML-based workflows to define sources, schedules, and posting rules. +- **Media Support**: Handle images, videos, and other media from feeds and tweets. +- **Deduplication**: Prevent duplicate posts using state tracking. +- **Logging**: Comprehensive logging for monitoring and debugging. -* [arrow](https://arrow.readthedocs.io/) - Time handling for humans -* [atproto](https://github.com/MarshalX/atproto) - AT protocol implementation for Python. The API of the library is still unstable, but the version is pinned in requirements.txt -* [fastfeedparser](https://github.com/kagisearch/fastfeedparser) - For feed parsing with a unified API -* [httpx](https://www.python-httpx.org/) - For grabbing remote media +## Installation +1. Clone the repository: + ```bash + git clone https://github.com/yourusername/post2bsky.git + cd post2bsky + ``` -## Features: +2. Install Python dependencies: + ```bash + pip install -r requeriments.txt + ``` -* Deduplication: The script queries the target timeline and only posts RSS items that are more recent than the latest top-level post by the handle. -* Filters: Easy to extend code to support filters on RSS contents for simple transformations and limiting cross-posts. -* Minimal rich-text support (links): Rich text is represented in a typed hierarchy in the AT protocol. This script currently performs post-processing on filtered string content of the input feeds to support links as long as they stand as a single line in the text. This definitely needs some improvement. -* Threading for long posts -* Tags -* Image references: Can forward image links from RSS to Bsky +3. Set up environment variables: + Create a `.env` file with your Bluesky credentials: + ``` + BSKY_USERNAME=your_bluesky_handle + BSKY_PASSWORD=your_bluesky_password + ``` -## Usage and configuration + For Twitter scraping, additional setup may be required (see Configuration). -1. Start by installing the required libraries `pip install -r requirements.txt` -2. Copy the configuration file and then edit it `cp config.json.sample config.json` -3. Run the script like `python rss2bsky.py` +## Configuration -The configuration file accepts the configuration of: +### RSS Feeds +Use `rss2bsky.py` to post from RSS feeds. Configure the feed URL and other options via command-line arguments. -* a feed URL -* bsky parameters for a handle, username, and password - * Handle is like name.bsky.social - * Username is the email address associated with the account. - * Password is your password. If you have a literal quote it can be escaped with a backslash like `\"` -* sleep - the amount of time to sleep while running +Example: +```bash +python rss2bsky.py --feed-url https://example.com/rss --bsky-handle your_handle +``` + +### Twitter Accounts +Use `twitter2bsky_daemon.py` for Twitter-to-Bluesky posting. It requires browser automation for scraping. + +Configure Twitter accounts in the script or via environment variables. + +### Workflows +The `workflows/` directory contains Jenkins pipeline configurations for automated runs. Each `.yml` file defines a pipeline for a specific source (e.g., `324.yml` for 324 RSS feed). + +To run a workflow manually, use the `sync_runner.sh` script or execute the Python scripts directly. + +## Usage + +### Running RSS Sync +```bash +python rss2bsky.py [options] +``` + +Options: +- `--feed-url`: URL of the RSS feed +- `--bsky-handle`: Your Bluesky handle +- Other options for filtering, formatting, etc. + +### Running Twitter Daemon +```bash +python twitter2bsky_daemon.py [options] +``` + +Options: +- Configure Twitter accounts and Bluesky credentials +- Run in daemon mode for continuous operation + +### Using Sync Runner +```bash +./sync_runner.sh +``` + +This script can be used to run multiple syncs or integrate with cron jobs. + +## Dependencies + +All Python dependencies are listed in `requeriments.txt`. Key packages include: +- `atproto`: For Bluesky API interaction +- `fastfeedparser`: For RSS parsing +- `playwright`: For browser automation (Twitter scraping) +- `beautifulsoup4`: For HTML parsing +- And many others for media processing, logging, etc. + +## License + +This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details. + +## Contributing + +Contributions are welcome! Please open issues or submit pull requests on GitHub. + +## Disclaimer + +This tool is for personal use and automation. Ensure compliance with the terms of service of Bluesky, Twitter, and any RSS sources you use. Respect rate limits and avoid spamming. \ No newline at end of file diff --git a/twitter2bsky_daemon-22.py b/twitter2bsky_daemon.py similarity index 93% rename from twitter2bsky_daemon-22.py rename to twitter2bsky_daemon.py index 09b6ffc..4333566 100644 --- a/twitter2bsky_daemon-22.py +++ b/twitter2bsky_daemon.py @@ -8,6 +8,7 @@ import httpx import time import os import subprocess +import tempfile from urllib.parse import urlparse from dotenv import load_dotenv from atproto import Client, client_utils, models @@ -21,6 +22,9 @@ SCRAPE_TWEET_LIMIT = 30 DEDUPE_BSKY_LIMIT = 30 TWEET_MAX_AGE_DAYS = 3 +STATE_MAX_ENTRIES = 5000 +STATE_MAX_AGE_DAYS = 180 + # --- Logging Setup --- logging.basicConfig( format="%(asctime)s [%(levelname)s] %(message)s", @@ -261,6 +265,27 @@ def build_text_media_key(normalized_text, media_fingerprint): return hashlib.sha256(f"{normalized_text}||{media_fingerprint}".encode("utf-8")).hexdigest() +def safe_remove_file(path): + if path and os.path.exists(path): + try: + os.remove(path) + logging.debug(f"🧹 Removed temp file: {path}") + except Exception as e: + logging.warning(f"⚠️ Could not remove temp file {path}: {e}") + + +def build_temp_video_output_path(tweet): + """ + Create a unique temp mp4 path for this tweet. + """ + canonical_url = canonicalize_tweet_url(tweet.tweet_url) or "" + seed = canonical_url or f"{tweet.created_on}_{tweet.text[:50]}" + suffix = hashlib.sha256(seed.encode("utf-8")).hexdigest()[:12] + + temp_dir = tempfile.gettempdir() + return os.path.join(temp_dir, f"twitter2bsky_{suffix}.mp4") + + # --- Local State Management --- def default_state(): return { @@ -357,47 +382,59 @@ def candidate_matches_state(candidate, state): if canonical_tweet_url and canonical_tweet_url in posted_tweets: return True, "state:tweet_url" - for _, record in posted_tweets.items(): + for record in posted_tweets.values(): if record.get("text_media_key") == text_media_key: return True, "state:text_media_fingerprint" - for _, record in posted_tweets.items(): + for record in posted_tweets.values(): if record.get("normalized_text") == normalized_text: return True, "state:normalized_text" return False, None -def prune_state(state, max_entries=5000): +def prune_state(state, max_entries=STATE_MAX_ENTRIES, max_age_days=STATE_MAX_AGE_DAYS): """ Keep state file from growing forever. - Prunes oldest records by posted_at if necessary. + Prunes: + - entries older than max_age_days + - entries beyond max_entries, keeping newest first + - orphan posted_by_bsky_uri keys """ posted_tweets = state.get("posted_tweets", {}) + cutoff = arrow.utcnow().shift(days=-max_age_days) - if len(posted_tweets) <= max_entries: - return state + kept_items = [] - sortable = [] for key, record in posted_tweets.items(): - posted_at = record.get("posted_at") or "" - sortable.append((key, posted_at)) + posted_at_raw = record.get("posted_at") + keep = True - sortable.sort(key=lambda x: x[1], reverse=True) - keep_keys = {key for key, _ in sortable[:max_entries]} + if posted_at_raw: + try: + posted_at = arrow.get(posted_at_raw) + if posted_at < cutoff: + keep = False + except Exception: + pass - new_posted_tweets = {} - for key, record in posted_tweets.items(): - if key in keep_keys: - new_posted_tweets[key] = record + if keep: + kept_items.append((key, record)) - new_posted_by_bsky_uri = {} - for bsky_uri, key in state.get("posted_by_bsky_uri", {}).items(): - if key in keep_keys: - new_posted_by_bsky_uri[bsky_uri] = key + kept_items.sort(key=lambda item: item[1].get("posted_at", ""), reverse=True) + kept_items = kept_items[:max_entries] + + keep_keys = {key for key, _ in kept_items} + + state["posted_tweets"] = {key: record for key, record in kept_items} + + posted_by_bsky_uri = state.get("posted_by_bsky_uri", {}) + state["posted_by_bsky_uri"] = { + bsky_uri: key + for bsky_uri, key in posted_by_bsky_uri.items() + if key in keep_keys + } - state["posted_tweets"] = new_posted_tweets - state["posted_by_bsky_uri"] = new_posted_by_bsky_uri return state @@ -898,12 +935,8 @@ def download_and_crop_video(video_url, output_path): return None finally: - for path in [temp_input, temp_output]: - if os.path.exists(path): - try: - os.remove(path) - except Exception: - pass + safe_remove_file(temp_input) + safe_remove_file(temp_output) def candidate_matches_existing_bsky(candidate, recent_bsky_posts): @@ -938,6 +971,8 @@ def sync_feeds(args): logging.info("🔄 Starting sync cycle...") try: state = load_state(STATE_PATH) + state = prune_state(state) + save_state(state, STATE_PATH) tweets = scrape_tweets_via_playwright( args.twitter_username, @@ -1028,7 +1063,7 @@ def sync_feeds(args): return new_posts = 0 - state_file = "twitter_browser_state.json" + browser_state_file = "twitter_browser_state.json" with sync_playwright() as p: browser = p.chromium.launch( @@ -1043,8 +1078,8 @@ def sync_feeds(args): ), "viewport": {"width": 1920, "height": 1080}, } - if os.path.exists(state_file): - context_kwargs["storage_state"] = state_file + if os.path.exists(browser_state_file): + context_kwargs["storage_state"] = browser_state_file context = browser.new_context(**context_kwargs) @@ -1078,7 +1113,7 @@ def sync_feeds(args): logging.warning("⚠️ Tweet has video marker but no tweet URL. Skipping video.") continue - temp_video_path = "temp_video.mp4" + temp_video_path = build_temp_video_output_path(tweet) try: real_video_url = extract_video_url_from_tweet_page(context, tweet.tweet_url) @@ -1099,8 +1134,9 @@ def sync_feeds(args): video_embed = build_video_embed(video_blob, dynamic_alt) finally: - if os.path.exists(temp_video_path): - os.remove(temp_video_path) + safe_remove_file(temp_video_path) + safe_remove_file(temp_video_path.replace(".mp4", "_source.mp4")) + safe_remove_file(temp_video_path.replace(".mp4", "_cropped.mp4")) try: post_result = None @@ -1116,7 +1152,7 @@ def sync_feeds(args): bsky_uri = getattr(post_result, "uri", None) remember_posted_tweet(state, candidate, bsky_uri=bsky_uri) - state = prune_state(state, max_entries=5000) + state = prune_state(state) save_state(state, STATE_PATH) recent_bsky_posts.insert(0, { @@ -1186,4 +1222,4 @@ def main(): if __name__ == "__main__": - main() + main() \ No newline at end of file