Added new yml

2026-04-05 09:04:46 +02:00
parent cce6ff558c
commit 6fdd588179
2 changed files with 164 additions and 61 deletions
--- a/README.md
+++ b/README.md
@@ -1,37 +1,104 @@
-RSS to Bluesky - in Python
--------------------------
+# post2bsky

-This is a proof-of-concept implementation for posting RSS/Atom content to Bluesky. Some hacking may be required. Issues and pull requests welcome to improve the system.
+post2bsky is a Python-based tool for automatically posting content from RSS feeds and Twitter accounts to Bluesky (AT Protocol). It supports both RSS-to-Bluesky and Twitter-to-Bluesky synchronization, with configurable workflows for various sources.

+## Features

-## Built with:
+- **RSS to Bluesky**: Parse RSS feeds and post new entries to Bluesky with proper formatting and media handling.
+- **Twitter to Bluesky**: Scrape tweets from specified Twitter accounts and repost them to Bluesky, including media attachments.
+- **Daemon Mode**: Run as a background service for continuous monitoring and posting.
+- **Configurable Workflows**: Use YAML-based workflows to define sources, schedules, and posting rules.
+- **Media Support**: Handle images, videos, and other media from feeds and tweets.
+- **Deduplication**: Prevent duplicate posts using state tracking.
+- **Logging**: Comprehensive logging for monitoring and debugging.

-* [arrow](https://arrow.readthedocs.io/) - Time handling for humans
-* [atproto](https://github.com/MarshalX/atproto) - AT protocol implementation for Python. The API of the library is still unstable, but the version is pinned in requirements.txt
-* [fastfeedparser](https://github.com/kagisearch/fastfeedparser) - For feed parsing with a unified API
-* [httpx](https://www.python-httpx.org/) - For grabbing remote media
+## Installation

+1. Clone the repository:
+   ```bash
+   git clone https://github.com/yourusername/post2bsky.git
+   cd post2bsky
+   ```

-## Features:
+2. Install Python dependencies:
+   ```bash
+   pip install -r requeriments.txt
+   ```

-* Deduplication: The script queries the target timeline and only posts RSS items that are more recent than the latest top-level post by the handle.
-* Filters: Easy to extend code to support filters on RSS contents for simple transformations and limiting cross-posts.
-* Minimal rich-text support (links): Rich text is represented in a typed hierarchy in the AT protocol. This script currently performs post-processing on filtered string content of the input feeds to support links as long as they stand as a single line in the text. This definitely needs some improvement.
-* Threading for long posts
-* Tags
-* Image references: Can forward image links from RSS to Bsky
+3. Set up environment variables:
+   Create a `.env` file with your Bluesky credentials:
+   ```
+   BSKY_USERNAME=your_bluesky_handle
+   BSKY_PASSWORD=your_bluesky_password
+   ```

-## Usage and configuration
+   For Twitter scraping, additional setup may be required (see Configuration).

-1. Start by installing the required libraries `pip install -r requirements.txt`
-2. Copy the configuration file and then edit it `cp config.json.sample config.json`
-3. Run the script like `python rss2bsky.py`
+## Configuration

-The configuration file accepts the configuration of:
+### RSS Feeds
+Use `rss2bsky.py` to post from RSS feeds. Configure the feed URL and other options via command-line arguments.

-* a feed URL
-* bsky parameters for a handle, username, and password
-  * Handle is like name.bsky.social
-  * Username is the email address associated with the account.
-  * Password is your password. If you have a literal quote it can be escaped with a backslash like `\"`
-* sleep - the amount of time to sleep while running
+Example:
+```bash
+python rss2bsky.py --feed-url https://example.com/rss --bsky-handle your_handle
+```
+
+### Twitter Accounts
+Use `twitter2bsky_daemon.py` for Twitter-to-Bluesky posting. It requires browser automation for scraping.
+
+Configure Twitter accounts in the script or via environment variables.
+
+### Workflows
+The `workflows/` directory contains Jenkins pipeline configurations for automated runs. Each `.yml` file defines a pipeline for a specific source (e.g., `324.yml` for 324 RSS feed).
+
+To run a workflow manually, use the `sync_runner.sh` script or execute the Python scripts directly.
+
+## Usage
+
+### Running RSS Sync
+```bash
+python rss2bsky.py [options]
+```
+
+Options:
+- `--feed-url`: URL of the RSS feed
+- `--bsky-handle`: Your Bluesky handle
+- Other options for filtering, formatting, etc.
+
+### Running Twitter Daemon
+```bash
+python twitter2bsky_daemon.py [options]
+```
+
+Options:
+- Configure Twitter accounts and Bluesky credentials
+- Run in daemon mode for continuous operation
+
+### Using Sync Runner
+```bash
+./sync_runner.sh
+```
+
+This script can be used to run multiple syncs or integrate with cron jobs.
+
+## Dependencies
+
+All Python dependencies are listed in `requeriments.txt`. Key packages include:
+- `atproto`: For Bluesky API interaction
+- `fastfeedparser`: For RSS parsing
+- `playwright`: For browser automation (Twitter scraping)
+- `beautifulsoup4`: For HTML parsing
+- And many others for media processing, logging, etc.
+
+## License
+
+This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details.
+
+## Contributing
+
+Contributions are welcome! Please open issues or submit pull requests on GitHub.
+
+## Disclaimer
+
+This tool is for personal use and automation. Ensure compliance with the terms of service of Bluesky, Twitter, and any RSS sources you use. Respect rate limits and avoid spamming.
--- a/twitter2bsky_daemon-22.py
+++ b/twitter2bsky_daemon-22.py
@@ -8,6 +8,7 @@ import httpx
 import time
 import os
 import subprocess
+import tempfile
 from urllib.parse import urlparse
 from dotenv import load_dotenv
 from atproto import Client, client_utils, models
@@ -21,6 +22,9 @@ SCRAPE_TWEET_LIMIT = 30
 DEDUPE_BSKY_LIMIT = 30
 TWEET_MAX_AGE_DAYS = 3

+STATE_MAX_ENTRIES = 5000
+STATE_MAX_AGE_DAYS = 180
+
 # --- Logging Setup ---
 logging.basicConfig(
    format="%(asctime)s [%(levelname)s] %(message)s",
@@ -261,6 +265,27 @@ def build_text_media_key(normalized_text, media_fingerprint):
    return hashlib.sha256(f"{normalized_text}||{media_fingerprint}".encode("utf-8")).hexdigest()


+def safe_remove_file(path):
+    if path and os.path.exists(path):
+        try:
+            os.remove(path)
+            logging.debug(f"🧹 Removed temp file: {path}")
+        except Exception as e:
+            logging.warning(f"⚠️ Could not remove temp file {path}: {e}")
+
+
+def build_temp_video_output_path(tweet):
+    """
+    Create a unique temp mp4 path for this tweet.
+    """
+    canonical_url = canonicalize_tweet_url(tweet.tweet_url) or ""
+    seed = canonical_url or f"{tweet.created_on}_{tweet.text[:50]}"
+    suffix = hashlib.sha256(seed.encode("utf-8")).hexdigest()[:12]
+
+    temp_dir = tempfile.gettempdir()
+    return os.path.join(temp_dir, f"twitter2bsky_{suffix}.mp4")
+
+
 # --- Local State Management ---
 def default_state():
    return {
@@ -357,47 +382,59 @@ def candidate_matches_state(candidate, state):
    if canonical_tweet_url and canonical_tweet_url in posted_tweets:
        return True, "state:tweet_url"

-    for _, record in posted_tweets.items():
+    for record in posted_tweets.values():
        if record.get("text_media_key") == text_media_key:
            return True, "state:text_media_fingerprint"

-    for _, record in posted_tweets.items():
+    for record in posted_tweets.values():
        if record.get("normalized_text") == normalized_text:
            return True, "state:normalized_text"

    return False, None


-def prune_state(state, max_entries=5000):
+def prune_state(state, max_entries=STATE_MAX_ENTRIES, max_age_days=STATE_MAX_AGE_DAYS):
    """
    Keep state file from growing forever.
-    Prunes oldest records by posted_at if necessary.
+    Prunes:
+    - entries older than max_age_days
+    - entries beyond max_entries, keeping newest first
+    - orphan posted_by_bsky_uri keys
    """
    posted_tweets = state.get("posted_tweets", {})
+    cutoff = arrow.utcnow().shift(days=-max_age_days)

-    if len(posted_tweets) <= max_entries:
-        return state
+    kept_items = []

-    sortable = []
    for key, record in posted_tweets.items():
-        posted_at = record.get("posted_at") or ""
-        sortable.append((key, posted_at))
+        posted_at_raw = record.get("posted_at")
+        keep = True

-    sortable.sort(key=lambda x: x[1], reverse=True)
-    keep_keys = {key for key, _ in sortable[:max_entries]}
+        if posted_at_raw:
+            try:
+                posted_at = arrow.get(posted_at_raw)
+                if posted_at < cutoff:
+                    keep = False
+            except Exception:
+                pass

-    new_posted_tweets = {}
-    for key, record in posted_tweets.items():
-        if key in keep_keys:
-            new_posted_tweets[key] = record
+        if keep:
+            kept_items.append((key, record))

-    new_posted_by_bsky_uri = {}
-    for bsky_uri, key in state.get("posted_by_bsky_uri", {}).items():
-        if key in keep_keys:
-            new_posted_by_bsky_uri[bsky_uri] = key
+    kept_items.sort(key=lambda item: item[1].get("posted_at", ""), reverse=True)
+    kept_items = kept_items[:max_entries]
+
+    keep_keys = {key for key, _ in kept_items}
+
+    state["posted_tweets"] = {key: record for key, record in kept_items}
+
+    posted_by_bsky_uri = state.get("posted_by_bsky_uri", {})
+    state["posted_by_bsky_uri"] = {
+        bsky_uri: key
+        for bsky_uri, key in posted_by_bsky_uri.items()
+        if key in keep_keys
+    }

-    state["posted_tweets"] = new_posted_tweets
-    state["posted_by_bsky_uri"] = new_posted_by_bsky_uri
    return state


@@ -898,12 +935,8 @@ def download_and_crop_video(video_url, output_path):
        return None

    finally:
-        for path in [temp_input, temp_output]:
-            if os.path.exists(path):
-                try:
-                    os.remove(path)
-                except Exception:
-                    pass
+        safe_remove_file(temp_input)
+        safe_remove_file(temp_output)


 def candidate_matches_existing_bsky(candidate, recent_bsky_posts):
@@ -938,6 +971,8 @@ def sync_feeds(args):
    logging.info("🔄 Starting sync cycle...")
    try:
        state = load_state(STATE_PATH)
+        state = prune_state(state)
+        save_state(state, STATE_PATH)

        tweets = scrape_tweets_via_playwright(
            args.twitter_username,
@@ -1028,7 +1063,7 @@ def sync_feeds(args):
            return

        new_posts = 0
-        state_file = "twitter_browser_state.json"
+        browser_state_file = "twitter_browser_state.json"

        with sync_playwright() as p:
            browser = p.chromium.launch(
@@ -1043,8 +1078,8 @@ def sync_feeds(args):
                ),
                "viewport": {"width": 1920, "height": 1080},
            }
-            if os.path.exists(state_file):
-                context_kwargs["storage_state"] = state_file
+            if os.path.exists(browser_state_file):
+                context_kwargs["storage_state"] = browser_state_file

            context = browser.new_context(**context_kwargs)

@@ -1078,7 +1113,7 @@ def sync_feeds(args):
                                logging.warning("⚠️ Tweet has video marker but no tweet URL. Skipping video.")
                                continue

-                            temp_video_path = "temp_video.mp4"
+                            temp_video_path = build_temp_video_output_path(tweet)

                            try:
                                real_video_url = extract_video_url_from_tweet_page(context, tweet.tweet_url)
@@ -1099,8 +1134,9 @@ def sync_feeds(args):
                                video_embed = build_video_embed(video_blob, dynamic_alt)

                            finally:
-                                if os.path.exists(temp_video_path):
-                                    os.remove(temp_video_path)
+                                safe_remove_file(temp_video_path)
+                                safe_remove_file(temp_video_path.replace(".mp4", "_source.mp4"))
+                                safe_remove_file(temp_video_path.replace(".mp4", "_cropped.mp4"))

                try:
                    post_result = None
@@ -1116,7 +1152,7 @@ def sync_feeds(args):
                    bsky_uri = getattr(post_result, "uri", None)

                    remember_posted_tweet(state, candidate, bsky_uri=bsky_uri)
-                    state = prune_state(state, max_entries=5000)
+                    state = prune_state(state)
                    save_state(state, STATE_PATH)

                    recent_bsky_posts.insert(0, {
@@ -1186,4 +1222,4 @@ def main():


 if __name__ == "__main__":
-    main()
+    main()